天气数据的爬取与可视化

引言

天气数据作为一类重要的数据资源,在各个行业有着广泛而且重要的应用。通常来说,天气预报的数据作为公共资源是非常容易获取的,但是如果想知道大范围、长时间的气象历史数据,还是需要费一番周折的。

网上也有一些可以获取历史气象数据的API,例如AccuWWeather、Darksky和OpenWeatherMap等,但是免费版的一般会限制调用次数,而收费版的并不在初学者的考虑范围之内。由于很多气象网站也提供历史数据,而网页的访问通常不限次数(只要不是太过分的话),所以为了解决这个问题,我们可以通过抓取网页,然后提取和整体我们需要的数据。

接下来我们将使用Requests, Selenium和Beautiful Soup等Python包从Wunderground网站上抓取天气数据。

准备工作

Wunderground介绍

Wunderground(https://www.wunderground.com )的全称是Weather Underground(目前属于IBM的子公司The Weather Company),Weather Underground成立于1995年,一直以来致力于为公众分享天气信息,其数据直接来源于气象站点,目前有在美国有超过18万个站点,在全世界其他国家超过29万个气象站点的气象数据,为我们获取气象数据提供了极大的便利。

Python包介绍

我们在爬取数据过程中,用到的Python包主要包括:

  • Requests包,它是Python中用于发出HTTP请求的事实标准库(Python自带的urllib易用性要差一些),提供了一套简洁、优美的API处理复杂的请求;
  • Beautiful Soup包,它是一个用来解析HTML或者XML文件的库,支持我们从网页中搜索、提取或者修改期望的数据;
  • Selenium包,是一个用来自动化测试网络应用的包,目前许多网页是动态生成的,无法从静态页面里抓取数据,所以我们用Selenium实现动态页面的生成和加载;
  • Scrapy包,是Python中的网络爬虫,用于获得气象网站的多个页面。

抓取Wunderground气象数据

数据的抓取过程可以分为一下几个步骤:

  1. Scrapy/Requests发出请求;
  2. Selenium加载网页;
  3. Beautiful Soup解析网页

所以,我们需要先定义一些工具:

1
2
3
4
5
6
7
from datetime import datetime, timedelta
from time import sleep
import requests
import bs4
import os
from selenium import webdriver
from selenium.common.exceptions import WebDriverException

定义基础工具

浏览器对象示例

1
2
3
4
5
6
7
def get_browser():
PROXY = "socks5://127.0.0.1:13579" # IP:PORT or HOST:PORT
options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_argument(f'--proxy-server={PROXY}')
browser = webdriver.Chrome(options=options, executable_path=r'C:\ProgramData\WebDriver\chromedriver.exe')
return browser

selenium.webdriver模块提供了所有WebDriver的实现,当前支持的WebDriver有: Firefox, Chrome, IE and Remote。

构造请求URL

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def get_url(location, day, month, year):
"""
Takes in variables to create a valid url which, when a get request
is sent to that url, redirects us to the results page the user is
looking for.

:param location: formatted string containing the location the
user wants temperature data for.
:param day: validated string for day that exists
:param month: validated string for month which exists
:param year: validated string for year which exists
:return formatted_url: A url to be used in a get request which will
point us to the results page for the data provided
"""
lookup_URL = 'http://www.wunderground.com/history/daily/{}/date/{}-{}-{}'
formatted_url = lookup_URL.format(location,
year,
month,
day)
return formatted_url

def scrape_station(station, start, end):
'''
This function scrapes the weather data web pages from wunderground.com
for the station you provide it.
You can look up your city's weather station by performing a search for
it on wunderground.com then clicking on the "History" section.
The 4-letter name of the station will appear on that page.
'''

# Make sure a directory exists for the station web pages
os.mkdir(station)

Selenium加载网页

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def parse_page(url, browser):
"""
Scrapes the web page specified by url and passes the resulting HTML
to a Beautiful Soup Object.
Because have to wait js to load data, have to use selenium.
But selenium is realy slower than PhantomJS.
May use browser.refresh()
"""
try:
browser.get(url)
browser.implicitly_wait(20)
except WebDriverException:
browser.get(url)
browser.implicitly_wait(20)

html = browser.page_source
soup = bs4.BeautifulSoup(html, 'lxml')

try:
hisotry_table = soup.find_all('tbody')[1]
except:
return None

return soup

def scrape_the_underground(url, browser):
'''
Keep trying until to get right soup.
'''
soup = parse_page(url, browser)

while(soup is None):
sleep(10)
soup = parse_page(url, browser)

return soup

BS4解析网页

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# Retrieve temperature data

def get_tmp_data(soup):
'''
Find the 'Temperature' section using the second 'tbody', directly.
However, if you want to get more variables, it is better using:
history_table = soup.find(class_='summary-table').

:param soup: BeautifulSoup object contains the target whole web page
:return temperature: String with the formatted temperature value
'''
history_table = soup.find_all('tbody')[1]
history_table_rows = history_table.findAll('tr')

temperature_dict = dict()

for row in history_table_rows:
row_text = row.get_text()
if 'Day Average Temp' in row_text:
cells = row.findAll('td')
temperature_dict['Actual Mean Temperature'] = get_cell_data(cells[0])
temperature_dict['Average Mean Temperature'] = get_cell_data(cells[1])
elif 'High Temp' in row_text:
cells = row.findAll('td')
temperature_dict['Actual Max Temperature'] = get_cell_data(cells[0])
temperature_dict['Average Max Temperature'] = get_cell_data(cells[1])
temperature_dict['Record Max Temperature'] = get_cell_data(cells[2])
elif 'Low Temp' in row_text:
cells = row.findAll('td')
temperature_dict['Actual Min Temperature'] = get_cell_data(cells[0])
temperature_dict['Average Min Temperature'] = get_cell_data(cells[1])
temperature_dict['Record Min Temperature'] = get_cell_data(cells[2])

return temperature_dict


# Parse a table cell (i.e., td)
def get_cell_data(cell):
"""
Receives a cell containing temperature data to be parsed out,
formatted, and returned.

:param cell: A cell from the table containing temperature data
:return temperature: String with the formatted temperature value
"""
temperature = cell.get_text().strip()
temperature = temperature.replace('\xa0', '')
temperature = temperature.replace('\n', ' ')
temperature = temperature.replace('°', '')

return temperature

爬虫示例

设置参数

1
2
3
station = 'KSFO'
start = datetime(year=2018, month=1, day=1)
end = datetime(year=2018, month=12, day=31)

迭代爬取页面

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
browser = get_browser()
data = []
with open('{}.csv'.format(station), 'w+') as out_file:
out_file.write('date,actual_mean_temp,actual_min_temp,actual_max_temp,'
'average_min_temp,average_max_temp,'
'record_min_temp,record_max_temp\n')
for i in range((end - start).days+1):
current_date = start + timedelta(days=i)
print(str(current_date))
full_url = get_url(station, current_date.day,
current_date.month,
current_date.year)

try:
soup = scrape_the_underground(full_url, browser)
temperature_dict = get_tmp_data(soup)
except Exception: # only try again
sleep(10)
soup = scrape_the_underground(full_url, browser)
temperature_dict = get_tmp_data(soup)
data.append(temperature_dict)
out_file.write('{}-{}-{},'.format(current_date.year, current_date.month, current_date.day))
out_file.write(','.join([temperature_dict['Actual Mean Temperature'],
temperature_dict['Actual Min Temperature'],
temperature_dict['Actual Max Temperature'],
temperature_dict['Average Min Temperature'],
temperature_dict['Average Max Temperature'],
temperature_dict['Record Min Temperature'],
temperature_dict['Record Max Temperature']]))

out_file.write('\n')

current_date += timedelta(days=1)
# sleep(10) # use a sleep, which tells Python to wait a certain number of seconds

browser.quit()

天气数据可视化

我们首先看一下2018年每天气温的基本情况,以及每日的最高气温、最低气温和平均气温之间的关系:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime
from pandas.plotting import scatter_matrix

weather_data = pd.read_csv('KSFO.csv', parse_dates=['date'])
print(weather_data.describe())

df = weather_data
df['Datetime'] = pd.to_datetime(df['date'])
df = df.drop(columns=['date'])
df = df.set_index('Datetime')
scatter_matrix(df.iloc[:,[0,1,2]], alpha=0.2, figsize=(10,10))
plt.show()
       actual_mean_temp  actual_min_temp  actual_max_temp  average_min_temp  \
count        365.000000       365.000000       365.000000        365.000000   
mean          59.484932        51.479452        66.975342         50.194521   
std            5.907522         4.950607         7.911040          6.202000   
min           45.000000        36.000000        53.000000          0.000000   
25%           55.000000        48.000000        61.000000         47.000000   
50%           60.000000        52.000000        67.000000         51.000000   
75%           63.000000        55.000000        72.000000         55.000000   
max           81.000000        64.000000       102.000000         56.000000   

       average_max_temp  record_min_temp  record_max_temp  
count        365.000000       365.000000       365.000000  
mean          65.317808        40.243836        83.635616  
std            8.496466         6.546809        10.027782  
min            0.000000        24.000000        65.000000  
25%           60.000000        35.000000        74.000000  
50%           67.000000        41.000000        85.000000  
75%           72.000000        46.000000        92.000000  
max           74.000000        50.000000       106.000000  

output_27_1

我从GitHub上找到了一个可视化效果比较好的代码(https://github.com/fivethirtyeight ),仅供参考:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# Generate a bunch of histograms of the data to make sure that all of the data
# is in an expected range.
with plt.style.context('https://gist.githubusercontent.com/rhiever/d0a7332fe0beebfdc3d5/raw/1b807615235ff6f4c919b5b70b01a609619e1e9c/tableau10.mplstyle'):

# Make sure we're only plotting temperatures for 2018 - 2019
weather_data_subset = weather_data[weather_data['date'] >= datetime(year=2018, month=1, day=1)]
weather_data_subset = weather_data_subset[weather_data_subset['date'] < datetime(year=2019, month=1, day=1)].copy()
weather_data_subset['day_order'] = range(len(weather_data_subset))

day_order = weather_data_subset['day_order']
record_max_temps = weather_data_subset['record_max_temp'].values
record_min_temps = weather_data_subset['record_min_temp'].values
average_max_temps = weather_data_subset['average_max_temp'].values
average_min_temps = weather_data_subset['average_min_temp'].values
actual_max_temps = weather_data_subset['actual_max_temp'].values
actual_min_temps = weather_data_subset['actual_min_temp'].values

fig, ax1 = plt.subplots(figsize=(15, 7))

# Create the bars showing all-time record highs and lows
plt.bar(day_order, record_max_temps - record_min_temps, bottom=record_min_temps,
edgecolor='none', color='#C3BBA4', width=1)

# Create the bars showing average highs and lows
plt.bar(day_order, average_max_temps - average_min_temps, bottom=average_min_temps,
edgecolor='none', color='#9A9180', width=1)

# Create the bars showing this year's highs and lows
plt.bar(day_order, actual_max_temps - actual_min_temps, bottom=actual_min_temps,
edgecolor='black', linewidth=0.5, color='#5A3B49', width=1)

new_max_records = weather_data_subset[weather_data_subset.record_max_temp <= weather_data_subset.actual_max_temp]
new_min_records = weather_data_subset[weather_data_subset.record_min_temp >= weather_data_subset.actual_min_temp]

# Create the dots marking record highs and lows for the year
plt.scatter(new_max_records['day_order'].values + 0.5,
new_max_records['actual_max_temp'].values + 1.25,
s=15, zorder=10, color='#d62728', alpha=0.75, linewidth=0)

plt.scatter(new_min_records['day_order'].values + 0.5,
new_min_records['actual_min_temp'].values - 1.25,
s=15, zorder=10, color='#1f77b4', alpha=0.75, linewidth=0)

plt.ylim(-15, 111)
plt.xlim(-5, 370)

plt.yticks(range(-10, 111, 10), [r'{}$^\circ$'.format(x)
for x in range(-10, 111, 10)], fontsize=10)
plt.ylabel(r'Temperature ($^\circ$F)', fontsize=12)

month_beginning_df = weather_data_subset[weather_data_subset['date'].apply(lambda x: True if x.day == 1 else False)]
month_beginning_indeces = list(month_beginning_df['day_order'].values)
month_beginning_names = list(month_beginning_df['date'].apply(lambda x: x.strftime("%B")).values)
month_beginning_names[0] += '\n\'18'

# Add the last month label manually
month_beginning_indeces += [weather_data_subset['day_order'].values[-1]]
month_beginning_names += ['January']
month_beginning_names[12] += '\n\'19'

plt.xticks(month_beginning_indeces,
month_beginning_names,
fontsize=10)
ax2 = ax1.twiny()
plt.xticks(month_beginning_indeces,
month_beginning_names,
fontsize=10)

plt.xlim(-5, 370)
plt.grid(False)

ax3 = ax1.twinx()
plt.yticks(range(-10, 111, 10), [r'{}$^\circ$'.format(x)
for x in range(-10, 111, 10)], fontsize=10)
plt.ylim(-15, 111)
plt.grid(False)
plt.show()

output_29_0

actual_mean_temp month
Datetime
2018-01-01 51 January
2018-01-02 55 January
2018-01-03 55 January
2018-01-04 58 January
2018-01-05 59 January
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="white", rc={"axes.facecolor": (0, 0, 0, 0)})
month_lst = ['January', 'February', 'March', 'April', 'May', 'June', 'July',
'August', 'September', 'October', 'November', 'December']
# Create the data
ddf['month'] = ddf.index.strftime('%B')
ddf.head()
# Initialize the FacetGrid object
pal = sns.cubehelix_palette(10, rot=-.25, light=.7)
g = sns.FacetGrid(ddf, row="month", hue="month", aspect=15, height=1, palette=pal)

# Draw the densities in a few steps
cmap = sns.cubehelix_palette(light=1, as_cmap=True)
g.map(sns.kdeplot, "actual_mean_temp", clip_on=False, shade=True, alpha=1, lw=1.5, bw=1)
g.map(sns.kdeplot, "actual_mean_temp", clip_on=False, color="w", lw=2, bw=1)
g.map(plt.axhline, y=0, lw=2, clip_on=False)


# Define and use a simple function to label the plot in axes coordinates
def label(x, color, label):
ax = plt.gca()
ax.text(0, .2, label, fontweight="bold", color=color,
ha="left", va="center", transform=ax.transAxes)


g.map(label, "actual_mean_temp")

# Set the subplots to overlap
g.fig.subplots_adjust(hspace=-.25)

# Remove axes details that don't play well with overlap
g.set_titles("")
g.set(yticks=[])
g.despine(bottom=True, left=True)
plt.show()

output_31_0

其他方法

如果用Selenium去解析网页的话,最大的问题是速度太慢,如果我们仔细分析Wunderground的网页的话,可以发现网页实际上是通过API获取JSON格式的数据的。所以实际上,我们仍然可以采取调用API的方法来获得数据,至于具体的方法,我们留待以后介绍。