风速预测

2023-04-06 | 阅读：次

1. 数据来源

气象学家们面临着高精度和可靠的风速预测的挑战。由于对流暴风引起的严重风灾造成了相当大的损害（如大规模的森林破坏、停电、建筑物/房屋损坏等），因此风速预测是获取先进的严重天气警报的重要任务。本数据集包含了收集不同天气变量（如温度和降水）的气象传感器的响应。

该数据集包含了6574个实例的日均响应，来自于一个包含5个天气变量传感器阵列的气象站。该设备位于一个相对空旷的区域内，海拔21米。数据记录时间为1961年1月至1978年12月（17年）。提供了地面真实的日均降水量、最高和最低温度以及草地最低温度。

属性信息

DATE (YYYY-MM-DD)
WIND: Average wind speed [knots]
IND: First indicator value
RAIN: Precipitation Amount (mm)
IND.1: Second indicator value
T.MAX: Maximum Temperature (°C)
IND.2: Third indicator value
T.MIN: Minimum Temperature (°C)
T.MIN.G: 09utc Grass Minimum Temperature (°C)

2. 数据预处理

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('./input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings(action='ignore')

import datetime

from sklearn.metrics import mean_absolute_error

from sklearn.preprocessing import MinMaxScaler
import tensorflow as tf

df=pd.read_csv('./input/wind_dataset.csv')

df.head()
df.info()
df.shape

# 对日期进行处理
df['DATE']=pd.to_datetime(df['DATE'])
# extract year, month and day from DATE
df['year']=df['DATE'].dt.year
df['month']=df['DATE'].dt.month
df['day']=df['DATE'].dt.day

输出结果如下：

20230406160240

Python 3.8.3 (default, Jul 2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.34.0 -- An enhanced Interactive Python. Type '?' for help.

./input\wind_dataset.csv
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6574 entries, 0 to 6573
Data columns (total 9 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   DATE     6574 non-null   object 
 1   WIND     6574 non-null   float64
 2   IND      6574 non-null   int64  
 3   RAIN     6574 non-null   float64
 4   IND.1    6513 non-null   float64
 5   T.MAX    5953 non-null   float64
 6   IND.2    6513 non-null   float64
 7   T.MIN    5900 non-null   float64
 8   T.MIN.G  6214 non-null   float64
dtypes: float64(7), int64(1), object(1)
memory usage: 462.4+ KB
DATE	WIND	IND	RAIN	IND.1	T.MAX	IND.2	T.MIN	T.MIN.G	year	month	day
0	1961-01-01	13.67	0	0.2	0.0	9.5	0.0	3.7	-1.0	1961	1	1
1	1961-01-02	11.50	0	5.1	0.0	7.2	0.0	4.2	1.1	1961	1	2
2	1961-01-03	11.25	0	0.4	0.0	5.5	0.0	0.5	-0.5	1961	1	3
3	1961-01-04	8.63	0	0.2	0.0	5.6	0.0	0.4	-3.2	1961	1	4
4	1961-01-05	11.92	0	10.4	0.0	7.2	1.0	-1.5	-7.5	1961	1	5
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6574 entries, 0 to 6573
Data columns (total 12 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   DATE     6574 non-null   datetime64[ns]
 1   WIND     6574 non-null   float64       
 2   IND      6574 non-null   int64         
 3   RAIN     6574 non-null   float64       
 4   IND.1    6513 non-null   float64       
 5   T.MAX    5953 non-null   float64       
 6   IND.2    6513 non-null   float64       
 7   T.MIN    5900 non-null   float64       
 8   T.MIN.G  6214 non-null   float64       
 9   year     6574 non-null   int64         
 10  month    6574 non-null   int64         
 11  day      6574 non-null   int64         
dtypes: datetime64[ns](1), float64(7), int64(4)
memory usage: 616.4 KB

20230406160435

查看数据是否有缺失值：

df.isnull().sum()

结果如下：

DATE         0
WIND         0
IND          0
RAIN         0
IND.1       61
T.MAX      621
IND.2       61
T.MIN      674
T.MIN.G    360
year         0
month        0
day          0
dtype: int64

# 用backfill (bfill)填充null
df = df.fillna(method='bfill')

然后再使用isnull()函数检查是否还有缺失值。

也可以用每一列出现最多的值来填充缺失值：

colu = ["IND.1","T.MAX" ,"IND.2" ,"T.MIN" ,"T.MIN.G" ]
for i in colu:
    print(i)
    print(df[i].mode()) # mode() 函数返回众数
    df[i].fillna(df[i].mode()[0],inplace=True) # inplace=True表示在原数据上修改

3. 数据可视化

风速随着时间的变化情况：

df.plot(x='DATE',y='WIND',figsize=(15,8))

输出结果如下：

20230406161322

各个特征值之间的相互关系：

sns.heatmap(df.corr(),annot=True, cbar=False, cmap='Blues', fmt='.1f')
# cbar=False表示不显示颜色条, cmap='Blues'表示颜色为蓝色, fmt='.1f'表示保留一位小数,annot=True表示显示相关系数

输出结果如下：

20230406161756

画出风速，降雨量，最高温度，最低温度的箱线图：

fig, axes = plt.subplots(2, 2, figsize=(15, 8))
sns.boxplot(df['WIND'],ax=axes[0,0]) # Boxplot:'Wind'
# axes[0,0].set_title('Wind') # title
sns.boxplot(df['RAIN'],ax=axes[0,1]) # Boxplot:'Rain'
# axes[0,1].set_title('Rain') # title
sns.boxplot(df['T.MAX'],ax=axes[1,0]) # Boxplot:'Max Temperature'
# axes[1,0].set_title('Max Temperature') # title
sns.boxplot(df['T.MIN'],ax=axes[1,1]) # Boxplot:'Min Temperature'
# axes[1,1].set_title('Min Temperature') # title

结果如下：

20230406162654

可以发现，风速平均值在9knots左右，降雨量有很多高的异常值，最高温度平均值在12°左右，最低温度平均值在8°左右，最低温度有一些异常的低值。

画出所有变量的时间序列图：

df = df.set_index('DATE')
df.head()
from matplotlib import pyplot
# load dataset
values = df.values
# specify columns to plot
groups = [0, 1, 2, 3,4, 5, 6, 7]
i = 1
# plot each column
pyplot.figure(figsize=(15,15))
for group in groups:
    pyplot.subplot(len(groups), 1, i)
    pyplot.plot(values[:, group])
    pyplot.title(df.columns[group], y=0.5, loc='right')
    i += 1
pyplot.show()

结果如下：

时间序列图

4. 变量之间的关系

4.1 风速与降雨量的关系

sns.scatterplot(x='WIND',y='RAIN',data=df)

输出结果如下：

20230406164723

4.2 降雨量月度变化

sns.boxplot(x=df['month'] ,y=df["WIND"])

输出结果如下：

20230406164941

4.3 降雨量年度变化折线图

sns.lineplot(x=df['year'] ,y=df["RAIN"])
# sns.lineplot(x=df['year'] ,y=df["RAIN"],ci=None) 
# 不要误差线,ci表示置信区间
# 横坐标为年份，纵坐标为降雨量
plt.title('Rainfall per year')
plt.xlabel('Year')
plt.ylabel('Rainfall')
# 设置横坐标取值
plt.xticks(np.arange(1960,1980,2))

输出结果如下：

20230406165652

5. 时间序列预测

5.1 LSTM

LSTM是指长短期记忆网络（Long Short-Term Memory networks），是一种特殊的RNN，能够学习长期依赖关系。

5.2 训练

这个还不是很懂，还需要时间消化消化。

# Train – Test Split
# 去掉年月日的列
df = df.drop(['year','month','day'],axis=1)
train_df,test_df = df[1:4601], df[4601:] 

# 缩放
train = train_df
scalers={}
for i in train_df.columns:
    scaler = MinMaxScaler(feature_range=(-1,1)) # MinMaxScaler缩放到[-1,1]之间
    s_s = scaler.fit_transform(train[i].values.reshape(-1,1))
    s_s=np.reshape(s_s,len(s_s))
    scalers['scaler_'+ i] = scaler
    train[i]=s_s
test = test_df
for i in train_df.columns:
    scaler = scalers['scaler_'+i]
    s_s = scaler.transform(test[i].values.reshape(-1,1))
    s_s=np.reshape(s_s,len(s_s))
    scalers['scaler_'+i] = scaler
    test[i]=s_s

# 使用滑动窗口方法将我们的序列转换为输入过去观测值的样本，
# 并输出未来观测值，以使用监督式学习算法。

def split_series(series, n_past, n_future):
  
  X, y = list(), list()
  for window_start in range(len(series)):
    past_end = window_start + n_past
    future_end = past_end + n_future
    if future_end > len(series):
      break
    past, future = series[window_start:past_end, :], series[past_end:future_end, :]
    X.append(past)
    y.append(future)
  return np.array(X), np.array(y)

# 假设根据过去10天的观测，我们需要预测未来5天的观测。
n_past = 10
n_future = 5 
n_features = 8

# 转换数据成为samples
X_train, y_train = split_series(train.values,n_past, n_future)
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1],n_features))
y_train = y_train.reshape((y_train.shape[0], y_train.shape[1], n_features))
X_test, y_test = split_series(test.values,n_past, n_future)
X_test = X_test.reshape((X_test.shape[0], X_test.shape[1],n_features))
y_test = y_test.reshape((y_test.shape[0], y_test.shape[1], n_features))

# 建模
## E1D1
encoder_inputs = tf.keras.layers.Input(shape=(n_past, n_features))
encoder_l1 = tf.keras.layers.LSTM(100, return_state=True)
encoder_outputs1 = encoder_l1(encoder_inputs)

encoder_states1 = encoder_outputs1[1:]

decoder_inputs = tf.keras.layers.RepeatVector(n_future)(encoder_outputs1[0])

decoder_l1 = tf.keras.layers.LSTM(100, return_sequences=True)(decoder_inputs,initial_state = encoder_states1)
decoder_outputs1 = tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(n_features))(decoder_l1)

model_e1d1 = tf.keras.models.Model(encoder_inputs,decoder_outputs1)
model_e1d1.summary()

## E2D2
## 具有两个编码层和两个解码层的序列模型
encoder_inputs = tf.keras.layers.Input(shape=(n_past, n_features))
encoder_l1 = tf.keras.layers.LSTM(100,return_sequences = True, return_state=True)
encoder_outputs1 = encoder_l1(encoder_inputs)
encoder_states1 = encoder_outputs1[1:]
encoder_l2 = tf.keras.layers.LSTM(100, return_state=True)
encoder_outputs2 = encoder_l2(encoder_outputs1[0])
encoder_states2 = encoder_outputs2[1:]

decoder_inputs = tf.keras.layers.RepeatVector(n_future)(encoder_outputs2[0])

decoder_l1 = tf.keras.layers.LSTM(100, return_sequences=True)(decoder_inputs,initial_state = encoder_states1)
decoder_l2 = tf.keras.layers.LSTM(100, return_sequences=True)(decoder_l1,initial_state = encoder_states2)
decoder_outputs2 = tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(n_features))(decoder_l2)

model_e2d2 = tf.keras.models.Model(encoder_inputs,decoder_outputs2)

model_e2d2.summary()

# 训练模型
reduce_lr = tf.keras.callbacks.LearningRateScheduler(lambda x: 1e-3 * 0.90 ** x)
model_e1d1.compile(optimizer=tf.keras.optimizers.Adam(), loss=tf.keras.losses.Huber())
history_e1d1=model_e1d1.fit(X_train,y_train,epochs=25,validation_data=(X_test,y_test),batch_size=32,verbose=0,callbacks=[reduce_lr])
model_e2d2.compile(optimizer=tf.keras.optimizers.Adam(), loss=tf.keras.losses.Huber())
history_e2d2=model_e2d2.fit(X_train,y_train,epochs=25,validation_data=(X_test,y_test),batch_size=32,verbose=0,callbacks=[reduce_lr])

# 得到预测值
pred_e1d1=model_e1d1.predict(X_test)
pred_e2d2=model_e2d2.predict(X_test)

# 反缩放
for index,i in enumerate(train_df.columns):
    scaler = scalers['scaler_'+i]
    pred_e1d1[:,:,index]=scaler.inverse_transform(pred_e1d1[:,:,index])
    pred_e2d2[:,:,index]=scaler.inverse_transform(pred_e2d2[:,:,index])
    y_train[:,:,index]=scaler.inverse_transform(y_train[:,:,index])
    y_test[:,:,index]=scaler.inverse_transform(y_test[:,:,index])

# check error
from sklearn.metrics import mean_absolute_error
for index,i in enumerate(train_df.columns):
  print(i)
  for j in range(1,6):
    print("Day ",j,":")
    print("MAE-E1D1 : ",mean_absolute_error(y_test[:,j-1,index],pred_e1d1[:,j-1,index]),end=", ")
    print("MAE-E2D2 : ",mean_absolute_error(y_test[:,j-1,index],pred_e2d2[:,j-1,index]))
  print()
  print()

# 评估模型
def evaluate_model(y_true,y_pred):
    mae = np.mean(np.abs(y_true-y_pred))
    mape = np.mean(np.abs((y_true-y_pred)/y_true))
    rmse = np.sqrt(np.mean((y_true-y_pred)**2))
    return mae,mape,rmse

# MAE表示预测值与真实值之间的平均绝对误差，MAPE表示预测值与真实值之间的平均绝对百分比误差，RMSE表示预测值与真实值之间的均方根误差。
mae_e1d1,mape_e1d1,rmse_e1d1=evaluate_model(y_test,pred_e1d1)
print("MAE-E1D1 : ",mae_e1d1)
print("MAPE-E1D1 : ",mape_e1d1)
print("RMSE-E1D1 : ",rmse_e1d1)
mae_e2d2,mape_e2d2,rmse_e2d2=evaluate_model(y_test,pred_e2d2)
print("MAE-E2D2 : ",mae_e2d2)
print("MAPE-E2D2 : ",mape_e2d2)
print("RMSE-E2D2 : ",rmse_e2d2)

输出结果如下：

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_1 (InputLayer)           [(None, 10, 8)]      0           []                               
                                                                                                  
 lstm (LSTM)                    [(None, 100),        43600       ['input_1[0][0]']                
                                 (None, 100),                                                     
                                 (None, 100)]                                                     
                                                                                                  
 repeat_vector (RepeatVector)   (None, 5, 100)       0           ['lstm[0][0]']                   
                                                                                                  
 lstm_1 (LSTM)                  (None, 5, 100)       80400       ['repeat_vector[0][0]',          
                                                                  'lstm[0][1]',                   
                                                                  'lstm[0][2]']                   
                                                                                                  
 time_distributed (TimeDistribu  (None, 5, 8)        808         ['lstm_1[0][0]']                 
 ted)                                                                                             
                                                                                                  
==================================================================================================
Total params: 124,808
Trainable params: 124,808
Non-trainable params: 0
__________________________________________________________________________________________________


Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_2 (InputLayer)           [(None, 10, 8)]      0           []                               
                                                                                                  
 lstm_2 (LSTM)                  [(None, 10, 100),    43600       ['input_2[0][0]']                
                                 (None, 100),                                                     
                                 (None, 100)]                                                     
                                                                                                  
 lstm_3 (LSTM)                  [(None, 100),        80400       ['lstm_2[0][0]']                 
                                 (None, 100),                                                     
                                 (None, 100)]                                                     
                                                                                                  
 repeat_vector_1 (RepeatVector)  (None, 5, 100)      0           ['lstm_3[0][0]']                 
                                                                                                  
 lstm_4 (LSTM)                  (None, 5, 100)       80400       ['repeat_vector_1[0][0]',        
                                                                  'lstm_2[0][1]',                 
                                                                  'lstm_2[0][2]']                 
                                                                                                  
 lstm_5 (LSTM)                  (None, 5, 100)       80400       ['lstm_4[0][0]',                 
                                                                  'lstm_3[0][1]',                 
                                                                  'lstm_3[0][2]']                 
                                                                                                  
 time_distributed_1 (TimeDistri  (None, 5, 8)        808         ['lstm_5[0][0]']                 
 buted)                                                                                           
                                                                                                  
==================================================================================================
Total params: 285,608
Trainable params: 285,608
Non-trainable params: 0
__________________________________________________________________________________________________


62/62 [==============================] - 1s 4ms/step
62/62 [==============================] - 2s 7ms/step

WIND
Day  1 :
MAE-E1D1 :  0.2030982378059999, MAE-E2D2 :  0.20359981017286832
Day  2 :
MAE-E1D1 :  0.23322182065839472, MAE-E2D2 :  0.2322014858929193
Day  3 :
MAE-E1D1 :  0.240528954285563, MAE-E2D2 :  0.2399887921500838
Day  4 :
MAE-E1D1 :  0.24262299357094172, MAE-E2D2 :  0.24300205558516846
Day  5 :
MAE-E1D1 :  0.24477795444422737, MAE-E2D2 :  0.24521872620795385


IND
Day  1 :
MAE-E1D1 :  0.8192329511403916, MAE-E2D2 :  0.8078464830086023
Day  2 :
MAE-E1D1 :  0.8205593273674974, MAE-E2D2 :  0.809087825709913
Day  3 :
MAE-E1D1 :  0.8095268123061028, MAE-E2D2 :  0.7957327934233981
Day  4 :
MAE-E1D1 :  0.8005428497982366, MAE-E2D2 :  0.7832877624393909
Day  5 :
MAE-E1D1 :  0.7966770311592184, MAE-E2D2 :  0.7773387461307403


RAIN
Day  1 :
MAE-E1D1 :  0.09180004186695194, MAE-E2D2 :  0.09083399148450579
Day  2 :
MAE-E1D1 :  0.0924991551961152, MAE-E2D2 :  0.09287003019686486
Day  3 :
MAE-E1D1 :  0.09169342586566709, MAE-E2D2 :  0.09171447431362073
Day  4 :
MAE-E1D1 :  0.0918647467209567, MAE-E2D2 :  0.09168521043800822
Day  5 :
MAE-E1D1 :  0.09274870878211129, MAE-E2D2 :  0.09302511114610458


IND.1
Day  1 :
MAE-E1D1 :  0.019823942188936086, MAE-E2D2 :  0.022393431063815607
Day  2 :
MAE-E1D1 :  0.019041566405447238, MAE-E2D2 :  0.021436605376209027
Day  3 :
MAE-E1D1 :  0.020100323505216622, MAE-E2D2 :  0.02054175316161681
Day  4 :
MAE-E1D1 :  0.02028246833968978, MAE-E2D2 :  0.019219360192129963
Day  5 :
MAE-E1D1 :  0.020312563383560998, MAE-E2D2 :  0.01931783006226548


T.MAX
Day  1 :
MAE-E1D1 :  0.1232421095326075, MAE-E2D2 :  0.12522745991132936
Day  2 :
MAE-E1D1 :  0.1474737389383158, MAE-E2D2 :  0.1490627993920697
Day  3 :
MAE-E1D1 :  0.15721180848764904, MAE-E2D2 :  0.15875159289268734
Day  4 :
MAE-E1D1 :  0.16539402954977084, MAE-E2D2 :  0.16673676071107582
Day  5 :
MAE-E1D1 :  0.16949716272784246, MAE-E2D2 :  0.17063818147683252


IND.2
Day  1 :
MAE-E1D1 :  0.0710885461311428, MAE-E2D2 :  0.07079118672775456
Day  2 :
MAE-E1D1 :  0.07985194428287154, MAE-E2D2 :  0.07748176526390453
Day  3 :
MAE-E1D1 :  0.0854636775923968, MAE-E2D2 :  0.08299859596065504
Day  4 :
MAE-E1D1 :  0.0882332985987768, MAE-E2D2 :  0.08443187568916721
Day  5 :
MAE-E1D1 :  0.09076995515531273, MAE-E2D2 :  0.08585589801614049


T.MIN
Day  1 :
MAE-E1D1 :  0.12947372942288168, MAE-E2D2 :  0.13142280593030686
Day  2 :
MAE-E1D1 :  0.15197962835745546, MAE-E2D2 :  0.15341848565558663
Day  3 :
MAE-E1D1 :  0.162795914776204, MAE-E2D2 :  0.16349701241098136
Day  4 :
MAE-E1D1 :  0.16716887079652246, MAE-E2D2 :  0.16748711679387449
Day  5 :
MAE-E1D1 :  0.17005420679692593, MAE-E2D2 :  0.17034509361844594


T.MIN.G
Day  1 :
MAE-E1D1 :  0.20059115168468492, MAE-E2D2 :  0.19822965409102764
Day  2 :
MAE-E1D1 :  0.22191297463551657, MAE-E2D2 :  0.22105030430725828
Day  3 :
MAE-E1D1 :  0.22813145943950772, MAE-E2D2 :  0.2277492541219974
Day  4 :
MAE-E1D1 :  0.23071047852604046, MAE-E2D2 :  0.2301128151892642
Day  5 :
MAE-E1D1 :  0.2318879832054546, MAE-E2D2 :  0.23180336898038414


MAE-E1D1 :  0.2210972135857277
MAPE-E1D1 :  inf
RMSE-E1D1 :  0.5905665027284146
MAE-E2D2 :  0.21918585763242307
MAPE-E2D2 :  inf
RMSE-E2D2 :  0.5937227770902509