Regularized-Linear-Models-Xgboost-Keras

原链接:https://www.kaggle.com/apapiu/regularized-linear-models

线性模型:

作者: Alexandru Papiu (@apapiu, GitHub)

There have been a few great scripts on xgboost already so I'd figured I'd try something simpler: a regularized linear regression model. 现在已经有一些很棒的xgboost的脚本了,但是我想尝试一些简单的模型:一个正则化的线性回归模型。令人惊讶的是,虽然特征很少,但是它很有用。关键是要对数字变量进行对数变换,因为它们中的大多数的分布都是偏斜的。

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib

import matplotlib.pyplot as plt
from scipy.stats import skew
from scipy.stats.stats import pearsonr


%config InlineBackend.figure_format = 'retina' #set 'png' here when working on notebook
%matplotlib inline
In [2]:
train = pd.read_csv("input/train.csv")
test = pd.read_csv("input/test.csv")
In [3]:
train.head()
Out[3]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

In [4]:
all_data = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
                      test.loc[:,'MSSubClass':'SaleCondition']))

数据预处理:

我们在这里不做任何猜想:

  • 首先,我们做对数变换,将特征变为正态分布
  • 为哑变量创建虚拟变量
  • 将数字缺失值(NaN)替换为各自列的平均值
In [5]:
matplotlib.rcParams['figure.figsize'] = (12.0, 6.0)
prices = pd.DataFrame({"price":train["SalePrice"], "log(price + 1)":np.log1p(train["SalePrice"])})
prices.hist()
Out[5]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000023C846D16D8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000023C846482B0>]],
      dtype=object)
In [6]:
#log transform the target:
train["SalePrice"] = np.log1p(train["SalePrice"])

#log transform skewed numeric features:
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index

skewed_feats = train[numeric_feats].apply(lambda x: skew(x.dropna())) #compute skewness
skewed_feats = skewed_feats[skewed_feats > 0.75]
skewed_feats = skewed_feats.index

all_data[skewed_feats] = np.log1p(all_data[skewed_feats])
C:\Python\Anaconda3\lib\site-packages\ipykernel_launcher.py:11: RuntimeWarning: invalid value encountered in log1p
  # This is added back by InteractiveShellApp.init_path()
In [7]:
all_data = pd.get_dummies(all_data)
In [8]:
#filling NA's with the mean of the column:
all_data = all_data.fillna(all_data.mean())
In [9]:
#creating matrices for sklearn:
X_train = all_data[:train.shape[0]]
X_test = all_data[train.shape[0]:]
y = train.SalePrice

训练模型

现在我们将使用scikit学习模块中的正则化线性回归模型。 我将尝试l_1(Lasso)和l_2(Ridge)正则化。 我还将定义一个返回交叉验证rmse错误的函数,以便我们可以评估我们的模型并选择最佳调整标准

In [10]:
from sklearn.linear_model import Ridge, RidgeCV, ElasticNet, LassoCV, LassoLarsCV
from sklearn.model_selection import cross_val_score

def rmse_cv(model):
    rmse= np.sqrt(-cross_val_score(model, X_train, y, scoring="neg_mean_squared_error", cv = 5))
    return(rmse)
In [11]:
model_ridge = Ridge()

Ridge模型的主要调整参数是alpha - 一个正则化参数,用于衡量模型的灵活程度。 正规化越高,我们的模型就越不容易过度拟合。 但是它也会失去灵活性,并且可能无法捕获数据中的所有信号。

In [12]:
alphas = [0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75]
cv_ridge = [rmse_cv(Ridge(alpha = alpha)).mean() 
            for alpha in alphas]
In [13]:
cv_ridge = pd.Series(cv_ridge, index = alphas)
cv_ridge.plot(title = "Validation - Just Do It")
plt.xlabel("alpha")
plt.ylabel("rmse")
Out[13]:
Text(0,0.5,'rmse')

注意上面的U形状曲线。 当alpha太大时,正则化太强,模型无法捕获数据中的所有复杂性。 然而,如果我们让模型过于灵活(alpha小),模型就会开始过度拟合。 根据上图,alpha = 10的值大约是正确的。

In [14]:
cv_ridge.min()
Out[14]:
0.12733734668670754

领回归的rmsle大概是0.127

让我们试试Lasso模型吧。 我们将在这里采用略微不同的方法,并使用内置的Lasso CV为我们找出最佳的alpha。 出于某种原因,Lasso CV中的alpha实际上是Ridge中的alpha。

In [15]:
model_lasso = LassoCV(alphas = [1, 0.1, 0.001, 0.0005]).fit(X_train, y)
In [16]:
rmse_cv(model_lasso).mean()
Out[16]:
0.1231442109097743

太好了! lasso表现得更好,所以我们只是用这个来预测测试集。 关于lasso的另一个好处是它为你做了特色选择 - 设置它认为不重要的特征系数为零。 我们来看看系数:

In [17]:
coef = pd.Series(model_lasso.coef_, index = X_train.columns)
In [18]:
print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " +  str(sum(coef == 0)) + " variables")
Lasso picked 110 variables and eliminated the other 178 variables

干得好Lasso。 然而,有一点需要注意的是,所选择的特征不一定是“正确的” - 特别是因为该数据集中有许多线性相关的特征(特征存在共线性)。 尝试一下在选择的样本上运行Lasso几次,看看特征选择的稳定性。

我们可以看看最重要的特征是什么:

In [19]:
imp_coef = pd.concat([coef.sort_values().head(10),
                     coef.sort_values().tail(10)])
In [20]:
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Coefficients in the Lasso Model")
Out[20]:
Text(0.5,1,'Coefficients in the Lasso Model')

最重要的特征是GrLivArea。一些位置和房屋质量的特征也有积极的作用。 一些负面特征不太有意义,它们可能来自不平衡的分类变量。

另请注意,与从随机森林中获得的特征重要性不同,这些是模型中的实际系数 - 因此您可以准确地说出为什么预测价格就是这样。 这里唯一的问题是我们做了对数变换,所以实际的幅度有点难以解释。

In [21]:
#let's look at the residuals as well:
matplotlib.rcParams['figure.figsize'] = (6.0, 6.0)

preds = pd.DataFrame({"preds":model_lasso.predict(X_train), "true":y})
preds["residuals"] = preds["true"] - preds["preds"]
preds.plot(x = "preds", y = "residuals",kind = "scatter")
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x23c85442d68>

这个图看起来相当不错

添加xgboost:

让我们在线性模型中添加一个xgboost模型,看看我们是否可以提高分数:

In [22]:
import xgboost as xgb
In [23]:
dtrain = xgb.DMatrix(X_train, label = y)
dtest = xgb.DMatrix(X_test)

params = {"max_depth":2, "eta":0.1}
model = xgb.cv(params, dtrain,  num_boost_round=500, early_stopping_rounds=100)
[15:43:03] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[15:43:03] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
[15:43:03] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 0 extra nodes, 0 pruned nodes, max_depth=0
......
......
[15:43:07] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
[15:43:07] d:\build\xgboost\xgboost-0.80.git\src\tree\updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
In [24]:
model.loc[30:,["test-rmse-mean", "train-rmse-mean"]].plot()
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x23c868e7f98>
In [25]:
model_xgb = xgb.XGBRegressor(n_estimators=360, max_depth=2, learning_rate=0.1) #the params were tuned using xgb.cv
model_xgb.fit(X_train, y)
Out[25]:
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=2, min_child_weight=1, missing=None, n_estimators=360,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
In [26]:
xgb_preds = np.expm1(model_xgb.predict(X_test))
lasso_preds = np.expm1(model_lasso.predict(X_test))
In [27]:
predictions = pd.DataFrame({"xgb":xgb_preds, "lasso":lasso_preds})
predictions.plot(x = "xgb", y = "lasso", kind = "scatter")
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x23c86896dd8>

很多时候,对不相关的结果进行加权平均是有意义的 - 这通常可以提高分数,尽管在这种情况下它没有那么大帮助。

In [28]:
preds = 0.7*lasso_preds + 0.3*xgb_preds
In [29]:
solution = pd.DataFrame({"id":test.Id, "SalePrice":preds})
solution.to_csv("ridge_sol.csv", index = False)

Trying out keras?

前馈神经网络似乎效果并不好......我也不知道为什么。

In [30]:
from keras.layers import Dense
from keras.models import Sequential
from keras.regularizers import l1
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
C:\Python\Anaconda3\lib\site-packages\h5py\__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
In [31]:
X_train = StandardScaler().fit_transform(X_train)
In [32]:
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y, random_state = 3)
In [33]:
X_tr.shape
Out[33]:
(1095, 288)
In [34]:
X_tr
Out[34]:
array([[ 1.00573733,  0.68066137, -0.46001991, ..., -0.11785113,
         0.4676514 , -0.30599503],
       [-1.12520184,  0.60296111,  0.03113183, ..., -0.11785113,
         0.4676514 , -0.30599503],
       [-1.12520184, -0.02865265, -0.74027492, ..., -0.11785113,
         0.4676514 , -0.30599503],
       ...,
       [ 0.16426234, -0.87075036, -0.81954431, ..., -0.11785113,
        -2.13834494, -0.30599503],
       [ 0.92361154, -0.30038284, -0.44275864, ..., -0.11785113,
         0.4676514 , -0.30599503],
       [ 0.83656519,  1.98505948,  0.46455838, ..., -0.11785113,
         0.4676514 , -0.30599503]])
In [35]:
model = Sequential()
#model.add(Dense(256, activation="relu", input_dim = X_train.shape[1]))
model.add(Dense(1, input_dim = X_train.shape[1], W_regularizer=l1(0.001)))

model.compile(loss = "mse", optimizer = "adam")
C:\Python\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: UserWarning: Update your `Dense` call to the Keras 2 API: `Dense(1, input_dim=288, kernel_regularizer=<keras.reg...)`
  This is separate from the ipykernel package so we can avoid doing imports until
In [36]:
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 1)                 289       
=================================================================
Total params: 289
Trainable params: 289
Non-trainable params: 0
_________________________________________________________________
In [37]:
hist = model.fit(X_tr, y_tr, validation_data = (X_val, y_val))
Train on 1095 samples, validate on 365 samples
Epoch 1/1
1095/1095 [==============================] - 0s 185us/step - loss: 147.0254 - val_loss: 149.9535
In [38]:
pd.Series(model.predict(X_val)[:,0]).hist()
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x23c8de79ba8>