Stacked Regressions-Top 4% on LeaderBoard

原文地址:https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard

Stacked Regressions to predict House Prices

Serigne

July 2017

这个比赛题目对我来说非常重要,因为它帮助我几个月前开始了我的Kaggle之旅。我在这里读过一些很棒的笔记本。仅举几例:

  1. Comprehensive data exploration with Python by Pedro Marcelino : 十分好的数据分析(https://jiangpz.github.io/articles/2018-08/Examining-Your-Data)

  2. A study on Regression applied to the Ames dataset by Julien Cohen-Solal : 彻底的特征工程和对线性回归分析的深入研究,同时对初学者来说非常容易理解。(https://jiangpz.github.io/articles/2018-08/Preprocessing-By-Linear-Regression)

  3. Regularized Linear Models by Alexandru Papiu :关于建模和交叉验证的Kernel(https://jiangpz.github.io/articles/2018-08/Regularized-Linear-Models-Xgboost-Keras)

我十分建议每个初学者都要仔细研究这些内核(当然还要通过其他很多内核)并获得他们在数据科学和Kaggle比赛中的第一次认识。

在那之后(以及一些基本的实践),你会更有信心理解 Human Analoghttps://www.kaggle.com/humananalog/xgboost-lasso 他做了令人印象深刻的功能改造工作。

由于数据集特别方便,我几天前决定回到本次比赛并应用我迄今学到的东西,特别是stacking模型。为此,我们构建了两个stacking类(最简单的方法和不太简单的方法)。

由于这些类是为通用目的而编写的,因此您可以轻松地调整它们和/或扩展它们以应对回归问题。希望这个方法简洁易懂。

特征工程很简单 :

  • Imputing missing values by proceeding sequentially through the data

  • Transforming 将一些看似数值型但是是表示种类的做转换

  • Label Encoding 一些分类变量实际上包含了顺序信息

  • Box Cox Transformation 通过Box Cox Transformation处理偏态 (而不是做对数变换) : 这在排行榜和交叉验证方面给了我一个稍好的结果。

  • Getting dummy variables 处理哑变量.

接着我们选择一些基本模型 (主要是 sklearn 基本模型 + sklearn API of DMLC's XGBoost and 微软的 LightGBM), 在 stacking/ensembling 他们之前做交叉验证. 这里的关键是使(线性)模型对异常值具有鲁棒性。这提高了LB排名和交叉验证的结果。

令我惊讶的是,这在LB上很好 ( 0.11420 and top 4% the last time I tested it : July 2, 2017 )

希望当你看完这篇文章,你能理解stacking这个概念,虽然以前我们认为这个概念不容易理解

In [1]:
#import some necessary librairies

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
%matplotlib inline
import matplotlib.pyplot as plt  # Matlab-style plotting
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
import warnings
def ignore_warn(*args, **kwargs):
    pass
warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)


from scipy import stats
from scipy.stats import norm, skew #for some statistics


pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x)) #Limiting floats output to 3 decimal points


from subprocess import check_output
print(check_output(["ls", "input"]).decode("utf8")) #check the files available in the directory
data_description.txt
sample_submission.csv
test.csv
train.csv
train1.csv

In [2]:
#Now let's import and put the train and test datasets in  pandas dataframe

train = pd.read_csv('input/train.csv')
test = pd.read_csv('input/test.csv')
In [3]:
##display the first five rows of the train dataset.
train.head(5)
Out[3]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.000 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.000 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.000 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.000 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.000 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

In [4]:
##display the first five rows of the test dataset.
test.head(5)
Out[4]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
0 1461 20 RH 80.000 11622 Pave NaN Reg Lvl AllPub ... 120 0 NaN MnPrv NaN 0 6 2010 WD Normal
1 1462 20 RL 81.000 14267 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN Gar2 12500 6 2010 WD Normal
2 1463 60 RL 74.000 13830 Pave NaN IR1 Lvl AllPub ... 0 0 NaN MnPrv NaN 0 3 2010 WD Normal
3 1464 60 RL 78.000 9978 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN NaN 0 6 2010 WD Normal
4 1465 120 RL 43.000 5005 Pave NaN IR1 HLS AllPub ... 144 0 NaN NaN NaN 0 1 2010 WD Normal

5 rows × 80 columns

In [5]:
#check the numbers of samples and features
print("The train data size before dropping Id feature is : {} ".format(train.shape))
print("The test data size before dropping Id feature is : {} ".format(test.shape))

#Save the 'Id' column
train_ID = train['Id']
test_ID = test['Id']

#Now drop the  'Id' colum since it's unnecessary for  the prediction process.
train.drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)

#check again the data size after dropping the 'Id' variable
print("\nThe train data size after dropping Id feature is : {} ".format(train.shape)) 
print("The test data size after dropping Id feature is : {} ".format(test.shape))
The train data size before dropping Id feature is : (1460, 81) 
The test data size before dropping Id feature is : (1459, 80) 

The train data size after dropping Id feature is : (1460, 80) 
The test data size after dropping Id feature is : (1459, 79) 

数据预处理

异常值

Documentation Ames Housing数据文档中说明训练数据中存在异常值

让我们来探索一下这些异常值

In [6]:
fig, ax = plt.subplots()
ax.scatter(x = train['GrLivArea'], y = train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

我们可以看到,在图的右下角有两个点,表明有面积很大的房子卖了很便宜的价格。这些值是很违反常识的。 因此,我们可以安全的删除他们

In [7]:
#Deleting outliers
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)

#Check the graphic again
fig, ax = plt.subplots()
ax.scatter(train['GrLivArea'], train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

注意:

删除异常值并不总是安全的。我们决定删除这两个值,因为它们值非常大且非常糟糕(价格非常低大面积房子)。

训练数据中可能还有其他异常值。 但是,如果测试数据中也存在异常值,则删除所有这些异常值可能会严重影响我们的模型。 这就是为什么我们有时候并不将它们全部删除,而是设法使我们的模型更加健壮。 你可以参考这个笔记本的建模部分。

目标变量

SalePrice 是我们需要预测的变量。 所以我们先对这个变量做一些分析。

In [8]:
sns.distplot(train['SalePrice'] , fit=norm);

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()
 mu = 180932.92 and sigma = 79467.79

目标变量是右倾斜的。 由于(线性)模型更适用于正态分布的数据,我们需要对此变量做转换以使其更像正态分布。

对目标变量做对数变换

In [9]:
#We use the numpy fuction log1p which  applies log(1+x) to all elements of the column
train["SalePrice"] = np.log1p(train["SalePrice"])

#Check the new distribution 
sns.distplot(train['SalePrice'] , fit=norm);

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()
 mu = 12.02 and sigma = 0.40

现在偏差被纠正了,数据看起来更正态分布。

特征工程

我们先把训练数据和测试数据放到一个dataframe中。

In [10]:
ntrain = train.shape[0]
ntest = test.shape[0]
y_train = train.SalePrice.values
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['SalePrice'], axis=1, inplace=True)
print("all_data size is : {}".format(all_data.shape))
all_data size is : (2917, 79)

缺失值

In [11]:
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head(20)
Out[11]:
Missing Ratio
PoolQC 99.691
MiscFeature 96.400
Alley 93.212
Fence 80.425
FireplaceQu 48.680
LotFrontage 16.661
GarageQual 5.451
GarageCond 5.451
GarageFinish 5.451
GarageYrBlt 5.451
GarageType 5.382
BsmtExposure 2.811
BsmtCond 2.811
BsmtQual 2.777
BsmtFinType2 2.743
BsmtFinType1 2.708
MasVnrType 0.823
MasVnrArea 0.788
MSZoning 0.137
BsmtFullBath 0.069

将数据缺失的比例用直方图画出来:

In [12]:
f, ax = plt.subplots(figsize=(15, 12))
plt.xticks(rotation='90')
sns.barplot(x=all_data_na.index, y=all_data_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)
Out[12]:
Text(0.5,1,'Percent missing data by feature')

数据相关性

In [13]:
#Correlation map to see how features are correlated with SalePrice
corrmat = train.corr()
plt.subplots(figsize=(12,9))
sns.heatmap(corrmat, vmax=0.9, square=True)
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x2a38d257198>

补全缺失值

我们对每个存在缺失值的数据列进行分析,然后对数据进行补全

  • PoolQC :泳池质量。 查看数据可以知道,这里面NA表示没有游泳池。这个很容易理解,因为缺失值的比例很大(+ 99%),而且大多数房屋一般都没有游泳池。
In [14]:
all_data["PoolQC"] = all_data["PoolQC"].fillna("None")
  • MiscFeature : 房屋的其他附属。NA表示"no misc feature"
In [15]:
all_data["MiscFeature"] = all_data["MiscFeature"].fillna("None")
  • Alley : 小巷类型。NA表示 "no alley access"
In [16]:
all_data["Alley"] = all_data["Alley"].fillna("None")
  • Fence : 围栏质量。NA表示 "no fence"
In [17]:
all_data["Fence"] = all_data["Fence"].fillna("None")
  • FireplaceQu :壁炉状态。 NA表示 "no fireplace"
In [18]:
all_data["FireplaceQu"] = all_data["FireplaceQu"].fillna("None")
  • LotFrontage : 路到房屋的距离。因为同一个街道的房屋到路的距离应该差不多,虽然没有这个房子到路的距离,但是可以用同社区房屋到路的中间值来填充。
In [19]:
#Group by neighborhood and fill in missing value by the median LotFrontage of all the neighborhood
all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))
  • GarageType, GarageFinish, GarageQual and GarageCond : 车库位置、车库是否修建完成、车库质量、车库状况。我们用None来填充缺失值。
In [20]:
for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
    all_data[col] = all_data[col].fillna('None')
  • GarageYrBlt, GarageArea and GarageCars : 年车库建成时间、车库面积、车库容量。 我们用0来填充缺失值
In [21]:
for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
    all_data[col] = all_data[col].fillna(0)
  • BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, BsmtFullBath and BsmtHalfBath : 地下室相关属性,如果值缺失,就意味着没有地下室。填0
In [22]:
for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'):
    all_data[col] = all_data[col].fillna(0)
  • BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1 and BsmtFinType2 : 地下室相关属性,如果值缺失,就意味着没有地下室。填None
In [23]:
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    all_data[col] = all_data[col].fillna('None')
  • MasVnrArea and MasVnrType : 表层砌体面积和表层砌体类型。NA很可能意味着这些房屋没有砌筑饰面。我们可以给区域面积填充0和给类型填充None。
In [24]:
all_data["MasVnrType"] = all_data["MasVnrType"].fillna("None")
all_data["MasVnrArea"] = all_data["MasVnrArea"].fillna(0)
  • MSZoning (The general zoning classification) : 一共1460条数据,其中'RL' 出现了1151次。所以我们使用'RL'填充缺失值。
In [25]:
all_data['MSZoning'] = all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])
  • Utilities : 这个属性中所有的值基本上都是"AllPub",其中有1个"NoSeWa"和2个NA。而且'NoSewa'在训练集中出现了,在测试集中没有出现。因此该特征无助于预测建模。所有我们可以删掉它
In [26]:
all_data = all_data.drop(['Utilities'], axis=1)
  • Functional : 在说明文档中,数据缺失就意味着取值应该是"Typ"
In [27]:
all_data["Functional"] = all_data["Functional"].fillna("Typ")
  • Electrical : 电气系统。只有一个缺失值,所以我们就用出现出现次数最多的值填充它。
In [28]:
all_data['Electrical'] = all_data['Electrical'].fillna(all_data['Electrical'].mode()[0])
  • KitchenQual: 厨房的品质。和Electrical一样,只有一个缺失值,所以我们就用出现出现次数最多的值填充它。
In [29]:
all_data['KitchenQual'] = all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0])
  • Exterior1st and Exterior2nd : 外墙面1、外墙面2.和上面一样
In [30]:
all_data['Exterior1st'] = all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0])
all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])
  • SaleType : 销售类型和上面一样
In [31]:
all_data['SaleType'] = all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])
  • MSSubClass : 数据缺失可能表示它不是建筑类型。我们用None填充。
In [32]:
all_data['MSSubClass'] = all_data['MSSubClass'].fillna("None")

我们检查一下是否还有其他的缺失值

In [33]:
#Check remaining missing values if any 
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head()
Out[33]:
Missing Ratio

这意味着没有缺失值了。

更多特征工程

将一些数值型但是实际表示的是类型的特征做变换

In [34]:
#MSSubClass=The building class
all_data['MSSubClass'] = all_data['MSSubClass'].apply(str)


#Changing OverallCond into a categorical variable
all_data['OverallCond'] = all_data['OverallCond'].astype(str)


#Year and month sold are transformed into categorical features.
all_data['YrSold'] = all_data['YrSold'].astype(str)
all_data['MoSold'] = all_data['MoSold'].astype(str)

LabelEncoder可以将标签分配一个0—n_classes-1之间的编码 有些特征虽然表示种类,但是包含了顺序信息,使用LabelEncoder处理这些特征。

In [35]:
from sklearn.preprocessing import LabelEncoder
cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 
        'YrSold', 'MoSold')
# process columns, apply LabelEncoder to categorical features
for c in cols:
    lbl = LabelEncoder() 
    lbl.fit(list(all_data[c].values)) 
    all_data[c] = lbl.transform(list(all_data[c].values))

# shape        
print('Shape all_data: {}'.format(all_data.shape))
Shape all_data: (2917, 78)

增加一个重要的特征

由于面积相关特征对于确定房价非常重要,我们还增加了一个特征,即每个房屋的地下室总面积加一楼和二楼面积

In [36]:
# Adding total sqfootage feature 
all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']

倾斜的特征(非正态分布的特征)

In [37]:
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index

# Check the skew of all numerical features
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
print("\nSkew in numerical features: \n")
skewness = pd.DataFrame({'Skew' :skewed_feats})
skewness.head(10)
Skew in numerical features: 

Out[37]:
Skew
MiscVal 21.940
PoolArea 17.689
LotArea 13.109
LowQualFinSF 12.085
3SsnPorch 11.372
LandSlope 4.973
KitchenAbvGr 4.301
BsmtFinSF2 4.145
EnclosedPorch 4.002
ScreenPorch 3.945

对(十分)倾斜特征做Box-Cox变换

Box-Cox变换是Box和Cox在1964年提出的一种广义幂变换方法,是统计建模中常用的一种数据变换,用于连续的响应变量不满足正态分布的情况。Box-Cox变换之后,可以一定程度上减小不可观测的误差和预测变量的相关性。Box-Cox变换的主要特点是引入一个参数,通过数据本身估计该参数进而确定应采取的数据变换形式,Box-Cox变换可以明显地改善数据的正态性、对称性和方差相等性,对许多实际数据都是行之有效的

我们使用scipy中的boxcox1p函数,它可以计算$1 + x$的Box-Cox变换

y = ((1+x)**lmbda - 1) / lmbda  if lmbda != 0
    log(1+x)             if lmbda == 0

从上面的公式我们可以看出,当 $ \lambda = 0 $ 时,它和我们上面对目标变量做的对数变换一样。

我们可以从下面两个页面中知道更多关于Box-Cox变换的细节this page and the scipy function's page

In [38]:
skewness = skewness[abs(skewness) > 0.75]
print("There are {} skewed numerical features to Box Cox transform".format(skewness.shape[0]))

from scipy.special import boxcox1p
skewed_features = skewness.index
lam = 0.15
for feat in skewed_features:
    #all_data[feat] += 1
    all_data[feat] = boxcox1p(all_data[feat], lam)
    
#all_data[skewed_features] = np.log1p(all_data[skewed_features])
There are 59 skewed numerical features to Box Cox transform

对哑变量编码

pandas使用get_dummies进行one-hot编码

In [39]:
all_data = pd.get_dummies(all_data)
print(all_data.shape)
(2917, 220)

获取新的训练集和测试集

In [40]:
train = all_data[:ntrain]
test = all_data[ntrain:]

训练模型

引入依赖库

In [41]:
from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgb

定义一个交叉验证的策略

由于这个函数没有考虑随机性,所以我们在这自强增加一行代码,在交叉验证之前把数据打乱 我们使用Sklearn的cross_val_score来做K折交叉验证。

In [42]:
#Validation function
n_folds = 5

def rmsle_cv(model):
    kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)
    rmse= np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

基础模型

  • LASSO Regression : L1范数正则化(Lasso回归)的线性回归

该模型可能对异常值非常敏感。 所以我们需要让它们更加健壮。所以我们对pipeline使用sklearn的Robustscaler()方法

In [43]:
lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))
  • Elastic Net Regression : L1和L2正则化的线性回归

同上

In [44]:
ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=3))
  • Kernel Ridge Regression :L2正则线性回归
In [45]:
KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)
  • Gradient Boosting Regression :背景梯度提升回归

为了使程序更健壮,我们使用huber loss

In [46]:
GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
                                   max_depth=4, max_features='sqrt',
                                   min_samples_leaf=15, min_samples_split=10, 
                                   loss='huber', random_state =5)
  • XGBoost :
In [47]:
model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, silent=1,
                             random_state =7, nthread = -1)
  • LightGBM :
In [48]:
model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=5,
                              learning_rate=0.05, n_estimators=720,
                              max_bin = 55, bagging_fraction = 0.8,
                              bagging_freq = 5, feature_fraction = 0.2319,
                              feature_fraction_seed=9, bagging_seed=9,
                              min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)

基础模型评分

让我们通过评估交叉验证的rmsle来了解这些基本模型的运行表现

In [49]:
score = rmsle_cv(lasso)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
Lasso score: 0.1115 (0.0074)

In [50]:
score = rmsle_cv(ENet)
print("ElasticNet score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
ElasticNet score: 0.1116 (0.0074)

In [51]:
score = rmsle_cv(KRR)
print("Kernel Ridge score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
Kernel Ridge score: 0.1153 (0.0075)

In [52]:
score = rmsle_cv(GBoost)
print("Gradient Boosting score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
Gradient Boosting score: 0.1177 (0.0080)

In [53]:
score = rmsle_cv(model_xgb)
print("Xgboost score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
Xgboost score: 0.1165 (0.0054)

In [54]:
score = rmsle_cv(model_lgb)
print("LGBM score: {:.4f} ({:.4f})\n" .format(score.mean(), score.std()))
LGBM score: 0.1162 (0.0071)

Stacking models

最简单的Stacking模型探索:基础模型平均

我们从这种平均基本模型的简单方法开始。 我们构建了一个新类来扩展scikit-learn与我们的模型,并且还包括封装和代码重用(inheritance)

Averaged base models class

In [55]:
class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models):
        self.models = models
        
    # we define clones of the original models to fit the data in
    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]
        
        # Train cloned base models
        for model in self.models_:
            model.fit(X, y)

        return self
    
    #Now we do the predictions for cloned models and average them
    def predict(self, X):
        predictions = np.column_stack([
            model.predict(X) for model in self.models_
        ])
        return np.mean(predictions, axis=1)   

Averaged base models score

我们只使用 ENet, GBoost, KRR and lasso这四个模型做平均。当然,我们可以轻松的增加模型。

In [56]:
averaged_models = AveragingModels(models = (ENet, GBoost, KRR, lasso))

score = rmsle_cv(averaged_models)
print(" Averaged base models score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
 Averaged base models score: 0.1091 (0.0075)

哇 ! 似乎最简单的Stacking模型确实提高了分数。 这鼓励了我们要进一步探索一个不那么简单的堆叠方法。

不太简单的Stacking模型:增加一个元模型(Meta-model)

在这种方法中,我们在平均基础模型上增加一个元模型,用这些基本模型的剩余的一折的预测值来训练我们的元模型

训练部分的程序符合下面的描述:

  1. 将整个训练集分成两个不相交的集(这里是train和.holdout

  2. 在第一部分(train)数据上训练几个基础模型

  3. 在第二部分(holdout)数据上测试(预测、打分)这些基础模型

  4. 使用来自3)的预测(称为out-of-folds predictions) 作为输入, and the correct responses (目标变量) 作为输出来训练更高级别的学习器称为元模型meta-model.

前三个步骤是迭代完成的。如果我们使用5折举例子,我们首先将训练数据分成5份。然后我们将进行5次迭代。在每次迭代中,我们使用4份数据训练基础模型,然后预测剩下的一份数据(holdout fold)。

因此,经过5次迭代后,整个数据集都被循环预测了,在第四步,我们使用它作为新的特征来训练我们的元模型。

对于预测部分,我们对测试数据上所有基础模型的预测进行平均,并将它们用作元特征meta-features,最终预测是使用元模型完成的。

Faron

(Image taken from Faron)

kaz

Gif taken from KazAnova's interview

在这个gif上,基本模型是算法0,1,2,元模型是算法3.整个训练数据集是A + B(目标变量y已知),我们可以分成训练部分(A)和holdout部分(B)。 测试数据集是C.

B1(来自holdout部分的预测值)是用于训练元模型3的新特征,并且C1(其是来自测试数据集的预测)是完成最终预测的元特征。

Stacking averaged Models Class

In [57]:
class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, base_models, meta_model, n_folds=5):
        self.base_models = base_models
        self.meta_model = meta_model
        self.n_folds = n_folds
   
    # We again fit the data on clones of the original models
    def fit(self, X, y):
        self.base_models_ = [list() for x in self.base_models]
        self.meta_model_ = clone(self.meta_model)
        kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)
        
        # Train cloned base models then create out-of-fold predictions
        # that are needed to train the cloned meta-model
        out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))
        for i, model in enumerate(self.base_models):
            for train_index, holdout_index in kfold.split(X, y):
                instance = clone(model)
                self.base_models_[i].append(instance)
                instance.fit(X[train_index], y[train_index])
                y_pred = instance.predict(X[holdout_index])
                out_of_fold_predictions[holdout_index, i] = y_pred
                
        # Now train the cloned  meta-model using the out-of-fold predictions as new feature
        self.meta_model_.fit(out_of_fold_predictions, y)
        return self
   
    #Do the predictions of all base models on the test data and use the averaged predictions as 
    #meta-features for the final prediction which is done by the meta-model
    def predict(self, X):
        meta_features = np.column_stack([
            np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
            for base_models in self.base_models_ ])
        return self.meta_model_.predict(meta_features)

Stacking Averaged models Score

To make the two approaches comparable (by using the same number of models) , we just average Enet KRR and Gboost, then we add lasso as meta-model. 为了使两种方法具有可比性(通过使用相同数量的模型),我们使用Enet KRR 和 Gboost作为基础模型,增加lasso作为元模型。

In [58]:
stacked_averaged_models = StackingAveragedModels(base_models = (ENet, GBoost, KRR),
                                                 meta_model = lasso)

score = rmsle_cv(stacked_averaged_models)
print("Stacking Averaged models score: {:.4f} ({:.4f})".format(score.mean(), score.std()))
Stacking Averaged models score: 0.1085 (0.0074)

通过添加元学习器,我们再次获得更好的分数

集成 StackedRegressor, XGBoost 和 LightGBM

我们增加 XGBoost and LightGBM到前面定义的StackedRegressor上。

我们首先定义一个rmsle评估函数

In [59]:
def rmsle(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))

最终的训练与预测

StackedRegressor:

In [60]:
stacked_averaged_models.fit(train.values, y_train)
stacked_train_pred = stacked_averaged_models.predict(train.values)
stacked_pred = np.expm1(stacked_averaged_models.predict(test.values))
print(rmsle(y_train, stacked_train_pred))
0.07815719379164107

XGBoost:

In [61]:
model_xgb.fit(train, y_train)
xgb_train_pred = model_xgb.predict(train)
xgb_pred = np.expm1(model_xgb.predict(test))
print(rmsle(y_train, xgb_train_pred))
0.0815823282299117

LightGBM:

In [62]:
model_lgb.fit(train, y_train)
lgb_train_pred = model_lgb.predict(train)
lgb_pred = np.expm1(model_lgb.predict(test.values))
print(rmsle(y_train, lgb_train_pred))
0.07307464036005418
In [63]:
'''RMSE on the entire Train data when averaging'''

print('RMSLE score on train data:')
print(rmsle(y_train,stacked_train_pred*0.70 +
               xgb_train_pred*0.15 + lgb_train_pred*0.15 ))
RMSLE score on train data:
0.07586557397142984

Ensemble prediction:

In [64]:
ensemble = stacked_pred*0.70 + xgb_pred*0.15 + lgb_pred*0.15

Submission

In [65]:
sub = pd.DataFrame()
sub['Id'] = test_ID
sub['SalePrice'] = ensemble
sub.to_csv('submission.csv',index=False)