原文地址：https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard

Stacked Regressions to predict House Prices¶

Serigne¶

July 2017

这个比赛题目对我来说非常重要，因为它帮助我几个月前开始了我的Kaggle之旅。我在这里读过一些很棒的笔记本。仅举几例：

Comprehensive data exploration with Python by Pedro Marcelino : 十分好的数据分析（https://jiangpz.github.io/articles/2018-08/Examining-Your-Data）
A study on Regression applied to the Ames dataset by Julien Cohen-Solal : 彻底的特征工程和对线性回归分析的深入研究，同时对初学者来说非常容易理解。（https://jiangpz.github.io/articles/2018-08/Preprocessing-By-Linear-Regression）
Regularized Linear Models by Alexandru Papiu :关于建模和交叉验证的Kernel（https://jiangpz.github.io/articles/2018-08/Regularized-Linear-Models-Xgboost-Keras）

我十分建议每个初学者都要仔细研究这些内核（当然还要通过其他很多内核）并获得他们在数据科学和Kaggle比赛中的第一次认识。

在那之后（以及一些基本的实践），你会更有信心理解 Human Analog 的https://www.kaggle.com/humananalog/xgboost-lasso 他做了令人印象深刻的功能改造工作。

由于数据集特别方便，我几天前决定回到本次比赛并应用我迄今学到的东西，特别是stacking模型。为此，我们构建了两个stacking类（最简单的方法和不太简单的方法）。

由于这些类是为通用目的而编写的，因此您可以轻松地调整它们和/或扩展它们以应对回归问题。希望这个方法简洁易懂。

特征工程很简单 :

Imputing missing values by proceeding sequentially through the data
Transforming 将一些看似数值型但是是表示种类的做转换
Label Encoding 一些分类变量实际上包含了顺序信息
Box Cox Transformation 通过Box Cox Transformation处理偏态 (而不是做对数变换) : 这在排行榜和交叉验证方面给了我一个稍好的结果。
Getting dummy variables 处理哑变量.

接着我们选择一些基本模型 (主要是 sklearn 基本模型 + sklearn API of DMLC's XGBoost and 微软的 LightGBM), 在 stacking/ensembling 他们之前做交叉验证. 这里的关键是使（线性）模型对异常值具有鲁棒性。这提高了LB排名和交叉验证的结果。

令我惊讶的是，这在LB上很好 ( 0.11420 and top 4% the last time I tested it : July 2, 2017 )

希望当你看完这篇文章，你能理解stacking这个概念，虽然以前我们认为这个概念不容易理解

#import some necessary librairies

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
%matplotlib inline
import matplotlib.pyplot as plt  # Matlab-style plotting
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
import warnings
def ignore_warn(*args, **kwargs):
    pass
warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)


from scipy import stats
from scipy.stats import norm, skew #for some statistics


pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x)) #Limiting floats output to 3 decimal points


from subprocess import check_output
print(check_output(["ls", "input"]).decode("utf8")) #check the files available in the directory

data_description.txt
sample_submission.csv
test.csv
train.csv
train1.csv

#Now let's import and put the train and test datasets in  pandas dataframe

train = pd.read_csv('input/train.csv')
test = pd.read_csv('input/test.csv')

##display the first five rows of the train dataset.
train.head(5)

##display the first five rows of the test dataset.
test.head(5)

#check the numbers of samples and features
print("The train data size before dropping Id feature is : {} ".format(train.shape))
print("The test data size before dropping Id feature is : {} ".format(test.shape))

#Save the 'Id' column
train_ID = train['Id']
test_ID = test['Id']

#Now drop the  'Id' colum since it's unnecessary for  the prediction process.
train.drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)

#check again the data size after dropping the 'Id' variable
print("\nThe train data size after dropping Id feature is : {} ".format(train.shape)) 
print("The test data size after dropping Id feature is : {} ".format(test.shape))

The train data size before dropping Id feature is : (1460, 81) 
The test data size before dropping Id feature is : (1459, 80) 

The train data size after dropping Id feature is : (1460, 80) 
The test data size after dropping Id feature is : (1459, 79)

数据预处理¶

异常值¶

Documentation Ames Housing数据文档中说明训练数据中存在异常值

让我们来探索一下这些异常值

fig, ax = plt.subplots()
ax.scatter(x = train['GrLivArea'], y = train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

我们可以看到，在图的右下角有两个点，表明有面积很大的房子卖了很便宜的价格。这些值是很违反常识的。因此，我们可以安全的删除他们

#Deleting outliers
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)

#Check the graphic again
fig, ax = plt.subplots()
ax.scatter(train['GrLivArea'], train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

注意：¶

删除异常值并不总是安全的。我们决定删除这两个值，因为它们值非常大且非常糟糕（价格非常低大面积房子）。

训练数据中可能还有其他异常值。但是，如果测试数据中也存在异常值，则删除所有这些异常值可能会严重影响我们的模型。这就是为什么我们有时候并不将它们全部删除，而是设法使我们的模型更加健壮。你可以参考这个笔记本的建模部分。

目标变量¶

SalePrice 是我们需要预测的变量。所以我们先对这个变量做一些分析。

sns.distplot(train['SalePrice'] , fit=norm);

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

 mu = 180932.92 and sigma = 79467.79

目标变量是右倾斜的。由于（线性）模型更适用于正态分布的数据，我们需要对此变量做转换以使其更像正态分布。

对目标变量做对数变换

#We use the numpy fuction log1p which  applies log(1+x) to all elements of the column
train["SalePrice"] = np.log1p(train["SalePrice"])

#Check the new distribution 
sns.distplot(train['SalePrice'] , fit=norm);

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

 mu = 12.02 and sigma = 0.40

现在偏差被纠正了，数据看起来更正态分布。

特征工程¶

我们先把训练数据和测试数据放到一个dataframe中。

ntrain = train.shape[0]
ntest = test.shape[0]
y_train = train.SalePrice.values
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['SalePrice'], axis=1, inplace=True)
print("all_data size is : {}".format(all_data.shape))

all_data size is : (2917, 79)

缺失值¶

all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head(20)

将数据缺失的比例用直方图画出来：

f, ax = plt.subplots(figsize=(15, 12))
plt.xticks(rotation='90')
sns.barplot(x=all_data_na.index, y=all_data_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)

Text(0.5,1,'Percent missing data by feature')

数据相关性

#Correlation map to see how features are correlated with SalePrice
corrmat = train.corr()
plt.subplots(figsize=(12,9))
sns.heatmap(corrmat, vmax=0.9, square=True)

<matplotlib.axes._subplots.AxesSubplot at 0x2a38d257198>

补全缺失值¶

我们对每个存在缺失值的数据列进行分析，然后对数据进行补全

PoolQC :泳池质量。查看数据可以知道，这里面NA表示没有游泳池。这个很容易理解，因为缺失值的比例很大（+ 99％），而且大多数房屋一般都没有游泳池。

all_data["PoolQC"] = all_data["PoolQC"].fillna("None")

MiscFeature : 房屋的其他附属。NA表示"no misc feature"

all_data["MiscFeature"] = all_data["MiscFeature"].fillna("None")

Alley : 小巷类型。NA表示 "no alley access"

all_data["Alley"] = all_data["Alley"].fillna("None")

Fence : 围栏质量。NA表示 "no fence"

all_data["Fence"] = all_data["Fence"].fillna("None")

FireplaceQu :壁炉状态。 NA表示 "no fireplace"

all_data["FireplaceQu"] = all_data["FireplaceQu"].fillna("None")

LotFrontage : 路到房屋的距离。因为同一个街道的房屋到路的距离应该差不多，虽然没有这个房子到路的距离，但是可以用同社区房屋到路的中间值来填充。

#Group by neighborhood and fill in missing value by the median LotFrontage of all the neighborhood
all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))

GarageType, GarageFinish, GarageQual and GarageCond : 车库位置、车库是否修建完成、车库质量、车库状况。我们用None来填充缺失值。

for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
    all_data[col] = all_data[col].fillna('None')

GarageYrBlt, GarageArea and GarageCars : 年车库建成时间、车库面积、车库容量。我们用0来填充缺失值

for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
    all_data[col] = all_data[col].fillna(0)

BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, BsmtFullBath and BsmtHalfBath : 地下室相关属性，如果值缺失，就意味着没有地下室。填0

for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'):
    all_data[col] = all_data[col].fillna(0)

BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1 and BsmtFinType2 : 地下室相关属性，如果值缺失，就意味着没有地下室。填None

for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    all_data[col] = all_data[col].fillna('None')

MasVnrArea and MasVnrType : 表层砌体面积和表层砌体类型。NA很可能意味着这些房屋没有砌筑饰面。我们可以给区域面积填充0和给类型填充None。

all_data["MasVnrType"] = all_data["MasVnrType"].fillna("None")
all_data["MasVnrArea"] = all_data["MasVnrArea"].fillna(0)

MSZoning (The general zoning classification) : 一共1460条数据，其中'RL' 出现了1151次。所以我们使用'RL'填充缺失值。

all_data['MSZoning'] = all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])

Utilities : 这个属性中所有的值基本上都是"AllPub"，其中有1个"NoSeWa"和2个NA。而且'NoSewa'在训练集中出现了，在测试集中没有出现。因此该特征无助于预测建模。所有我们可以删掉它

all_data = all_data.drop(['Utilities'], axis=1)

Functional : 在说明文档中，数据缺失就意味着取值应该是"Typ"

all_data["Functional"] = all_data["Functional"].fillna("Typ")

Electrical : 电气系统。只有一个缺失值，所以我们就用出现出现次数最多的值填充它。

all_data['Electrical'] = all_data['Electrical'].fillna(all_data['Electrical'].mode()[0])

KitchenQual: 厨房的品质。和Electrical一样，只有一个缺失值，所以我们就用出现出现次数最多的值填充它。

all_data['KitchenQual'] = all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0])

Exterior1st and Exterior2nd : 外墙面1、外墙面2.和上面一样

all_data['Exterior1st'] = all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0])
all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])

SaleType : 销售类型和上面一样

all_data['SaleType'] = all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])

MSSubClass : 数据缺失可能表示它不是建筑类型。我们用None填充。

all_data['MSSubClass'] = all_data['MSSubClass'].fillna("None")

我们检查一下是否还有其他的缺失值

#Check remaining missing values if any 
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head()

这意味着没有缺失值了。

训练模型¶

引入依赖库

from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgb

定义一个交叉验证的策略

由于这个函数没有考虑随机性，所以我们在这自强增加一行代码，在交叉验证之前把数据打乱我们使用Sklearn的cross_val_score来做K折交叉验证。

#Validation function
n_folds = 5

def rmsle_cv(model):
    kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)
    rmse= np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

基础模型¶

LASSO Regression : L1范数正则化（Lasso回归）的线性回归

该模型可能对异常值非常敏感。所以我们需要让它们更加健壮。所以我们对pipeline使用sklearn的Robustscaler()方法

lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))

Elastic Net Regression : L1和L2正则化的线性回归

同上

ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=3))

Kernel Ridge Regression :L2正则线性回归

KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)

Gradient Boosting Regression :背景梯度提升回归

为了使程序更健壮，我们使用huber loss

GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
                                   max_depth=4, max_features='sqrt',
                                   min_samples_leaf=15, min_samples_split=10, 
                                   loss='huber', random_state =5)

XGBoost :

model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, silent=1,
                             random_state =7, nthread = -1)

LightGBM :

model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=5,
                              learning_rate=0.05, n_estimators=720,
                              max_bin = 55, bagging_fraction = 0.8,
                              bagging_freq = 5, feature_fraction = 0.2319,
                              feature_fraction_seed=9, bagging_seed=9,
                              min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)

基础模型评分¶

让我们通过评估交叉验证的rmsle来了解这些基本模型的运行表现

score = rmsle_cv(lasso)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

Lasso score: 0.1115 (0.0074)

score = rmsle_cv(ENet)
print("ElasticNet score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

ElasticNet score: 0.1116 (0.0074)

score = rmsle_cv(KRR)
print("Kernel Ridge score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

Kernel Ridge score: 0.1153 (0.0075)

score = rmsle_cv(GBoost)
print("Gradient Boosting score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

Gradient Boosting score: 0.1177 (0.0080)

score = rmsle_cv(model_xgb)
print("Xgboost score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

Xgboost score: 0.1165 (0.0054)

score = rmsle_cv(model_lgb)
print("LGBM score: {:.4f} ({:.4f})\n" .format(score.mean(), score.std()))

LGBM score: 0.1162 (0.0071)

Stacking models¶

最简单的Stacking模型探索：基础模型平均¶

我们从这种平均基本模型的简单方法开始。我们构建了一个新类来扩展scikit-learn与我们的模型，并且还包括封装和代码重用(inheritance)

Averaged base models class

class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models):
        self.models = models
        
    # we define clones of the original models to fit the data in
    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]
        
        # Train cloned base models
        for model in self.models_:
            model.fit(X, y)

        return self
    
    #Now we do the predictions for cloned models and average them
    def predict(self, X):
        predictions = np.column_stack([
            model.predict(X) for model in self.models_
        ])
        return np.mean(predictions, axis=1)

Averaged base models score

我们只使用 ENet, GBoost, KRR and lasso这四个模型做平均。当然，我们可以轻松的增加模型。

averaged_models = AveragingModels(models = (ENet, GBoost, KRR, lasso))

score = rmsle_cv(averaged_models)
print(" Averaged base models score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

 Averaged base models score: 0.1091 (0.0075)

哇！似乎最简单的Stacking模型确实提高了分数。这鼓励了我们要进一步探索一个不那么简单的堆叠方法。

不太简单的Stacking模型：增加一个元模型（Meta-model）¶

在这种方法中，我们在平均基础模型上增加一个元模型，用这些基本模型的剩余的一折的预测值来训练我们的元模型

训练部分的程序符合下面的描述：

将整个训练集分成两个不相交的集（这里是train和.holdout）
在第一部分（train）数据上训练几个基础模型
在第二部分（holdout）数据上测试（预测、打分）这些基础模型
使用来自3）的预测(称为out-of-folds predictions) 作为输入, and the correct responses (目标变量) 作为输出来训练更高级别的学习器称为元模型meta-model.

前三个步骤是迭代完成的。如果我们使用5折举例子，我们首先将训练数据分成5份。然后我们将进行5次迭代。在每次迭代中，我们使用4份数据训练基础模型，然后预测剩下的一份数据(holdout fold)。

因此，经过5次迭代后，整个数据集都被循环预测了，在第四步，我们使用它作为新的特征来训练我们的元模型。

对于预测部分，我们对测试数据上所有基础模型的预测进行平均，并将它们用作元特征meta-features，最终预测是使用元模型完成的。

Faron

(Image taken from Faron)

kaz

Gif taken from KazAnova's interview

在这个gif上，基本模型是算法0,1,2，元模型是算法3.整个训练数据集是A + B（目标变量y已知），我们可以分成训练部分（A）和holdout部分（B）。测试数据集是C.

B1（来自holdout部分的预测值）是用于训练元模型3的新特征，并且C1（其是来自测试数据集的预测）是完成最终预测的元特征。

Stacking averaged Models Class

class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, base_models, meta_model, n_folds=5):
        self.base_models = base_models
        self.meta_model = meta_model
        self.n_folds = n_folds
   
    # We again fit the data on clones of the original models
    def fit(self, X, y):
        self.base_models_ = [list() for x in self.base_models]
        self.meta_model_ = clone(self.meta_model)
        kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)
        
        # Train cloned base models then create out-of-fold predictions
        # that are needed to train the cloned meta-model
        out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))
        for i, model in enumerate(self.base_models):
            for train_index, holdout_index in kfold.split(X, y):
                instance = clone(model)
                self.base_models_[i].append(instance)
                instance.fit(X[train_index], y[train_index])
                y_pred = instance.predict(X[holdout_index])
                out_of_fold_predictions[holdout_index, i] = y_pred
                
        # Now train the cloned  meta-model using the out-of-fold predictions as new feature
        self.meta_model_.fit(out_of_fold_predictions, y)
        return self
   
    #Do the predictions of all base models on the test data and use the averaged predictions as 
    #meta-features for the final prediction which is done by the meta-model
    def predict(self, X):
        meta_features = np.column_stack([
            np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
            for base_models in self.base_models_ ])
        return self.meta_model_.predict(meta_features)

Stacking Averaged models Score

To make the two approaches comparable (by using the same number of models) , we just average Enet KRR and Gboost, then we add lasso as meta-model. 为了使两种方法具有可比性（通过使用相同数量的模型），我们使用Enet KRR 和 Gboost作为基础模型，增加lasso作为元模型。

stacked_averaged_models = StackingAveragedModels(base_models = (ENet, GBoost, KRR),
                                                 meta_model = lasso)

score = rmsle_cv(stacked_averaged_models)
print("Stacking Averaged models score: {:.4f} ({:.4f})".format(score.mean(), score.std()))

Stacking Averaged models score: 0.1085 (0.0074)

通过添加元学习器，我们再次获得更好的分数

集成 StackedRegressor, XGBoost 和 LightGBM¶

我们增加 XGBoost and LightGBM到前面定义的StackedRegressor上。

我们首先定义一个rmsle评估函数

def rmsle(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))

最终的训练与预测¶

StackedRegressor:

stacked_averaged_models.fit(train.values, y_train)
stacked_train_pred = stacked_averaged_models.predict(train.values)
stacked_pred = np.expm1(stacked_averaged_models.predict(test.values))
print(rmsle(y_train, stacked_train_pred))

0.07815719379164107

XGBoost:

model_xgb.fit(train, y_train)
xgb_train_pred = model_xgb.predict(train)
xgb_pred = np.expm1(model_xgb.predict(test))
print(rmsle(y_train, xgb_train_pred))

0.0815823282299117

LightGBM:

model_lgb.fit(train, y_train)
lgb_train_pred = model_lgb.predict(train)
lgb_pred = np.expm1(model_lgb.predict(test.values))
print(rmsle(y_train, lgb_train_pred))

0.07307464036005418

'''RMSE on the entire Train data when averaging'''

print('RMSLE score on train data:')
print(rmsle(y_train,stacked_train_pred*0.70 +
               xgb_train_pred*0.15 + lgb_train_pred*0.15 ))

RMSLE score on train data:
0.07586557397142984

Ensemble prediction:

ensemble = stacked_pred*0.70 + xgb_pred*0.15 + lgb_pred*0.15

Submission

sub = pd.DataFrame()
sub['Id'] = test_ID
sub['SalePrice'] = ensemble
sub.to_csv('submission.csv',index=False)

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.000	8450	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	2	20	RL	80.000	9600	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	3	60	RL	68.000	11250	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	4	70	RL	60.000	9550	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	5	60	RL	84.000	14260	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	12	2008	WD	Normal	250000

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	ScreenPorch	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition
0	1461	20	RH	80.000	11622	Pave	NaN	Reg	Lvl	AllPub	...	120	NaN	MnPrv	NaN	0	6	2010	WD	Normal
1	1462	20	RL	81.000	14267	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	Gar2	12500	6	2010	WD	Normal
2	1463	60	RL	74.000	13830	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	MnPrv	NaN	0	3	2010	WD	Normal
3	1464	60	RL	78.000	9978	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	6	2010	WD	Normal
4	1465	120	RL	43.000	5005	Pave	NaN	IR1	HLS	AllPub	...	144	NaN	NaN	NaN	0	1	2010	WD	Normal

	Skew
MiscVal	21.940
PoolArea	17.689
LotArea	13.109
LowQualFinSF	12.085
3SsnPorch	11.372
LandSlope	4.973
KitchenAbvGr	4.301
BsmtFinSF2	4.145
EnclosedPorch	4.002
ScreenPorch	3.945

Stacked Regressions-Top 4% on LeaderBoard

Stacked Regressions to predict House Prices¶

Serigne¶

数据预处理¶

异常值¶

注意：¶

目标变量¶

特征工程¶

缺失值¶

补全缺失值¶

更多特征工程¶

训练模型¶

基础模型¶

基础模型评分¶

Stacking models¶

最简单的Stacking模型探索：基础模型平均¶

不太简单的Stacking模型：增加一个元模型（Meta-model）¶

集成 StackedRegressor, XGBoost 和 LightGBM¶

最终的训练与预测¶

	Missing Ratio
PoolQC	99.691
MiscFeature	96.400
Alley	93.212
Fence	80.425
FireplaceQu	48.680
LotFrontage	16.661
GarageQual	5.451
GarageCond	5.451
GarageFinish	5.451
GarageYrBlt	5.451
GarageType	5.382
BsmtExposure	2.811
BsmtCond	2.811
BsmtQual	2.777
BsmtFinType2	2.743
BsmtFinType1	2.708
MasVnrType	0.823
MasVnrArea	0.788
MSZoning	0.137
BsmtFullBath	0.069

Stacked Regressions to predict House Prices¶

Serigne¶

数据预处理¶

异常值¶

注意：¶

目标变量¶

特征工程¶

缺失值¶

补全缺失值¶

更多特征工程¶

训练模型¶

基础模型¶

基础模型评分¶

Stacking models¶

最简单的Stacking模型探索：基础模型平均¶

不太简单的Stacking模型：增加一个元模型（Meta-model）¶

集成 StackedRegressor, XGBoost 和 LightGBM¶

最终的训练与预测¶

Share on: