kaggle初探--泰坦尼克号生存预测
2018-03-18 15:41
633 查看
继续学习数据挖掘,尝试了kaggle上的泰坦尼克号生存预测。
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
数据特征有:PassengerId,无特别意义
Pclass,客舱等级,对生存有影响吗?是否高等仓的有更多机会?
Name,姓名,可帮助我们判断性别,大概年龄。
Sex,女性的生产率是否更高?
Age,不同年龄段是否对生存有影响?
SibSp和Parch,指是否有兄弟姐妹和配偶父母,有亲人的情况下生存率是提高还是降低?
Fare,票价,高票价是否有更多机会?
Cabin,Embarked,客舱和登录港口……自然理解对生存应该没有影响
4000
/pre>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
there are 7 numerical features: [‘PassengerId’, ‘Survived’, ‘Pclass’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Fare’]
there are 5 category features: [‘Name’, ‘Sex’, ‘Ticket’, ‘Cabin’, ‘Embarked’]
feature类别:
- 数值型
- 特征型:可排序/不可排序型
- category不可排序型:sex,Embarked
Sex
female 314
male 577
Name: Survived, dtype: int64
Sex
female 0.742038
male 0.188908
Name: Survived, dtype: float64
船上原有人数,男性远高于女性;存活率,女性在75%左右,远高于男性18%-19%.可见女性存活率远高于男性,是重要特征。
从图中看出大部分乘客来自S port,其中多数为class 3,但是class 1 的人数也是3个口中最多的,C port的存活率最高,为0.55,因为C port中class1的人比例较高,Q port 绝大部分乘客是class 3的。C口1,2仓的票价均值较高,可能是暗示这个口上的人的社会地位较高。不过,从逻辑上说登录口对生存率是没有影响的,所以可以将其转成哑变量或drop.
Pclass Survived
1 1 136
0 80
2 0 97
1 87
3 0 372
1 119
Name: Survived, dtype: int64
class1,2的存活率明显较高,1有半数以上存活,2也基本持平,1,2仓女性甚至接近于1,所以客舱等级对生存有很大影响。
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
在没有同伴的情况下,存活率大概在0.3左右,有一个同伴的存活率最高>0.5,可能原因是1,2仓的乘客比例较高,随后,随着同伴数量增加而降低,降低的主要原因可能是,超过3人以上的乘客主要在class3,class3中3人以上存活率很低
趋势跟SibSp相似,一个人存活率较低,在有1-3parents时存活率较高,随后迅速降低,因为多数乘客来自class3
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
1等仓获救年龄总体偏低,生存率年龄跨度大,尤其是20岁以上至50岁的生存率较高,可能和1等仓人年龄总体偏大有关;10岁左右的儿童在2,3等仓的生存率明显提升,对于男性而言同理,儿童有个明显提升,;女性的生存年龄集中在中青年;20-40岁左右的中青年人死亡人数最多。
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
补习一波英语:Mme:称呼非英语民族的”上层社会”已婚妇女,及有职业的妇女,相当于Mrs;Jonkheer:乡绅;Capt:船长;Lady:贵族夫人;Don唐:是西班牙语中贵族和有地位者的尊称;the Countess:女伯爵;Ms:Ms.或Mz:婚姻状态不明的妇女;Col:上校;Major:少校;Mlle:小姐;Rev:牧师。
Pclass
1 84.154687
2 20.662183
3 13.675550
Name: Fare, dtype: float64
初步分析总结:
- 对于性别,女性生存率明显高于男性
- 头等舱生存率很高,3等仓很低,class1,2女性生存率接近于1
- 10岁左右的儿童生存率又明显提升
- SibSp和Parch相似,一个人存活率较低,有1-2SibSp或者1-3Parents生存率较高,但超过后生存率大幅下降
- name和age可以对所有数据进行处理,用name提取性别title,借助均值对age进行补充
(1309, 13)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
0
Title
Master 4.574167
Miss 21.845638
Mr 32.891990
Mrs 36.188034
Name: Age, dtype: float64
0
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
16岁左右儿童存活率较高,最年长乘客(80岁)幸存
大量16~40青少年没有存活
大多数乘客在16~40岁
为辅助分类,将年龄分段,创造新特征,同时增加儿童特征
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
票价80,一等舱,很大概率是C口
False
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
当乘客一个人的时候,生存率很低,大概在0.3左右,有1-3家庭成员时生存率上升,但>4时,生存率又急速下降。
对于女性,1,2等仓来说,是否一个人对生存率影响不大,但对于3等仓女性,一个人时反而生存率提高。
Fare_band
(-0.001, 8.662] 0.198052
(8.662, 26.0] 0.402778
(26.0, 512.329] 0.559322
Name: Survived, dtype: float64
价格上升,生存率增加,对男性尤为明显
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
5 rows × 24 columns
舍弃特征有Embarked(已离散化),Fare,Fare_band(已用Fare_cut代替),Sex(已用Person代替),Age(有Age_band),Ticket,S,SibSp,Parch
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
5 rows × 25 columns
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
5 rows × 21 columns
train data:(668, 21)
test data:(641, 21)
(668, 20)
(223, 20)
Best estimator found by grid search:
SVC(C=1000.0, cache_size=200, class_weight=’balanced’, coef0=0.0,
decision_function_shape=None, degree=3, gamma=0.0001, kernel=’rbf’,
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
0.826306967835
0.816196122718
0.808216271583
0.811631846414
0.826239790353
0.829653679654
0.838226647511
0.811848296631
0.794959693511
0.789695087521
Fitting 3 folds for each of 9 candidates, totalling 27 fits
[Parallel(n_jobs=1)]: Done 27 out of 27 | elapsed: 32.2s finished
RandomForestClassifier(bootstrap=True, class_weight=None, criterion=’gini’,
max_depth=None, max_features=’auto’, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=200, n_jobs=1, oob_score=False, random_state=0,
verbose=0, warm_start=False)
0.817365269461
0.811178066885
0.811434217956
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
提交结果看,ensemble模型和单个模型比并没有明显提升,分析可能是基模型相关性较强,训练数据不够多,或者是one-hot编码会不会引入共线性。虽然测试集和训练集结果相差不大,但提交结果降低明显,分析可能是数据不够,训练不充分,特征较少且相关性强,可以考虑引入更多特征。
Titanic for Machine Learning
导入和读取
# data processing import numpy as np import pandas as pd import re #visiulization import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline plt.style.use('ggplot')
train = pd.read_csv('D:/data/titanic/train.csv') test = pd.read_csv('D:/data/titanic/test.csv') train.head()
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th… | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB
数据特征有:PassengerId,无特别意义
Pclass,客舱等级,对生存有影响吗?是否高等仓的有更多机会?
Name,姓名,可帮助我们判断性别,大概年龄。
Sex,女性的生产率是否更高?
Age,不同年龄段是否对生存有影响?
SibSp和Parch,指是否有兄弟姐妹和配偶父母,有亲人的情况下生存率是提高还是降低?
Fare,票价,高票价是否有更多机会?
Cabin,Embarked,客舱和登录港口……自然理解对生存应该没有影响
train.describe()<
4000
/pre>
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
train.describe(include=['O'])#['O'] indicates category feature
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Name | Sex | Ticket | Cabin | Embarked | |
---|---|---|---|---|---|
count | 891 | 891 | 891 | 204 | 889 |
unique | 891 | 2 | 681 | 147 | 3 |
top | Hippach, Mrs. Louis Albert (Ida Sophia Fischer) | male | 1601 | C23 C25 C27 | S |
freq | 1 | 577 | 7 | 4 | 644 |
目标Survived特征
survive_num = train.Survived.value_counts() survive_num.plot.pie(explode=[0,0.1],autopct='%1.1f%%',labels=['died','survived'],shadow=True) plt.show()
x=[0,1] plt.bar(x,survive_num,width=0.35) plt.xticks(x,('died','survived')) plt.show()
特征分析
num_f = [f for f in train.columns if train.dtypes[f] != 'object'] cat_f = [f for f in train.columns if train.dtypes[f]=='object'] print('there are %d numerical features:'%len(num_f),num_f) print('there are %d category features:'%len(cat_f),cat_f)
there are 7 numerical features: [‘PassengerId’, ‘Survived’, ‘Pclass’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Fare’]
there are 5 category features: [‘Name’, ‘Sex’, ‘Ticket’, ‘Cabin’, ‘Embarked’]
feature类别:
- 数值型
- 特征型:可排序/不可排序型
- category不可排序型:sex,Embarked
category特征
性别
train.groupby(['Sex'])['Survived'].count()
Sex
female 314
male 577
Name: Survived, dtype: int64
f,ax = plt.subplots(figsize=(8,6)) fig = sns.countplot(x='Sex',hue='Survived',data=train) fig.set_title('Sex:Survived vs Dead') plt.show()
train.groupby(['Sex'])['Survived'].sum()/train.groupby(['Sex'])['Survived'].count()
Sex
female 0.742038
male 0.188908
Name: Survived, dtype: float64
船上原有人数,男性远高于女性;存活率,女性在75%左右,远高于男性18%-19%.可见女性存活率远高于男性,是重要特征。
Embarked
sns.factorplot('Embarked','Survived',data=train) plt.show()
f,ax = plt.subplots(1,3,figsize=(24,6)) sns.countplot('Embarked',data=train,ax=ax[0]) ax[0].set_title('No. Of Passengers Boarded') sns.countplot(x='Embarked',hue='Survived',data=train,ax=ax[1]) ax[1].set_title('Embarked vs Survived') sns.countplot('Embarked',hue='Pclass',data=train,ax=ax[2]) ax[2].set_title('Embarked vs Pclass') #plt.subplots_adjust(wspace=0.2,hspace=0.5) plt.show()
#pd.pivot_table(train,index='Embarked',columns='Pclass',values='Fare') sns.boxplot(x='Embarked',y='Fare',hue='Pclass',data=train) plt.show()
从图中看出大部分乘客来自S port,其中多数为class 3,但是class 1 的人数也是3个口中最多的,C port的存活率最高,为0.55,因为C port中class1的人比例较高,Q port 绝大部分乘客是class 3的。C口1,2仓的票价均值较高,可能是暗示这个口上的人的社会地位较高。不过,从逻辑上说登录口对生存率是没有影响的,所以可以将其转成哑变量或drop.
Pclass
train.groupby('Pclass')['Survived'].value_counts()
Pclass Survived
1 1 136
0 80
2 0 97
1 87
3 0 372
1 119
Name: Survived, dtype: int64
plt.subplots(figsize=(8,6)) f = sns.countplot('Pclass',hue='Survived',data=train)
sns.factorplot('Pclass','Survived',hue='Sex',data=train) plt.show()
class1,2的存活率明显较高,1有半数以上存活,2也基本持平,1,2仓女性甚至接近于1,所以客舱等级对生存有很大影响。
SibSp
train[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
SibSp | Survived | |
---|---|---|
1 | 1 | 0.535885 |
2 | 2 | 0.464286 |
0 | 0 | 0.345395 |
3 | 3 | 0.250000 |
4 | 4 | 0.166667 |
5 | 5 | 0.000000 |
6 | 8 | 0.000000 |
sns.factorplot('SibSp','Survived',data=train) plt.show()
#pd.pivot_table(train,values='Survived',index='SibSp',columns='Pclass') sns.countplot(x='SibSp',hue='Pclass',data=train) plt.show()
在没有同伴的情况下,存活率大概在0.3左右,有一个同伴的存活率最高>0.5,可能原因是1,2仓的乘客比例较高,随后,随着同伴数量增加而降低,降低的主要原因可能是,超过3人以上的乘客主要在class3,class3中3人以上存活率很低
Parch
#pd.pivot_table(train,values='Survived',index='Parch',columns='Pclass') sns.countplot(x='Parch',hue='Pclass',data=train) plt.show()
sns.factorplot('Parch','Survived',data=train) plt.show()
趋势跟SibSp相似,一个人存活率较低,在有1-3parents时存活率较高,随后迅速降低,因为多数乘客来自class3
Age
train.groupby('Survived')['Age'].describe()
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Survived | ||||||||
0 | 424.0 | 30.626179 | 14.172110 | 1.00 | 21.0 | 28.0 | 39.0 | 74.0 |
1 | 290.0 | 28.343690 | 14.950952 | 0.42 | 19.0 | 28.0 | 36.0 | 80.0 |
f,ax = plt.subplots(1,2,figsize=(16,6)) sns.violinplot('Pclass','Age',hue='Survived',data=train,split=True,ax=ax[0]) ax[0].set_title('Pclass Age & Survived') sns.violinplot('Sex','Age',hue='Survived',data=train,split=True,ax=ax[1]) ax[1].set_title('Sex Age & Survived') plt.show()
1等仓获救年龄总体偏低,生存率年龄跨度大,尤其是20岁以上至50岁的生存率较高,可能和1等仓人年龄总体偏大有关;10岁左右的儿童在2,3等仓的生存率明显提升,对于男性而言同理,儿童有个明显提升,;女性的生存年龄集中在中青年;20-40岁左右的中青年人死亡人数最多。
Name
name主要用途是可以帮助我们分辨性别,帮助补充有相同title的年龄缺失值#用正则表达式帮助找出姓名中表示年龄的title def getTitle(data): name_sal = [] for i in range(len(data['Name'])): name_sal.append(re.findall(r'.\w*\.',data.Name[i])) Salut = [] for i in range(len(name_sal)): name = str(name_sal[i]) name = name[1:-1].replace("'","") name = name.replace(".","").strip() name = name.replace(" ","") Salut.append(name) data['Title'] = Salut getTitle(train) train.head(2)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | Mr |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th… | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | Mrs |
pd.crosstab(train['Title'],train['Sex'])
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Sex | female | male |
---|---|---|
Title | ||
Capt | 0 | 1 |
Col | 0 | 2 |
Countess | 1 | 0 |
Don | 0 | 1 |
Dr | 1 | 6 |
Jonkheer | 0 | 1 |
Lady | 1 | 0 |
Major | 0 | 2 |
Master | 0 | 40 |
Miss | 182 | 0 |
Mlle | 2 | 0 |
Mme | 1 | 0 |
Mr | 0 | 517 |
Mrs | 124 | 0 |
Mrs,L | 1 | 0 |
Ms | 1 | 0 |
Rev | 0 | 6 |
Sir | 0 | 1 |
Fare
train.groupby('Pclass')['Fare'].mean()
Pclass
1 84.154687
2 20.662183
3 13.675550
Name: Fare, dtype: float64
sns.distplot(train['Fare'].dropna()) plt.xlim((0,200)) plt.xticks(np.arange(0,200,10)) plt.show()
初步分析总结:
- 对于性别,女性生存率明显高于男性
- 头等舱生存率很高,3等仓很低,class1,2女性生存率接近于1
- 10岁左右的儿童生存率又明显提升
- SibSp和Parch相似,一个人存活率较低,有1-2SibSp或者1-3Parents生存率较高,但超过后生存率大幅下降
- name和age可以对所有数据进行处理,用name提取性别title,借助均值对age进行补充
数据处理
#合并训练集和测试集 passID = test['PassengerId'] all_data = pd.concat([train,test],keys=["train","test"]) all_data.shape #all_data.head()
(1309, 13)
#统计缺失值 NAs = pd.concat([train.isnull().sum(),train.isnull().sum()/train.isnull().count(),test.isnull().sum(),test.isnull().sum()/test.isnull().count()],axis=1,keys=["train","percent_train","test","percent"]) NAs[NAs.sum(axis=1)>1].sort_values(by="percent",ascending=False)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
train | percent_train | test | percent | |
---|---|---|---|---|
Cabin | 687 | 0.771044 | 327.0 | 0.782297 |
Age | 177 | 0.198653 | 86.0 | 0.205742 |
Fare | 0 | 0.000000 | 1.0 | 0.002392 |
Embarked | 2 | 0.002245 | 0.0 | 0.000000 |
#删除无意义特征 all_data.drop(['PassengerId','Cabin'],axis=1,inplace=True)
all_data.head(2)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Age | Embarked | Fare | Name | Parch | Pclass | Sex | SibSp | Survived | Ticket | Title | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
train | 0 | 22.0 | S | 7.2500 | Braund, Mr. Owen Harris | 0 | 3 | male | 1 | 0.0 | A/5 21171 | Mr |
1 | 38.0 | C | 71.2833 | Cumings, Mrs. John Bradley (Florence Briggs Th… | 0 | 1 | female | 1 | 1.0 | PC 17599 | Mrs |
Age处理
#先提取name中的title getTitle(all_data)
pd.crosstab(all_data['Title'], all_data['Sex'])
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Sex | female | male |
---|---|---|
Title | ||
Capt | 0 | 1 |
Col | 0 | 4 |
Countess | 1 | 0 |
Don | 0 | 1 |
Dona | 1 | 0 |
Dr | 1 | 7 |
Jonkheer | 0 | 1 |
Lady | 1 | 0 |
Major | 0 | 2 |
Master | 0 | 61 |
Miss | 260 | 0 |
Mlle | 2 | 0 |
Mme | 1 | 0 |
Mr | 0 | 757 |
Mrs | 196 | 0 |
Mrs,L | 1 | 0 |
Ms | 2 | 0 |
Rev | 0 | 8 |
Sir | 0 | 1 |
all_data['Title'] = all_data['Title'].replace( ['Lady','Dr','Dona','Mme','Countess'],'Mrs') all_data['Title'] =all_data['Title'].replace('Mlle','Miss') all_data['Title'] =all_data['Title'].replace('Mrs,L','Mrs') all_data['Title'] = all_data['Title'].replace('Ms', 'Miss') #all_data['Title'] = all_data['Title'].replace('Mme', 'Mrs') all_data['Title'] = all_data['Title'].replace(['Capt','Col','Don','Major','Rev','Jonkheer','Sir'],'Mr') ''' all_data['Title'] = all_data.Title.replace({'Mlle':'Miss','Mme':'Mrs','Ms':'Miss','Dr':'Mrs', 'Major':'Mr','Lady':'Mrs','Countess':'Mrs', 'Jonkheer':'Mr','Col':'Mr','Rev':'Mr', 'Capt':'Mr','Sir':'Mr','Don':'Mr','Mrs,L':'Mrs'}) ''' all_data.Title.isnull().sum()
0
all_data[:train.shape[0]].groupby('Title')['Age'].mean()
Title
Master 4.574167
Miss 21.845638
Mr 32.891990
Mrs 36.188034
Name: Age, dtype: float64
#通过训练集中title对应的age均值替换 all_data.loc[(all_data.Age.isnull()) & (all_data.Title=='Mr'),'Age']=32 all_data.loc[(all_data.Age.isnull())&(all_data.Title=='Mrs'),'Age']=36 all_data.loc[(all_data.Age.isnull())&(all_data.Title=='Master'),'Age']=5 all_data.loc[(all_data.Age.isnull())&(all_data.Title=='Miss'),'Age']=22 #all_data.loc[(all_data.Age.isnull())&(all_data.Title=='other'),'Age']=46 all_data.Age.isnull().sum()
0
all_data[:train.shape[0]][['Title', 'Survived']].groupby(['Title'], as_index=False).mean()
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Title | Survived | |
---|---|---|
0 | Master | 0.575000 |
1 | Miss | 0.702703 |
2 | Mr | 0.158192 |
3 | Mrs | 0.777778 |
f,ax = plt.subplots(1,2,figsize=(16,6)) sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Sex=='female','Age'],color='red',ax=ax[0]) sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Sex=='male','Age'],color='blue',ax=ax[0]) sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==0,'Age' ], color='red', label='Not Survived', ax=ax[1]) sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==1,'Age' ], color='blue', label='Survived', ax=ax[1]) plt.legend(loc='best') plt.show()
16岁左右儿童存活率较高,最年长乘客(80岁)幸存
大量16~40青少年没有存活
大多数乘客在16~40岁
为辅助分类,将年龄分段,创造新特征,同时增加儿童特征
add isChild
def male_female_child(passenger): # 取年龄和性别 age,sex = passenger # 提出儿童特征 if age < 16: return 'child' else: return sex # 创建新特征 all_data['person'] = all_data[['Age','Sex']].apply(male_female_child,axis=1)
#0-80岁的年龄分布,若分段成3组,按少年、中青年、老年分 all_data['Age_band']=0 all_data.loc[all_data['Age']<=16,'Age_band']=0 all_data.loc[(all_data['Age']>16)&(all_data['Age']<=40),'Age_band']=1 all_data.loc[all_data['Age']>40,'Age_band']=2
Name处理
df = pd.get_dummies(all_data['Title'],prefix='Title') all_data = pd.concat([all_data,df],axis=1)
all_data.drop('Title',axis=1,inplace=True)
#drop name all_data.drop('Name',axis=1,inplace=True)
fiilna Embarked
all_data.loc[all_data.Embarked.isnull()]
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Age | Embarked | Fare | Parch | Pclass | Sex | SibSp | Survived | Ticket | Title | person | Age_band | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
train | 61 | 38.0 | NaN | 80.0 | 0 | 1 | female | 0 | 1.0 | 113572 | 2 | female | 1 |
829 | 62.0 | NaN | 80.0 | 0 | 1 | female | 0 | 1.0 | 113572 | 3 | female | 2 |
all_data['Embarked'].fillna('C',inplace=True) all_data.Embarked.isnull().any()
False
embark_dummy = pd.get_dummies(all_data.Embarked)
all_data = pd.concat([all_data,embark_dummy],axis=1)all_data.head(2)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Age | Embarked | Fare | Parch | Pclass | Sex | SibSp | Survived | Ticket | person | Age_band | Title_Master | Title_Miss | Title_Mr | Title_Mrs | C | Q | S | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
train | 0 | 22.0 | S | 7.2500 | 0 | 3 | male | 1 | 0.0 | A/5 21171 | male | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
1 | 38.0 | C | 71.2833 | 0 | 1 | female | 1 | 1.0 | PC 17599 | female | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
add SibSp and Parch
#创造familysize和alone两个新特征 all_data['Family_size'] = all_data['SibSp']+all_data['Parch']#是所有亲属总和 all_data['alone'] = 0#不是一个人 all_data.loc[all_data.Family_size==0,'alone']=1#代表是一个人
f,ax=plt.subplots(1,2,figsize=(16,6)) sns.factorplot('Family_size','Survived',data=all_data[:train.shape[0]],ax=ax[0]) ax[0].set_title('Family_size vs Survived') sns.factorplot('alone','Survived',data=all_data[:train.shape[0]],ax=ax[1]) ax[1].set_title('alone vs Survived') plt.close(2) plt.close(3) plt.show()
当乘客一个人的时候,生存率很低,大概在0.3左右,有1-3家庭成员时生存率上升,但>4时,生存率又急速下降。
#再将family size分段 all_data['Family_size'] = np.where(all_data['Family_size']==0, 'solo', np.where(all_data['Family_size']<=3, 'normal', 'big'))
sns.factorplot('alone','Survived',hue='Sex',data=all_data[:train.shape[0]],col='Pclass') plt.show()
对于女性,1,2等仓来说,是否一个人对生存率影响不大,但对于3等仓女性,一个人时反而生存率提高。
all_data['poor_girl'] = 0 all_data.loc[(all_data['Sex']=='female')&(all_data['Pclass']==3)&(all_data['alone']==1),'poor_girl']=1
连续变量Fare填充、分段
#补充全缺失值 all_data.loc[(all_data.Fare.isnull()) & (all_data.Pclass==1),'Fare']=84 all_data.loc[(all_data.Fare.isnull()) & (all_data.Pclass==2),'Fare']=21 all_data.loc[(all_data.Fare.isnull()) & (all_data.Pclass==3),'Fare']=14
sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==0,'Fare' ], color='red', label='Not Survived') sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==1,'Fare' ], color='blue', label='Survived') plt.xlim((0,100))
(0, 100)
sns.lmplot('Fare','Survived',data=all_data[:train.shape[0]]) plt.show()
#Fare平均分成3段取均值 all_data['Fare_band'] = pd.qcut(all_data['Fare'],3) all_data[:train.shape[0]].groupby('Fare_band')['Survived'].mean()
Fare_band
(-0.001, 8.662] 0.198052
(8.662, 26.0] 0.402778
(26.0, 512.329] 0.559322
Name: Survived, dtype: float64
#将连续变量fare分段,离散化 all_data['Fare_cut'] = 0 all_data.loc[all_data['Fare']<=8.662,'Fare_cut'] = 0 all_data.loc[((all_data['Fare']>8.662) & (all_data['Fare']<=26)),'Fare_cut'] = 1 #all_data.loc[((all_data['Fare']>14.454) & (all_data['Fare']<=31.275)),'Fare_cut'] = 2 all_data.loc[((all_data['Fare']>26) & (all_data['Fare']<513)),'Fare_cut'] = 2 sns.factorplot('Fare_cut','Survived',hue='Sex',data=all_data[:train.shape[0]]) plt.show()
价格上升,生存率增加,对男性尤为明显
# creat a feature about rich man all_data['rich_man'] = 0 all_data.loc[((all_data['Fare']>=80) & (all_data['Sex']=='male')),'rich_man'] = 1
类型特征数值化
all_data.head()
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Age | Embarked | Fare | Parch | Pclass | Sex | SibSp | Survived | Ticket | person | … | Title_Mrs | C | Q | S | Family_size | alone | poor_girl | Fare_band | Fare_cut | rich_man | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
train | 0 | 22.0 | S | 7.2500 | 0 | 3 | male | 1 | 0.0 | A/5 21171 | male | … | 0 | 0 | 0 | 1 | normal | 0 | 0 | (-0.001, 8.662] | 0 | 0 |
1 | 38.0 | C | 71.2833 | 0 | 1 | female | 1 | 1.0 | PC 17599 | female | … | 1 | 1 | 0 | 0 | normal | 0 | 0 | (26.0, 512.329] | 2 | 0 | |
2 | 26.0 | S | 7.9250 | 0 | 3 | female | 0 | 1.0 | STON/O2. 3101282 | female | … | 0 | 0 | 0 | 1 | solo | 1 | 1 | (-0.001, 8.662] | 0 | 0 | |
3 | 35.0 | S | 53.1000 | 0 | 1 | female | 1 | 1.0 | 113803 | female | … | 1 | 0 | 0 | 1 | normal | 0 | 0 | (26.0, 512.329] | 2 | 0 | |
4 | 35.0 | S | 8.0500 | 0 | 3 | male | 0 | 0.0 | 373450 | male | … | 0 | 0 | 0 | 1 | solo | 1 | 0 | (-0.001, 8.662] | 0 | 0 |
舍弃特征有Embarked(已离散化),Fare,Fare_band(已用Fare_cut代替),Sex(已用Person代替),Age(有Age_band),Ticket,S,SibSp,Parch
''' 舍弃不需要的特征:Age,用Age_band分段代替了, Fare,Fare_band用Fare_cut分段代替了 Ticket无意义 ''' #all_data.drop(['Age','Fare','Fare_band','Ticket'],axis=1,inplace=True) #all_data.drop(['Age','Fare','Fare_band','Ticket','Embarked','C'],axis=1,inplace=True) all_data.drop(['Age','Fare','Ticket','Embarked','C','Fare_band','SibSp','Parch'],axis=1,inplace=True)
all_data.head(2)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Pclass | Sex | Survived | person | Age_band | Title_Master | Title_Miss | Title_Mr | Title_Mrs | Q | S | Family_size | alone | poor_girl | Fare_cut | rich_man | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
train | 0 | 3 | male | 0.0 | male | 1 | 0 | 0 | 1 | 0 | 0 | 1 | normal | 0 | 0 | 0 | 0 |
1 | 1 | female | 1.0 | female | 1 | 0 | 0 | 0 | 1 | 0 | 0 | normal | 0 | 0 | 2 | 0 |
df1 = pd.get_dummies(all_data['Family_size'],prefix='Family_size') df2 = pd.get_dummies(all_data['person'],prefix='person') df3 = pd.get_dummies(all_data['Age_band'],prefix='age') all_data = pd.concat([all_data,df1,df2,df3],axis=1) all_data.head()
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Pclass | Sex | Survived | person | Age_band | Title_Master | Title_Miss | Title_Mr | Title_Mrs | Q | … | rich_man | Family_size_big | Family_size_normal | Family_size_solo | person_child | person_female | person_male | age_0 | age_1 | age_2 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
train | 0 | 3 | male | 0.0 | male | 1 | 0 | 0 | 1 | 0 | 0 | … | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
1 | 1 | female | 1.0 | female | 1 | 0 | 0 | 0 | 1 | 0 | … | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | |
2 | 3 | female | 1.0 | female | 1 | 0 | 1 | 0 | 0 | 0 | … | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | |
3 | 1 | female | 1.0 | female | 1 | 0 | 0 | 0 | 1 | 0 | … | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | |
4 | 3 | male | 0.0 | male | 1 | 0 | 0 | 1 | 0 | 0 | … | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
all_data.drop(['Sex','person','Age_band','Family_size'],axis=1,inplace=True) all_data.head()
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Pclass | Survived | Title_Master | Title_Miss | Title_Mr | Title_Mrs | Q | S | alone | poor_girl | … | rich_man | Family_size_big | Family_size_normal | Family_size_solo | person_child | person_female | person_male | age_0 | age_1 | age_2 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
train | 0 | 3 | 0.0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | … | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
1 | 1 | 1.0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | … | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | |
2 | 3 | 1.0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | … | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | |
3 | 1 | 1.0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | … | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | |
4 | 3 | 0.0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | … | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
建立模型
from sklearn.model_selection import cross_val_score, train_test_split from sklearn.metrics import confusion_matrix#retun array of prredict and target from sklearn.model_selection import cross_val_predict#use to retun the predict of cross val from sklearn.model_selection import GridSearchCV from sklearn import svm from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.linear_model import LogisticRegression from sklearn.naive_bayes import GaussianNB from sklearn.ensemble import RandomForestClassifier
train_data = all_data[:train.shape[0]] test_data = all_data[train.shape[0]:] print('train data:'+str(train_data.shape)) print('test data:'+str(test_data.shape))
train data:(668, 21)
test data:(641, 21)
train,test = train_test_split(train_data,test_size = 0.25, random_state=0,stratify=train_data['Survived'])
train_x = train.drop('Survived',axis=1) train_y = train['Survived'] test_x = test.drop('Survived',axis=1) test_y = test['Survived']
print(train_x.shape) print(test_x.shape)
(668, 20)
(223, 20)
# define score on train and test data def cv_score(model): cv_result = cross_val_score(model,train_x,train_y,cv=10,scoring = "accuracy") return(cv_result) def cv_score_test(model): cv_result_test = cross_val_score(model,test_x,test_y,cv=10,scoring = "accuracy") return(cv_result_test)
rbf SVM
# RBF SVM model param_grid = {'C': [1e3, 5e3, 1e4, 5e4, 1e5], 'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1], } clf_svc = GridSearchCV(svm.SVC(kernel='rbf', class_weight='balanced'), param_grid) clf_svc = clf_svc.fit(train_x, train_y) print("Best estimator found by grid search:") print(clf_svc.best_estimator_) acc_svc_train = cv_score(clf_svc.best_estimator_).mean() acc_svc_test = cv_score_test(clf_svc.best_estimator_).mean() print(acc_svc_train) print(acc_svc_test)
Best estimator found by grid search:
SVC(C=1000.0, cache_size=200, class_weight=’balanced’, coef0=0.0,
decision_function_shape=None, degree=3, gamma=0.0001, kernel=’rbf’,
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
0.826306967835
0.816196122718
决策树
#a simple tree clf_tree = DecisionTreeClassifier() clf_tree.fit(train_x,train_y) acc_tree_train = cv_score(clf_tree).mean() acc_tree_test = cv_score_test(clf_tree).mean() print(acc_tree_train) print(acc_tree_test)
0.808216271583
0.811631846414
KNN
#test n_neighbors pred = [] for i in range(1,11): model = KNeighborsClassifier(n_neighbors=i) model.fit(train_x,train_y) pred.append(cv_score(model).mean()) n = list(range(1,11)) plt.plot(n,pred) plt.xticks(range(1,11)) plt.show()
clf_knn = KNeighborsClassifier(n_neighbors=4) clf_knn.fit(train_x,train_y) acc_knn_train = cv_score(clf_knn).mean() acc_knn_test = cv_score_test(clf_knn).mean() print(acc_knn_train) print(acc_knn_test)
0.826239790353
0.829653679654
逻辑回归
#logistic regression clf_LR = LogisticRegression() clf_LR.fit(train_x,train_y) acc_LR_train = cv_score(clf_LR).mean() acc_LR_test = cv_score_test(clf_LR).mean() print(acc_LR_train) print(acc_LR_test)
0.838226647511
0.811848296631
高斯贝叶斯
clf_gb = GaussianNB() clf_gb.fit(train_x,train_y) acc_gb_train = cv_score(clf_gb).mean() acc_gb_test = cv_score_test(clf_gb).mean() print(acc_gb_train) print(acc_gb_test)
0.794959693511
0.789695087521
随机森林
n_estimators = range(100,1000,100) grid = {'n_estimators':n_estimators} clf_forest = GridSearchCV(RandomForestClassifier(random_state=0),param_grid=grid,verbose=True) clf_forest.fit(train_x,train_y) print(clf_forest.best_estimator_) print(clf_forest.best_score_) #print(cv_score(clf_forest).mean()) #print(cv_score_test(clf_forest).mean())
Fitting 3 folds for each of 9 candidates, totalling 27 fits
[Parallel(n_jobs=1)]: Done 27 out of 27 | elapsed: 32.2s finished
RandomForestClassifier(bootstrap=True, class_weight=None, criterion=’gini’,
max_depth=None, max_features=’auto’, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=200, n_jobs=1, oob_score=False, random_state=0,
verbose=0, warm_start=False)
0.817365269461
clf_forest = RandomForestClassifier(n_estimators=200) clf_forest.fit(train_x,train_y) acc_forest_train = cv_score(clf_forest).mean() acc_forest_test = cv_score_test(clf_forest).mean() print(acc_forest_train) print(acc_forest_test)
0.811178066885
0.811434217956
pd.Series(clf_forest.feature_importances_,train_x.columns).sort_values(ascending=True).plot.barh(width=0.8) plt.show()
models = pd.DataFrame({ 'model':['SVM','Decision Tree','KNN','Logistic regression','Gaussion Bayes','Random Forest'], 'score on train':[acc_svc_train,acc_tree_train,acc_knn_train,acc_LR_train,acc_gb_train,acc_forest_train], 'score on test':[acc_svc_test,acc_tree_test,acc_knn_test,acc_LR_test,acc_gb_test,acc_forest_test] }) models.sort_values(by='score on test', ascending=False) ''' models = pd.DataFrame({ 'model':['SVM','Decision Tree','KNN','Logistic regression','Gaussion Bayes','Random Forest'], 'score on train':[acc_svc_train,acc_tree_train,acc_knn_train,acc_LR_train,acc_gb_train,acc_forest_train] }) ''' models.sort_values(by='score on test', ascending=False)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
model | score on test | score on train | |
---|---|---|---|
2 | KNN | 0.829654 | 0.826240 |
0 | SVM | 0.816196 | 0.826307 |
3 | Logistic regression | 0.811848 | 0.838227 |
1 | Decision Tree | 0.811632 | 0.808216 |
5 | Random Forest | 0.811434 | 0.811178 |
4 | Gaussion Bayes | 0.789695 | 0.794960 |
Ensemble
from sklearn.ensemble import AdaBoostClassifier from sklearn.ensemble import VotingClassifier from sklearn.ensemble import GradientBoostingClassifier
# bagging Decision tree from sklearn.ensemble import BaggingClassifier bag_tree = BaggingClassifier(base_estimator=clf_svc.best_estimator_,n_estimators=200,random_state=0) bag_tree.fit(train_x,train_y) acc_bagtree_train = cv_score(bag_tree).mean() acc_bagtree_test =cv_score_test(bag_tree).mean() print(acc_bagtree_train) print(acc_bagtree_test)
0.82782211935 0.816196122718
Adaboosting
n_estimators = range(100,1000,100) a = [0.05,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9] grid = {'n_estimators':n_estimators,'learning_rate':a} ada = GridSearchCV(AdaBoostClassifier(),param_grid=grid,verbose=True) ada.fit(train_x,train_y) print(ada.best_estimator_) print(ada.best_score_) #acc_ada_train = cv_score(ada).mean() #acc_ada_test = cv_score_test(ada).mean() #print(acc_ada_train) #print(acc_ada_test)
Fitting 3 folds for each of 90 candidates, totalling 270 fits [Parallel(n_jobs=1)]: Done 270 out of 270 | elapsed: 5.4min finished AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=0.05, n_estimators=200, random_state=None) 0.835329341317
ada = AdaBoostClassifier(n_estimators=200,random_state=0,learning_rate=0.2) ada.fit(train_x,train_y) acc_ada_train = cv_score(ada).mean() acc_ada_test = cv_score_test(ada).mean() print(acc_ada_train) print(acc_ada_test)
0.829248144305 0.825719932242
#confusion matrix to see the presiction y_pred = cross_val_predict(ada,test_x,test_y,cv=10) sns.heatmap(confusion_matrix(test_y,y_pred),cmap='winter',annot=True,fmt='2.0f') plt.show()
GradientBoosting
n_estimators = range(100,1000,100) a = [0.05,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9] grid = {'n_estimators':n_estimators,'learning_rate':a} grad = GridSearchCV(GradientBoostingClassifier(),param_grid=grid,verbose=True) grad.fit(train_x,train_y) print(grad.best_estimator_) print(grad.best_score_)
Fitting 3 folds for each of 90 candidates, totalling 270 fits [Parallel(n_jobs=1)]: Done 270 out of 270 | elapsed: 2.4min finished GradientBoostingClassifier(criterion='friedman_mse', init=None, learning_rate=0.05, loss='deviance', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=200, presort='auto', random_state=None, subsample=1.0, verbose=0, warm_start=False) 0.824850299401
#use best estimator in gradient clf_grad=GradientBoostingClassifier(n_estimators=200,random_state=0,learning_rate=0.05) clf_grad.fit(train_x,train_y) acc_grad_train = cv_score(clf_grad).mean() acc_grad_test = cv_score_test(clf_grad).mean() print(acc_grad_train) print(acc_grad_test)
0.818709926304 0.807500470544
from sklearn.metrics import precision_score class Ensemble(object): def __init__(self,estimators): self.estimator_names = [] self.estimators = [] for i in estimators: self.estimator_names.append(i[0]) self.estimators.append(i[1]) self.clf = LogisticRegression() def fit(self, train_x, train_y): for i in self.estimators: i.fit(train_x,train_y) x = np.array([i.predict(train_x) for i in self.estimators]).T y = train_y self.clf.fit(x, y) def predict(self,x): x = np.array([i.predict(x) for i in self.estimators]).T #print(x) return self.clf.predict(x) def score(self,x,y): s = precision_score(y,self.predict(x)) return s
ensem = Ensemble([('Ada',ada),('Bag',bag_tree),('SVM',clf_svc.best_estimator_),('LR',clf_LR),('gbdt',clf_grad)]) score = 0 for i in range(0,10): ensem.fit(train_x, train_y) sco = round(ensem.score(test_x,test_y) * 100, 2) score+=sco print(score/10)
89.83
提交
pre = ensem.predict(test_data) pd.DataFrame({'PassengerId':temp['PassengerId'],'Survived':pre}) submission = pd.DataFrame({'PassengerId':passID,'Survived':pre})
提交结果看,ensemble模型和单个模型比并没有明显提升,分析可能是基模型相关性较强,训练数据不够多,或者是one-hot编码会不会引入共线性。虽然测试集和训练集结果相差不大,但提交结果降低明显,分析可能是数据不够,训练不充分,特征较少且相关性强,可以考虑引入更多特征。
相关文章推荐
- kaggle 泰坦尼克号生存预测——六种算法模型实现与比较
- 数据科学工程师面试宝典系列之二---Python机器学习kaggle案例:泰坦尼克号船员获救预测
- 数据科学工程师面试宝典系列之二---Python机器学习kaggle案例:泰坦尼克号船员获救预测
- 泰坦尼克号预测生存可能
- kaggle 泰坦尼克号生还者预测
- 【Kaggle笔记】预测泰坦尼克号乘客生还情况(决策树)
- Kaggle_Titanic生存预测 -- 详细流程吐血梳理
- kaggle——泰坦尼克号生死预测
- Kaggle竞赛(lecture 1-2 入门)Titanic生存预测
- Kaggle实战——泰坦尼克生存预测大赛
- 泰坦尼克号生存预测(python)
- kaggle竞赛入门Titanic生存预测
- 用sklearn(scikit-learn)的LogisticRegression预测titanic生还情况(kaggle)
- 详解 Kaggle 房价预测竞赛优胜方案:用 Python 进行全面数据探索
- 泰坦尼克号预测生还案例的分析(一)
- kaggle预测
- kaggle首秀之intel癌症预测(续篇)
- 泰坦尼克号:幸存人数预测数据分析:原来也可以很简单
- [kaggle系列 一] 使用贝叶斯分类器判断是否能从泰坦尼克号生还
- Kaggle入门之房屋价格预测