数据分析脚本&分析特征跟label的关系&缺失特征&缺失交叉特征&相关性热图
2017-05-21 18:06
639 查看
https://www.kaggle.com/dollardollar/bosch-production-line-performance/eda-of-important-features/comments
说明:这个链接里,进行数据分析的脚本可以借鉴。有如下几个功能:
1、分析特征跟label的关系
2、分析,不同label的样本,其缺失的比例
3、绘制相关性热图
An XGBoost fitted on approximately 700 columns achieves already a LB-Score of ~0.25 without any feature engineering. What can we learn from the 20 most essential numeric features suggested by that model?
The analysis in this notebook reveals that the positive and negative samples differ significantly both in their missing-value counts as well as in their missing-value correlation structure. On the other hand, the
behavior on non-missing samples is surprisingly similar for positive and negative samples.
In the first step, we import standard libraries and fix the most essential features as suggested by an XGB oracle.
In [1]:
We determine the indices of the most important features. After that the training data is loaded
In [2]:
The data is split into positive and negative samples.
In [3]:
In order to understand better the predictive power of single features, we compare the univariate distributions of the most important features. First, we divide the train data into batches column-wise to prepare the
data for plotting.
In [4]:
After this split, we can now draw violin plots. Due to memory reasons, we have to split the presentation into several cells. For many of the distributions there is no clear difference between the positive and negative
samples.
In [5]:
The data set is characterized by a large proportion of missing values. When comparing the missing values between positive and negative samples, we see striking differences.
In [6]:
In the previous section we have seen differences between negative and positive samples for univariate characteristics. We go down the rabbit hole a little further and analyze covariances for the negative and positive
samples separately.
In [7]:
Out[7]:
The difference between the two matrices is sparse except for three specific feature combinations.
In [8]:
Out[8]:
Finally, as in the univariate case, we analyze correlations between missing values in different features.
In [9]:
Out[9]:
For the difference of the missing-value correlation matrices, a striking pattern emerges. A further and more systematic analysis of such missing-value patterns has the potential to beget powerful features.
In [10]:
Out[10]:
说明:这个链接里,进行数据分析的脚本可以借鉴。有如下几个功能:
1、分析特征跟label的关系
2、分析,不同label的样本,其缺失的比例
3、绘制相关性热图
原文如下:
EDA of important features
An XGBoost fitted on approximately 700 columns achieves already a LB-Score of ~0.25 without any feature engineering. What can we learn from the 20 most essential numeric features suggested by that model?The analysis in this notebook reveals that the positive and negative samples differ significantly both in their missing-value counts as well as in their missing-value correlation structure. On the other hand, the
behavior on non-missing samples is surprisingly similar for positive and negative samples.
Data preparation
In the first step, we import standard libraries and fix the most essential features as suggested by an XGB oracle.In [1]:
%matplotlib inline import matplotlib.pyplot as plt import numpy as np import pandas as pd import seaborn as sns feature_names = ['L3_S38_F3960', 'L3_S33_F3865', 'L3_S38_F3956', 'L3_S33_F3857', 'L3_S29_F3321', 'L1_S24_F1846', 'L3_S32_F3850', 'L3_S29_F3354', 'L3_S29_F3324', 'L3_S35_F3889', 'L0_S1_F28', 'L1_S24_F1844', 'L3_S29_F3376', 'L0_S0_F22', 'L3_S33_F3859', 'L3_S38_F3952', 'L3_S30_F3754', 'L2_S26_F3113', 'L3_S30_F3759', 'L0_S5_F114']
We determine the indices of the most important features. After that the training data is loaded
In [2]:
numeric_cols = pd.read_csv("../input/train_numeric.csv", nrows = 1).columns.values imp_idxs = [np.argwhere(feature_name == numeric_cols)[0][0] for feature_name in feature_names] train = pd.read_csv("../input/train_numeric.csv", index_col = 0, header = 0, usecols = [0, len(numeric_cols) - 1] + imp_idxs) train = train[feature_names + ['Response']]
The data is split into positive and negative samples.
In [3]:
X_neg, X_pos = train[train['Response'] == 0].iloc[:, :-1], train[train['Response']==1].iloc[:, :-1]
Univariate characteristics
In order to understand better the predictive power of single features, we compare the univariate distributions of the most important features. First, we divide the train data into batches column-wise to prepare thedata for plotting.
In [4]:
BATCH_SIZE = 5 train_batch =[pd.melt(train[train.columns[batch: batch + BATCH_SIZE].append(np.array(['Response']))], id_vars = 'Response', value_vars = feature_names[batch: batch + BATCH_SIZE]) for batch in list(range(0, train.shape[1] - 1, BATCH_SIZE))]
After this split, we can now draw violin plots. Due to memory reasons, we have to split the presentation into several cells. For many of the distributions there is no clear difference between the positive and negative
samples.
In [5]:
FIGSIZE = (12,16) _, axs = plt.subplots(len(train_batch), figsize = FIGSIZE) plt.suptitle('Univariate distributions') for data, ax in zip(train_batch, axs): sns.violinplot(x = 'variable', y = 'value', hue = 'Response', data = data, ax = ax, split =True)
The data set is characterized by a large proportion of missing values. When comparing the missing values between positive and negative samples, we see striking differences.
In [6]:
non_missing = pd.DataFrame(pd.concat([(X_neg.count()/X_neg.shape[0]).to_frame('negative samples'), (X_pos.count()/X_pos.shape[0]).to_frame('positive samples'), ], axis = 1)) non_missing_sort = non_missing.sort_values(['negative samples']) non_missing_sort.plot.barh(title = 'Proportion of non-missing values', figsize = FIGSIZE) plt.gca().invert_yaxis()
Correlation structure
In the previous section we have seen differences between negative and positive samples for univariate characteristics. We go down the rabbit hole a little further and analyze covariances for the negative and positivesamples separately.
In [7]:
FIGSIZE = (13,4) _, (ax1, ax2) = plt.subplots(1,2, figsize = FIGSIZE) MIN_PERIODS = 100 triang_mask = np.zeros((X_pos.shape[1], X_pos.shape[1])) triang_mask[np.triu_indices_from(triang_mask)] = True ax1.set_title('Negative Class') sns.heatmap(X_neg.corr(min_periods = MIN_PERIODS), mask = triang_mask, square=True, ax = ax1) ax2.set_title('Positive Class') sns.heatmap(X_pos.corr(min_periods = MIN_PERIODS), mask = triang_mask, square=True, ax = ax2)
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd3d802b400>
The difference between the two matrices is sparse except for three specific feature combinations.
In [8]:
sns.heatmap(X_pos.corr(min_periods = MIN_PERIODS) -X_neg.corr(min_periods = MIN_PERIODS), mask = triang_mask, square=True)
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd3d80b3c88>
Finally, as in the univariate case, we analyze correlations between missing values in different features.
In [9]:
nan_pos, nan_neg = np.isnan(X_pos), np.isnan(X_neg) triang_mask = np.zeros((X_pos.shape[1], X_pos.shape[1])) triang_mask[np.triu_indices_from(triang_mask)] = True FIGSIZE = (13,4) _, (ax1, ax2) = plt.subplots(1,2, figsize = FIGSIZE) MIN_PERIODS = 100 ax1.set_title('Negative Class') sns.heatmap(nan_neg.corr(), square=True, mask = triang_mask, ax = ax1) ax2.set_title('Positive Class') sns.heatmap(nan_pos.corr(), square=True, mask = triang_mask, ax = ax2)
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd3d80b06a0>
For the difference of the missing-value correlation matrices, a striking pattern emerges. A further and more systematic analysis of such missing-value patterns has the potential to beget powerful features.
In [10]:
sns.heatmap(nan_neg.corr() - nan_pos.corr(), mask = triang_mask, square=True)
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd3d05d05f8>
相关文章推荐
- 数据分析脚本&分析特征跟label的关系&缺失特征&缺失交叉特征&相关性热图
- 数据分析脚本学习-从数据中,分析出哪些特征的组合能更好的预测label
- 实现BIN文件数据读取的TCL脚本分析
- Asp.net MVC 示例项目"Suteki.Shop"分析之---数据验证
- excel分析数据 --数据相关性
- accesslog或者cookie'log的shell常用分析脚本
- accesslog或者cookie'log的shell常用分析脚本
- 数据关系是1×2×2×12(C)
- 问题分析探讨 --> 大约有700W数据的表,把当天的10W数据select导入新表,整个原来的表就锁死
- web数据采集核心技术分享系列(三)如何破解验证码?图像分析?特征匹配?人工智能?第三方集成?...哪个最强大?
- 雷击风险评估之--地闪数据时空特征分析 V3.0中国版 使用说明
- Memcached数据被踢(evictions>0)现象分析
- 《驾驭大数据》:跟大数据关系不大,讲数据分析的理论、工具、方法、团队建设经验
- Asp.net MVC 示例项目"Suteki.Shop"分析之---数据验证
- joj 1089 &&zoj 1060&&poj 1094 以及wa的分析和数据
- 将选定的 OmniFind 分析结果存储到关系数据库中以便进行报告和数据挖掘
- 产品属性数据关系分析
- getJSON 跨域脚本提交出现"invalid label"解决方法 ,域脚本
- 统计分析&数据挖掘&数学优化&developer 方面的资源列表(持续添加中。。。)强烈推荐
- Nutch 的启动脚本分析 & Linux Shell