寻求数据集中缺失值的解决方案

时间:2016-04-27 19:34:02

标签: python machine-learning

我目前的数据集包括大约28,000个观测值和35个特征。我的X矩阵包括前34个特征,我的矩阵包括最后一个特征或第35个特征(我在下面的代码中标注了HighLowMobility)我构建了一个神经网络来分类高和低,但是由于缺少数据点,我的算法的准确率为12%。我遇到的问题是我的一些功能缺少很多数据点。我绕过它的一种方法是填写缺失值的均值。这将算法的准确性提高到56%,但我不喜欢将均值用于缺失值的想法。我想寻求另一种方法

#loading the data into data frame
X = pd.read_csv('raw_data_for_edits.csv')
#Impute the missing values with mean values,.
X = X.fillna(X.mean())
#Dropping the categorical values
X = X.drop(['county_name','statename','stateabbrv'],axis=1)
#Collect the output in y variable
y = X['HighLowMobility']

我无法复制和粘贴整个数据集,因为它太大了,但是我粘贴了前12行和15个功能:

 birthcohort    countyfipscode  county_name cty_pop2000 statename   state_id    stateabbrv  perm_res_p25_kr24   perm_res_p75_kr24   perm_res_p25_c1823  perm_res_p75_c1823  perm_res_p25_c19    perm_res_p75_c19    perm_res_p25_kr26   perm_res_p75_kr26
1980    1001    Autauga 43671   Alabama 1   AL  45.29939    60.7061             20.79255    66.0626 40.33072    61.38815
1981    1001    Autauga 43671   Alabama 1   AL  42.61835    63.21074    29.72325    75.26598    18.54342    54.94438    39.72811    65.40214
1982    1001    Autauga 43671   Alabama 1   AL  48.26985    62.34378    38.06422    72.25443    21.53552    59.08011    44.65976    63.69386
1983    1001    Autauga 43671   Alabama 1   AL  42.63371    56.42043    38.25876    80.4664 15.57722    57.13945    40.6005 61.02879
1984    1001    Autauga 43671   Alabama 1   AL  44.01634    62.27992    38.12383    73.74701    23.0881 55.17943    43.34503    62.40761
1985    1001    Autauga 43671   Alabama 1   AL  45.71784    61.31874    40.93386    83.06611    25.66557    72.2912 42.42057    62.00612
1986    1001    Autauga 43671   Alabama 1   AL  47.92037    59.65535    47.48409    72.49103    28.89066    63.85233    42.06915    59.60703
1987    1001    Autauga 43671   Alabama 1   AL  48.31079    54.04203    53.19901    84.53795    35.28359    71.83407        
1988    1001    Autauga 43671   Alabama 1   AL  47.98552    59.42001    52.89273    85.28442    30.55523    67.43595        
1980    1003    Baldwin 140415  Alabama 1   AL  42.46106    51.41415            19.86316    58.6601 41.89684    55.88935
1981    1003    Baldwin 140415  Alabama 1   AL  43.00288    55.10138    35.59233    76.98567    11.48056    40.79744    42.46521    57.31494

注意功能" perm_res_p25_c1823"缺少价值观。就我的算法的准确性而言,这成为问题。 因此,对于缺失值,我该怎么做?我读过关于插值的内容,我会这样做吗?如果是这样,我将如何进行编码呢?

1 个答案:

答案 0 :(得分:0)

一种方法是使用预处理器,我建议scikit-learn,根据你的情况,我的例子将使用一个简单的“均值”策略来转换丢失的数据“NaN”,如下所示: / p>

In [1]: import pandas as pd

In [2]: from sklearn.preprocessing import Imputer

# df is a copy from your sample data

In [6]: values = df[['perm_res_p25_kr26', 'perm_res_p75_kr26']].values

In [7]: values
Out[7]: 
array([[      nan,       nan],
       [ 39.72811,  65.40214],
       [ 44.65976,  63.69386],
       [ 40.6005 ,  61.02879],
       [ 43.34503,  62.40761],
       [ 42.42057,  62.00612],
       [ 42.06915,  59.60703],
       [      nan,       nan],
       [      nan,       nan],
       [      nan,       nan],
       [ 42.46521,  57.31494]])

# use a Imputer simple "mean" strategy to preprocess your missing data
In [8]: imp = Imputer(missing_values="NaN", strategy="mean", axis=0)
# simple fit & transform operations
In [9]: imp.fit(values)
Out[9]: Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)
# assign the missing values back to the dataframe
In [10]: df.ix[:, 'perm_res_p25_kr26':'perm_res_p75_kr26'] = imp.transform(values)
# and your missing data is taken care of
In [12]: df[['perm_res_p25_kr26', 'perm_res_p75_kr26']]
Out[12]: 
    perm_res_p25_kr26  perm_res_p75_kr26
0           42.184047          61.637213
1           39.728110          65.402140
2           44.659760          63.693860
3           40.600500          61.028790
4           43.345030          62.407610
5           42.420570          62.006120
6           42.069150          59.607030
7           42.184047          61.637213
8           42.184047          61.637213
9           42.184047          61.637213
10          42.465210          57.314940

这只是一个简单的“卑鄙”策略(不是您想要的),但您可以从Preprocessing data - custom-transformers了解更多相关信息,并实施自己的策略来恢复丢失的数据。

希望这有帮助。