我目前的数据集包括大约28,000个观测值和35个特征。我的X矩阵包括前34个特征,我的矩阵包括最后一个特征或第35个特征(我在下面的代码中标注了HighLowMobility)我构建了一个神经网络来分类高和低,但是由于缺少数据点,我的算法的准确率为12%。我遇到的问题是我的一些功能缺少很多数据点。我绕过它的一种方法是填写缺失值的均值。这将算法的准确性提高到56%,但我不喜欢将均值用于缺失值的想法。我想寻求另一种方法
#loading the data into data frame
X = pd.read_csv('raw_data_for_edits.csv')
#Impute the missing values with mean values,.
X = X.fillna(X.mean())
#Dropping the categorical values
X = X.drop(['county_name','statename','stateabbrv'],axis=1)
#Collect the output in y variable
y = X['HighLowMobility']
我无法复制和粘贴整个数据集,因为它太大了,但是我粘贴了前12行和15个功能:
birthcohort countyfipscode county_name cty_pop2000 statename state_id stateabbrv perm_res_p25_kr24 perm_res_p75_kr24 perm_res_p25_c1823 perm_res_p75_c1823 perm_res_p25_c19 perm_res_p75_c19 perm_res_p25_kr26 perm_res_p75_kr26
1980 1001 Autauga 43671 Alabama 1 AL 45.29939 60.7061 20.79255 66.0626 40.33072 61.38815
1981 1001 Autauga 43671 Alabama 1 AL 42.61835 63.21074 29.72325 75.26598 18.54342 54.94438 39.72811 65.40214
1982 1001 Autauga 43671 Alabama 1 AL 48.26985 62.34378 38.06422 72.25443 21.53552 59.08011 44.65976 63.69386
1983 1001 Autauga 43671 Alabama 1 AL 42.63371 56.42043 38.25876 80.4664 15.57722 57.13945 40.6005 61.02879
1984 1001 Autauga 43671 Alabama 1 AL 44.01634 62.27992 38.12383 73.74701 23.0881 55.17943 43.34503 62.40761
1985 1001 Autauga 43671 Alabama 1 AL 45.71784 61.31874 40.93386 83.06611 25.66557 72.2912 42.42057 62.00612
1986 1001 Autauga 43671 Alabama 1 AL 47.92037 59.65535 47.48409 72.49103 28.89066 63.85233 42.06915 59.60703
1987 1001 Autauga 43671 Alabama 1 AL 48.31079 54.04203 53.19901 84.53795 35.28359 71.83407
1988 1001 Autauga 43671 Alabama 1 AL 47.98552 59.42001 52.89273 85.28442 30.55523 67.43595
1980 1003 Baldwin 140415 Alabama 1 AL 42.46106 51.41415 19.86316 58.6601 41.89684 55.88935
1981 1003 Baldwin 140415 Alabama 1 AL 43.00288 55.10138 35.59233 76.98567 11.48056 40.79744 42.46521 57.31494
注意功能" perm_res_p25_c1823"缺少价值观。就我的算法的准确性而言,这成为问题。 因此,对于缺失值,我该怎么做?我读过关于插值的内容,我会这样做吗?如果是这样,我将如何进行编码呢?
答案 0 :(得分:0)
一种方法是使用预处理器,我建议scikit-learn,根据你的情况,我的例子将使用一个简单的“均值”策略来转换丢失的数据“NaN”,如下所示: / p>
In [1]: import pandas as pd
In [2]: from sklearn.preprocessing import Imputer
# df is a copy from your sample data
In [6]: values = df[['perm_res_p25_kr26', 'perm_res_p75_kr26']].values
In [7]: values
Out[7]:
array([[ nan, nan],
[ 39.72811, 65.40214],
[ 44.65976, 63.69386],
[ 40.6005 , 61.02879],
[ 43.34503, 62.40761],
[ 42.42057, 62.00612],
[ 42.06915, 59.60703],
[ nan, nan],
[ nan, nan],
[ nan, nan],
[ 42.46521, 57.31494]])
# use a Imputer simple "mean" strategy to preprocess your missing data
In [8]: imp = Imputer(missing_values="NaN", strategy="mean", axis=0)
# simple fit & transform operations
In [9]: imp.fit(values)
Out[9]: Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)
# assign the missing values back to the dataframe
In [10]: df.ix[:, 'perm_res_p25_kr26':'perm_res_p75_kr26'] = imp.transform(values)
# and your missing data is taken care of
In [12]: df[['perm_res_p25_kr26', 'perm_res_p75_kr26']]
Out[12]:
perm_res_p25_kr26 perm_res_p75_kr26
0 42.184047 61.637213
1 39.728110 65.402140
2 44.659760 63.693860
3 40.600500 61.028790
4 43.345030 62.407610
5 42.420570 62.006120
6 42.069150 59.607030
7 42.184047 61.637213
8 42.184047 61.637213
9 42.184047 61.637213
10 42.465210 57.314940
这只是一个简单的“卑鄙”策略(不是您想要的),但您可以从Preprocessing data - custom-transformers了解更多相关信息,并实施自己的策略来恢复丢失的数据。
希望这有帮助。