
时间:2016-03-15 21:35:32

标签: python python-2.7 numpy scikit-learn


                         Database      Target    Market_Description    Brand  \
0            CN_Milk powder_Incl_Others    NaN  Shanghai Hyper total  O.Brand   
1            CN_Milk powder_Incl_Others    NaN  Shanghai Hyper total  O.Brand   
2            CN_Milk powder_Incl_Others    NaN  Shanghai Hyper total  O.Brand   

  Sub_Brand Category                   Class_Category  
0       NaN      NaN  Hi Cal Adult Milk Powders- C1  
1       NaN      NaN  Hi Cal Adult Milk Powders- C1  
2       NaN      NaN  Hi Cal Adult Milk Powders- C1 


df3 = CountryDF.apply(preprocessing.LabelEncoder().fit_transform)   


>>> print pd.unique(CountryDF.Target.ravel())

>>> [nan 'Elder' 'Others' 'Lady']


>>> print pd.unique(df3.Target.ravel())
>>> [ 40749 667723 667725 ...,  43347  43346  43345]


编辑: - 此数据集是大数据集的子集。这与此有什么关系吗?

EDIT2: - @Kevin我尝试了你的建议,这很奇怪。看到这个。 enter image description here

1 个答案:

答案 0 :(得分:1)

我不认为大数据集会影响您的结果。 LabelEncoder的目的是转换预测目标(在您的情况下,我假设,Target列)。来自User Guide




from sklearn.preprocessing import LabelEncoder
import numpy as np
import pandas as pd

CountryDF = pd.DataFrame([['CN_Milk powder_Incl_Others',np.nan,'Shanghai Hyper total','O.Brand',np.nan,np.nan,'Hi Cal Adult Milk Powders- C1'],
                              ['CN_Milk powder_Incl_Others','Elder','Shanghai Hyper total','O.Brand',np.nan,np.nan,'Hi Cal Adult Milk Powders- C1'],
                              ['CN_Milk powder_Incl_Others','Others','Shanghai Hyper total','O.Brand',np.nan,np.nan,'Hi Cal Adult Milk Powders- C1'],
                              ['CN_Milk powder_Incl_Others','Lady','Shanghai Hyper total','O.Brand',np.nan,np.nan,'Hi Cal Adult Milk Powders- C1'],
                             ['CN_Milk powder_Incl_Others',np.nan,'Shanghai Hyper total','O.Brand','S_B1',np.nan,'Hi Cal Adult Milk Powders- C1'],
                             ['CN_Milk powder_Incl_Others',np.nan,'Shanghai Hyper total','O.Brand','S_B2',np.nan,'Hi Cal Adult Milk Powders- C1']],
                            columns=['Database','Target','Market_Description','Brand','Sub_Brand', 'Category','Class_Category'])


le = LabelEncoder() # initialze the LabelEncoder once

#Create a new column with transformed values.
CountryDF['EncodedTarget'] = le.fit_transform(CountryDF['Target'])



Database    Target  Market_Description  Brand   Sub_Brand   Category    Class_Category  EncodedTarget
0   CN_Milk powder_Incl_Others  NaN     Shanghai Hyper total    O.Brand     NaN     NaN     Hi Cal Adult Milk Powders- C1   0
1   CN_Milk powder_Incl_Others  Elder   Shanghai Hyper total    O.Brand     NaN     NaN     Hi Cal Adult Milk Powders- C1   1
2   CN_Milk powder_Incl_Others  Others  Shanghai Hyper total    O.Brand     NaN     NaN     Hi Cal Adult Milk Powders- C1   3
3   CN_Milk powder_Incl_Others  Lady    Shanghai Hyper total    O.Brand     NaN     NaN     Hi Cal Adult Milk Powders- C1   2

我希望这有助于澄清LabelEncoder。如果这还没有完全回答您的问题,可能会引导您走上正确的道路,转变您的功能(这可能是您尝试做的事情?) - 查看OneHotEncoder

修改 我向CountryDF添加了另外两行(见上文),它为Sub_Brand列提供了两个唯一值,它们跟随一系列连续的NaN。我很难过为什么你会看到这种行为,它适用于我,熊猫0.17.0和scikit 0.17。

df3 = CountryDF.apply(LabelEncoder().fit_transform)
Database    Target  Market_Description  Brand   Sub_Brand   Category    Class_Category
0   0   0   0   0   0   0   0
1   0   1   0   0   0   1   0
2   0   3   0   0   0   2   0
3   0   2   0   0   0   3   0
4   0   0   0   0   1   4   0
5   0   0   0   0   2   5   0


array([nan, 'Elder', 'Others', 'Lady'], dtype=object)
array([0, 1, 3, 2])