Question

假设我有一个类似以下的数据框：

      A      B
0   bar    one
1   bar  three
2  flux    six
3   bar  three
4   foo   five
5  flux    one
6   foo    two

我想在其上应用dummy-coding contrasting以便我得到：

（即将每个唯一值映射到每列不同的整数）。

我尝试过使用scikit-learn's DictVectorizer，但我得到了：

> from sklearn.feature_extraction import DictVectorizer as DV
> vectorizer        = DV( sparse = False )
> dict_to_vectorize = df.T.to_dict().values()
> df_vec            = vectorizer.fit_transform(dict_to_vectorize )
> df_vec
array([[ 1.,  0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  1.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.]])

这是因为scikit-learn的DictVectorizer旨在输出一个K编码。我想要的是一个简单的编码（每个变量一列）。

我怎样才能用scikit-learn和/或pandas做到这一点？除此之外，是否有任何其他Python包可以帮助一般contrasting methods？

Answer 1

您可以使用pd.factorize：

In [124]: df.apply(lambda x: pd.factorize(x)[0])
Out[124]: 
   A  B
0  0  0
1  0  1
2  1  2
3  0  1
4  2  3
5  1  0
6  2  4

Answer 2

patsy套餐提供了您所需的所有对比（以及制作更多内容的能力）。 [1] AFAIK，statsmodels是目前使用patsy公式框架的唯一统计软件包。 [2,3]。

[1] https://patsy.readthedocs.org/en/latest/API-reference.html#handling-categorical-data

[2] http://statsmodels.sourceforge.net/devel/contrasts.html

[3] http://statsmodels.sourceforge.net/devel/example_formulas.html

Answer 3

虚拟编码是您拨打DictVectorizer时获得的编码。你得到的整数编码实际上是不同的：

sklearn.preprocessing.LabelBinarizer或DictVectorizer提供虚拟编码（pandas.get_dummies）
sklearn.preprocessing.LabelEncoder提供整数分类编码（pandas.factorize）

使用分类变量对数据帧进行矢量化/对比

3 个答案: