Question

我有一个具有不同权重的分类数据的数据集，例如，Phd的权重高于Masters，而且MSc高于Bsc。

我知道要使用Label编码器，但我不希望python任意为这些变量分配代码。我希望更高的代码为Phd = 4，Msc = 3，Bsc = 2，O Levels = 1且No education = 0。

无论如何我可以解决这个问题吗？任何人都可以帮忙吗？

Answer 1

LabelEncoder将根据字母顺序对类别进行编码，并存储在classes_属性中。默认情况下是这样的：

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(['Phd', 'Msc','Bsc', 'O Levels','No education'])
ll.classes_
# Output: array(['Bsc', 'Msc', 'No education', 'O Levels', 'Phd'], dtype='|S12')

有多少个类别？如果更少，您可以使用dict进行转换，类似于this answer here：

my_dict = {'Phd':4, 'Msc':3 , 'Bsc':2, 'O Levels':1, 'No education':0}

y = ['No education', 'O Levels','Bsc', 'Msc','Phd']
np.vectorize(my_dict.get)(y)

# Output: array([0, 1, 2, 3, 4])

回归中的序数分类数据

1 个答案: