Question

我正在尝试编写一个代码，该代码使用Python（和Scikit Learn）对我的银行交易进行自动分类。目前，我已经分类了70个类别（标签），大约有1.7k笔交易-我总共有约3.5k的行，但不是全部都组织良好，这是我的第一次尝试。基本上，我已经导入了CSV文件，其内容如下：

Description             | Value | Label
RSHOP-SABORES DA -25/04 | -30   | Restaurants
RSHOP-MERCATTINO -28/04 | -23   | Bars
RSHOP-HORTISABOR -07/05 | -65   | Supermarket
TBI 3712.06663-9 tokpag | 1.000 | Salary

描述和价值是我的特征，标签是我的标签。使用不同的字符等会使描述变得有点复杂。因此，我了解到我应该使用Tf-IDF对描述进行矢量化处理，并使用LabelEncode标签。

目前，我有：

# Loads data
data = pd.read_csv('classifications.csv',
                    encoding='latin1',
                    error_bad_lines=False,
                    delimiter=';')

# Assigns features and labels - I chose to use only the description to make it simpler for a first time. I want to use the value later as well.
data.columns = ['desc', 'value', 'label']
data_base    = data.values
features_base= data_base[:,[0]]
labels_base  = data_base[:,[2]]

# Printing features returns a (1722,1) array - looks good.
print(features_base.shape)

# Printing labels returns a (1722,1) array - looks good.
print(labels_base.shape)

# Encodes labels, printing returns (1722,) - don't know why the "1" is missing on the y.
encoder       = LabelEncoder()
label_encoded = encoder.fit_transform((labels_base.astype(str)).ravel())
print(label_encoded.shape)

# Encodes features. Printing returns (1722, 1012) - don't know what's the "1012" on the y axis... the only thing I can think of the number of unique values on the vector, but can't be sure.
vectorizer = TfidfVectorizer()
vectors     = vectorizer.fit_transform(features_base.ravel().astype('U'))
print(vectors.shape)


#Test
train_features, train_labels, test_features, test_labels = tts(vectors, label_encoded, test_size=0.2)

然后我尝试一些估计器，每个估计器都有不同的错误（写在第一条注释行上）：

# Random Forest Classifier - returns "ValueError: Unknown label type: 'continuous-multioutput'"
clf1 = RandomForestClassifier()
print("Using", clf1)
clf1.fit(train_features.toarray(), train_labels.toarray())
predictions1 = clf1.predict(test_features)
print( "\nPredictions:", predictions1)
score = 0
for i in range(len(predictions1)):
    if predictions[i] == test_labels[i]:
        score += 1
print("Accuracy:", (score / len(predictions)) * 100, "%")


# Decision Tree Classifier - returns "ValueError: Unknown label type: 'continuous-multioutput'"
clf2 = tree.DecisionTreeClassifier()
print("Using", clf2)
clf2.fit(train_features.toarray(), train_labels.toarray())
predictions2 = clf2.predict(test_features)
print( "\nPredictions:", predictions2)
score = 0
for i in range(len(predictions2)):
    if predictions[i] == test_labels[i]:
        score += 1
print("Accuracy:", (score / len(predictions)) * 100, "%")


#SVC Linear - returns "ValueError: bad input shape (345, 1012)"
clf3 = svm.SVC(kernel='linear')
print("Using", clf3)
clf3.fit(train_features, train_labels)
predictions3 = clf3.predict(test_features)
print( "\nPredictions:", predictions3)
score = 0
for i in range(len(predictions1)):
    if predictions[i] == test_labels[i]:
        score += 1
print("Accuracy:", (score / len(predictions)) * 100, "%")


# SVC Non Linear - returns "ValueError: bad input shape (345, 1012)"
clf4 = svm.SVC()
print("Using", clf4)
clf4.fit(train_features.toarray(), train_labels.toarray())
predictions4 = clf4.predict(test_features)
print( "\nPredictions:", predictions4)
score = 0
for i in range(len(predictions1)):
    if predictions[i] == test_labels[i]:
        score += 1
print("Accuracy:", (score / len(predictions)) * 100, "%")

最终目标是加载一个包含“描述/金额”的CSV文件，它会向我建议一个标签（知道建议的确定性水平非常好）。

总结：

矢量化描述文本的方法是否合理？有什么建议吗？
我有权使用LabelEncoder对标签进行矢量化吗？
我在做什么错？代码有什么错误？

非常感谢。

Answer 1

问题出在标签上，因为您使用大熊猫，所以应该将它们作为分类数据传递给分类器。我将在几分钟后发布一些代码。

更新：好的，因此您的代码存在一些问题。当在新任务上开发ML模型时，建议您从简单的模型开始，然后在拥有可用的原型后增加其复杂性。我只为RandomForestClassifier实现了代码，您应该能够轻松地将其复制给您感兴趣的其他分类器。在这里：

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier

data = pd.read_csv('classifications.csv',
                    encoding='latin1',
                    error_bad_lines=False,
                    delimiter=';')

data.columns = ['desc', 'value', 'label']
data['label'] = data['label'].astype('category')
data.info()
vectorizer = TfidfVectorizer()
vectors    = vectorizer.fit_transform(data['desc'])
print('Shape: ',vectors.shape)

clf = RandomForestClassifier(random_state=42)

clf.fit(vectors,data['label'])
print('Score: {}'.format(clf.score(vectors,data['label'])))
clf.predict(vectorizer.transform(data['desc']))

此代码的输出是：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
desc     4 non-null object
value    4 non-null float64
label    4 non-null category
dtypes: category(1), float64(1), object(1)
memory usage: 340.0+ bytes
Shape:  (4, 14)
Score: 1.0
array(['Restaurants', 'Bars', 'Supermarket', 'Salary'], dtype=object)

一些评论：

1）如果您使用熊猫，则分类标签最好是分类数据（pandas.Categorical）。这样就减少了分类器将标签解释为有序数据并尝试对其预测进行排序的可能性。

2）如果要从sklearn链接多个对象，例如矢量化程序和分类器，最好通过编写实例化Pipeline对象

from sklearn.pipeline import Pipeline
pipeline = Pipeline([('vectorizer',TfidfVectorizer()),
                     ('classifier',RandomForestClassifier())])

这为您省去了在每次需要向分类器提供新数据时都必须将.transform或.fit_transform方法的输出从矢量化器传递给分类器的麻烦，因为管道会自动执行此操作。

3）为随机分类器设置random_state，以确保结果的可重复性。

4）不清楚为什么要尝试手动计算分数：分类器的.score（）方法自动计算平均准确度分数，并且可以防止您因len（predictions）之类的函数而犯错误。。在其他情况下，您将尝试预测概率分布而不是单点预测时，如果养成了调用len（predictions）的习惯，最终可能会在不注意的情况下在数组的错误维度上进行计算。但是，如果您希望分数为百分比而不是0到1，则只需将.score（）方法返回的分数乘以100。

希望这会有所帮助。

Answer 2

如果训练时使用的特征数量与测试特征不相等，通常会产生不良的输入形状错误。

机器学习：字符串的自动分类-“未知标签类型”和“输入格式错误”

2 个答案: