Question

我有一个训练数据集，其中包含144个反馈，分别为72个正反馈和72个负反馈。有两个目标标签分别是正数和负数。考虑以下代码段：

import pandas as pd
feedback_data = pd.read_csv('output.csv')
print(feedback_data) 
                     data    target
0      facilitates good student teacher communication.  positive
1                           lectures are very lengthy.  negative
2             the teacher is very good at interaction.  positive
3                       good at clearing the concepts.  positive
4                       good at clearing the concepts.  positive
5                                    good at teaching.  positive
6                          does not shows test copies.  negative
7                           good subjective knowledge.  positive

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary = True)
cv.fit(feedback_data)
X = cv.transform(feedback_data)
X_test = cv.transform(feedback_data_test)

from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

target = [1 if i<72 else 0 for i in range(144)]
# the below line gives error
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.50)

我不明白问题是什么。请帮忙。

Answer 1

您没有正确使用计数矢量化器。这就是您现在拥有的：

<html>
<head>
    <script>
        function showTags(){
          var text = document.getElementById("text").value;
          var result = text.split(' ').filter(v=> v.startsWith('#'));
          document.getElementById("result").innerHTML = result;
        }
    </script>
</head>
<body>
    <h1>Enter text or paragraph</h1>

    <textarea type="text" id="text"></textarea><br>
    <button onclick="showTags()">Get Hashtags</button><br><br>
    <div id="result"></div>
</body>

所以您看到自己没有达到想要的目标。您没有正确变换每一行。您甚至没有正确地训练计数矢量化器，因为您使用了整个DataFrame而不只是注释的语料库。要解决此问题，我们需要确保计数工作良好：如果您这样做（使用正确的语料库）：

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary = True)
cv.fit(df)
X = cv.transform(df)
X
<2x2 sparse matrix of type '<class 'numpy.int64'>'
    with 2 stored elements in Compressed Sparse Row format>

您看到我们正在接近我们想要的。我们只需要对它进行正确的转换（转换每一行）：

cv = CountVectorizer(binary = True)
cv.fit(df['data'].values)
X = cv.transform(df)
X
<2x23 sparse matrix of type '<class 'numpy.int64'>'
    with 0 stored elements in Compressed Sparse Row format>

我们有一个更合适的X！现在我们只需要检查是否可以拆分：

cv = CountVectorizer(binary = True)
cv.fit(df['data'].values)
X = df['data'].apply(lambda x: cv.transform([x])).values
X
array([<1x23 sparse matrix of type '<class 'numpy.int64'>'
with 5 stored elements in Compressed Sparse Row format>,
   ...
       <1x23 sparse matrix of type '<class 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>], dtype=object)

它有效！

您需要确保您了解CountVectorizer为正确使用它所做的事情

找到样本数量不一致的输入变量：[2，144]

1 个答案: