我正在尝试在数据集上实现Naive Bayes分类器,该数据集包含来自客户投诉形式的文本数据(投诉)和Reddit评论(General_Text)。整个集合每个类别有250'000个文本。但是,我在这里的示例postet中每个类别只使用1000个文本。我得到了与整个数据集相同的结果。我之前使用“tm”包完成了文本预处理,这应该不是问题!
数据框的结构如下,其中包含1000个Complaint和General_Text条目:
type text
"General_Text" "random words"
"Complaint" "other random words"
对于分类任务,我将数据拆分为算法应该学习的训练集和测量准确度的测试集。朴素贝叶斯算法来自“e1071”库。
library(plyr)
library(e1071)
library(caret)
library(MLmetrics)
#Import data and rename columns into $type and $text`
General_Text<- read.csv("General_Text.csv", sep=";", head=T, stringsAsFactors = F)
Complaints<- read.csv("Complaints.csv", sep=";", head=T, stringsAsFactors = F)
Data <- rbind(General_Text, Complaints)
colnames(Data) <- c("type", "text")
# $type as factor and $text as string
Data$text <- iconv(Data$text, encoding = "UTF-8")
Data$type <- factor(Data$type)
# Split the data into training set (1400 texts) and test set (600 texts)
set.seed(1234)
trainIndex <- createDataPartition(Data$type, p = 0.7, list = FALSE, times = 1)
trainData <- Data[trainIndex,]
testData <- Data[-trainIndex,]
# Create corpus for training data
corpus<- Corpus(VectorSource(trainData$text))
# Create Document Term Matrix for training data
docs_dtm <- DocumentTermMatrix(corpus, control = list(global = c(2, Inf)))
# Remove Sparse Terms in DTM
docs_dtm_train <- removeSparseTerms(docs_dtm , 0.97)
# Convert counts into "Yes" or "No"
convert_counts <- function(x){
x <- ifelse(x > 0, 1, 0)
x <- factor(x, levels = c(0,1), labels = c("No", "Yes"))
return (x)
}
# Apply convert_counts function to the training data
docs_dtm_train <- apply(docs_dtm_train, MARGIN = 2, convert_counts)
# Create Corpus for test set
corpus_2 <- Corpus(VectorSource(testData$text))
# Create Document Term Matrix for test data
docs_dtm_2 <- DocumentTermMatrix(corpus_2, list(global = c(2, Inf)))
# Remove Sparse Terms in DTM
docs_dtm_test <- removeSparseTerms(docs_dtm_2, 0.97)
# Apply convert_ counts function to the test data
docs_dtm_test <- apply(docs_dtm_test, MARGIN = 2, convert_counts)
# Naive Bayes Classification
nb_classifier <- naiveBayes(docs_dtm_train, trainData$type)
nb_test_pred <- predict(nb_classifier, newdata = docs_dtm_test)
# Output as Confusion Matrix
ConfusionMatrix(nb_test_pred, testData$type)
对不起,我无法提供数据,因此是一个可重复的例子。代码提供的结果非常令人沮丧:它将所有文本标识为投诉,而没有标识为通用文本。
> ConfusionMatrix(nb_test_pred, testData$type)
y_pred
y_true Complaint General_Text
Complaint 300 0
General_Text 300 0
我还收到以下错误消息:在data.matrix(newdata)中:由强制引入的NAs
如果有人遇到类似的问题,是否有人可以澄清我的代码是否犯了错误或让我抬头?