带有两个类别的朴素贝叶斯文本分类问题

时间:2018-04-15 17:52:39

标签: r machine-learning classification text-mining naivebayes

我正在尝试在数据集上实现Naive Bayes分类器,该数据集包含来自客户投诉形式的文本数据(投诉)和Reddit评论(General_Text)。整个集合每个类别有250'000个文本。但是,我在这里的示例postet中每个类别只使用1000个文本。我得到了与整个数据集相同的结果。我之前使用“tm”包完成了文本预处理,这应该不是问题!

数据框的结构如下,其中包含1000个Complaint和General_Text条目:

type              text
"General_Text"    "random words"
"Complaint"       "other random words"

对于分类任务,我将数据拆分为算法应该学习的训练集和测量准确度的测试集。朴素贝叶斯算法来自“e1071”库。

library(plyr)
library(e1071)
library(caret)
library(MLmetrics)    

#Import data and rename columns into $type and $text`
General_Text<- read.csv("General_Text.csv", sep=";", head=T, stringsAsFactors = F)
Complaints<- read.csv("Complaints.csv", sep=";", head=T, stringsAsFactors = F)
Data <- rbind(General_Text, Complaints)
colnames(Data) <- c("type", "text")

# $type as factor and $text as string
Data$text <- iconv(Data$text, encoding = "UTF-8")
Data$type <- factor(Data$type)

# Split the data into training set (1400 texts) and test set (600 texts)
set.seed(1234)
trainIndex <- createDataPartition(Data$type, p = 0.7, list = FALSE, times = 1)
trainData <- Data[trainIndex,]
testData <- Data[-trainIndex,]

# Create corpus for training data
corpus<- Corpus(VectorSource(trainData$text))

# Create Document Term Matrix for training data
docs_dtm <- DocumentTermMatrix(corpus, control = list(global = c(2, Inf)))

# Remove Sparse Terms in DTM
docs_dtm_train <- removeSparseTerms(docs_dtm , 0.97)    

# Convert counts into "Yes" or "No"
convert_counts <- function(x){
 x <- ifelse(x > 0, 1, 0)
 x <- factor(x, levels = c(0,1), labels = c("No", "Yes"))
return (x)
}

# Apply convert_counts function to the training data
docs_dtm_train <- apply(docs_dtm_train, MARGIN = 2, convert_counts) 

# Create Corpus for test set
corpus_2 <- Corpus(VectorSource(testData$text))

# Create Document Term Matrix for test data
docs_dtm_2 <- DocumentTermMatrix(corpus_2, list(global = c(2, Inf)))

# Remove Sparse Terms in DTM
docs_dtm_test <- removeSparseTerms(docs_dtm_2, 0.97)

# Apply convert_ counts function to the test data
docs_dtm_test <- apply(docs_dtm_test, MARGIN = 2, convert_counts)

# Naive Bayes Classification
nb_classifier <- naiveBayes(docs_dtm_train, trainData$type)
nb_test_pred <- predict(nb_classifier, newdata = docs_dtm_test)

# Output as Confusion Matrix
ConfusionMatrix(nb_test_pred, testData$type)

对不起,我无法提供数据,因此是一个可重复的例子。代码提供的结果非常令人沮丧:它将所有文本标识为投诉,而没有标识为通用文本。

 > ConfusionMatrix(nb_test_pred, testData$type)
          y_pred
 y_true         Complaint General_Text
 Complaint          300            0
 General_Text       300            0

我还收到以下错误消息:在data.matrix(newdata)中:由强制引入的NAs

如果有人遇到类似的问题,是否有人可以澄清我的代码是否犯了错误或让我抬头?

0 个答案:

没有答案