POS标签& R

时间:2015-09-08 00:14:09

标签: r themes text-analysis pos-tagger

我是R的新手,正在探索Text Mining。使用以下步骤我可以通过直到阻止,但是,我需要做POS标记并获得文本/主题模式。我正在使用的数据是客户逐字记录。请帮助如何进一步。我检查的大多数文章没有解释如何对Corpus中的数据进行POS标记,我找不到有关模式检测的任何细节。任何帮助将不胜感激...!提前谢谢,

CSVfile = read.csv("Testfortextcsv.csv",stringsAsFactors = FALSE)
TestSplit = as.data.frame(sent_detect_nlp(CSVfile$Comment))
colnames(TestSplit)[colnames(TestSplit)=="sent_detect_nlp(CSVfile$Comment)"]<- "Comment"
TestCorpus = Corpus(VectorSource(TestSplit$Comment))
TestCorpus = tm_map(TestCorpus, tolower)
TestCorpus = tm_map(TestCorpus, PlainTextDocument)
TestCorpus = tm_map(TestCorpus, removePunctuation)
TestCorpus = tm_map(TestCorpus, removeWords,c("Test",stopwords("SMART"),stopwords("english")))
TestCorpus = tm_map(TestCorpus, stripWhitespace)
TestCorpus = tm_map(TestCorpus, stemDocument)
dtm <- TermDocumentMatrix(TestCorpus)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)

这是我用来获得wordcloud,关联和Barplot。

WordCloud
----------
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,max.words=200,random.order=FALSE, rot.per=0.35, colors=brewer.pal(8,
"Dark2"))

Find Frequent Terms
-----------------
findFreqTerms(dtm, lowfreq = 15)

Find Association:
-----------------------
findAssocs(dtm, terms = "account", corlimit = 0.3)

Bar Plot for frequencies
--------------------------
barplot(d[1:10,]$freq, las = 2, names.arg = d[1:10,]$word,col ="lightblue", main ="Most frequent words",ylab = "Word frequencies")

1 个答案:

答案 0 :(得分:2)

qdap包允许您识别字符串中每个单词的词性。:

library(qdap)
s1<-c("Hello World")  
pos(s1)

您可能会找到其他资源openNLPRTextTools以及another possibility