Question

我正在尝试进行一些文本挖掘，其主要目的是采用此data.frame中的以下单词，但将具有相似词根的单词组合在一起

+-------------+------+
|    word     | freq |
+-------------+------+
| best        |  897 |
| see         |  768 |
| received    |  701 |
| questions   |  686 |
| contact     |  663 |
| use         |  659 |
| seat        |  643 |
| information |  640 |
| shipping    |  617 |
| help        |  589 |
| want        |  577 |
| discount    |  549 |
| purchase    |  545 |
| code        |  528 |
| team        |  524 |
| sale        |  503 |
| unsubscribe |  460 |
| website     |  426 |
| love        |  414 |
| buy         |  399 |
| ’m          |  394 |
| furniture   |  388 |
| return      |  387 |
| privacy     |  385 |
| looking     |  383 |
| customer    |  382 |
| receive     |  380 |
| fabric      |  375 |
| interested  |  370 |
| delivery    |  348 |
| intended    |  322 |
| ship        |  322 |
| financing   |  314 |
| •           |  314 |
+-------------+------+

最好的例子是received和receive。我希望最终结果看起来像这样：

+----------+------+
|   word   | freq |
+----------+------+
| best     |  897 |
| see      |  768 |
| received | 1081 |
+----------+------+

因此，现在received和receive及其频率合计为1。另外，如何清除’m和•之类的条目？

Answer 1

就个人而言，我建议您使用其他lemmatizer。例如，spaCy提供的一个可以在R中使用，例如通过使用spacyr：

# install.packages("spacyr")
library("spacyr")
# install spacy if running for first time
# spacy_install()
spacy_initialize()
spacy_parse("received and receive")

  doc_id sentence_id token_id    token   lemma   pos entity
1  text1           1        1 received receive  VERB       
2  text1           1        2      and     and CCONJ       
3  text1           1        3  receive receive  VERB

单词缩略词无法正常运行

1 个答案: