我正在尝试进行一些文本挖掘,其主要目的是采用此data.frame
中的以下单词,但将具有相似词根的单词组合在一起
+-------------+------+
| word | freq |
+-------------+------+
| best | 897 |
| see | 768 |
| received | 701 |
| questions | 686 |
| contact | 663 |
| use | 659 |
| seat | 643 |
| information | 640 |
| shipping | 617 |
| help | 589 |
| want | 577 |
| discount | 549 |
| purchase | 545 |
| code | 528 |
| team | 524 |
| sale | 503 |
| unsubscribe | 460 |
| website | 426 |
| love | 414 |
| buy | 399 |
| ’m | 394 |
| furniture | 388 |
| return | 387 |
| privacy | 385 |
| looking | 383 |
| customer | 382 |
| receive | 380 |
| fabric | 375 |
| interested | 370 |
| delivery | 348 |
| intended | 322 |
| ship | 322 |
| financing | 314 |
| • | 314 |
+-------------+------+
最好的例子是received
和receive
。我希望最终结果看起来像这样:
+----------+------+
| word | freq |
+----------+------+
| best | 897 |
| see | 768 |
| received | 1081 |
+----------+------+
因此,现在received
和receive
及其频率合计为1。另外,如何清除’m
和•
之类的条目?
答案 0 :(得分:0)
就个人而言,我建议您使用其他lemmatizer。例如,spaCy
提供的一个可以在R
中使用,例如通过使用spacyr
:
# install.packages("spacyr")
library("spacyr")
# install spacy if running for first time
# spacy_install()
spacy_initialize()
spacy_parse("received and receive")
doc_id sentence_id token_id token lemma pos entity
1 text1 1 1 received receive VERB
2 text1 1 2 and and CCONJ
3 text1 1 3 receive receive VERB