我有很多短语(单个和多个单词;有些重叠),我有很多文档。最后,我只想存储每个文档的短语列表(来自大词组列表),而不是整个文档。什么是实现这一目标的有效方法? (最好是在python中)
示例:
phrase_list = ['cat', 'dog', 'tree', 'tree house'] // actually a few thousend if not million
// a list of a few thousend documents with longer text
doc_dictionary = {'doc1':"""the cat sat under the tree""",
'doc2':"""the dog chased the cat""",
'doc3':"""the boy loves his tree house"",}
result_dict = {'doc1': ['cat','tree'], 'doc2': ['dog', 'cat'], 'doc3': ['tree house']}