如何有效地在文档中查找短语

时间:2011-05-05 17:00:48

标签: python language-agnostic

我有很多短语(单个和多个单词;有些重叠),我有很多文档。最后,我只想存储每个文档的短语列表(来自大词组列表),而不是整个文档。什么是实现这一目标的有效方法? (最好是在python中)

示例:

phrase_list = ['cat', 'dog', 'tree', 'tree house'] // actually a few thousend if not million

// a list of a few thousend documents with longer text
doc_dictionary = {'doc1':"""the cat sat under the tree""",
                  'doc2':"""the dog chased the cat""",
                  'doc3':"""the boy loves his tree house"",}

result_dict = {'doc1': ['cat','tree'], 'doc2': ['dog', 'cat'], 'doc3': ['tree house']}

1 个答案:

答案 0 :(得分:2)

听起来你需要一个索引器和搜索引擎,比如Lucene for Java。也许PyLucene端口会有所帮助。