如何提取特定单词周围的文本片段?

时间:2019-07-10 19:02:26

标签: python string

我有一个很大的txt文件,我正尝试提取特定单词的每个实例以及两侧的15个单词。当一个单词的两个实例彼此之间不超过15个单词时,我正遇到一个问题,我试图将其作为一个大的文本片段来获取。

我正在尝试获取大量文本来分析特定主题。到目前为止,除上述场景外,我已经为所有实例提供了工作代码。

def occurs(word1, word2, filename):
    import os

    infile = open(filename,'r')     #opens file, reads, splits into lines
    lines = infile.read().splitlines()
    infile.close()
    wordlist = [word1, word2]       #this list allows for multiple words
    wordsString = ''.join(lines)      #splits file into individual words
    words = wordsString.split()

    f = open(filename, 'w')
    f.write("start")
    f.write(os.linesep)

    for word in wordlist:       
        matches = [i for i, w in enumerate(words) if w.lower().find(word) != -1] 

        for m in matches:        
            l = " ".join(words[m-15:m+16])
            f.write(f"...{l}...")       #writes the data to the external file
            f.write(os.linesep)
    f.close

到目前为止,当两个相同的单词彼此靠得太近时,该程序就不会在其中一个上运行。相反,我想找出更长的文本,在最远的前后单词前后加15个单词

2 个答案:

答案 0 :(得分:1)

一如既往,这里提供了多种解决方案。一个好玩的可能是递归wordFind,它会搜索接下来的15个单词,如果找到目标单词,它可以自称。

一种更简单但可能效率不高的解决方案是一次添加一个单词:

import matplotlib.pyplot as plt

cmap = plt.cm.jet
norm = plt.Normalize(a.min(), a.max())
colors1 = cmap(norm(a))

colors2 = cmap(dig)

assert(np.all(colors1 == colors2))

或者,如果您希望删除后续用途...

.\AccessDatabaseEngine.exe /quiet

注意,尚未测试,因此可能需要一些调试。但是要点很清楚:遇到目标单词时,逐个添加单词并扩展加法过程。这还使您可以使用当前目标以外的其他目标词扩展第二个条件if的位置。

答案 1 :(得分:0)

此代码段将获取所选关键字周围的单词数。如果同时存在一些关键字,它将加入它们:

s = '''xxx I have a large txt file and I'm xxx trying to pull out every instance of a specific word, as well as the 15 words on either side. I'm running into a problem when there are two instances of that word within 15 words of each other, which I'm trying to get as one large snippet of text.
I'm trying to xxx get chunks of text to analyze about a specific topic. So far, I have working code for all instances except the scenario mentioned above. xxx'''

words = s.split()

from itertools import groupby, chain

word = 'xxx'

def get_snippets(words, word, l):
    snippets, current_snippet, cnt = [], [], 0
    for v, g in groupby(words, lambda w: w != word):
        w = [*g]
        if v:
            if len(w) < l:
                current_snippet += [w]
            else:
                current_snippet += [w[:l] if cnt % 2 else w[-l:]]
                snippets.append([*chain.from_iterable(current_snippet)])
                current_snippet = [w[-l:] if cnt % 2 else w[:l]]
                cnt = 0
            cnt += 1
        else:
            if current_snippet:
                current_snippet[-1].extend(w)
            else:
                current_snippet += [w]

    if current_snippet[-1][-1] == word or len(current_snippet) > 1:
        snippets.append([*chain.from_iterable(current_snippet)])

    return snippets

for snippet in get_snippets(words, word, 15):
    print(' '.join(snippet))

打印:

xxx I have a large txt file and I'm xxx trying to pull out every instance of a specific word, as well as the 15
other, which I'm trying to get as one large snippet of text. I'm trying to xxx get chunks of text to analyze about a specific topic. So far, I have working
topic. So far, I have working code for all instances except the scenario mentioned above. xxx

数据相同但长度不同:

for snippet in get_snippets(words, word, 2):
    print(' '.join(snippet))

打印:

xxx and I'm
I have xxx trying to
trying to xxx get chunks
mentioned above. xxx