在查找关键字时处理文本文件

时间:2019-05-05 17:37:56

标签: python python-3.x text iteration

我一直在研究一个程序,该程序查找仅在文本中出现一次的单词。但是,当程序找到一个单词时,我希望它为该单词提供一些上下文。

这是我的代码。

from collections import Counter
from string import punctuation

text = str("bible.txt")
with open(text) as f:
     word_counts = Counter(word.strip(punctuation) for line in f for word in 
line.split())

unique = [word.lower() for word, count in word_counts.items() if count == 1]

with open(text, 'r') as myfile:
    wordlist = myfile.read().lower()

print(unique)
print(len(unique), " unique words found.")

for word in unique:
    first = 1
    second = 1
    index = wordlist.index(word)
    if wordlist[index - first:index] is not int():
        first += 1
    if wordlist[index:index + second] is not ".":
        second += 1
    print(" ")

    first_part = wordlist[index - first:index]
    second_part = wordlist[index:index + second]
    print(word)
    print("%s %s" % ("".join(first_part), "".join(second_part)))

this是输入文本。

理想情况下,它会显示

sojournings
1 Jacob lived in the land of his father's sojournings, in the land of 
Canaan.

generations
2 These are the generations of Jacob.

基本上我希望它显示单词所在的句子,开头是诗句编号。我知道我会对索引做些什么,但是老实说我不知道​​该怎么做。

任何帮助将不胜感激。

谢谢, 本

2 个答案:

答案 0 :(得分:1)

我将检索所选单词的第一个字母的索引(在整个字符串中,对于圣经来说,这将是长;'),然后找到第一个“”。在那封信之前。我还会找到“下一个”“。”,但是可能强制使用最小长度以确保小句中的上下文。这给了您包括/打印/显示的范围。

def stringer():

    mystring = """ the quick brown fox. Which jumped over the lazy dog and died a horrible death. ad ipsum valorem"""

    word_posn = mystring.find("lazy")
    start_posn = mystring[:word_posn].rfind(".") + 1
    end_posn = mystring[word_posn:].find(".")+word_posn +1

    return '"' + mystring[start_posn:end_posn].strip() + '"'

此代码的编码速度非常快,因此为出现的错误表示歉意。

答案 1 :(得分:1)

我将把完整的代码留在这里给以后遇到的任何人。

from collections import Counter
from string import punctuation
import time

path = input("Path to file: ")
with open(path) as f:
    word_counts = Counter(word.strip(punctuation) for line in f for word in line.split())

wordlist = open(path).read().replace('\n', '')

unique = [word for word, count in word_counts.items() if count == 1]

print(unique)
print(len(unique), " unique words found.")

for word in unique:
    print(" ")
    word_posn = wordlist.find(word)
    start_posn = wordlist[:word_posn].rfind("." or "," or "!" or "?")) + 1
    end_posn = wordlist[word_posn:].find("." or "," or "!" or "?")) + word_posn + 1
    print(word)
    print(wordlist[start_posn:end_posn])

也要向@lb_so大喊帮助!