Python在某个单词之前和之后找到n个单词

时间:2016-02-05 22:45:03

标签: python regex numpy

让我们说我有一个文本文件。我应该阅读它,它会像:

 ... Department of Something is called (DoS) and then more texts and more text...

然后“while”我正在阅读文本文件,我找到了一个首字母缩略词,这里是

DoS 

因此,为了找到我写的首字母缩略词:

import re
import numpy

# open the file? 
test_string = " a lot of text read from file ... Department of Something is called (DoS) and then more texts and more text..."
regex = r'\b[A-Z][a-zA-Z\.]*[A-Z]\b\.?'

found= re.findall(regex, test_string)
print found

,输出为:

['DoS']

我想做的是:

  1. 我正在阅读文件并查找和缩写(这是DoS),
  2. 计算我找到的字符数(这里是Dos的3个字符)
  3. 在'Dos'之前和之后找到2次(这里是2x3 = 6)个单词。这将是:

    3.1 pre=     Department of Something is called
    3.2 acronym= DoS
    3.3 post=    and then more texts and more 
    
  4. 将这些3(pre,acronym,post)放在一个数组中。
  5. 任何帮助都将受到赞赏,因为我是python的新手。

1 个答案:

答案 0 :(得分:1)

不确定这是否是最佳解决方案,但也许它足以帮助您。

11.2.0.1,ORA1,ORACLE
11.2.0.4,ORA2,ORACLE
11.2.0.3,ORA3,ORACLE
12.2.0.1,ORA4,ORACLE
12.2.0.2,ORA5,ORACLE
12.2.0.2,ORA6,ORACLE
12.2.0.2,ORA7,ORACLE
5.1,MYS1,MYSQL
5.1,MYS2,MYSQL

会给你:

import re
import numpy

# open the file? 
test_string = " a lot of text read from file ... Department of Something is called (DoS) and then more texts and more text..."
regex_acronym = r'\b[A-Z][a-zA-Z\.]*[A-Z]\b\.?'

ra = re.compile(regex_acronym)
for m in ra.finditer(test_string):
    print m.start(), m.group(), m.span()
    n = len(m.group()) * 2
    regex_pre_post = r"((?:[a-zA-Z'-]+[^a-zA-Z'-]+){0,%d})(" % n
    regex_pre_post += regex_acronym 
    regex_pre_post += ")((?:[^a-zA-Z'-]+[a-zA-Z'-]+){0,%d})" % n
    found= re.findall(regex_pre_post, test_string)
    print found

    found = found[0] # For a single match, just do this.
    pre = found[0]
    acro = found[1]
    post = found[2]
    print pre, acro, post