使用正则表达式查找单词上下文

时间:2016-05-13 18:38:41

标签: python python-3.x

我创建了一个函数来搜索文本中给定单词(w)的上下文,其中left和right作为参数,用于记录单词数量的灵活性。

import re
def get_context (text, w, left, right):
    text.insert (0, "*START*")
    text.append ("*END*")

    all_contexts = []

    for i in range(len(text)):

        if re.match(w,text[i], 0):

            if i < left:
                context_left = text[:i]

            else:
                context_left = text[i-left:i]

            if len(text) < (i+right):
                context_right = text[i:]

            else: 
                context_right = text[i:(i+right+1)]

            context = context_left + context_right

            all_contexts.append(context)
    return all_contexts

例如,如果a有一个像这样的列表形式的文本:

  

text = ['Python','is','动态','打字','语言','Python',   '功能','真的','关心','关于','什么','你','传递','到',   '他们','但','你','得','它','','错','方式','如果','你',   '想要','到','通过','一','千','论点','到','你的',   '功能','然后','你','可以','明确','定义','每一个',   '参数','在','你的','功能','定义','和','你的',   '功能','将','是','自动','能','到','处理',   'all','the','arguments','you','pass','to','they','for','you']

该功能正常工作,例如:

get_context(text, "function",2,2)
[['language', 'python', 'functions', 'really', 'care'], ['to', 'your', 'function', 'then', 'you'], ['in', 'your', 'function', 'definition', 'and'], ['and', 'your', 'function', 'will', 'be']]

现在我正在尝试使用文本中每个单词的上下文构建一个字典,执行以下操作:

d = {}
for w in set(text):
    d[w] = get_context(text,w,2,2)

但我收到了这个错误。

Traceback (most recent call last):
  File "<pyshell#32>", line 2, in <module>
    d[w] = get_context(text,w,2,2)
  File "<pyshell#20>", line 9, in get_context
    if re.match(w,text[i], 0):
  File "/usr/lib/python3.4/re.py", line 160, in match
    return _compile(pattern, flags).match(string)
  File "/usr/lib/python3.4/re.py", line 294, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.4/sre_compile.py", line 568, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.4/sre_parse.py", line 760, in parse
    p = _parse_sub(source, pattern, 0)
  File "/usr/lib/python3.4/sre_parse.py", line 370, in _parse_sub
    itemsappend(_parse(source, state))
  File "/usr/lib/python3.4/sre_parse.py", line 579, in _parse
    raise error("nothing to repeat")
sre_constants.error: nothing to repeat

我不明白这个错误。任何人都可以帮我这个吗?

3 个答案:

答案 0 :(得分:1)

问题是“* START *”和“* END *”被解释为正则表达式。另外,请注意在函数的请求中在text中插入“* START *”和“* END *”会导致问题。你应该只做一次。

以下是工作代码的完整版本:

import re

def get_context(text, w, left, right):
    all_contexts = []
    for i in range(len(text)):
        if re.match(w,text[i], 0):
            if i < left:
                context_left = text[:i]
            else:
                context_left = text[i-left:i]
            if len(text) < (i+right):
                context_right = text[i:]
            else:
                context_right = text[i:(i+right+1)]
            context = context_left + context_right
            all_contexts.append(context)
    return all_contexts

text = ['Python', 'is', 'dynamically', 'typed', 'language',
        'Python', 'functions', 'really', 'care', 'about', 'what',
        'you', 'pass', 'to', 'them', 'but', 'you', 'got', 'it', 'the',
        'wrong', 'way', 'if', 'you', 'want', 'to', 'pass', 'one',
        'thousand', 'arguments', 'to', 'your', 'function', 'then',
        'you', 'can', 'explicitly', 'define', 'every', 'parameter',
        'in', 'your', 'function', 'definition', 'and', 'your',
        'function', 'will', 'be', 'automagically', 'able', 'to', 'handle',
        'all', 'the', 'arguments', 'you', 'pass', 'to', 'them', 'for', 'you']

text.insert(0, "START")
text.append("END")

d = {}
for w in set(text):
    d[w] = get_context(text,w,2,2)

也许您可以将re.match(w,text[i], 0)替换为w == text[i]

答案 1 :(得分:1)

整个的东西可以非常简洁地重写,

pdict = defaultdict(lambda: defaultdict(list))

pdict[userid][‘jobid’].append('1234')

保持text = 'Python is dynamically typed language Python functions really care about what you pass to them but you got it the wrong way if you want to pass one thousand arguments to your function then you can explicitly define every parameter in your function definition and your function will be automagically able to handle all the arguments you pass to them for you' ,假设为str

context = 'function',

现在,正则表达式中需要进行小规模的自定义,例如,pat = re.compile(r'(\w+\s\w+\s)functions?(?=(\s\w+\s\w+))') pat.findall(text) [('language Python ', ' really care'), ('to your ', ' then you'), ('in your ', ' definition and'), ('and your ', ' will be')] functional这样的字词不仅仅是functioningfunction。但重要的想法是取消索引并更加实用。

如果批量申请时,请注意这是否有效。

答案 2 :(得分:0)

text中至少有一个元素包含正则表达式中特殊的字符。如果您只是想查找单词是否在字符串中,只需使用str.startswith,即

if text[i].startswith(w):  # instead of re.match(w,text[i], 0):

但是我不明白你为什么要检查它,而不是为了平等。