RegEx用于捕获科学引文

时间:2019-05-26 20:22:19

标签: python regex regex-group regex-greedy python-regex

我正在尝试捕获其中至少包含一位数字的文本括号(请引用)。这是我的正则表达式,现在可以正常使用:https://regex101.com/r/oOHPvO/5

\((?=.*\d).+?\)

所以我希望它捕获(Author 2000)(2000)而不是(Author)

我正在尝试使用python捕获所有这些括号,但是在python中,即使它们没有数字,它也会捕获括号中的文本。

import re

with open('text.txt') as f:
    f = f.read()

s = "\((?=.*\d).*?\)"

citations = re.findall(s, f)

citations = list(set(citations))

for c in citations:
    print (c)

有什么想法我做错了吗?

2 个答案:

答案 0 :(得分:1)

处理该表达式的最可靠方法可能是随着表达式的增长添加边界。例如,我们可以尝试创建字符列表,我们希望在其中收集所需的数据:

(?=\().([a-z]+)([\s,;]+?)([0-9]+)(?=\)).

DEMO

测试

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(?=\().([a-z]+)([\s,;]+?)([0-9]+)(?=\))."

test_str = "some text we wish before (Author) some text we wish after (Author 2000) some text we wish before (Author) some text we wish after (Author, 2000) some text we wish before (Author) some text we wish after (Author 2000) some text we wish before (Author) some text we wish after (Author; 2000)"

matches = re.finditer(regex, test_str, re.MULTILINE | re.IGNORECASE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

演示

const regex = /(?=\().([a-z]+)([\s,;]+?)([0-9]+)(?=\))./mgi;
const str = `some text we wish before (Author) some text we wish after (Author 2000) some text we wish before (Author) some text we wish after (Author, 2000) some text we wish before (Author) some text we wish after (Author 2000) some text we wish before (Author) some text we wish after (Author; 2000)`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

RegEx电路

jex.im可视化正则表达式:

enter image description here

答案 1 :(得分:1)

您可以使用

re.findall(r'\([^()\d]*\d[^()]*\)', s)

请参见regex demo

详细信息

  • \(-一个(字符
  • [^()\d]*-除()和数字之外的0个或更多字符
  • \d-一个数字
  • [^()]*-除()以外的0个或更多字符
  • \)-一个)字符。

请参见regex graph

enter image description here

Python demo

import re
rx = re.compile(r"\([^()\d]*\d[^()]*\)")
s = "Some (Author) and (Author 2000)"
print(rx.findall(s)) # => ['(Author 2000)']

要获得不带括号的结果,请添加一个捕获组:

rx = re.compile(r"\(([^()\d]*\d[^()]*)\)")
                    ^                ^

请参见this Python demo