Question

我正在尝试用Python编写一个正则表达式来从字符串中提取一些信息。

假设：

"Only in Api_git/Api/folder A: new.txt"

我想打印：

Folder Path: Api_git/Api/folder A
Filename: new.txt

看了re manual page上的一些例子后，我仍然有点卡住了。

这是我到目前为止所尝试的

m = re.match(r"(Only in ?P<folder_path>\w+:?P<filename>\w+)","Only in Api_git/Api/folder A: new.txt")

print m.group('folder_path')
print m.group('filename')

有人能指出我正确的方向吗？

Answer 1

使用捕获组从索引1和2获取匹配的组。

^Only in ([^:]*): (.*)$

这是demo

示例代码：

import re
p = re.compile(ur'^Only in ([^:]*): (.*)$')
test_str = u"Only in Api_git/Api/folder A: new.txt"

re.findall(p, test_str)

如果您想以下列格式打印，请尝试替换。

Folder Path: Api_git/Api/folder A 
Filename: new.txt

DEMO

示例代码：

import re
p = re.compile(ur'^Only in ([^:]*): (.*)$')
test_str = u"Only in Api_git/Api/folder A: new.txt"
subst = u"Folder Path: $1\nFilename: $2"

result = re.sub(p, subst, test_str)

Answer 2

你的模式：(Only in ?P<folder_path>\w+:?P<filename>\w+)有一些缺陷。

?P构造仅作为带括号的表达式中的第一位有效，所以我们需要这个。

(Only in (?P<folder_path>\w+):(?P<filename>\w+))

\w字符类仅适用于字母和下划线。例如，它不会与/或.匹配。我们需要使用更符合要求的不同角色类。事实上，我们可以使用几乎所有字符的类.：

(Only in (?P<folder_path>.+):(?P<filename>.+))

冒号在示例文本中后面有一个空格。我们需要匹配它：

(Only in (?P<folder_path>.+): (?P<filename>.+))

不需要最外面的括号。他们没有错，只是不需要：

Only in (?P<folder_path>.+): (?P<filename>.+)

提供与正则表达式引擎的调用分开的正则表达式通常很方便。这可以通过创建新变量轻松完成，例如：

regex = r'Only in (?P<folder_path>.+): (?P<filename>.+)'
... # several lines later
m = re.match(regex, "Only in Api_git/Api/folder A: new.txt")

以上仅仅是为了程序员的方便：它既不会节省也不会浪费时间或内存空间。但是，有一种技术可以节省正则表达式中的一些时间：编译。

考虑以下代码段：

regex = r'Only in (?P<folder_path>.+): (?P<filename>.+)'
for line in input_file:
    m = re.match(regex, line)
    ...

对于循环的每次迭代，正则表达式引擎必须解释正则表达式并将其应用于line变量。 re模块允许我们将解释与应用程序分开;我们可以解释一次，但应用几次：

regex = re.compile(r'Only in (?P<folder_path>.+): (?P<filename>.+)')
for line in input_file:
    m = re.match(regex, line)
    ...

现在，您的原始程序应如下所示：

regex = re.compile(r'Only in (?P<folder_path>.+): (?P<filename>.+)')
m = re.match(regex, "Only in Api_git/Api/folder A: new.txt")
print m.group('folder_path')
print m.group('filename')

但是，我喜欢使用注释来解释正则表达式。我的版本，包括一些常规清理，看起来像这样：

import re
regex = re.compile(r'''(?x)                # Verbose
            Only\ in\             # Literal match
            (?P<folder_path>.+)   # match longest sequence of anything, and put in 'folder_path'
            :\                    # Literal match
            (?P<filename>.+)      # match longest sequence of anything and put in 'filename'
            ''')

with open('diff.out') as input_file:
    for line in input_file:
        m = re.match(regex, line)
        if m:
            print m.group('folder_path')
            print m.group('filename')

Answer 3

这实际上取决于输入的限制，如果这是唯一可以实现这一目的的输入。

^Only in (?P<folder_path>[a-zA-Z_/ ]*): (?P<filename>[a-z]*.txt)$

使用正则表达式从字符串中提取信息

3 个答案: