Question

我正在尝试从特定格式的文档中提取文件名，并将它们放入列表中。该文档包含大量信息，但我关注的行如下所示，“文件名：”始终位于该行的开头：

File Name: C:\windows\system32\cmd.exe

我尝试了以下内容：

xmlfile = open('my_file.xml', 'r')
filetext = xmlfile.read()
file_list = []
file_list.append(re.findall(r'\bFile Name:\s+.*\\.*(?=\n)', filetext))

这使file_list看起来像：

[['File Name: c:\\windows\\system32\\file1.exe',
  'File Name: c:\\windows\\system32\\file2.exe',
  'File Name: c:\\windows\\system32\\file3.exe']]

我正在寻找我的输出只是：

(file1.exe, file2.exe, file3.exe)

我也尝试在上面的输出中使用ntpath.basename，但看起来它想要一个字符串作为输入而不是列表。

我对Python和脚本编程非常陌生，所以任何建议都会受到赞赏。

Answer 1

您可以使用以下正则表达式获得预期输出：

file_list = re.findall(r'\bFile Name:\s+.*\\([^\\]*)(?=\n)', filetext)

([^\\]*)将捕获除最终路径分隔符后的斜杠以外的所有内容，直到遇到\n，请参阅online example。由于findall已经返回一个列表，因此无需将返回值附加到现有列表。

Answer 2

我会稍微改变一下，让它更清晰一点，读取并稍微分离一下这个过程 - 显然它可以一步完成，但我认为你的代码将很难管理以后

import re
import os

with open('my_file.xml', 'r') as xmlfile:
    filetext = xmlfile.read()   # this way the file handle goes away - you left the file open
file_list = []
my_pattern = re.compile(r'\bFile Name:\s+.*\\.*(?=\n)')
for filename in my_pattern.findall(filetext):
    cleaned_name = filename.split(os.sep)[-1]
    file_list.append(cleaned_name)

Answer 3

您可以使用更多声明式样式执行此操作。它可以确保更少的错误，更高的内存效率。

import os.path

pat = re.compile(r'\bFile Name:\s+.*\\.*(?=\n)')
with open('my_file.xml') as f:
    ms = (pat.match(line) for line in f)
    ns = (os.path.basename(m) for m in ms)
# the iterator ns emits names such as 'foo.txt'
for n in ns:
    # do something

如果稍微更改正则表达式，即您甚至不需要os.path分组。

Answer 4

你走在正确的轨道上。 basename无法正常工作的原因是因为re.findall()返回了一个列表，该列表被放入另一个列表中。这里修复了那个迭代通过该列表返回的内容，并创建了另一个只包含基本文件名的内容：

import re
import os

with open('my_file.xml', 'rU') as xmlfile:
    file_text = xmlfile.read()
    file_list = [os.path.basename(fn)
                    for fn in re.findall(r'\bFile Name:\s+.*\\.*(?=\n)', file_text)]

从完整路径列表中提取文件名？

4 个答案: