Question

我无法从txt文件提取部分文本。使用python 3，整个文本文件的格式如下：

    integer stringOfFilePathandName.cpp string integer
    ...not needed text...
    ...not needed text...
    singleInteger( zero or one)
    ---------------------------------
    integer stringOfFilePathandName2.cpp string integer
    ...not needed text...
    ...not needed text...
    singleInteger( zero or one)
    ---------------------------------

对于每个模式出现，不需要的文本行数都是不稳定的。 我需要将 stringOfFilePathandName.cpp 和 singleInteger 值保存到字典中，例如 {stringOfFilePathandName：（0或1）} 。

该文本包含我不需要的其他文件扩展名（例如.cpp）。另外，我不知道文件的编码，所以我将其读取为二进制文件。

我的问题与以下链接所解决的问题具有共同之处：

Python read through file until match, read until next pattern

https://sopython.com/canon/92/extract-text-from-a-file-between-two-markers/-我不太了解

python - Read file from and to specific lines of text-我尝试复制，但是仅在一个实例上起作用。我需要遍历整个文件。

目前，我已经尝试过这种方法，这种方法只适用于一次出现：

fileRegex = re.compile(r".*\.cpp")

with open('txfile',"rb") as fin:
   filename = None
   for line in input_data:
       if re.search(fileRegex,str(line)):
           filename = ((re.search(fileRegex,str(line))).group()).lstrip("b'") 
           break
   for line in input_data:
       if (str(line).lstrip("b'").rstrip("\\n'"))=="0" or (str(line).lstrip("b'").rstrip("\\n'"))=="1":
        dictOfFiles[filename] = (str(line).lstrip("b'").rstrip("\\n'"))

   del filename

我的想法是，需要类似的过程来遍历文件。到目前为止，我所遵循的方法是逐行的。可能的情况是，最好将整个文本保存到一个变量中，然后进行提取。任何想法，欢迎，这已经困扰了我很长时间...

每个请求是以下文本文件：https://raw.githubusercontent.com/CGCL-codes/VulDeePecker/master/CWE-119/CGD/cwe119_cgd.txt

Answer 1

一种可能性是将re.findall与正则表达式模式配合使用，该模式可以处理多行内容：

input = """1 file1.cpp blah 3
           not needed
           not needed
           2
           ---------------------------------
           9 file1.cpp blah 5
           not needed
           not needed
           3
           ---------------------------------"""
matches = re.findall(r'(\w+\.cpp).*?(\d+)(?=\s+--------)', input, re.DOTALL)
print(matches)

此打印：

[('file1.cpp', '2'), ('file1.cpp', '3')]

此答案假设您可以容忍将整个文件读入内存，然后使用re.findall进行一次通过。如果您无法执行此操作，则需要继续使用当前的解析方法。

Answer 2

您可以使用

fileRegex = re.compile(rb"^\d+\s+(\S+\.cpp)\s.*(?:\r?\n(?![01]\r?$).*)*\r?\n([10]+)\r?$", re.M)
dictOfFiles = []
with open(r'txfile','rb') as fin:
    dictOfFiles = [(k.decode('utf-8'), (int)(v.decode('utf-8'))) for k, v in fileRegex.findall(fin.read())]

然后，print(dictOfFiles)返回

[('stringOfFilePathandName.cpp': 0), ('stringOfFilePathandName2.cpp': 1)....]

请参见regex demo。

注释

您需要将所有文件读取到内存中，此多行正则表达式才能正常工作，因此我正在使用fin.read()
当您以二进制模式读取文件时，不会删除CR，因此我在每个\r?前添加了\n（可选CR）
要将字节字符串转换为Unicode字符串，我们需要在结果上使用.decode('utf-8')。

正则表达式详细信息（以防您稍后需要调整）：

^-行的开头（由于re.M，^与行的起始位置匹配）
\d+-1个以上数字
\s+-超过1个空格
(\S+\.cpp)-组1：1+个非空白字符，然后是.cpp
\s-空格
.*-尽可能多地添加0+个除换行符以外的字符
(?:\r?\n(?![01]\r?$).*)*
\r?\n-CRLF或LF换行符
([10])-第2组：1或0
\r?-可选CR
$-行尾。

Python，如何在整个文本文件中两次在两个标记之间提取文本？

2 个答案: