Question

我有一个文本文件，其中所有列都合并为一个列，“行”由两行长的“-”分隔。看起来像这样：

Hash: some_hash_id
Author: some_author
Message: Message about the update


Reviewers: jimbo

Reviewed By: jimbo

Test Plan: Auto-generated

@bypass-lint
Commit Date: 2019-06-30 20:12:38
Modified path: path/to/my/file.php
Modified path: some/other/path/to/my/file.php
Modified path: path/to/other/file.php
-------------------------------------------------------
-------------------------------------------------------
Hash: some_other_hash_id
Author: different_author
Message: Auto generated message



Reviewers: broseph

Reviewed By: broseph

Test Plan: Auto-generated by Sam

@bypass-lint
Commit Date: 2019-06-30 18:09:12
Modified path: my/super/file.php
Modified path: totally/awesome/file.php
Modified path: file/path.json
-------------------------------------------------------
-------------------------------------------------------
Hash: hash_id_4
Author: new_author
Message: Auto DB big update



Reviewers: foo

Reviewed By: foo

Test Plan: Auto-generated by Tom

@bypass-lint
Commit Date: 2019-06-30 11:08:59
Modified path: big/scripts/file.json

此示例的

预期输出是只有3行的数据框。数据框列：哈希（str），作者（str），消息（str），审阅者（str），审阅者（str），测试计划（str），提交日期（时间戳），修改的路径（array（str））< / p>

Answer 1

将整个文件内容加载到名为 txt 的变量中。

然后，要生成一个DataFrame，运行一个单个就足够了（尽管非常复杂）指令：

pd.DataFrame([ collections.OrderedDict(
    { m.group('key').strip(): re.sub(r'\n', ' ', m.group('val').strip())
        for m in re.finditer(
            r'^(?P<key>[^:\n]+):\s*(?P<val>.+?(?:\n[^:\n]+)*)$', chunk, re.M)})
    for chunk in re.split(r'(?:\n\-+)+\n', txt) ])

从最后一行开始读取代码。它将 txt 分成几行，每行仅包含-个字符。

然后finditer接管，将每个块划分为 key 和 value 捕获团体。

下一步是字典理解，剥离/替换每个 key 和 value ，并创建一个 OrderedDict （导入 collections ）。

所有这些字典都包含在列表理解中。

最后一步是创建一个DataFrame。

为避免多行项目，请在每个值中（冒号后的一段文本）换行符已替换为空格（您可以自由更改）。

Answer 2

这是一个实现。遍历每行，当行包含:时，将行拆分为columnname:columnval，并将columnname作为键，并将columnval添加为临时字典的值。使用if语句来检测何时遇到特殊键Hash（对于新行的开始），Modified path（将其添加到数组）和Commit Date（将其转换为日期时间）

import pandas as pd
from datetime import datetime

test_path = '/home/kkawabat/.PyCharmCE2018.1/config/scratches/test.txt'
with open(test_path, 'r') as ofile:
    lines = ofile.readlines()
row_list = []
cur_row_dict = {}
for line in lines:
    line_split = line.split(':', 1)
    if len(line_split) == 2:
        colname, colval = line_split[0].strip(), line_split[1].strip()
        if colname == 'Hash': #assuming Hash is always the first element
            if len(cur_row_dict) != 0:
                row_list.append(cur_row_dict)
                cur_row_dict = {}
        elif colname == 'Commit Date':
            cur_row_dict[colname] = datetime.strptime(colval, '%Y-%m-%d %H:%M:%S')
        elif colname == 'Modified path':
            if colname not in cur_row_dict:
                cur_row_dict[colname] = [colval]
            else:
                cur_row_dict[colname].append(colval)
        else:
            cur_row_dict[colname] = colval
row_list.append(cur_row_dict)

df = pd.DataFrame(row_list)
print(df)

将非csv文本文件解析为数据框

2 个答案: