将txt文件解析为JSON,仅获取最后一条记录

时间:2017-01-12 14:12:35

标签: python parsing

我有一个格式化的文本文件,由outlook电子邮件组成。

enter image description here From:表示新电子邮件。

我正在尝试解析“发件人”,“主题”(分成多个字段),然后阅读其余内容,直到下一封新电子邮件由新来自:

指示

首先,我试图强制它,因为这是对概念证明的测试,但是,我只是在链中收到最后电子邮件。

l = []
with open(r'transcripts.txt', 'r') as transcripts:

for line in transcripts:
    is_new_subject = line.lower().startswith('from')
    if is_new_subject:
        record = {}
        record['from'] = line.split(':')[1]
    for line in transcripts:

        if line.lower().startswith('subject'):
            subject = line.split(':')[1]
            record['subject'] = subject
            split_it = subject.split('.')
            record['show'] = split_it[0]
            record['air_date'] = split_it[1]
            record['hour'] = split_it[2]
            record['content'] = ""
            for line in transcripts:
                record['content'] += line
                is_new_subject = line.lower().startswith('from')
                if is_new_subject:
                    l.append(record)
                    break
with open('output.json', 'w') as outfile:
    json.dump(l, outfile, indent=4)

任何想法,我将从头开始重新加工

2 个答案:

答案 0 :(得分:1)

您的代码有点难以阅读,我认为如果将其分解为函数,调试它会更容易。另外,我建议使用python的re库进行这种类型的文本处理,因为它比仅测试静态字符串更灵活。例如:

import re

def parse_emails_from_list(email_list):
    """returns a list of emails from an email list"""
    return re.compile("From:").split(email_list)

def parse_email_details_from_email(email):
    """do some more processing here"""
    email = {}
    email['subject'] = #parse your email details here
    #...
    #...
    return email

if __name__ == "main":
    """main loop"""
    parsed_emails = []
    with open(r'transcripts.txt', 'r') as email_list:
        email_list = parse_emails_from_list(transcripts)
        [parsed_emails.append(parse_email_details_from_email(email)) for email in email_list]

    with open('output.json', 'w') as outfile:
        json.dump(parsed_emails, outfile, indent=4)

在仔细查看代码后,很明显您的循环逻辑肯定是您遇到问题的地方。

答案 1 :(得分:1)

您应该尝试Email Parser。这很容易使用。出于某种原因,此电子邮件不适用于多部分电子邮件。所以我使用了@Max Paymar创建的split函数。谢谢@Max Paymar。

import email
import re


def parse_emails_from_list(email_list):
    """returns a list of emails from an email list"""
    return re.compile("From:").split(email_list)

a=open('sampleEmail.txt','r')
email_list = parse_emails_from_list(a.read())

for E_mail in email_list:
    msg = email.message_from_string('From:'+E_mail)
    print msg['Subject']
    print msg['From']
    print msg.get_payload()