Question

我在一个字符串中有很多电子邮件。我需要将此字符串拆分为单独的电子邮件。每封电子邮件以新行中的“发件人：”开头。如果身体其他任何地方都没有“从：”，则以下工作 -

list_of_email_strings = re.split("From:", my_email_text_string)

我需要忽略“From：”，它不会在新行之后发生。以下（带插入符号）不起作用 -

list_of_email_strings = re.split("^From:", my_email_text_string)

解决方案？

Answer 1

您可以将\n与非消费前瞻断言(?=...)结合使用，其优点是不会吃掉您要分割的字符串（例如“From：”保持不变）。

list_of_email_strings = re.split("\n(?=From:)", my_email_text_string)

E.g：

>>> s = "From: ...\nFrom: ...\nFrom: ..."
>>> re.split("\n(?=From:)", s)
['From:...', 'From:...', 'From:...']

与：相比：

>>> re.split("\nFrom:", s)
['From: ...', ' ...', ' ...']

Answer 2

与wim的答案类似，但使用From：根据需要添加回电子邮件：

list = ['From:' + msg for msg in ('\n' + text).split('\nFrom:')]

但是，有一些本机Python模块可以让您更好，更可靠地控制电子邮件文件中的读取，就像您描述的那样。想到email和mailbox。

假设这些是标准的mbox风格的电子邮件，其中每个文件以“From：”开头，然后是一些标题行，可能是摘要等 - 就像sendmail或Postfix使用的那样 - 如果你要么这样的话首先将字符串写入文件或只使用现有文件：

mbox = mailbox.mbox(path_to_mailbox_file)
mbox.lock()  # only if you're using an active mailbox file
message_strings = [message.as_string() for message in mbox]
mbox.unlock()  # again, only if you're using an acture mailbox file
mbox.close()

要获取消息数量，只需使用len(mbox)。

还有很多其他有用的功能。我使用这些mudules制作了一些脚本，并且对结果非常不满意。（请注意，as_string可能会重新格式化某些标题。）

使用re.split（）拆分字符串

2 个答案: