使用正则表达式尝试从电子邮件中提取段落

时间:2018-11-26 18:07:19

标签: python regex python-3.x

我正在尝试使用正则表达式从文本中提取以下形式的段落:

<0.30.1.92.13.39.38.marian+@MARIAN.ADM.CS.CMU.EDU (Marian D'Amico).0>
Type:     cmu.cs.scs
Topic:    LOGIC COLLOQUIUM
Dates:    6-Feb-92
Time:     3:30
Host:     Stephen D. Brookes
PostedBy: marian+ on 30-Jan-92 at 13:39 from MARIAN.ADM.CS.CMU.EDU 
(Marian D'Amico)
Abstract: 



***********************************************************************
          Logic Colloquium
            Thursday February 6
           3:30 Wean 5409
 **********************************************************************
       On The Mathematics of Non-monotonic Reasoning
          Menachem Magidor
       Hebrew University of Jerusalem
          (Joint work with Daniel Lehman)

Non-monotonic reasoning is an attempt to develop reasoning systems
where an inference means that the conclusion holds in the "normal 
case",
in "most cases", but it does not necessarily hold in all cases. It 
seems 
that this type of reasoning is needed if one wants to model everyday
common-sense reasoning. There have been many models suggested for
non-monotonic reasoning (like circumscription, default logic, 
autoepistemic logic, etc). We study all these approaches in a more 
abstract fashion by considering the inference relation of the 
reasoning system, and clarify the role of different inference rules 
and the impact they have on the model theory of the logic. We are 
especially interested in a particular rule called "Rational Monotony" 
and the connection between it and probabilistic models.

 NOTE: Prof. Magidor will also give a Math Department Colloquium on 
Friday
 February 7.

-------------------------
 Host:  Stephen D. Brookes

Appointments can be made through Marian D'Amico, marian@cs, x7665.

我正在尝试:     段正则表达式= r'(?<= \ n \ n)(?:(?:\ s * \ b。+ \ b:(?:。| \ s)+?)|(\ s {0,4} A -Za-z0-9 +?    \ s *))(?= \ n \ n)'

但是此正则表达式捕获某些情况,而在另一些情况下,它要么无法捕获段落,要么将其挂起。

任何帮助将不胜感激

1 个答案:

答案 0 :(得分:1)

我会尝试另一种方法。

您可以根据新行将文本分开:

texts = text.split('\n')

从那里进行测试,以确定文本是否是电子邮件正文的一部分或其他内容。也许要查找前/后行为空白的文本块。这样的事情可能会起作用:

段落= []

for i, text in enumerate(texts):
  if i>0:
    if (text != '' and texts[i-1] == '' and texts[i+1]):
       paragraphs.append(text)

顺便说一句,使用regexp只能达到目的。大多数文本数据源的格式通常会有很多变化,并且您的正则表达式将永远无法捕获所有边缘情况。我只需要这样做一次,构建分类模型来识别段落会更强大(更容易)。

那是它自己的研究项目,但是如果您采用这种方式,请查看带有支持向量分类器(SVC)的配对术语频率-逆文档频率(TF-IDF),并且不要让任何人说服您使用神经网络除非您有很多好的培训数据:)。