在Python中提取段落文本

时间:2017-10-04 04:04:49

标签: python extract docx

我如何使用python搜索word文档,在搜索并匹配段落标题后提取段落文本,即“ 1.2 Broadspectrum Offer 摘要”。

即。请参阅下面的文档示例,我基本上希望得到以下文字“我们提供的提供工作范围的摘要如招标文件中所述。请参阅以下各条款和条件。我们的优惠详情请见此处。 还请查看费用明细

<android.support.v7.widget.CardView
    xmlns:card_view="http://schemas.android.com/apk/res-auto"
    card_view:cardElevation="2dp"
    card_view:cardUseCompatPadding="true"
    card_view:cardCornerRadius="8dp"
    card_view:cardBackgroundColor="@color/white"
    card_view:contentPadding="2dp"
    android:id="@+id/image_view_container"
    android:layout_centerVertical="true"
    android:layout_toRightOf="@+id/coupon_item_index_text_view"
    android:rotation="-7"
    android:layout_height="70dp"
    android:layout_width="86dp">

    <ImageView
        android:adjustViewBounds="true"
        android:scaleType="centerCrop"
        android:id="@+id/coupon_item_image_view"
        android:background="@drawable/rounded_border_corners"
        android:src="@drawable/pizza"
        android:layout_height="match_parent"
        android:layout_width="match_parent" />

</android.support.v7.widget.CardView>

请注意,标题编号从doc更改为doc并且不想依赖于此,所以我希望依赖标题中的搜索文本

到目前为止,我可以搜索文档,但只是一个开始。

1.  Executive Summary

1.1 Summary of Services
Energy Savings (Carbon Emissions and Intensity Reduction)
Upgrade Economy Cycle on Level 2,5,6,7 & 8, replace Chilled Water Valves on Level 6 & 8 and install lighting controls on L5 & 6..

1.2 Summary of Broadspectrum Offer

A summary of our Offer to deliver the Scope of Work as outlined in the tender documents is provided below. Please refer to the various terms and conditions of our Offer as detailed herein.
Please also find the cost breakdown 

1 个答案:

答案 0 :(得分:2)

这是一个初步的解决方案(我对你上述帖子的评论待定答案)。这还没有考虑在<{strong> Summary of Broadspectrum Offer部分之后排除其他段落。如果需要,您很可能需要一个小的正则表达式匹配,以确定您是否遇到了另一个带有1.3(等等)的标题部分,并停止理解。如果这是一项要求,请告诉我。

修改:将print()从列表理解方法转换为标准for循环,以响应下面的Anton vBR评论。

from docx import Document

document = Document("North Sydney TE SP30062590-1 HVAC - Project Offer -  Rev1.docx")

# Find the index of the `Summary of Broadspectrum Offer` syntax and store it
ind = [i for i, para in enumerate(document.paragraphs) if 'Summary of Broadspectrum Offer' in para.text]
# Print the text for any element with an index greater than the index found in the list comprehension above
if ind:
    for i, para in enumerate(document.paragraphs):
        if i > ind[0]:
             print(para.text)    

[print(para.text)for i,para in enumerate(document.paragraphs)if ind and i&gt; IND [0]]

>> A summary of our Offer to deliver the Scope of Work as outlined in the tender documents is provided below. 
Please refer to the various terms and conditions of our Offer as detailed herein.
Please also find the cost breakdown 

此外,这里有另一篇文章可以帮助解决另一种方法,即使用段落元数据检测heading类型:Extracting headings' text from word doc