使用BeautifulSoup提取文本标题

时间:2018-02-20 19:02:31

标签: python beautifulsoup

我正在尝试提取本陆军野战手册中列出的文本标题。我首先使用adobe acrobat将其转换为html文件:

http://usacac.army.mil/sites/default/files/misc/doctrine/CDG/cdg_resources/manuals/fm/fm7_15.pdf

from requests import get
from bs4 import BeautifulSoup
import pandas as pd

url = 'C:/Users/.../fm7_15.html'

with open(url, "r") as ur:
    html = ur.read()

soup = BeautifulSoup(html)

headers_30 = soup.find_all("p", attrs={"class":
                                "s30"})
headers_33 = soup.find_all("p", attrs={"class":
                                "s33"})
headers_20 = soup.find_all("p", attrs={"class":
                                "s20"})

df30 = pd.DataFrame(headers_30,columns=["column"])
df30.to_csv('headers_30.csv', index=False)

df33 = pd.DataFrame(headers_33,columns=["column"])
df33.to_csv('headers_33.csv', index=False)

df20 = pd.DataFrame(headers_20,columns=["column"])
df20.to_csv('headers_20.csv', index=False)

有3个类组成不同的标题(s30,s33,s20)。我设法将它们保存为csv,但问题是它还提取了所有相关的html标签。提取标题文本的最佳方法是什么?

1 个答案:

答案 0 :(得分:2)

您可以使用列表推导从元素中提取文本:

headers_30 = [i.text for i in soup.find_all("p", {"class":"s30"})]
headers_33 = [i.text for i in soup.find_all("p", {"class":"s33"})]
headers_20 = [i.text for i in soup.find_all("p", {"class":"s20"})]

而不是:

headers_30 = soup.find_all("p", attrs={"class":"s30"})
headers_33 = soup.find_all("p", attrs={"class":"s33"})
headers_20 = soup.find_all("p", attrs={"class":"s20"})