仅解析包含特定字符串模式的数据标签

时间:2019-07-19 19:54:55

标签: python parsing beautifulsoup

我想解析包含正则表达式模式字符串的数据标签“ td”。包含此类字符串的样本td为“ /Archives/edgar/data/1446194/000144619419000004/0001446194-19-000004-index.htm”。

我试图将re.compile和regex表达式与“ td:contains”一起使用

a=list()

url = "https://www.sec.gov/cgi-bin/browse-edgar?filenum=028-13216&action=getcompany"
r =requests.get(url)
soup = BeautifulSoup(r.text, 'html')
table = soup.find("table",{"class":"tableFile2"})
rows = table.find_all("tr")

text_main='<[a-z]{2} [a-z]{7}="[a-z]{7}"><[a-z] [a-z]{4}="/\w/\w/\w/\d{7}/\d{18}/\d{10}-\d{2}-\d{6}-\w.[a-z]{3}" [a-z]{2}'

for i in rows:
    a.append(i.find_all(f'td:contains({re.compile(text_main)})'))
)'))

a只是一个空列表清单

1 个答案:

答案 0 :(得分:0)

不需要正则表达式。请尝试以下代码。

from bs4 import BeautifulSoup
import requests

a=[]
url = "https://www.sec.gov/cgi-bin/browse-edgar?filenum=028-13216&action=getcompany"
r =requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
table = soup.select_one(".tableFile2")


for i in table.select("td[nowrap='nowrap']"):
   if i.select_one('a#documentsbutton'):
     a.append(i.select_one('a#documentsbutton')['href'])

print(a)

输出:

['/Archives/edgar/data/1446194/000144619419000004/0001446194-19-000004-index.htm', '/Archives/edgar/data/1446194/000144619419000003/0001446194-19-000003-index.htm', '/Archives/edgar/data/1446194/000144619418000008/0001446194-18-000008-index.htm', '/Archives/edgar/data/1446194/000144619418000007/0001446194-18-000007-index.htm', '/Archives/edgar/data/1446194/000144619418000005/0001446194-18-000005-index.htm', '/Archives/edgar/data/1446194/000144619418000002/0001446194-18-000002-index.htm', '/Archives/edgar/data/1446194/000144619417000017/0001446194-17-000017-index.htm', '/Archives/edgar/data/1446194/000144619417000010/0001446194-17-000010-index.htm', '/Archives/edgar/data/1446194/000144619417000008/0001446194-17-000008-index.htm', '/Archives/edgar/data/1446194/000144619417000006/0001446194-17-000006-index.htm', '/Archives/edgar/data/1446194/000144619417000002/0001446194-17-000002-index.htm', '/Archives/edgar/data/1446194/000144619416000016/0001446194-16-000016-index.htm', '/Archives/edgar/data/1446194/000144619416000014/0001446194-16-000014-index.htm', '/Archives/edgar/data/1446194/000144619416000013/0001446194-16-000013-index.htm', '/Archives/edgar/data/1446194/000144619416000012/0001446194-16-000012-index.htm', '/Archives/edgar/data/1446194/000144619416000009/0001446194-16-000009-index.htm', '/Archives/edgar/data/1446194/000144619415000008/0001446194-15-000008-index.htm', '/Archives/edgar/data/1446194/000144619415000006/0001446194-15-000006-index.htm', '/Archives/edgar/data/1446194/000113630515000010/0001136305-15-000010-index.htm', '/Archives/edgar/data/1446194/000144619415000002/0001446194-15-000002-index.htm', '/Archives/edgar/data/1446194/000144619414000013/0001446194-14-000013-index.htm', '/Archives/edgar/data/1446194/000144619414000009/0001446194-14-000009-index.htm', '/Archives/edgar/data/1446194/000144619414000007/0001446194-14-000007-index.htm', '/Archives/edgar/data/1446194/000144619414000001/0001446194-14-000001-index.htm', '/Archives/edgar/data/1446194/000144619413000053/0001446194-13-000053-index.htm', '/Archives/edgar/data/1446194/000144619413000050/0001446194-13-000050-index.htm', '/Archives/edgar/data/1446194/000144619413000013/0001446194-13-000013-index.htm', '/Archives/edgar/data/1446194/000144619413000002/0001446194-13-000002-index.htm', '/Archives/edgar/data/1446194/000144619412000034/0001446194-12-000034-index.htm', '/Archives/edgar/data/1446194/000144619412000024/0001446194-12-000024-index.htm', '/Archives/edgar/data/1446194/000144619412000013/0001446194-12-000013-index.htm', '/Archives/edgar/data/1446194/000144619412000002/0001446194-12-000002-index.htm', '/Archives/edgar/data/1446194/000091895011000005/0000918950-11-000005-index.htm', '/Archives/edgar/data/1446194/000144619411000004/0001446194-11-000004-index.htm', '/Archives/edgar/data/1446194/000144619411000003/0001446194-11-000003-index.htm', '/Archives/edgar/data/1446194/000144619411000002/0001446194-11-000002-index.htm', '/Archives/edgar/data/1446194/000144619411000001/0001446194-11-000001-index.htm', '/Archives/edgar/data/1446194/000144619410000014/0001446194-10-000014-index.htm', '/Archives/edgar/data/1446194/000144619410000013/0001446194-10-000013-index.htm', '/Archives/edgar/data/1446194/000144619410000011/0001446194-10-000011-index.htm']

或者您可以使用它。

from bs4 import BeautifulSoup
import requests

a=[]
url = "https://www.sec.gov/cgi-bin/browse-edgar?filenum=028-13216&action=getcompany"
r =requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
table = soup.select_one(".tableFile2")


for i in table.select("td[nowrap='nowrap'] a#documentsbutton"):
    a.append(i['href'])

print(a)