如何解析文本文件中的表?

时间:2016-03-02 21:48:56

标签: python parsing web-scraping beautifulsoup text-files

我试图抓取此页面底部的馆藏表,以获取每栏中的信息:https://www.sec.gov/Archives/edgar/data/1412093/000114036111027807/0001140361-11-027807.txt

到目前为止我所拥有的是:

from bs4 import BeautifulSoup
import urllib2
import datetime
import sys

def scrape(url):
    htmlfile = urllib2.urlopen(url)
    htmltext = htmlfile.read()
    bs = BeautifulSoup(htmltext)
    tables =bs.find_all('table')
    for table in tables:
        print table

if __name__ == '__main__':
    url = 'https://www.sec.gov/Archives/edgar/data/1412093/000114036111027807/0001140361-11-027807.txt'
    scrape(url)

然而,这只能让我获得一席之地,而我似乎无法进一步逐行解析它。 任何有关这方面的帮助将不胜感激,谢谢!

1 个答案:

答案 0 :(得分:0)

问题在于,这不是HTML表,而是以空格分隔的列集,您必须以不同方式进行解析。这是一个非常天真但有效的解决方案,使用splitlines()将表格拆分为行,split()拆分成列:

import urllib2

from bs4 import BeautifulSoup

def scrape(url):
    htmlfile = urllib2.urlopen(url)
    htmltext = htmlfile.read()
    bs = BeautifulSoup(htmltext, "html.parser")

    data = bs.find('table').get_text().splitlines()[10:]
    for line in data:
        print([item for item in line.split()])

if __name__ == '__main__':
    url = 'https://www.sec.gov/Archives/edgar/data/1412093/000114036111027807/0001140361-11-027807.txt'
    scrape(url)

打印:

['ADVENTRX', 'PHARMAMACEUTICALS', 'INC', 'COM', 'NEW', '00764X202', '289', '138,377', 'SH', 'SOLE', 'N/A', '138,377']
['AMGEN', 'INC', 'COM', '31162100', '54,519', '1,020,000', 'SH', 'SOLE', 'N/A', '1,020,000']
...
['SOUTHERN', 'UN', 'CO', 'NEW', 'COM', '844030106', '5,328', '186,154', 'SH', 'SOLE', 'N/A', '186,154']
['TAKE-TWO', 'INTERACTIVE', 'SOFTWAR', 'COM', '874054109', '151,310', '9,844,502', 'SH', 'SOLE', 'N/A', '9,844,502']

最不可靠的部分是[10:]切片。我离开这个让你改进。