如何使用xPath解决IndexError

时间:2019-10-31 13:09:04

标签: xpath web-scraping

抱歉,我是一个初学者。我一直在尝试从SEC网站获取元数据。这是链接-https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001403161&type=10&dateb=&owner=exclude&count=40

让我们现在获取日期。我正在尝试xPath,但它抛出了IndexError。我检查了获取的html,它似乎确实有数据。

我的代码:

from lxml import html
import requests


page = requests.get('https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001403161&type=10&dateb=&owner=exclude&count=40')
tree = html.fromstring(page.content)

date = tree.xpath('//*[@id="seriesDiv"]/table/tbody/tr[2]/td[4]')[0].text
print(date)

如何使它工作?

任何帮助将不胜感激。

谢谢!

2 个答案:

答案 0 :(得分:0)

不确定xpath,因为那是我写的方式。但是,如果您不必专门使用xpath,我将使用Pandas路由解析整个表,并且可以在需要时调用单个单元格:

pd.read_html()将返回数据帧列表(即html中的所有<table>标签)。您只需要调用所需的表,在这种情况下,该表就是索引位置2(或3个数据框的最后一个)

import pandas as pd

url = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001403161&type=10&dateb=&owner=exclude&count=40'
dfs = pd.read_html(url)
df = dfs[-1]

输出:     打印(df.to_string())

print (df.to_string())
   Filings                      Format                                        Description Filing Date    File/Film Number
0     10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2019-07-26   001-3397719978181
1     10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2019-04-26   001-3397719771802
2     10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2019-01-31   001-3397719556097
3     10-K  Documents Interactive Data  Annual report [Section 13 and 15(d), not S-K I...  2018-11-16  001-33977181189947
4     10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2018-07-27   001-3397718974910
5     10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2018-04-27   001-3397718783872
6     10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2018-02-01   001-3397718567042
7     10-K  Documents Interactive Data  Annual report [Section 13 and 15(d), not S-K I...  2017-11-17  001-33977171209440
8     10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2017-07-20   001-3397717974492
9     10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2017-04-21   001-3397717774258
10    10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2017-02-02   001-3397717568413
11    10-K  Documents Interactive Data  Annual report [Section 13 and 15(d), not S-K I...  2016-11-15  001-33977162000223
12    10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2016-07-25  001-33977161782265
13    10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2016-04-25  001-33977161589237
14    10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2016-01-28  001-33977161369122
15    10-K  Documents Interactive Data  Annual report [Section 13 and 15(d), not S-K I...  2015-11-20  001-33977151244628
16    10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2015-07-23  001-33977151002526
17    10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2015-04-30   001-3397715819049
18    10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2015-01-29   001-3397715559143
19    10-K  Documents Interactive Data  Annual report [Section 13 and 15(d), not S-K I...  2014-11-21  001-33977141240400
20    10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2014-07-24   001-3397714991576
21    10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2014-04-24   001-3397714781985
22    10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2014-01-30   001-3397714558846
23    10-K  Documents Interactive Data  Annual report [Section 13 and 15(d), not S-K I...  2013-11-22  001-33977131236561
24    10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2013-07-24   001-3397713983884
25    10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2013-05-01   001-3397713803519
26    10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2013-02-06   001-3397713578037
27    10-K  Documents Interactive Data  Annual report [Section 13 and 15(d), not S-K I...  2012-11-16  001-33977121209935
28    10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2012-07-27   001-3397712990778
29    10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2012-05-02   001-3397712805918
30    10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2012-02-08   001-3397712582250
31    10-K  Documents Interactive Data  Annual report [Section 13 and 15(d), not S-K I...  2011-11-18  001-33977111214519
32    10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2011-07-29   001-3397711996223
33    10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2011-05-05   001-3397711815087
34    10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2011-02-02   001-3397711566916
35    10-K  Documents Interactive Data  Annual report [Section 13 and 15(d), not S-K I...  2010-11-19  001-33977101205707
36    10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2010-08-02   001-3397710982428
37    10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2010-05-03   001-3397710789509
38    10-Q  Documents Interactive Data  Quarterly report [Sections 13 or 15(d)]Acc-no:...  2010-02-03   001-3397710571090
39    10-K  Documents Interactive Data  Annual report [Section 13 and 15(d), not S-K I...  2009-11-20  001-33977091198831

要打印单个行和列:

print (df.loc[0,'Filing Date'])
2019-07-26

答案 1 :(得分:0)

此方法将返回整列-将数据归档为列表,

    page = requests.get('https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001403161&type=10&dateb=&owner=exclude&count=40')
    tree = html.fromstring(page.content)

    Firstdate = tree.xpath('//table[@class="tableFile2"]//tr[2]/td[4]/text()')
    print(Fristdate)
    Alldates = tree.xpath('//table[@class="tableFile2"]//tr/td[4]/text()')

    print(Alldates)

输出: ['2019-07-26','2019-04-26','2019-01-31','2018-11-16','2018-07-27','2018-04-27',' 2018-02-01','2017-11-17','2017-07-20','2017-04-21','2017-02-02','2016-11-15','2016- 07-25','2016-04-25','2016-01-28','2015-11-20','2015-07-23','2015-04-30','2015-01- 29','2014-11-21','2014-07-24','2014-04-24','2014-01-30','2013-11-22','2013-07-24' ,“ 2013-05-01”,“ 2013-02-06”,“ 2012-11-16”,“ 2012-07-27”,“ 2012-05-02”,“ 2012-02-08”,“ 2011-11-18','2011-07-29','2011-05-05','2011-02-02','2010-11-19','2010-08-02','2010- 05-03','2010-02-03','2009-11-20']

相关问题