美丽的汤网页刮板

时间:2018-03-04 19:30:29

标签: python python-2.7 web-scraping beautifulsoup

我正在尝试使用以下网址抓取网页 https://www.bseindia.com/corporates/shpSecurities.aspx?scripcd=500209&qtrid=96.00

我想用下面的html代码抓一张桌子。我已经尝试过一些东西,但是无法实现所需的表格插入到csv.Here <“tr”> 标记没有关闭数据,因此将数据分隔到不同的行是一个问题。

感谢您的帮助 --j

<table border='0' width='900' align='center' cellspacing='1' cellpadding='4'>
                <tr>
                    <td class='innertable_header1' rowspan='3'>Category of shareholder</td>
                    <td class='innertable_header1' rowspan='3'>Nos. of shareholders</td>
                    <td class='innertable_header1' rowspan='3'>No. of fully paid up equity shares held</td>
                    <td class='innertable_header1' rowspan='3'>No. of shares underlying Depository Receipts</td>
                    <td class='innertable_header1' rowspan='3'>Total nos. shares held</td>
                    <td class='innertable_header1' rowspan='3'>Shareholding as a % of total no. of shares (calculated as per SCRR, 1957)As a % of (A+B+C2)</td>
                    <td class='innertable_header1' rowspan='3'> Number of equity shares held in dematerialized form</td>
                </tr>
                <tr></tr>
                <tr></tr>
                <tr>
                    <td class='TTRow_left'>(A) Promoter & Promoter Group</td>
                    <td class='TTRow_right'>19</td>
                    <td class='TTRow_right'>28,17,02,889</td>
                    <td class='TTRow_right'></td>
                    <td class='TTRow_right'>28,17,02,889</td>
                    <td class='TTRow_right'>12.90</td>
                    <td class='TTRow_right'>28,17,02,889</td>
                    <tr>
                        <td class='TTRow_left'>(B) Public</td>
                        <td class='TTRow_right'>9,16,058</td>
                        <td class='TTRow_right'>1,87,81,45,362</td>
                        <td class='TTRow_right'>1,32,95,642</td>
                        <td class='TTRow_right'>1,89,14,41,004</td>
                        <td class='TTRow_right'>86.61</td>
                        <td class='TTRow_right'>1,88,74,40,959</td>
                        <tr>
                            <td class='TTRow_left'>(C1) Shares underlying DRs</td>
                            <td class='TTRow_right'></td>
                            <td class='TTRow_right'></td>
                            <td class='TTRow_right'></td>
                            <td class='TTRow_right'></td>
                            <td class='TTRow_right'>0.00</td>
                            <td class='TTRow_right'></td>
                            <tr>
                                <td class='TTRow_left'>(C2) Shares held by Employee Trust</td>
                                <td class='TTRow_right'>1</td>
                                <td class='TTRow_right'>1,08,05,896</td>
                                <td class='TTRow_right'></td>
                                <td class='TTRow_right'>1,08,05,896</td>
                                <td class='TTRow_right'>0.49</td>
                                <td class='TTRow_right'>1,08,05,896</td>
                                <tr>
                                    <td class='TTRow_left'>(C) Non Promoter-Non Public</td>
                                    <td class='TTRow_right'>1</td>
                                    <td class='TTRow_right'>1,08,05,896</td>
                                    <td class='TTRow_right'></td>
                                    <td class='TTRow_right'>1,08,05,896</td>
                                    <td class='TTRow_right'>0.49</td>
                                    <td class='TTRow_right'>1,08,05,896</td>
                                    <tr>
                                        <td class='TTRow_left'>Grand Total</td>
                                        <td class='TTRow_right'>9,16,078</td>
                                        <td class='TTRow_right'>2,17,06,54,147</td>
                                        <td class='TTRow_right'>1,32,95,642</td>
                                        <td class='TTRow_right'>2,18,39,49,789</td>
                                        <td class='TTRow_right'>100.00</td>
                                        <td class='TTRow_right'>2,17,99,49,744</td>
                                    </tr>
            </table>

1 个答案:

答案 0 :(得分:1)

你可以试试这个:

from bs4 import BeautifulSoup as soup
import urllib
import re
s = soup(str(urllib.urlopen('https://www.bseindia.com/corporates/shpSecurities.aspx?scripcd=500209&qtrid=96.00').read()), 'lxml')
results = filter(None, [re.sub('[\n\r]+|\s{2,}', '', i.text) for i in s.find_all('td', {'class':re.compile('TTRow_right|TTRow_left')})])

输出:

[u'(A) Promoter & Promoter Group', u'19', u'28,17,02,889', u'28,17,02,889', u'12.90', u'28,17,02,889', u'(B) Public', u'9,16,058', u'1,87,81,45,362', u'1,32,95,642', u'1,89,14,41,004', u'86.61', u'1,88,74,40,959', u'(C1) Shares underlying DRs', u'0.00', u'(C2) Shares held by Employee Trust', u'1', u'1,08,05,896', u'1,08,05,896', u'0.49', u'1,08,05,896', u'(C) Non Promoter-Non Public', u'1', u'1,08,05,896', u'1,08,05,896', u'0.49', u'1,08,05,896', u'Grand Total', u'9,16,078', u'2,17,06,54,147', u'1,32,95,642', u'2,18,39,49,789', u'100.00', u'2,17,99,49,744']