刮取第二个文字<br/>

时间:2018-10-06 10:56:02

标签: python beautifulsoup

我在下面有html提取,请注意,我需要捕获的每一行都有两个td重复。

<table class="ent">
<tbody class=""><tr class="tablestyle">

    <td class="hide_on_mobile">  <a href="../" class="">
        <img class="ProductImage" src="https://.."></a>
    </td>
    <td class="hide_on_mobile" align="center">
        <strong class="">
            <span style="font-size:1.4em;" class="">Scraped okay - col0</span>
                <br>
                <br>Scrape this text - col1</strong><br>
                <br><i><span style="color:indigo;" class="">Scrape this text - col2
                <br class="">
                <br>Next Event: Scrape this text -col3</span></i>
    </td>

我需要捕获4个不同的数据块col0,col1,col2,col3

我已经可以使用col0了。我需要捕获col1,col2,col3

我正在尝试使用BR,即 跨度之后

在col1的第二个BR之后获取文本

在col2之后的第3个BR之后获取文本

在col3的第5个BR后获取文本

我无法让col1与br> br一起工作。有什么想法可以解决这个问题吗?

import sqlite3
import datetime
import requestsnt
import pandas as pd
from bs4 import BeautifulSoup

url = "http:/*"

r = requests.get(url)
source = r.text
t = datetime.datetime.now().date()
soup = BeautifulSoup(source, "lxml")

row_count=200

row_marker = 0

new_table = pd.DataFrame(columns = ["col0", "col1", "col2","col3", "DateAdded"], index = range(0,row_count)) # I don't know the number of rows

# For col0
column_marker = 0
for layout in soup.select("strong > span"):
            new_table.iat[row_marker,column_marker] = layout.text.strip()
            new_table.iat[row_marker,4] = t
            row_marker +=1

# For col 1

column_marker = 1
row_marker = 0
for layout in soup.select("strong > span > br > br"):
            new_table.iat[row_marker,column_marker] = layout.text.strip()
            row_marker +=1

1 个答案:

答案 0 :(得分:0)

#since you said there are multiple trs
trs = data.find_all('tr')


for tr in trs:
    l = []
    td =  tr.find_all('td')
    #since first td will never have data.. acc to the above posted ques 
    for tags in td[1]:
        try:
            if tags.text:
                print(tags.text)
                l.extend((tags.text).split('\n'))
        except:
            pass

#once there are more trs keep below code inside the loop
#then store the data in a df..since each loop will give new list
str_data = [' '.join(s.split()) for s in l if s]        
str_data.remove('')
print(str_data)

输出

['Scraped okay - col0',
 'Scrape this text - col1',
 'Scrape this text - col2',
 'Next Event: Scrape this text -col3']