在<pre> tag

时间:2019-01-07 12:53:51

标签: python beautifulsoup screen-scraping

I wanted to try some basic web-scraping but ran into a problem since I am used to simple td-tags, in this case I had a webpage which had the following pre-tag and all the text inside of it which means it is a bit trickier to scrape it.

<pre style="word-wrap: break-word; white-space: pre-wrap;">
11111111
11111112
11111113
11111114
11111115
</pre>

Any suggestions on how to scrape each row?

Thanks

2 个答案:

答案 0 :(得分:4)

If that is exactly what you want to parse, you can use the splitlines() function easily to get a list of rows, or you can tweak the split() function like this.

from bs4 import BeautifulSoup

content = """
<pre style="word-wrap: break-word; white-space: pre-wrap;">
11111111 
11111112 
11111113
11111114
11111115 
</pre>""" # This is your content

soup = BeautifulSoup(content, "html.parser")
stuff = soup.find('pre').text
lines = stuff.split("\n") # or replace this by stuff.splitlines()
# print(lines) gives ["11111111", "11111112", "11111113", "11111114", "11111115"]
for line in lines:
    print(line)
# prints each row separately.

答案 1 :(得分:0)

If each line is indeed on a line by itself, why not just split the content into a list?

data = soup.find('pre').text
lines = data.splitlines()

You can pass True into the splitlines routine to keep the line endings if that's what you desire.