Question

我正在做一些网页抓取工作，从表格中获取文本。 Unicode错误不断弹出，当我编码为utf8时，我的结果中混入了一堆b'和b'\xc2\xa0'，有一种方法可以解决必须编码并且仅从表中获取文本的问题？

Traceback (most recent call last): File "c:\...\...\...", line 15, in 
<module> print(rows) File 
"C:\...\...\...\Python\Python37\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u2612' in position 3: character maps to <undefined>

当我使用replace时，出现类型错误：

TypeError: a bytes-like object is required, not 'str'

无论我是否使用str()。我试图遍历并仅打印可以转换为字符串的项目，但再次弹出Unicode错误

test = 'https://www.sec.gov/Archives/edgar/data/789019/000156459019001392/msft-10q_20181231.htm'

import re

import requests
from urllib.request import urlopen


from bs4 import BeautifulSoup

page = urlopen(test).read()
soup = BeautifulSoup(page, 'lxml')

tables = soup.findAll('table')

for table in tables:
  for row in table.findAll('tr'):
    for cel in row.findAll('td'):
      if str(cel.getText().encode('utf-8').strip()) != "b'\\xc2\\xa0'":
        print(str(cel.getText().encode('utf-8').strip())
        #print(str(cel.getText().encode('utf-8').strip().replace('\\xc2\\xa0', '').replace('b\'', '')

实际结果：

b'\xe2\x98\x92'
b'QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934'

b'\xe2\x98\x90'
b'TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934'

b'Washington'

b'\xc2\xa0'

b'91-1144442'

b'(State or other jurisdiction of\nincorporation or organization)'
...
...

预期结果：

'QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934'

'TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934'

'Washington'

'91-1144442'

'(State or other jurisdiction of\nincorporation or organization)'

...
...

Answer 1

BeautifulSoup通过将字符串转换为字节进行编码，已经可以正确处理utf-8格式的HTML。

以下内容产生了所需的输出：

from bs4 import BeautifulSoup
import requests

test = 'https://www.sec.gov/Archives/edgar/data/789019/000156459019001392/msft-10q_20181231.htm'
req = requests.get(test)
soup = BeautifulSoup(req.content, "html.parser")

for table in soup.find_all('table'):
    for row in table.findAll('tr'):
        for cel in row.findAll('td'):
            text = cel.get_text(strip=True)

            if text:   # skip blank lines
                print(text)

HTML表可以存储为以下列表：

from bs4 import BeautifulSoup
import requests

test = 'https://www.sec.gov/Archives/edgar/data/789019/000156459019001392/msft-10q_20181231.htm'
req = requests.get(test)
soup = BeautifulSoup(req.content, "html.parser")

rows = []

for table in soup.find_all('table'):
    for row in table.findAll('tr'):
        values = [cel.get_text(strip=True) for cel in row.findAll('td')]
        rows.append(values)

print(rows)

经过测试：

Python 3.7.3，BS4 4.7.1
Python 2.7.16，BS4 4.7.1

beautifulsoup-忽略unicode错误，仅打印文本

1 个答案: