Question

我正在尝试阅读HTML页面并从中获取一些信息。在其中的一行中，我需要的信息位于Image的 alt 属性中。像这样：

<img src='logo.jpg' alt='info i need'>

问题是，在解析此内容时，beautifulsoup将alt的内容括在双引号中，而不是使用已经存在的单引号。因此，结果是这样的：

<img alt="\'info" i="" need="" src="\'logo.jpg\'"/>

当前，我的代码包含以下内容：

name = row.find("td", {"class": "logo"}).find("img")["alt"]

应该返回“我需要的信息”，但当前返回“ \'info”的信息我做错了什么？我需要更改任何设置以使beautifulsoup正确解析吗？

编辑：我的代码看起来像这样（我也使用了标准的html解析器，但在那里没有区别）

import sys
import urllib.request
import time
from html.parser import HTMLParser
from bs4 import BeautifulSoup

def main():     
    url = 'https://myhtml.html'
    with urllib.request.urlopen(url) as page:
        text = str(page.read())
        html = BeautifulSoup(page.read(), "lxml")

        table = html.find("table", {"id": "info_table"})
        rows = table.find_all("tr")

        for row in rows:
            if row.find("th") is not None:
                continue
            info = row.find("td", {"class": "logo"}).find("img")["alt"]
            print(info) 


if __name__ == '__main__':
    main()

和html：

<div class="table_container">
<table class="info_table" id="info_table">
<tr>
   <th class="logo">Important infos</th>
   <th class="useless">Other infos</th>
</tr>
<tr >
   <td class="logo"><img src='Logo.jpg' alt='info i need'><br></td>
   <td class="useless">
      <nobr>useless info</nobr>
   </td>
</tr>
<tr >
   <td class="logo"><img src='Logo2.jpg' alt='info i need too'><br></td>
   <td class="useless">
      <nobr>useless info</nobr>
   </td>
</tr>

Answer 1

对不起，我无法添加评论。

我已经测试过您的情况，对我来说输出似乎正确。

HTML：

<html>
    <body>
        <td class="logo">
            <img src='logo.jpg' alt='info i need'>
        </td>
    </body>
</html>

Python：

from bs4 import BeautifulSoup

with open("myhtml.html", "r") as html:
    soup = BeautifulSoup(html, 'html.parser')
    name = soup.find("td", {"class": "logo"}).find("img")["alt"]
    print(name)

返回：

info i need

我认为将文件写回到html时，您的问题是编码问题。

请提供完整的代码和更多信息。

html
您的python代码

更新：

我已经测试过您的代码，您的代码根本无法使用：/ 返工后，我能够获得所需的输出。

import sys
import urllib.request
import time
from html.parser import HTMLParser
from bs4 import BeautifulSoup

def main():     
    url = 'https://code.mytesturl.net'
    with urllib.request.urlopen(url) as page:

        soup = BeautifulSoup(page, "html.parser")
        name = soup.find("td", {"class": "logo"}).find("img")["alt"]
        print(name)


if __name__ == '__main__':
    main()

可能的问题：
也许您的解析器应该是html.parser
Python版本/ bs版本？

BeatifulSoup和属性中的单引号

1 个答案: