如何从txt文件网站scrab

时间:2018-02-18 14:04:16

标签: python numpy beautifulsoup

假设我使用HTML源代码查看器等在线工具 然后我输入一个链接,然后生成HTML源代码 然后只选择我想要的<li>标签,类似这样的

<li class='item'><a class='list-link' href='https://foo1.com'><img src='https://foo1.com/imgfoo1.jpg' /></a></li><li class='item'><a class='list-link' href='https://foo2.com'><img src='https://foo1.com/imgfoo2.jpg' /></a></li><li class='item'><a class='list-link' href='https://foo3.com'><img src='https://foo1.com/imgfoo3.jpg' /></a></li>

所以是的,有时它是一条长行,然后将它们放到文本名称urlcontainer.txt

那么,我应该怎么做呢? 因为当我使用终端

在python上运行下面的代码时
import requests
import numpy as np
from bs4 import BeautifulSoup as soup

page_html = np.genfromtxt('urlcontainer.txt',dtype='str')

page_soup = soup(page_html, "html.parser") #I got the error on this line

这就是错误

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/bs4/__init__.py", line 225, in __init__
    markup, from_encoding, exclude_encodings=exclude_encodings)):
  File "/usr/lib/python2.7/dist-packages/bs4/builder/_htmlparser.py", line 157, in prepare_markup
    exclude_encodings=exclude_encodings)
  File "/usr/lib/python2.7/dist-packages/bs4/dammit.py", line 352, in __init__
    markup, override_encodings, is_html, exclude_encodings)
  File "/usr/lib/python2.7/dist-packages/bs4/dammit.py", line 228, in __init__
    self.markup, self.sniffed_encoding = self.strip_byte_order_mark(markup)
  File "/usr/lib/python2.7/dist-packages/bs4/dammit.py", line 280, in strip_byte_order_mark
    if (len(data) >= 4) and (data[:2] == b'\xfe\xff') \
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

问题是,当我在终端上输入page_html时,这是值:

array(['<li', "class='item'><a", "class='list-link'",
       "href='https://foo1.com'><img",
       "src='https://foo1.com/imgfoo1.jpg'", '/></a></li><li',
       "class='item'><a", "class='list-link'",
       "href='https://foo2.com'><img",
       "src='https://foo1.com/imgfoo2.jpg'", '/></a></li><li',
       "class='item'><a", "class='list-link'",
       "href='https://foo3.com'><img",
       "src='https://foo1.com/imgfoo3.jpg'", '/></a></li>'], 
      dtype='|S34')

1 个答案:

答案 0 :(得分:1)

像往常一样阅读文件。无需使用NumPy。

with open("urlcontainer.txt") as f:
    page = f.read()
soup = BeautifulSoup(page, "html.parser")

然后,继续进行解析活动。

相关问题