Question

我正在使用一个简单的HTMLParser来解析一个网页，其代码总是格式正确（它是自动生成的）。它运行良好，直到它用'＆amp;'命中一个数据登录它 - 它似乎认为这使它成为两个独立的数据并分别处理它们。（也就是说，它会两次调用“handle_data”。）我起初认为无法解决问题，但我不认为这样做。有没有人对我如何让我的解析器进行处理有任何建议，例如“Paradise Bakery and Cafe”（即“Paradise Bakery＆amp;Café”）作为单个数据项而不是两个？

非常感谢， BSG

P.S。请不要告诉我，我真的应该使用BeautifulSoup。我知道。但在这种情况下，我知道标记每次都保证格式良好，我发现HTMLParser比BeautifulSoup更容易使用。感谢。

我正在添加我的代码 - 谢谢！

#this class, extending HTMLParser, is written to process HTML within a <ul>. 
#There are 6 <a> elements nested within each <li>, and I need the data from the second 
#one. Whenever it encounters an <li> tag, it sets the 'is_li' flag to true and resets 
#the count of a's seen to 0; whenever it encounters an <a> tag, it increments the count
#by 1.   When handle_data is called, it checks to make sure that the data is within
#1)an li element and 2) an a element, and that the a element is the second one in that
#li (num_as == 2). If so, it adds the data to the list. 

class MyHTMLParser(HTMLParser):
pages = []
is_li = 'false'
#is_li 
num_as = 0

def _init_(self):
    HTMLParser._init_(self)
    self.pages = []
    self.is_li = 'false'
    self.num_as = 0
    self.close_a = 'false'
    sel.close_li = 'false'
    print "initialized"


def handle_starttag(self, tag, attrs):
      if tag == 'li':
          self.is_li = 'true'
          self.close_a = 'false'
          self.close_li = 'false'


      if tag == 'a' and self.is_li == 'true':
          if self.num_as < 7:
              self.num_as += 1
              self.close_a = 'false'

          else:
              self.num_as = 0
              self.is_li = 'false'

def handle_endtag(self, tag):
     if tag == 'a':
         self.close_a = 'true'

     if tag == 'li':
         self.close_li = 'true'
         self.num_as = 0

def handle_data(self, data):
    if self.is_li == 'true':
        if self.num_as == 2 and self.close_li == 'false' and self.close_a == 'false':
            print "found data",  data
            self.pages.append(data)

def get_pages(self):
    return self.pages

Answer 1

这是因为&是HTML实体的开头。显示的&应在HTML中表示为&（虽然浏览器会显示&后跟空格作为＆符号，我相信技术上这是无效的）。

您只需要编写handle_data()来容纳多个调用，例如使用当您看到开始标记时被设置为[]的成员变量，并且每次调用都会附加该变量到handle_data()，然后在看到结束标记时加入到字符串中。

我在下面对它进行了重击。我添加的关键行有# *****条评论。我也冒昧地使用适当的布尔号作为你的旗帜，而不是字符串，因为它允许代码更清洁（希望我没有弄乱它）。我还将您的__init__()更改为reset()方法（以便可以重用您的解析器对象）并删除多余的类变量。最后，我添加了handle_entityref()和handle_charref()方法来处理转义字符实体。

class MyHTMLParser(HTMLParser):

    def reset(self):
        HTMLParser.reset(self)
        self.pages    = []
        self.text     = []                     # *****
        self.is_li    = False
        self.num_as   = 0
        self.close_a  = False
        self.close_li = False

    def handle_starttag(self, tag, attrs):
          if tag == 'li':
              self.is_li    = True
              self.close_a  = False
              self.close_li = False

          if tag == 'a' and self.is_li:
              if self.num_as < 7:
                  self.num_as += 1
                  self.close_a = False
              else:
                  self.num_as = 0
                  self.is_li = False

    def handle_endtag(self, tag):
         if tag == 'a':
             self.close_a  = True
         if tag == 'li':
             self.close_li = True
             self.num_as   = 0
             self.pages.append("".join(self.text))      # *****
             self.text = []                             # *****

    def handle_data(self, data):
        if self.is_li:
            if self.num_as == 2 and not self.close_li and not self.close_a:
                print "found data",  data
                self.text.append(data)              # *****

    def handle_charref(self, ref):
        self.handle_entityref("#" + ref)

    def handle_entityref(self, ref):
        self.handle_data(self.unescape("&%s;" % ref))

    def get_pages(self):
        return self.pages

基本想法是，不是在每次拨打self.pages时附加handle_data()，而是附加到self.text。然后你会发现每个文本元素会发生一次其他事件（当你看到</li>标签时我选择但是当你看到</a>时，我可能无法看到一些你的数据也加入了这些文本，并将附加到pages。

希望这会让您了解我正在谈论的方法，即使我发布的确切代码对您不起作用。

Answer 2

Unescaping &会导致所有&的奇怪行为。我创建了一个类，它不会将数据拆分为&个实体的块。你可以找到它HERE。

Python HTMLParser将数据划分为＆amp;

2 个答案: