Question

我试图访问p标签之间的餐馆列表。

<p class="openclosemonth" id="May2014">May, 2014</p>

<p>
<strong>CLOSED:</strong><br />
-- Haveli, Cambridge (Inman Square), MA<br />
-- Ma Soba, Boston (Beacon Hill), MA<br />
-- Milestone, Wellesley, MA<br />
-- Scosso, Peabody, MA<br />
-- Sonny Noto's, East Boston, MA<br />
-- Viva Mexican Grill, Wayland, MA<br />
</p>

<p>
<strong>OPEN:</strong><br />
-- The Abbey, Cambridge, MA<br />
-- The Bancroft, Burlington, MA<br />
-- Beantown Pho and Grill, Boston (Back Bay), MA<br />
-- The Briar Rose, Hyde Park, MA<br />
-- Caffe Nero, Boston, MA<br />
-- Cheeburger Cheeburger, Swampscott, MA<br />
</p>

有关如何提取所需数据的任何建议？

谢谢！

Answer 1

从<p>标记获取所有文本，删除空格，跳过空白，然后跳过第一个：

for para in soup.find_all('p'):
    if para.strong is not None:
        print para.strong.get_text()
        lines = filter(None, (t.strip() for t in para.find_all(text=True)))[1:]
        print '\n'.join(lines)
        print

我为<strong>子标记添加了一个测试，以便只选择那些特定的段落。

对于您的输入，它给出了：

>>> for para in soup.find_all('p'):
...     if para.strong is not None:
...         print para.strong.get_text()
...         lines = filter(None, (t.strip() for t in para.find_all(text=True)))[1:]
...         print '\n'.join(lines)
...         print
... 
CLOSED:
-- Haveli, Cambridge (Inman Square), MA
-- Ma Soba, Boston (Beacon Hill), MA
-- Milestone, Wellesley, MA
-- Scosso, Peabody, MA
-- Sonny Noto's, East Boston, MA
-- Viva Mexican Grill, Wayland, MA

OPEN:
-- The Abbey, Cambridge, MA
-- The Bancroft, Burlington, MA
-- Beantown Pho and Grill, Boston (Back Bay), MA
-- The Briar Rose, Hyde Park, MA
-- Caffe Nero, Boston, MA
-- Cheeburger Cheeburger, Swampscott, MA

Answer 2

使用简单方法的Python3兼容版本。

for para in soup.find_all('p'):
    if para.strong is not None:
        for t in para.find_all(text=True):
            print (t.strip())

美丽的汤（Python）

2 个答案: