创建一个关键字数组

时间:2019-02-05 19:59:15

标签: python beautifulsoup

我正在尝试从包含html的CSV列中创建一组关键字。 CSV抱怨类别div中的数据不完整。

categories = []

def find_elms(soup, tag, attribute):
    """Find the block using it's tag and attribute values"""
    categories_block = soup.find(tag, attribute)
    if categories_block:
        keywords = [elm.text for elm in categories_block.findAll('a')]
        return keywords
        #return [elm.text for elm in categories_block.findAll('a')]
    return []

def build_cats(categories):
    category = find_elms(soup, 'div', {'id': 'categories'})
    '''returns [x,y]'''
    for cat in category:
        categories.append(category)

build_cats(soup)

我更改了代码以实现如下结果:

[category1,...,category1000]

但是,我的结果是[[category1,..,category25],[category26,...,category50],... []]或一系列导致兔子洞陷入黑暗的错误。

源数据类似于:

"<div id="categories">
<h3>Categories</h3>
<ul>
<li><a href="">CategoryA</a></li><li><a href="">CategoryB</a></li>
</ul></div>
","<div id="col1"><h3>File</h3></div>, <div id="col1">
<a href="">A.jpg</a>
<br/></div>
, <div id="col1">
<a href="">B.jpg</a>
<br/></div>
, <div id="col1">
<a href="">C.jpg</a>
<br/></div>
"
"<div id="categories">
<h3>Categories</h3>
</div>
","<div id="col1"><h3>File</h3></div>, <div id="col1">
<a href="">D.jpg</a>
<br/></div>
, <div id="col1">
<a href="">E.jpg</a>
<br/></div>
, <div id="col1">
<a href="">F.jpg</a>
<br/></div>
"
"<div id="categories">
<h3>Categories</h3>
<ul>
<li><a href="">CategoryC</a></li><li><a href="">CategoryD</a></li>
</ul></div>
","<div id="col1"><h3>File</h3></div>, <div id="col1">
<a href="">G.jpg</a>
<br/></div>
, <div id="col1">
<a href="">H.jpg</a>
<br/></div>
, <div id="col1">
<a href="">I.jpg</a>
<br/></div>
"
"<div id="categories">
<h3>Categories</h3>
<ul>
<li><a href="">CategoryA</a></li><li><a href="">CategoryE</a></li>
</ul></div>
","<div id="col1"><h3>File</h3></div>, <div id="col1">
<a href="">J.jpg</a>
<br/></div>
, <div id="col1">
<a href="">K.jpg</a>
<br/></div>
, <div id="col1">
<a href="">L.jpg</a>
<br/></div>
"

任何修改或建议都会有所帮助。谢谢。

1 个答案:

答案 0 :(得分:0)

我将您的源数据粘贴到一个文本文件中,并将其另存为input.csv。然后,我运行了以下代码行,并能够创建示例源数据中所有类别的列表:

from bs4 import BeautifulSoup

Categories = []

path = 'input.csv'
html = open(path)
bs = BeautifulSoup(html, 'html.parser')
divs = bs.find_all('div', attrs = {'id': 'categories'})

for d in divs:
    cats = d.find_all('a')
    for c in cats:
        cat_label = c.text
        if cat_label not in Categories:
            Categories.append(cat_label)

Categories

上面的代码生成源数据中所有类别的以下列表:

['CategoryA', 'CategoryB', 'CategoryC', 'CategoryD', 'CategoryE']

每个类别在列表中仅出现一次,无论其在源数据中是否多次出现(例如CategoryA)。