BeautifulSoup美化丑陋的过程

时间:2014-08-22 08:00:26

标签: python html beautifulsoup

我正在使用beautifulSoup解析以下HTML:

<div id="cpv_codes">
   <span>
        79000000 - Business services: law, marketing, consulting, recruitment, printing and security
        <br/>
        79632000 - Personnel-training services
        <br/>
        80000000 - Education and training services
        <br/>
        80511000 - Staff training services
        <br/>
        80530000 - Vocational training services
    </span>
</div>

我正在尝试将内容转换为列表,以便可以将其放入csv中以便以后进行规范化。

目前,我正在使用一个非常丑陋的过程将数据锤成形状,我非常想写一些更优雅的东西。我确信通过更好地使用BS,我可以使用一行提取列表中的数据,任何人都可以帮我清理这段代码吗?

categories = tender_soup.find('div',{"id":"cpv_codes"}).findNext('span')
categories = unicode(categories) # converts tag output to a string
categories = categories.split('<br/>') # converts string to an array
categories = [category.replace('<span>', '') for category in categories] # removes '<span>' from items
categories = [category.replace('</span>', '') for category in categories] # removes '</span>' from items
categories = filter(None, categories) # filters out any empty items in the array

2 个答案:

答案 0 :(得分:2)

NavigableString课程会对此有所帮助:

from bs4 import NavigableString

span = tender_soup.find('div',{"id":"cpv_codes"}).findNext('span')
categories = [c.strip() for c in span.contents if isinstance(c, NavigableString)]

现在你有了清单

[u'79000000 - Business services: law, marketing, consulting, recruitment, printing and security',
 u'79632000 - Personnel-training services',
 u'80000000 - Education and training services',
 u'80511000 - Staff training services',
 u'80530000 - Vocational training services']

答案 1 :(得分:0)

您可能会发现regular expression有用。

import re
categories = tender_soup.find('div',{"id":"cpv_codes"}).findNext('span')
categories = [itm for itm in re.split(r'\s{2,}', categories.text) if itm]

根据您的数据,类别将是这样的,

[u'79000000 - Business services: law, marketing, consulting, recruitment, printing and security',
u'79632000 - Personnel-training services',
u'80000000 - Education and training services',
u'80511000 - Staff training services',
u'80530000 - Vocational training services']