Python:将解析后的项目拆分为CSV文件

时间:2014-12-07 14:15:41

标签: python regex csv split beautifulsoup

我得到Jamie Bull和PM 2Ring的建议,使用CSV模块输出我的网络剪贴簿。我差不多完成但是有一些问题用一些用冒号或连字符分隔的解析项目。我希望这些项目在当前列表中分成两个项目。

当前输出:

GB,16,19,255,1,的 26:40 下,19,13,4,2,的 6-12 0-1 ,255,57,4.5,80,21,3.8,175,的 23-33 下,4.9,3,14,1,4,38.3,8,65,1,0 海,36,25,398,1,的 33:20 下,25,8,13,4,的 4-11 1-1 下,398 ,66,6.0,207,37,5.6,191,的 19-28 下,6.6,1,0,0,2,33.0,4,69,2,1

期望的输出:(问题/差异以粗体显示)

GB,16,19,255,1,的 26,40 下,19,13,4,2,的 6,12 下,的 0,1 ,255,57,4.5,80,21,3.8,175,的 23,33 下,4.9,3,14,1,4,38.3,8,65,1,0 海,36,25,398,1,的 33,20 下,25,8,13,4,的 4,11 下,的 1,1 下,398 ,66,6,207,37,5.6,191,的件19,28 下,6.6,1,0,0,2,33,4,69,2,1

我不确定在何处或如何进行这些更改。我也不知道是否需要正则表达式。显然我可以在记事本或Excel中处理这个问题,但我的目标是在Python中处理所有这些。

如果您运行该程序,上述结果将来自2014赛季第1周。

import requests
import re
from bs4 import BeautifulSoup
import csv

year_entry = raw_input("Enter year: ")

week_entry = raw_input("Enter week number: ")

week_link = requests.get("http://sports.yahoo.com/nfl/scoreboard/?week=" + week_entry + "&phase=2&season=" + year_entry)

page_content = BeautifulSoup(week_link.content)

a_links = page_content.find_all('tr', {'class': 'game link'})

csvfile = open('NFL_2014.csv', 'a')

writer = csv.writer(csvfile)

for link in a_links:
        r = 'http://www.sports.yahoo.com' + str(link.attrs['data-url'])
        r_get = requests.get(r)
        soup = BeautifulSoup(r_get.content)
        stats = soup.find_all("td", {'class':'stat-value'})
        teams = soup.find_all("th", {'class':'stat-value'})
        scores = soup.find_all('dd', {"class": 'score'})

        try:
                away_game_stats = []
                home_game_stats = []
                statistic = []
                game_score = scores[-1]
                game_score = game_score.text
                x = game_score.split(" ")
                away_score = x[1]
                home_score = x[4]
                home_team = teams[1]
                away_team = teams[0]
                away_team_stats = stats[0::2]
                home_team_stats = stats[1::2]
                away_game_stats.append(away_team.text)
                away_game_stats.append(away_score)
                home_game_stats.append(home_team.text)
                home_game_stats.append(home_score)
                for stats in away_team_stats:
                        text = stats.text.strip("").encode('utf-8')
                        away_game_stats.append(text)


                writer.writerow(away_game_stats)

                for stats in home_team_stats:
                        text = stats.text.strip("").encode('utf-8')
                        home_game_stats.append(text)

                writer.writerow(home_game_stats)

        except:
                pass


csvfile.close()                         

非常感谢任何帮助。这是我的第一个程序,搜索这个板子是一个很好的资源。

谢谢,

JT

2 个答案:

答案 0 :(得分:0)

import re
print re.sub(r"-|:",",",test_string)

参见演示。

https://regex101.com/r/aQ3zJ3/2

答案 1 :(得分:0)

您可以使用正则表达式来分割字符串,然后"展平"该列表是为了避免使用引号进行分组:

替代

writer.writerow(away_game_stats)

away_game_stats = [re.split(r"-|:",x) for x in away_game_stats]
writer.writerow([x for y in away_game_stats for x in y])

(和writer.writerow(home_game_stats)相同)