Question

我正在尝试使用urllib来解析网站上的文本文件并提取数据。我还能够做其他文件，它们是用列格式化的文本，但是由于南伊利诺伊州 - 爱德华兹维尔的线路将第二个得分和位置从列中推出，这个文件有点让我失望。

file = urllib.urlopen('http://www.boydsworld.com/cgi/scores.pl?team1=all&team2=all&firstyear=2011&lastyear=2011&format=Text&submit=Fetch')

for line in file:
    game_month = line[0:1].rstrip()
    game_day   = line[2:4].rstrip()
    game_year  = line[5:9].rstrip()
    team1      = line[11:37].rstrip()
    team1_scr  = line[38:40].rstrip()
    team2      = line[42:68].rstrip()
    team2_scor = line[68:70].rstrip()
    extra_info = line[72:100].rstrip()

南伊利诺伊州 - 爱德华兹维尔线将'il'作为team2_scr进口，并将'4'Central Arkansas'作为extra_info进口。

Answer 1

想看到最好的解决方案吗？ http://www.boydsworld.com/cgi/scores.pl?team1=all&team2=all&firstyear=2011&lastyear=2011&format=CSV&submit=Fetch会给你很好的CSV文件，不需要黑魔法。

Answer 2

假设s包含表格的一行。然后你可以使用re（正则表达式）库的split（）方法：

import re
rexp = re.compile('  +')  # Match two or more spaces
cols = rexp.split(s)

...而cols现在是一个字符串列表，每个字符串都是表格行中的一列。这假设表列由至少两个空格分隔，而不是其他任何空格。如果不是这种情况，可以编辑re.compile（）的参数以允许其他配置。

回想一下，Python认为文件是一系列行，由换行符分隔。因此，您所要做的就是对文件进行for循环，将.split（）应用于每一行。

要获得更好的解决方案，请查看内置的map（）函数并尝试使用它而不是for循环。

Answer 3

你想要这样的东西：

def get_row(row):
    row=row.split()
    num_pos=[]
    for i in range(len(row)):
        try:
            int(row[i])
            num_pos.append(i)
        except:
            pass
    assert(len(num_pos)==2)
    ans=[]
    ans.append(row[0])
    ans.append("".join(row[1:num_pos[0]]))
    ans.append(int(row[num_pos[0]]))
    ans.append("".join(row[num_pos[0]+1:num_pos[1]]))
    ans.append(int(row[num_pos[1]]))
    ans.append("".join(row[num_pos[1]+1:]))
    return ans


row1="2/18/2011  Central Arkansas           5  Southern Illinois-Edwardsville  4  @Central Arkansas"
row2="2/18/2011  Central Florida           11  Siena                      1  @Central Florida"

print get_row(row1)
print get_row(row2)

输出：

['2/18/2011', 'CentralArkansas', 5, 'SouthernIllinois-Edwardsville', 4, '@CentralArkansas']
['2/18/2011', 'CentralFlorida', 11, 'Siena', 1, '@CentralFlorida']

Answer 4

显然，你只需要拆分多个空格。不幸的是，csv模块只允许使用单字符分隔符，但re.sub可以提供帮助。我会推荐这样的东西：

import urllib2
import csv
import re

u = urllib2.urlopen('http://www.boydsworld.com/cgi/scores.pl?team1=all&team2=all&firstyear=2011&lastyear=2011&format=Text&submit=Fetch')

reader = csv.DictReader((re.sub(' {2,}', '\t', line) for line in u), delimiter='\t', fieldnames=('date', 'team1', 'team1_score', 'team2', 'team2_score', 'extra_info'))

for i, row in enumerate(reader):
    if i == 5: break  # Only do five (otherwise you don't need ``enumerate()``)
    print row

这会产生如下结果：

{'team1': 'Air Force', 'team2': 'Missouri State', 'date': '2/18/2011', 'team2_score': '2', 'team1_score': '7', 'extra_info': '@neutral'}
{'team1': 'Akron', 'team2': 'Lamar', 'date': '2/18/2011', 'team2_score': '1', 'team1_score': '2', 'extra_info': '@neutral'}
{'team1': 'Alabama', 'team2': 'Alcorn State', 'date': '2/18/2011', 'team2_score': '0', 'team1_score': '11', 'extra_info': '@Alabama'}
{'team1': 'Alabama State', 'team2': 'Tuskegee', 'date': '2/18/2011', 'team2_score': '5', 'team1_score': '9', 'extra_info': '@Alabama State'}
{'team1': 'Appalachian State', 'team2': 'Maryland-Eastern Shore', 'date': '2/18/2011', 'team2_score': '0', 'team1_score': '4', 'extra_info': '@Appalachian State'}

或者如果您愿意，只需使用cvs.reader并获取list而不是dict s：

reader = csv.reader((re.sub(' {2,}', '\t', line) for line in u), delimiter='\t')

print reader.next()

使用urllib导入带有列外行的格式化文本文件

4 个答案: