从url获取数据并将其放入DataFrame

时间:2019-07-10 07:19:54

标签: python python-3.x pandas web-scraping

大家好,我目前正在尝试从网址中获取一些数据,然后尝试预测该文章应属于哪个类别。 到目前为止,我已经做到了,但是有一个错误:

    info = pd.read_csv('labeled_urls.tsv',sep='\t',header=None)
    html, category = [], []
    for i in info.index:
        response = requests.get(info.iloc[i,0])
        soup = BeautifulSoup(response.text, 'html.parser')
        html.append([re.sub(r'<.*?>','', 
                      str(soup.findAll(['p','h1','\href="/avtorji/'])))])
        category.append(info.iloc[0,i])

    data = pd.DataFrame()
    data['html'] = html
    data['category'] = category

错误是这样的:

  

IndexError:单个位置索引器超出范围。

有人可以帮我吗?

2 个答案:

答案 0 :(得分:1)

您可以避免iloc调用,而改用iterrows,并且我认为您必须使用loc而不是iloc,因为您正在操作索引,但是使用了{{ 1}}和iloc循环通常效率不高。您可以尝试以下代码(插入等待时间):

loc

如果您确实只需要循环中的网址,请替换:

import time

info = pd.read_csv('labeled_urls.tsv',sep='\t',header=None)
html, category = [], []
for i, row in info.iterrows():
    url= row.iloc[0]
    time.sleep(2.5)  # wait 2.5 seconds
    response = requests.get(url)  # you can use row[columnname] instead here as well (i only use iloc, because I don't know the column names)
    soup = BeautifulSoup(response.text, 'html.parser')
    html.append([re.sub(r'<.*?>','', 
                  str(soup.findAll(['p','h1','\href="/avtorji/'])))])
    # the following iloc was probably raising the error, because you access the ith column in the first row of your df
    # category.append(info.iloc[0,i])
    category.append(row.iloc[0])  # not sure which field you wanted to access here, you should also replace it by row['name']

data = pd.DataFrame()
data['html'] = html
data['category'] = category

类似:

for i, row in info.iterrows():
    url= row.iloc[0]

答案 1 :(得分:1)

该错误很可能是由于将索引传递给iloc引起的:loc期望索引值和列名,而iloc期望行和列的数字位置。此外,您已经将categorycategory.append(info.iloc[0,i])的行和列位置互换了。因此,您至少应该这样做:

for i in range(len(info)):
    response = requests.get(info.iloc[i,0])
    ...
    category.append(info.iloc[i,0])

但是当您尝试迭代数据框的第一列时,以上代码不是Pythonic。最好直接使用该列:

for url in info.loc[:, 0]:
    response = requests.get(url)
    ...
    category.append(url)