熊猫 - 处理空细胞

时间:2017-09-09 14:40:49

标签: python pandas beautifulsoup

我很难使用beautifulsoup将足球运动员的细节刮到可行的熊猫桌上。

问题是我抓的一些数据是"额外的"并用废话填满我的桌子。例如:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) Gecko/20100101 Firefox/20.0"}

page = requests.get('https://www.transfermarkt.co.uk/manchester-united/startseite/verein/985', headers=HEADERS)
soup = BeautifulSoup(page.content, 'html.parser')

playerdata = soup.find_all(class_='posrela')
names = [';'.join(pt.findAll(text=True)) for pt in playerdata]

df = pd.DataFrame(names)
df = pd.DataFrame([sub.split(";") for sub in names])

print(df.replace('^$', np.nan, regex=True))

结果:

 python testing5.py
                     0               1                   2                   3
0         David de Gea       D. de Gea              Keeper                None
1        Sergio Romero       S. Romero              Keeper                None
2         Joel Pereira      J. Pereira              Keeper                None
3          Eric Bailly       E. Bailly                             Centre-Back
4      Victor Lindelöf     V. Lindelöf         Centre-Back                None
5          Marcos Rojo         M. Rojo                             Centre-Back
6       Chris Smalling     C. Smalling         Centre-Back                None
7           Phil Jones        P. Jones                             Centre-Back
8          Daley Blind        D. Blind           Left-Back                None
9            Luke Shaw       Luke Shaw           Left-Back                None
10      Matteo Darmian      M. Darmian          Right-Back                None
11    Antonio Valencia     A. Valencia          Right-Back                None
12       Nemanja Matic        N. Matic  Defensive Midfield                None
13     Michael Carrick      M. Carrick                      Defensive Midfield
14          Paul Pogba        P. Pogba    Central Midfield                None
15       Ander Herrera      A. Herrera    Central Midfield                None
16   Marouane Fellaini     M. Fellaini    Central Midfield                None
17        Ashley Young        A. Young       Left Midfield                None
18  Henrikh Mkhitaryan   H. Mkhitaryan  Attacking Midfield                None
19           Juan Mata       Juan Mata  Attacking Midfield                None
20       Jesse Lingard      J. Lingard           Left Wing                None
21       Romelu Lukaku       R. Lukaku      Centre-Forward                None
22     Anthony Martial      A. Martial                   .      Centre-Forward
23     Marcus Rashford     M. Rashford      Centre-Forward                None
24  Zlatan Ibrahimovic  Z. Ibrahimovic                          Centre-Forward

正如您所看到的,在我删除空数据的地方,它已将数据推送到错误的单元格中。您可能会问为什么我有第4列,我将在那里插入更多数据但是现在我需要清理第3列。

正如您所看到的,我已经尝试使用正则表达式在第一个实例中用NaN替换空格。但无论我尝试什么,我都无法选择"选择"空单元格。我无法接触他们!

当我尝试和对待'名字'就像一个列表一样,解释器告诉我这不是一个列表而是一个结果集!

想知道是否有人可以提供帮助,作为一个编程菜鸟我已经取得了很大的进步,但已经碰壁了。

2 个答案:

答案 0 :(得分:2)

您可以使用后期处理 - 使用NaNnotnull从第3列到第2列替换非loc

df.loc[df[3].notnull(), 2] = df[3]
#remove column 3
df = df.drop(3, axis=1)

另一个解决方案是使用mask

df[2] = df[2].mask(df[3].notnull(), df[3])
df = df.drop(3, axis=1)

或与numpy.where非常相似:

df[2] = np.where(df[3].notnull(), df[3], df[2])
df = df.drop(3, axis=1)

我尝试了一点改进你的解决方案:

playerdata = soup.find_all(class_='posrela')
names = [list(pt.findAll(text=True)) for pt in playerdata]
df = pd.DataFrame(names)
df.loc[df[3].notnull(), 2] = df[3]
df = df.drop(3, axis=1)
print (df)

                     0               1                   2
0         David de Gea       D. de Gea              Keeper
1        Sergio Romero       S. Romero              Keeper
2         Joel Pereira      J. Pereira              Keeper
3          Eric Bailly       E. Bailly         Centre-Back
4      Victor Lindelöf     V. Lindelöf         Centre-Back
5          Marcos Rojo         M. Rojo         Centre-Back
6       Chris Smalling     C. Smalling         Centre-Back
7           Phil Jones        P. Jones         Centre-Back
8          Daley Blind        D. Blind           Left-Back
9            Luke Shaw       Luke Shaw           Left-Back
10      Matteo Darmian      M. Darmian          Right-Back
11    Antonio Valencia     A. Valencia          Right-Back
12       Nemanja Matic        N. Matic  Defensive Midfield
13     Michael Carrick      M. Carrick  Defensive Midfield
14          Paul Pogba        P. Pogba    Central Midfield
15       Ander Herrera      A. Herrera    Central Midfield
16   Marouane Fellaini     M. Fellaini    Central Midfield
17        Ashley Young        A. Young       Left Midfield
18  Henrikh Mkhitaryan   H. Mkhitaryan  Attacking Midfield
19           Juan Mata       Juan Mata  Attacking Midfield
20       Jesse Lingard      J. Lingard           Left Wing
21       Romelu Lukaku       R. Lukaku      Centre-Forward
22     Anthony Martial      A. Martial      Centre-Forward
23     Marcus Rashford     M. Rashford      Centre-Forward
24  Zlatan Ibrahimovic  Z. Ibrahimovic      Centre-Forward

另一种解决方案:

playerdata = soup.find_all(class_='posrela')

names = []
for pt in playerdata:
   L = list(pt.findAll(text=True))
   #check length of list
   if len(L) == 4:
      #assign 4. value to 3. 
      L[2] = L[3]
   #appenf first 3 values in list 
   names.append(L[:3])

df = pd.DataFrame(names)
print (df)
                     0               1                   2
0         David de Gea       D. de Gea              Keeper
1        Sergio Romero       S. Romero              Keeper
2         Joel Pereira      J. Pereira              Keeper
3          Eric Bailly       E. Bailly         Centre-Back
4      Victor Lindelöf     V. Lindelöf         Centre-Back
5          Marcos Rojo         M. Rojo         Centre-Back
6       Chris Smalling     C. Smalling         Centre-Back
7           Phil Jones        P. Jones         Centre-Back
8          Daley Blind        D. Blind           Left-Back
9            Luke Shaw       Luke Shaw           Left-Back
10      Matteo Darmian      M. Darmian          Right-Back
11    Antonio Valencia     A. Valencia          Right-Back
12       Nemanja Matic        N. Matic  Defensive Midfield
13     Michael Carrick      M. Carrick  Defensive Midfield
14          Paul Pogba        P. Pogba    Central Midfield
15       Ander Herrera      A. Herrera    Central Midfield
16   Marouane Fellaini     M. Fellaini    Central Midfield
17        Ashley Young        A. Young       Left Midfield
18  Henrikh Mkhitaryan   H. Mkhitaryan  Attacking Midfield
19           Juan Mata       Juan Mata  Attacking Midfield
20       Jesse Lingard      J. Lingard           Left Wing
21       Romelu Lukaku       R. Lukaku      Centre-Forward
22     Anthony Martial      A. Martial      Centre-Forward
23     Marcus Rashford     M. Rashford      Centre-Forward
24  Zlatan Ibrahimovic  Z. Ibrahimovic      Centre-Forward

答案 1 :(得分:1)

如果您要提取更多数据,我建议您按照容易适合数据框的顺序提取所有数据。除非以正确的格式提取数据,否则您将不得不继续运行不必要的清理操作

playerdata = soup.find_all(class_='inline-table')

names = [[x.find('img')['title'],
         x.find_all(class_='spielprofil_tooltip')[-1].renderContents(),
         x.find_all('tr')[-1].find('td').renderContents()] for x in playerdata]

df = pd.DataFrame(names,columns=['Name','Short','Position'])


                  Name            Short            Position
0         David de Gea        D. de Gea              Keeper
1        Sergio Romero        S. Romero              Keeper
2         Joel Pereira       J. Pereira              Keeper
3          Eric Bailly        E. Bailly         Centre-Back
4      Victor Lindelöf      V. Lindelöf         Centre-Back
5          Marcos Rojo          M. Rojo         Centre-Back
6       Chris Smalling      C. Smalling         Centre-Back
7           Phil Jones         P. Jones         Centre-Back
8          Daley Blind         D. Blind           Left-Back
9            Luke Shaw        Luke Shaw           Left-Back
10      Matteo Darmian       M. Darmian          Right-Back
11    Antonio Valencia      A. Valencia          Right-Back
12       Nemanja Matic         N. Matic  Defensive Midfield
13     Michael Carrick       M. Carrick  Defensive Midfield
14          Paul Pogba         P. Pogba    Central Midfield
15       Ander Herrera       A. Herrera    Central Midfield
16   Marouane Fellaini      M. Fellaini    Central Midfield
17        Ashley Young         A. Young       Left Midfield
18  Henrikh Mkhitaryan    H. Mkhitaryan  Attacking Midfield
19           Juan Mata        Juan Mata  Attacking Midfield
20       Jesse Lingard       J. Lingard           Left Wing
21       Romelu Lukaku        R. Lukaku      Centre-Forward
22     Anthony Martial       A. Martial      Centre-Forward
23     Marcus Rashford      M. Rashford      Centre-Forward
24  Zlatan Ibrahimovic   Z. Ibrahimovic      Centre-Forward
25       Romelu Lukaku    Romelu Lukaku      Centre-Forward
26          Paul Pogba       Paul Pogba    Central Midfield
27     Anthony Martial  Anthony Martial      Centre-Forward
28     Marcus Rashford  Marcus Rashford      Centre-Forward
29         Eric Bailly      Eric Bailly         Centre-Back