读取并合并来自多个csv文件的数据

时间:2013-07-08 13:15:02

标签: python sorting csv pandas

我有3个不同的文件:NewRush4.csv,NewRush5.csv,NewRush6.csv。 我试图从每个赛季(4,5和6)收集一位“历史领袖”。

我想要读取每个文件中的每个玩家的名字,如果它们是重复的,请将它们合并,或者读取第一个文件并将其与其他两个文件进行比较以组合它们。

这是我的python代码。我必须阅读第一个文件。我不知道如何使用DictReader。

#!/usr/bin/python
import csv
file = open("NewRush4.csv", "rb")

for line in csv.DictReader(file, delimiter=","):
    name = line["Player"].strip()
    yds = line["YDS"].strip()
    car = line["CAR"].strip()
    td = line["TD"].strip()
    fum = line["FUM"].strip()
    ypc = line["YPC"].strip()

    print "%-20s%10s%10s%10s%10s%10s" % (name, car, yds, td, fum, ypc)
file.close()

输出:

49erswag                     3      14.0         0         0       4.7  
A Beast Playa                7      23.0         0         0       3.3  
A Swanky Guy 2              29     154.0         1         2       5.3  
ACIDRUST                     1       4.0         0         0       4.0  
Aj dahitman                227    1898.0        19         2       8.4  
Aldizzl                     10      45.0         0         0       4.5  
Areis21                     13      58.0         0         2       4.5  
at43                        48     214.0         1         1       4.5  
Ayala2012xTCU               57     195.0         0         1       3.4  
B O R Nx 25                 13      31.0         0         1       2.4  
B r e e z yx60               4      13.0         0         0       3.3  
Beardown74                 116     621.0         6         3       5.4  
beatdown54 2010             26     126.0         3         1       4.8  
behe SWAG                    1      -5.0         0         0      -5.0  
Big Murph22                 73     480.0         6         2       6.6  
BigBlack973                 18      57.0         0         1       3.2  
BiGDaDDyNaPSacK            184    1181.0        20         4       6.4  

Season4文件:

Player,YDS,TD,CAR,FUM,YPC  
49erswag,   14.0,   0,   3,   0,   4.7  
A Beast Playa,   23.0,   0,   7,   0,   3.3  
A Swanky Guy 2,   154.0,   1,   29,   2,   5.3  
ACIDRUST,   4.0,   0,   1,   0,   4.0  
Aj dahitman,   1898.0,   19,   227,   2,   8.4  
Aldizzl,   45.0,   0,   10,   0,   4.5  
Areis21,   58.0,   0,   13,   2,   4.5  
at43,   214.0,   1,   48,   1,   4.5  
Ayala2012xTCU,   195.0,   0,   57,   1,   3.4  
B O R Nx 25,   31.0,   0,   13,   1,   2.4  
B r e e z yx60,   13.0,   0,   4,   0,   3.3  
...  

Season5文件:

Player,YDS,TD,CAR,FUM,YPC  
a toxic taz,   307.0,   4,   44,   0,   7.0  
AbNL Boss,   509.0,   4,   174,   2,   2.9  
AFFISHAUL,   190.0,   0,   35,   2,   5.4  
AJ DA HITMAN,   1283.0,   19,   228,   6,   5.6  
allen5422,   112.0,   2,   18,   0,   6.2  
Allxdayxapx,   264.0,   1,   76,   2,   3.5  
AlpHaaNike,   51.0,   1,   10,   1,   5.1  
Aura Reflexx,   215.0,   1,   40,   0,   5.4  
AWAKEN DA BEAST,   -5.0,   0,   4,   1,   -1.3  
AxDub24,   -3.0,   0,   2,   1,   -1.5  
Ayala2012xTCU,   568.0,   4,   173,   1,   3.3  
BALLxXHAWKXx,   221.0,   1,   47,   2,   4.7   
BANG FIGHTY007,   983.0,   6,   171,   3,   5.7  
bang z ro,   29.0,   0,   9,   0,   3.2  
BEARDOWN74,   567.0,   6,   104,   2,   5.5  
...  

所以,如果一名球员在多个赛季中出场,请添加他的统计数据并打印。否则,只需打印。

2 个答案:

答案 0 :(得分:0)

你可以试试python pandas,看起来像你需要的工具。对于阅读部分,您可以使用read_csv,然后创建三个DataFrame(或一个,包含所有记录),并进一步操作它们。

例如,对于重复项,您可以尝试duplicated function,例如使用df[ df.duplicated('Player') ]。您还会发现许多descriptive statistics函数,例如您可能需要的max。看看。

为了给你品味(根据原帖中的Season4和Season5数据):

import pandas as pd

if __name__ == '__main__':

    # reading in is very convenient here:
    df_4 = pd.read_csv('season4.csv')
    df_5 = pd.read_csv('season5.csv')
    # combine the two DataFrames into one:
    df   = pd.concat([df_4, df_5], ignore_index=True)
    # see how it looks:
    print df.head(50)

             Player   YDS  TD  CAR  FUM  YPC
0          49erswag    14   0    3    0  4.7
1     A Beast Playa    23   0    7    0  3.3
2    A Swanky Guy 2   154   1   29    2  5.3
3          ACIDRUST     4   0    1    0  4.0
4       Aj dahitman  1898  19  227    2  8.4
5           Aldizzl    45   0   10    0  4.5
6           Areis21    58   0   13    2  4.5
7              at43   214   1   48    1  4.5
8     Ayala2012xTCU   195   0   57    1  3.4
9       B O R Nx 25    31   0   13    1  2.4
10   B r e e z yx60    13   0    4    0  3.3
11      a toxic taz   307   4   44    0  7.0
12        AbNL Boss   509   4  174    2  2.9
13        AFFISHAUL   190   0   35    2  5.4
14     AJ DA HITMAN  1283  19  228    6  5.6
15        allen5422   112   2   18    0  6.2
16      Allxdayxapx   264   1   76    2  3.5
17       AlpHaaNike    51   1   10    1  5.1
18     Aura Reflexx   215   1   40    0  5.4
19  AWAKEN DA BEAST    -5   0    4    1 -1.3
20          AxDub24    -3   0    2    1 -1.5
21    Ayala2012xTCU   568   4  173    1  3.3
22     BALLxXHAWKXx   221   1   47    2  4.7
23   BANG FIGHTY007   983   6  171    3  5.7
24        bang z ro    29   0    9    0  3.2
25       BEARDOWN74   567   6  104    2  5.5 

    # see for duplicated entries in the 'Player' column:
    print df[ df.duplicated('Player') ]

           Player  YDS  TD  CAR  FUM  YPC
21  Ayala2012xTCU  568   4  173    1  3.3

    # see for the maximum value in the 'YDS' column:
    print 'Max YDS:', df['YDS'].max()

Max YDS: 1898.0

希望有所帮助。

答案 1 :(得分:0)

使用collections.defaultdict

我不知道每个字段的含义;我总结每个领域。根据需要进行调整。

from collections import defaultdict
import csv

class PlayerStat(object):
    def __init__(self, yds=0, car=0, td=0, fum=0, ypc=0, count=0):
        self.yds   = float(yds)
        self.car   = float(car)
        self.td    = float(td)
        self.fum   = float(fum)
        self.ypc   = float(ypc)
        self.count = count
    def __iadd__(self, other):
        self.yds   += other.yds
        self.car   += other.car
        self.td    += other.td
        self.fum   += other.fum
        self.ypc   += other.ypc
        self.count += other.count
        return self

filenames = 'NewRush4.csv', 'NewRush5.csv', 'NewRush6.csv',
stats = defaultdict(PlayerStat)
for filename in filenames:
    with open(filename) as f:
        reader = csv.DictReader(f, delimiter=',')
        for row in reader:
            stat = PlayerStat(row['YDS'], row['CAR'], row['TD'], row['FUM'], row['YPC'], count=1)
            stats[row['Player']] += stat

for player in sorted(stats, key=lambda player: stats[player].yds):
    stat = stats[player]
    if stat.count == 1:
        continue
    print '{0:<20}{1.car:>10}{1.yds:>10}{1.td:>10}{1.fum:>10}{1.ypc:>10}'.format(player, stat)