如何阅读pandas中的非结构化csv

时间:2015-12-05 12:54:22

标签: python csv pandas dataframe

我有一个凌乱的csv文件(只是扩展名为csv)。但是,当我以ms {exinited ;开启此文件时,它看起来如下(虚拟样本) -

我调查了这个文件并找到了以下内容:

  1. 某些专栏有名称而其他专栏没有。
  2. 行的长度是可变的,但包含换行符char以触发下一行开始。
  3. 问题:

    如何在pandas中读取此表,而所有现有列(标题)保留,空白列填充连续数字,可以控制行的可变长度。

    事实上,我想一次又一次地获取8个单元格值,直到排出任何一行。从无标题列进行分析。

    N.B。我在usecols中尝试了namesskiprowssepread_csv等,但没有成功

    data

    修改

    添加了示例输入和预期输出(格式设置更糟,但pandas.read_clipboard()应该有效)

    INPUT

    car_id   car_type    entry_gate  entry_time(ms)  exit_gate   exit_time(ms)   traveled_dist(m)    avg_speed(m/s)  trajectory(x[m]    y[m]    speed[m/s]  a_tangential[ms-2]  a_lateral[ms-2] timestamp[ms]   )                                           
    24   Bus    25  4300    26  48520   118.47  2.678999    509552.78   5039855.59  10.074  0.429   0.2012  0   509552.97   5039855.57  10.0821 0.3853  0.2183  20                      
    25   Car    25  20  26  45900   113.91  2.482746    509583.7    5039848.78  4.5344  -0.1649 0.2398  0   509583.77   5039848.71                                      
    26   Car     -   -   -   -  109.68  8.859805    509572.75   5039862.75  4.0734  -0.7164 -0.1066 0   509572.67   5039862.76  4.0593  -0.7021 -0.1141 20  509553.17   5039855.55  10.0886 0.2636  0.2356  40
    27   Car     -   -   -   -  119.84  3.075936    509582.73   5039862.78  1.191   0.5247  0.0005  0   509582.71   5039862.78  1.2015  0.5322                              
    28   Car     -   -   -   -  129.64  4.347466    509591.07   5039862.9   1.6473  0.1987  -0.0033 0   509591.04   5039862.89  1.6513  0.2015  -0.0036 20  
    

    预期的输出(数据帧)

    car_id   car_type    entry_gate  entry_time(ms)  exit_gate   exit_time(ms)   traveled_dist(m)    avg_speed(m/s)  trajectory(x[m]    y[m]    speed[m/s]  a_tangential[ms-2]  a_lateral[ms-2] timestamp[ms]   1   2   3   4   5   6   7   8   9   10  11  12
    24   Bus    25  4300    26  48520   118.47  2.678999    509552.78   5039855.59  10.074  0.429   0.2012  0   509552.97   5039855.57  10.0821 0.3853  0.2183  20                      
    25   Car    25  20  26  45900   113.91  2.482746    509583.7    5039848.78  4.5344  -0.1649 0.2398  0   509583.77   5039848.71                                      
    26   Car     -   -   -   -  109.68  8.859805    509572.75   5039862.75  4.0734  -0.7164 -0.1066 0   509572.67   5039862.76  4.0593  -0.7021 -0.1141 20  509553.17   5039855.55  10.0886 0.2636  0.2356  40
    27   Car     -   -   -   -  119.84  3.075936    509582.73   5039862.78  1.191   0.5247  0.0005  0   509582.71   5039862.78  1.2015  0.5322                              
    28   Car     -   -   -   -  129.64  4.347466    509591.07   5039862.9   1.6473  0.1987  -0.0033 0   509591.04   5039862.89  1.6513  0.2015  -0.0036 20      
    

1 个答案:

答案 0 :(得分:2)

<强>预处理

函数get_names()打开文件,检查分割行的最大长度。 然后我读取第一行并从最大长度添加缺失值。

第一行的最后一个值为),因此我将其firstline[:-1]删除,然后添加 按+1 rng = range(1, m - lenfirstline + 2)排列缺少的列。 +2是因为范围从值1开始。

然后你可以使用函数read_csv,skipp第一行和名称使用get_names()的输出。

import pandas as pd
import csv

#preprocessing
def get_names():
    with open('test/file.txt', 'r') as csvfile:
        reader = csv.reader(csvfile)
        num = []
        for i, row in enumerate(reader):
            if i ==0:
                firstline = ''.join(row).split()
                lenfirstline = len(firstline)
                #print firstline, lenfirstline
            num.append(len(''.join(row).split()))
        m = max(num)
        rng = range(1, m - lenfirstline + 2)
        #remove )
        rng = firstline[:-1] + rng
        return rng

#names is list return from function
df = pd.read_csv('test/file.txt', sep="\s+", names=get_names(), index_col=[0], skiprows=1)
 
#temporaly display 10 rows and 30 columns
with pd.option_context('display.max_rows', 10, 'display.max_columns', 30):
    print df

       car_type entry_gate entry_time(ms) exit_gate exit_time(ms)  \
car_id                                                              
24          Bus         25           4300        26         48520   
25          Car         25             20        26         45900   
26          Car          -              -         -             -   
27          Car          -              -         -             -   
28          Car          -              -         -             -   

        traveled_dist(m)  avg_speed(m/s)  trajectory(x[m]        y[m]  \
car_id                                                                  
24                118.47        2.678999        509552.78  5039855.59   
25                113.91        2.482746        509583.70  5039848.78   
26                109.68        8.859805        509572.75  5039862.75   
27                119.84        3.075936        509582.73  5039862.78   
28                129.64        4.347466        509591.07  5039862.90   

        speed[m/s]  a_tangential[ms-2]  a_lateral[ms-2]  timestamp[ms]  \
car_id                                                                   
24         10.0740              0.4290           0.2012              0   
25          4.5344             -0.1649           0.2398              0   
26          4.0734             -0.7164          -0.1066              0   
27          1.1910              0.5247           0.0005              0   
28          1.6473              0.1987          -0.0033              0   

                1           2        3       4       5   6          7  \
car_id                                                                  
24      509552.97  5039855.57  10.0821  0.3853  0.2183  20        NaN   
25      509583.77  5039848.71      NaN     NaN     NaN NaN        NaN   
26      509572.67  5039862.76   4.0593 -0.7021 -0.1141  20  509553.17   
27      509582.71  5039862.78   1.2015  0.5322     NaN NaN        NaN   
28      509591.04  5039862.89   1.6513  0.2015 -0.0036  20        NaN   

                 8        9      10      11  12  
car_id                                           
24             NaN      NaN     NaN     NaN NaN  
25             NaN      NaN     NaN     NaN NaN  
26      5039855.55  10.0886  0.2636  0.2356  40  
27             NaN      NaN     NaN     NaN NaN  
28             NaN      NaN     NaN     NaN NaN  

<强>后处理

首先,您必须估算最大列数N。我知道他们的真实数字是26,所以我估计为N = 30。参数read_csv的函数name = range(N)返回NaN列,列的估计长度和实际长度之间存在差异。

删除后,您可以选择包含列名称的第一行,其中不是NaN(我在)删除了最后一列[:-1]) - df1.loc[0].dropna()[:-1]。 然后,您可以在第一行中添加范围从1到NaN长度的新系列。 最后一行由df的子集删除。

#set more as estimated number of columns
N = 30

df1 = pd.read_csv('test/file.txt', sep="\s+", names=range(N))

df1 = df1.dropna(axis=1, how='all')  #drop columns with all NaN

df1.columns = df1.loc[0].dropna()[:-1].append(pd.Series(range(1, len(df1.columns) - len(df1.loc[0].dropna()[:-1]) + 1 )))

#remove first line with uncomplete column names
df1 = df1.ix[1:]
print df1.head()
相关问题