Pandas read_csv - 具有可变列数的行

时间:2015-06-25 14:19:00

标签: python csv pandas

我有一个CSV文件,其行包含可变数量的列(并且没有列标题)。例如。文件可以从一些有23列的行开始,然后是一些有83列的行等等。现在,当read_csv()开始读取文件时,它会在读取前几行之后猜测列数(我认为)所以如果数据行在开头比最后短,我得到下面的例外。有没有办法将参数传递给函数以将列数设置为某个最大值?或者有更好的方法吗?

感谢。

CParserError:标记数据时出错。 C错误:预计行150中的23个字段,见83

2 个答案:

答案 0 :(得分:0)

我想我要发布此内容,因为即使您传递了names参数,注释中仍会返回所有答案的答案,因此对于任意数量的列,您只会命名(我相信)最后x列数和您提供的名称数。您仍将使用所有列。见下文:

arbitrary.csv:

A,100,2001,600,NaN,NaN,NaN,NaN,NaN,ANX,NaN
B,101,2002,601,NaN,NaN,NaN,NaN,NaN,ANX,102.0
C,102,2003,602,88.0,NaN,NaN,NaN,JKR,ANX,103.0
D,103,2004,603,89.0,NaN,NaN,NaN,JKR,ANX,104.0
E,104,2005,604,90.0,ABC,NaN,NaN,JKR,ANX,105.0
F,105,2006,605,91.0,ABC,JKL,NaN,JKR,ANX,106.0
G,106,2007,606,92.0,ABC,JKL,NaN,JKR,ANX,107.0
H,107,2008,607,93.0,ABC,JKL,NaN,JKR,ANX,108.0
I,108,2009,608,94.0,ABC,JKL,NaN,JKR,ANX,109.0
J,109,2010,609,95.0,ABC,JKL,MFG,JKR,ANX,110.0
K,110,2011,610,96.0,ABC,JKL,MFG,NaN,ANX,111.0
L,111,2012,611,97.0,ABC,JKL,MFG,JKR,ANX,112.0
M,112,2013,612,98.0,ABC,JKL,MFG,JKR,ANX,113.0
N,113,2014,613,99.0,ABC,JKL,MFG,JKR,ANX,114.0
O,114,2015,614,100.0,ABC,JKL,MFG,JKR,ANX,115.0
P,115,2016,615,101.0,ABC,JKL,MFG,JKR,ANX,116.0
Q,116,2017,616,102.0,ABC,JKL,MFG,JKR,ANX,117.0
R,117,2018,617,103.0,ABC,JKL,MFG,JKR,ANX,118.0
S,118,2019,618,104.0,ABC,JKL,MFG,JKR,ANX,119.0
T,119,2020,619,105.0,ABC,JKL,MFG,JKR,ANX,120.0
U,120,2021,620,106.0,ABC,JKL,MFG,JKR,ANX,121.0
V,121,2022,621,107.0,ABC,JKL,MFG,JKR,ANX,122.0
W,122,2023,622,108.0,ABC,JKL,MFG,JKR,ANX,123.0
X,123,2024,623,109.0,ABC,JKL,MFG,JKR,ANX,124.0
Y,124,2025,624,110.0,ABC,JKL,MFG,JKR,ANX,125.0
Z,125,2026,625,111.0,ABC,JKL,MFG,JKR,ANX,126.0

在示例中,NaN由csv文件中的空白字段表示。我在自己的实践中使用以下方法处理此类数据:

def df_from_weird_csv(infile, *argv):
    df = pd.read_csv(infile, names=argv, usecols=argv)
    return df

df_from_weird_csv('./Desktop/arbitrary.csv', 'Col1', 'Col2', 'Col3', 'Col4')

以下内容将names=usecols=的可变数目的列名传递给pd.read_csv,因此您可以定义列标题和要在单个步骤中使用的列数:

输出df:

   Col1  Col2  Col3  Col4
0     A   100  2001   600
1     B   101  2002   601
2     C   102  2003   602
3     D   103  2004   603
4     E   104  2005   604
5     F   105  2006   605
6     G   106  2007   606
7     H   107  2008   607
8     I   108  2009   608
9     J   109  2010   609
10    K   110  2011   610
11    L   111  2012   611
12    M   112  2013   612
13    N   113  2014   613
14    O   114  2015   614
15    P   115  2016   615
16    Q   116  2017   616
17    R   117  2018   617
18    S   118  2019   618
19    T   119  2020   619
20    U   120  2021   620
21    V   121  2022   621
22    W   122  2023   622
23    X   123  2024   623
24    Y   124  2025   624
25    Z   125  2026   625

因此,如果仅使用名称参数(如链接示例中的答案),则不仅保留将名称传递给的列:

def df2_from_weird_csv(infile, *argv):
    df = pd.read_csv(infile, names=argv)
    return df
df2_from_weird_csv('./Desktop/arbitrary.csv', 'Col1', 'Col2', 'Col3', 'Col4')

输出df(这种行为确实很奇怪):

                              Col1 Col2 Col3   Col4
A  100 2001 600 NaN   NaN NaN  NaN  NaN  ANX    NaN
B  101 2002 601 NaN   NaN NaN  NaN  NaN  ANX  102.0
C  102 2003 602 88.0  NaN NaN  NaN  JKR  ANX  103.0
D  103 2004 603 89.0  NaN NaN  NaN  JKR  ANX  104.0
E  104 2005 604 90.0  ABC NaN  NaN  JKR  ANX  105.0
F  105 2006 605 91.0  ABC JKL  NaN  JKR  ANX  106.0
G  106 2007 606 92.0  ABC JKL  NaN  JKR  ANX  107.0
H  107 2008 607 93.0  ABC JKL  NaN  JKR  ANX  108.0
I  108 2009 608 94.0  ABC JKL  NaN  JKR  ANX  109.0
J  109 2010 609 95.0  ABC JKL  MFG  JKR  ANX  110.0
K  110 2011 610 96.0  ABC JKL  MFG  NaN  ANX  111.0
L  111 2012 611 97.0  ABC JKL  MFG  JKR  ANX  112.0
M  112 2013 612 98.0  ABC JKL  MFG  JKR  ANX  113.0
N  113 2014 613 99.0  ABC JKL  MFG  JKR  ANX  114.0
O  114 2015 614 100.0 ABC JKL  MFG  JKR  ANX  115.0
P  115 2016 615 101.0 ABC JKL  MFG  JKR  ANX  116.0
Q  116 2017 616 102.0 ABC JKL  MFG  JKR  ANX  117.0
R  117 2018 617 103.0 ABC JKL  MFG  JKR  ANX  118.0
S  118 2019 618 104.0 ABC JKL  MFG  JKR  ANX  119.0
T  119 2020 619 105.0 ABC JKL  MFG  JKR  ANX  120.0
U  120 2021 620 106.0 ABC JKL  MFG  JKR  ANX  121.0
V  121 2022 621 107.0 ABC JKL  MFG  JKR  ANX  122.0
W  122 2023 622 108.0 ABC JKL  MFG  JKR  ANX  123.0
X  123 2024 623 109.0 ABC JKL  MFG  JKR  ANX  124.0
Y  124 2025 624 110.0 ABC JKL  MFG  JKR  ANX  125.0
Z  125 2026 625 111.0 ABC JKL  MFG  JKR  ANX  126.0

答案 1 :(得分:-1)

# coding: utf-8

# In[16]:

def params(text):
    pairs = text.split("|")
    print pairs
    out = {i.split("=")[0]:i.split("=")[1] for i in pairs}
    return pd.Series(out) 

params("asd=2|qwe=5")


# In[27]:


import pandas as pd
aa = pd.DataFrame({'id':[1,2],'text':["asd=2|qwe=5","asd=20|qwe=5|qzxc=5"]})
aa



# In[29]:

aa['text'].apply(params)


# In[30]:

pd.concat([aa,aa['text'].apply(params)],1)