我有一个CSV文件,其行包含可变数量的列(并且没有列标题)。例如。文件可以从一些有23列的行开始,然后是一些有83列的行等等。现在,当read_csv()开始读取文件时,它会在读取前几行之后猜测列数(我认为)所以如果数据行在开头比最后短,我得到下面的例外。有没有办法将参数传递给函数以将列数设置为某个最大值?或者有更好的方法吗?
感谢。
CParserError:标记数据时出错。 C错误:预计行150中的23个字段,见83
答案 0 :(得分:0)
我想我要发布此内容,因为即使您传递了names
参数,注释中仍会返回所有答案的答案,因此对于任意数量的列,您只会命名(我相信)最后x列数和您提供的名称数。您仍将使用所有列。见下文:
arbitrary.csv:
A,100,2001,600,NaN,NaN,NaN,NaN,NaN,ANX,NaN
B,101,2002,601,NaN,NaN,NaN,NaN,NaN,ANX,102.0
C,102,2003,602,88.0,NaN,NaN,NaN,JKR,ANX,103.0
D,103,2004,603,89.0,NaN,NaN,NaN,JKR,ANX,104.0
E,104,2005,604,90.0,ABC,NaN,NaN,JKR,ANX,105.0
F,105,2006,605,91.0,ABC,JKL,NaN,JKR,ANX,106.0
G,106,2007,606,92.0,ABC,JKL,NaN,JKR,ANX,107.0
H,107,2008,607,93.0,ABC,JKL,NaN,JKR,ANX,108.0
I,108,2009,608,94.0,ABC,JKL,NaN,JKR,ANX,109.0
J,109,2010,609,95.0,ABC,JKL,MFG,JKR,ANX,110.0
K,110,2011,610,96.0,ABC,JKL,MFG,NaN,ANX,111.0
L,111,2012,611,97.0,ABC,JKL,MFG,JKR,ANX,112.0
M,112,2013,612,98.0,ABC,JKL,MFG,JKR,ANX,113.0
N,113,2014,613,99.0,ABC,JKL,MFG,JKR,ANX,114.0
O,114,2015,614,100.0,ABC,JKL,MFG,JKR,ANX,115.0
P,115,2016,615,101.0,ABC,JKL,MFG,JKR,ANX,116.0
Q,116,2017,616,102.0,ABC,JKL,MFG,JKR,ANX,117.0
R,117,2018,617,103.0,ABC,JKL,MFG,JKR,ANX,118.0
S,118,2019,618,104.0,ABC,JKL,MFG,JKR,ANX,119.0
T,119,2020,619,105.0,ABC,JKL,MFG,JKR,ANX,120.0
U,120,2021,620,106.0,ABC,JKL,MFG,JKR,ANX,121.0
V,121,2022,621,107.0,ABC,JKL,MFG,JKR,ANX,122.0
W,122,2023,622,108.0,ABC,JKL,MFG,JKR,ANX,123.0
X,123,2024,623,109.0,ABC,JKL,MFG,JKR,ANX,124.0
Y,124,2025,624,110.0,ABC,JKL,MFG,JKR,ANX,125.0
Z,125,2026,625,111.0,ABC,JKL,MFG,JKR,ANX,126.0
在示例中,NaN由csv文件中的空白字段表示。我在自己的实践中使用以下方法处理此类数据:
def df_from_weird_csv(infile, *argv):
df = pd.read_csv(infile, names=argv, usecols=argv)
return df
df_from_weird_csv('./Desktop/arbitrary.csv', 'Col1', 'Col2', 'Col3', 'Col4')
以下内容将names=
和usecols=
的可变数目的列名传递给pd.read_csv,因此您可以定义列标题和要在单个步骤中使用的列数:>
输出df:
Col1 Col2 Col3 Col4
0 A 100 2001 600
1 B 101 2002 601
2 C 102 2003 602
3 D 103 2004 603
4 E 104 2005 604
5 F 105 2006 605
6 G 106 2007 606
7 H 107 2008 607
8 I 108 2009 608
9 J 109 2010 609
10 K 110 2011 610
11 L 111 2012 611
12 M 112 2013 612
13 N 113 2014 613
14 O 114 2015 614
15 P 115 2016 615
16 Q 116 2017 616
17 R 117 2018 617
18 S 118 2019 618
19 T 119 2020 619
20 U 120 2021 620
21 V 121 2022 621
22 W 122 2023 622
23 X 123 2024 623
24 Y 124 2025 624
25 Z 125 2026 625
因此,如果仅使用名称参数(如链接示例中的答案),则不仅保留将名称传递给的列:
def df2_from_weird_csv(infile, *argv):
df = pd.read_csv(infile, names=argv)
return df
df2_from_weird_csv('./Desktop/arbitrary.csv', 'Col1', 'Col2', 'Col3', 'Col4')
输出df(这种行为确实很奇怪):
Col1 Col2 Col3 Col4
A 100 2001 600 NaN NaN NaN NaN NaN ANX NaN
B 101 2002 601 NaN NaN NaN NaN NaN ANX 102.0
C 102 2003 602 88.0 NaN NaN NaN JKR ANX 103.0
D 103 2004 603 89.0 NaN NaN NaN JKR ANX 104.0
E 104 2005 604 90.0 ABC NaN NaN JKR ANX 105.0
F 105 2006 605 91.0 ABC JKL NaN JKR ANX 106.0
G 106 2007 606 92.0 ABC JKL NaN JKR ANX 107.0
H 107 2008 607 93.0 ABC JKL NaN JKR ANX 108.0
I 108 2009 608 94.0 ABC JKL NaN JKR ANX 109.0
J 109 2010 609 95.0 ABC JKL MFG JKR ANX 110.0
K 110 2011 610 96.0 ABC JKL MFG NaN ANX 111.0
L 111 2012 611 97.0 ABC JKL MFG JKR ANX 112.0
M 112 2013 612 98.0 ABC JKL MFG JKR ANX 113.0
N 113 2014 613 99.0 ABC JKL MFG JKR ANX 114.0
O 114 2015 614 100.0 ABC JKL MFG JKR ANX 115.0
P 115 2016 615 101.0 ABC JKL MFG JKR ANX 116.0
Q 116 2017 616 102.0 ABC JKL MFG JKR ANX 117.0
R 117 2018 617 103.0 ABC JKL MFG JKR ANX 118.0
S 118 2019 618 104.0 ABC JKL MFG JKR ANX 119.0
T 119 2020 619 105.0 ABC JKL MFG JKR ANX 120.0
U 120 2021 620 106.0 ABC JKL MFG JKR ANX 121.0
V 121 2022 621 107.0 ABC JKL MFG JKR ANX 122.0
W 122 2023 622 108.0 ABC JKL MFG JKR ANX 123.0
X 123 2024 623 109.0 ABC JKL MFG JKR ANX 124.0
Y 124 2025 624 110.0 ABC JKL MFG JKR ANX 125.0
Z 125 2026 625 111.0 ABC JKL MFG JKR ANX 126.0
答案 1 :(得分:-1)
# coding: utf-8
# In[16]:
def params(text):
pairs = text.split("|")
print pairs
out = {i.split("=")[0]:i.split("=")[1] for i in pairs}
return pd.Series(out)
params("asd=2|qwe=5")
# In[27]:
import pandas as pd
aa = pd.DataFrame({'id':[1,2],'text':["asd=2|qwe=5","asd=20|qwe=5|qzxc=5"]})
aa
# In[29]:
aa['text'].apply(params)
# In[30]:
pd.concat([aa,aa['text'].apply(params)],1)