使用额外的逗号读取CSV而不使用Pandas的quotechar?

时间:2017-06-27 17:28:17

标签: python csv pandas

数据:

from io import StringIO
import pandas as pd

s = '''ID,Level,QID,Text,ResponseID,responseText,date_key
375280046,S,D3M,Which is your favorite?,D5M0,option 1,2012-08-08 00:00:00
375280046,S,D3M,How often? (at home, at work, other),D3M0,Work,2010-03-31 00:00:00
375280046,M,A78,Do you prefer a, b, or c?,A78C,a,2010-03-31 00:00:00'''

df = pd.read_csv(StringIO(s))

收到错误:

pandas.io.common.CParserError: Error tokenizing data. C error: Expected 7 fields in line 3, saw 9

我收到此错误的原因非常明显。数据包含How often? (at home, at work, other)Do you prefer a, b, or c?等文字。

如何将此类数据读入pandas DataFrame?

1 个答案:

答案 0 :(得分:1)

当然,在我写这个问题时,我想出来了。当我忘记如何做到这一点时,我会将其与未来的自我分享,而不是删除它。

显然,pandas默认sep=','也可以是正则表达式。

解决方案是将sep=r',(?!\s)'添加到read_csv,如下所示:

df = pd.read_csv(StringIO(s), sep=r',(?!\s)')

(?!\s)部分是否定前瞻,只匹配后面没有后续空格的逗号。

结果:

          ID Level  QID                                  Text ResponseID  \
0  375280046     S  D3M               Which is your favorite?       D5M0   
1  375280046     S  D3M  How often? (at home, at work, other)       D3M0   
2  375280046     M  A78             Do you prefer a, b, or c?       A78C   

  responseText             date_key  
0     option 1  2012-08-08 00:00:00  
1         Work  2010-03-31 00:00:00  
2            a  2010-03-31 00:00:00