我有一个csv由于额外的逗号而中断,我只需要数据集中的一列但它出现在带有额外逗号的列之后

时间:2017-10-17 18:53:17

标签: python pandas

如果我可以反向解析csv,无论错误如何都能得到正确的值。

df1 = pd.read_csv('MyData.csv', error_bad_lines=False)

我能够看到列前面的所有列都有额外的逗号显示正常。

import pandas as pd
import csv
with open('Myfile', 'rb') as f, 
   open('Newfile', 'wb') as g:
writer = csv.writer(g, delimiter=',')
for line in f:
    row = line.split(',', 2)
    writer.writerow(row)

我试图在python pandas中做到这一点

示例csv:

id,name,place,address,age,type,dob,date
1,Murtaza,someplace,Street,MA,22,B,somedate,somedate,
2,Murtaza,someplace,somestreet,45,C,somedate,somedate,
3,Murtaza,someplace,somestreet,MA,44,V,somedate,somedate

Excel输出:

id  name    place       address    age  type  dob     date     newcolumn9

1  Murtaza someplace  somestreet    MA   22    B      somedate  somedate

2  Murtaza someplace  somestreet    45    C  somedate somedate

3  Murtaza someplace  somestreet    MA   44    V      somedate  somedate

我想要年龄栏。我无法发布原始csv或其输出plzz了解

2 个答案:

答案 0 :(得分:1)

panda,或简称players.last_name

re.split()

在类似下面的.csv文件上执行

import re

your_csv_file=open('your_csv_file.csv','r').read()
i_column=2      #index of desired column, counted from back
lines=re.split('\n',your_csv_file)[:-1] #eventually remove last (empty) line
your_column=[]
for line in lines:
  your_column.append(re.split(',',line)[-i_column])    #the minus affects indexing beginning at the end
print(your_column)

返回

4rth,askj,fpou,ABC,aekert
kjgf,poiuf,pejhh,,oeiu,DEF,akdhg
iuzrit,fslgk,gth,,rhf,,rhe,GHI,ozug
pwiuto,,,,eflgjkhrlguiazg,JKL,rgj

答案 1 :(得分:0)

我认为最好的方法可能是编写一个单独的脚本来删除错误的逗号。但是如果你想忽略错误的行,那么可以通过将每行读入StringIO并忽略逗号数量不正确的行来完成。所以,如果您期待4列:

from cStringIO import StringIO
import pandas

s = StringIO()
correct_columns = 4
with open('MyData.csv') as file:
    for line in file:
        if len(','.split(line)) == correct_columns:
            s.write(line)
s.seek(0)
pandas.read_csv(s)