在最后一行用逗号读取CSV

时间:2018-04-10 13:38:13

标签: python pandas csv

我正在使用Python来阅读通过网络刮刀获得的一系列CSV(其中数以千计,因此手动编辑是不行的)。数据如下所示:

"Client: Secret Client"
"G/L Account: (#-#-#) Secret Type of Account"
"Process Date: MM/DD/YYYY"
"Export Date: MM/DD/YYYY"
"Unit Name ","Description","Pay. Type ","Amount","Tran. Date "
"last, first","some note (dates with commas like 17 Aug, 2018 could be here)","Credit Card ","$AMNT.CHANGE","Date and Timestamp"
"Total","","","$AMNT.CHANGE","

如果你仔细考虑,你会看到一个最后一个逗号,然后是一个流氓"。我试图使用的代码在这里:

import os
import pandas as pd
import csv

def read_temp(file):
    tmp = pd.read_csv(file, header=None, error_bad_lines=False, quotechar='"', skiprows=5, quoting=csv.QUOTE_ALL,skipinitialspace=True, skipfooter=1)
    gl = pd.read_csv(file, header=None, error_bad_lines=False, quotechar='"', skiprows=1, nrows=1, quoting=csv.QUOTE_ALL,skipinitialspace=True)
    proc_date = pd.read_csv(file, header=None, error_bad_lines=False, quotechar='"', skiprows=2, nrows=1, quoting=csv.QUOTE_ALL,skipinitialspace=True)
    cols = ['NAME', 'DESCRIPTION', 'PAY_TYP', 'AMOUNT', 'TRAN_DATE']
    tmp.columns = cols
    # print(tmp.columns)
    # print(file)
    tmp['G/L_ACCOUNT'] = gl[0][0].split(':')[1]
    tmp['PROCESS_DATE'] = proc_date[0][0].split(':')[1]
    for col in tmp.columns:
        tmp[col] = tmp[col].str.strip('"')
    return tmp
master = "C:\\path\\to\\master\\"
want=[]
flag = 0
for direc in os.listdir(master):
    for file in os.listdir(master+direc):
        temp = read_temp(master+direc+'\\'+file)
        want.append(temp)

df = pd.concat(want)

错误是:

',' expected after '"'

我想如果我可以使用CSV阅读器和正则表达式(我没有经验)来预先阅读每一行,并找到被"包围的所有内容。 "然后我可以以某种方式更改它或者删除结束逗号和双引号。 任何想法将不胜感激!

1 个答案:

答案 0 :(得分:1)

csv模块的快速测试不会失败

import csv

data = """"Client: Secret Client"
"G/L Account: (#-#-#) Secret Type of Account"
"Process Date: MM/DD/YYYY"
"Export Date: MM/DD/YYYY"
"Unit Name ","Description","Pay. Type ","Amount","Tran. Date "
"last, first","some note (dates with commas like 17 Aug, 2018 could be here)","Credit Card ","$AMNT.CHANGE","Date and Timestamp"
"Total","","","$AMNT.CHANGE","
"""

reader = csv.reader(data.split("\n"), delimiter=',', quotechar='"')
for row in reader:
    print(', '.join(row))

但也被最后一个不完整的元素“混淆”:

Client: Secret Client
G/L Account: (#-#-#) Secret Type of Account
Process Date: MM/DD/YYYY
Export Date: MM/DD/YYYY
Unit Name , Description, Pay. Type , Amount, Tran. Date 
last, first, some note (dates with commas like 17 Aug, 2018 could be here), Credit Card , $AMNT.CHANGE, Date and Timestamp
Total, , , $AMNT.CHANGE, 

但您可以从数据中删除有问题的字符,例如使用rfind和“slicing”:

pos = data.rfind(',"', -5)
if pos != -1:
    data = data.strip()[:pos]
print( data[-15:] )

应打印,"$AMNT.CHANGE"。 它在字符串的最后5个字符上搜索,"。如果找到,则返回位置,用于删除相应的字符(或者更确切地说,返回不带它们的字符串)。

strip()只是删除任何换行符(通过使用字符串文字“”“嵌入数据而引入。)

或者,如果问题总是那两个额外的字符,您可以通过提供负片索引来切片它们,例如data[:-2]

不需要regular expression,但是

import re
data = re.sub(",\"?$", "", data, 1)

可以做到这一点,它也适用于只有一个尾随,的情况。 你可以play with this on regex101.com解释表达的作用。

现在,大熊猫解析数据时不会有任何问题。