读取csv时fread()错误和奇怪的行为

时间:2014-06-21 09:46:59

标签: r data.table

我使用fread()库中的data.table来尝试读取540MB的csv文件。它返回了一条错误消息:

' ends field 36 on line 4 when detecting types: 20.00,8/25/2006 0:00:00,"07:05:00 PM","CST",143.00,"OTTAWA","KS","HAIL",1.00,"S","MINNEAPOLIS",8/25/2006 0:00:00,"07:05:00 PM",0.00,,1.00,"S","MINNEAPOLIS",0.00,0.00,,88.00,0.00,0.00,0.00,,0.00,,"TOP","KANSAS, East",,3907.00,9743.00,3907.00,9743.00,"Dime to nickel sized hail.

我不知道导致错误的原因,并且想要追踪是否存在错误或只是某些数据格式问题我可以调整fread()进行处理。

我设法使用read.csv()读取csv,并决定追踪触发上述错误的行(第617174行,而不是上面的错误消息第4行)。然后我重新输出行和在违规行之前和之后的一行,使用write.csv()作为testout.csv写出

我能够使用testout.csv回读read.csv(),按预期创建一个包含3个观察数据的数据框。但是,在fread()上使用testout.csv会导致数据表只有1个观察值,这是最后一行。

testout.csv中的四行是下面的(为了便于阅读,我为下面的每个条目开始一个新行。)

“STATE __”, “BGN_DATE”, “BGN_TIME”, “TIME_ZONE”, “县”, “COUNTYNAME”, “状态”, “EVTYPE”, “BGN_RANGE”, “BGN_AZI”, “BGN_LOCATI”, “END_DATE” “END_TIME”, “COUNTY_END”, “COUNTYENDN”, “END_RANGE”, “END_AZI”, “END_LOCATI”, “长度”, “宽度”, “F”, “MAG”, “死亡”, “损伤”,” PROPDMG”, “PROPDMGEXP”, “CROPDMG”, “CROPDMGEXP”, “WFO”, “STATEOFFIC”, “区域名称”, “纬度”, “经度”, “LATITUDE_E”, “经度_”, “备注”, “引用句柄”

20,“8/25/2006 0:00:00”,“07:01:00 PM”,“CST”,139,“OSAGE”,“KS”,“TSTM WIND”,5,“WNW “,”OSAGE CITY“,”8/25/2006 0:00:00“,”07:01:00 PM“,0,NA,5,”WNW“,”OSAGE CITY“,0,0,NA, 52,0,0,0,“”,0,“”,“TOP”,“KANSAS,East”,“”,3840,9554,3840,9554,“。”,617129

20,“8/25/2006 0:00:00”,“07:05:00 PM”,“CST”,143,“渥太华”,“KS”,“HAIL”,1,“S” ,“MINNEAPOLIS”,“8/25/2006 0:00:00”,“07:05:00 PM”,0,NA,1,“S”,“MINNEAPOLIS”,0,0,NA,88,0 ,0,0,“”,0,“”,“TOP”,“KANSAS,East”,“”,3907,9743,3907,9743,“尺寸为镍大小的冰雹。 。“,617130

20,“8/25/2006 0:00:00”,“07:07:00 PM”,“CST”,125,“MONTGOMERY”,“KS”,“TSTM WIND”,3,“N “,”COFFEYVILLE“,”8/25/2006 0:00:00“,”07:07:00 PM“,0,NA,3,”N“,”COFFEYVILLE“,0,0,NA,61, 0,0,0,“”,0,“”,“ICT”,“KANSAS,Southeast”,“”,3705,9538,3705,9538,“”,617131

当我运行fread("testout.csv", sep=",", verbose=TRUE)时,输出为

Input contains no \n. Taking this to be a filename to open
File opened, filesize is  1.05E-06B
File is opened and mapped ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Looking for supplied sep ',' on line 5 (the last non blank line in the first 'autostart') ... found ok
Found 37 columns
First row with 37 fields occurs on line 5 (either column names or first row of data)
Some fields on line 5 are not type character (or are empty). Treating as a data row and using default column names.
Count of eol after first data row: 2
Subtracted 1 for last eol and any trailing empty lines, leaving 1 data rows
Type codes: 1444144414444111441111111414444111141 (first 5 rows)
Type codes: 1444144414444111441111111414444111141 (after applying colClasses and integer64)
Type codes: 1444144414444111441111111414444111141 (after applying drop or select (if supplied)

知道可能导致意外结果的原因,以及首先出现的错误?任何方式呢?为了清楚起见,我的目标是能够使用fread()来读取主文件,即使read.csv()到目前为止仍然有效。

1 个答案:

答案 0 :(得分:4)

更新:现在已在GitHub上的v1.9.3中修复:

  

Windows用户使用success中的最新版本报告GitHub