在R中快速读取Windows中的unicode文件

时间:2015-03-30 10:35:59

标签: r data.table

我总是使用 data.table 包中的fread来读取大表。但显然它不支持在Windows中读取unicode文件(Windows 7 Professional更精确)

这是我试过的文件:

A,B
ą,ž
ū,į
ų,ė
š,ę

如果我在Mac OS X中阅读它,或者我使用read.csv选项encoding=UTF-8阅读它,它可以正常工作。很遗憾fread没有此option

还有其他快速方法可以在Windows中读取unicode表,还是应该使用其他操作系统?或者我错过了一些明显的东西?

以下是sessionInfo():

的输出
R version 3.1.3 (2015-03-09)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.4

loaded via a namespace (and not attached):
[1] chron_2.3-45   plyr_1.8.1     Rcpp_0.11.5    reshape2_1.4.1 stringr_0.6.2 

更新:按要求粘贴输出。

> aa<-fread("F:/R/unicode_test2.csv",verbose=TRUE)

Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.000000 GB.
Memory mapping ... ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Detecting sep ... ','
Detected 2 columns. Longest stretch was from line 1 to line 5
Starting data input on line 1 (either column names or first row of data). First 10 characters: Ä„,B
All the fields on line 1 are character fields. Treating as the column names.
Count of eol: 5 (including 1 at the end)
Count of sep: 4
nrow = MIN( nsep [4] / ncol [2] -1, neol [5] - nblank [1] ) = 4
Type codes (   first 5 rows): 44
Type codes: 44 (after applying colClasses and integer64)
Type codes: 44 (after applying drop or select (if supplied)
Allocating 2 column slots (2 - 0 dropped)
Read 4 rows. Exactly what was estimated and allocated up front
   0.000s (  0%) Memory map (rerun may be quicker)
   0.000s (  0%) sep and header detection
   0.000s (  0%) Count rows (wc -l)
   0.000s (  0%) Column type detection (first, middle and last 5 rows)
   0.000s (  0%) Allocation of 4x2 result (xMB) in RAM
   0.000s (  0%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.000s (  0%) Changing na.strings to NA
   0.001s        Total

> aa
   Ä„  B
1: ą ž
2: ū į
3: ų ė
4: Å¡ Ä™
> aa$A
[1] "ą" "ū" "ų" "š"
> aa$B
[1] "ž" "į" "ė" "ę"

> bb <- read.csv("F:/R/unicode_test.csv",encoding="UTF-8",strings=FALSE)
> bb
  A B
1 a ž
2 u i
3 u e
4 š e
> bb$B
[1] "ž" "į" "ė" "ę"
> bb$A
[1] "ą" "ū" "ų" "š"

0 个答案:

没有答案
相关问题