Question

我有一个包含XML字符标题和二进制数据的文件，然后在R中使用readBin读取：

zz <- file('myfile', 'rb')

# Read header
x <- readBin(zz, 'character')

# Read binary data
...

但是，当标头超过10 000字节时，我得到以下结果：

Warning message:
 In readBin(zz, 'character') :
 null terminator not found: breaking string at 10000 bytes

我试图循环，直到字符串与标题的末尾匹配，然后将字符串连接在一起，但是XML不会验证，因为某些部分的结尾已损坏（例如\xa0W\x97^\xff\177已添加到端）。

我应该如何处理readBin字符限制 - 是否有任何简单的解决方法？

赞赏任何类型的建议。谢谢！

更新

下面是一个可重现的例子：

url <- 'http://www.enetpulse.com/wp-content/uploads/sample_xml_feed_enetpulse_icehockey.xml'
x <- paste(readLines(url), collapse = '\n')  # more than 10 000 bytes

f <- tempfile()
zz <- file(f, 'wb')
writeBin(x, zz)  # header
writeBin(1:10000, zz)  # data
close(zz)

# readBin
zz <- file(f, 'rb')
y <- readBin(zz, 'character')
# Warning message:
# In readBin(zz, "character") :
#   null terminator not found: breaking string at 10000 bytes
y
# "... participantFK=\"98707\" [\x97^\xff\177"
close(zz)

# readChar
zz <- file(f, 'rb')
readChar(zz, nchars = 999999)
# Error in readChar(zz, nchars = 999999) : 
#   invalid UTF-8 input in readChar()
close(zz)

# readBin-loop
library(XML)
p <- xmlParse(x)  # it works to parse the original xml
zz <- file(f, 'rb')
fun <- function(x) readBin(zz, 'character')
res <- paste(sapply(1:4, fun), collapse = '')
p2 <- xmlParse(res)  # errors!

Answer 1

确定。这确实是一种凌乱的文件格式。在这里，我建议对文件进行更传统的旧式样式解析。基本上以字节为单位读取所有内容，直到找到空终止符。当我们这样做时，我们取所有这些字节，转换为字符，然后解析。然后在这个例子中，我将读取点倒回到二进制数据的开头，然后我也可以使用相同的连接读取它。

在您上面的示例代码中编写测试文件后立即开始，我从

开始

block <- 256*4
zz <- file(f, 'rb')
rr <- raw()
found <- 0
while ( found==0 ) {
    r <- readBin(zz, "raw", block)
    if( length(w<-head(which(r==0),1)) ) {
        rr <- c(rr, r[1:(w-1)])
        found <- 1
        seek(zz, -(block-w), origin="current") #rewind
    } else {
        rr <- c(rr, r)
    }
}

library(XML)
p <- xmlParse(rawToChar(rr), asText=TRUE)
dd <- readBin(zz, "integer",10000)
close(zz)

然后恢复p中的XML文件和dd中的整数列表。

这是唯一可能的，因为你做了一个很好的可重复的例子。包含您尝试过的代码也非常好。欢呼声。

R：readBin解决方法到字符限制（10 000字节）？

1 个答案: