Question

我开发了一个R脚本，可以从小（<2 MB）XML文件中正确提取所选数据。该脚本涉及将整个文件读入内存。但是，现在我尝试将此脚本应用于更大的624 MB XML文件，并遇到以下问题：

〜如果我尝试在我的笔记本电脑上运行它，CPU和内存使用率高达100％，我对在这个平台上运行工作感到紧张，所以我把它杀了。

〜我曾尝试在CoCalc云计算平台上运行它，但我遇到了R XML解析器的问题，因此这项工作甚至没有开始运行。

〜我不确定将完整文件读入内存仍然是一个可行的选择，或者我是否需要修改我的代码以同时处理完整XML文件的小得多的子集。

我一直在研究可能允许我对代码进行简单更改的选项，以允许一次一行或一块地读取大文件，但不清楚最好的文件选项。我见过的一些描述，例如对于使用SAX处理，似乎暗示这个代码需要在非常低的级别重写，不会使用XML文件的层次结构，这需要编写低级数据处理函数。我试图避免这种情况。

最有希望的选项似乎是R XML包中的xml_siblings（）和/或其他相关函数。理想情况下，我想在循环中调用其中一个函数，并在每次调用此函数时提取单个节点，这样我就可以一次处理一个节点。

但是，当我使用提供的语法调用任何这些函数时（并且，对于每个测试函数，遵循文档的语法指南），我总是会收到以下错误：

# try to extract Node 1 from the xmldata file:
# library(XML)
# library(xml2)
# filename = "SmallTestFile.xml"
# xmldata = xmlRoot(xmlTreeParse(filename))
> TestSiblings <- xml_siblings(xmldata)
Error in UseMethod("nodeset_apply") : 
  no applicable method for 'nodeset_apply' applied to an object of class "c('XMLNode', 'RXMLAbstractNode', 'XMLAbstractNode', 'oldClass')"

我已经搜索过，但尚未找到有用的资源来通知上述错误消息的疑难解答。

我也收到了建议，我可能想要从R切换到Python，例如使用美丽的汤。如有必要，我会这样做，但如果可能的话，我更愿意只调整现有的R代码。

提前感谢您提供的任何指导。

library(XML)
library(xml2)
library(gdata)

filename = "HugeFile.xml"

# Save the database file as a tree structure
xmldata = xmlRoot(xmlTreeParse(filename))

# Number of nodes in the entire database file
NumNodes <- xmlSize(xmldata)

# read file into variable
MyData <- read_xml(filename)

# strip out the namespace; this can make the data easier to work with
xml_ns_strip(MyData)

# locate all items [i.e. nodes] within the data set
items <- xml_find_all(MyData, './item');

row_count <- 1

TotalNumberOfSubitems <- length(xml_find_all(itemss, './subitems/subitem'));

item.name <- array(, dim=c(TotalNumberOfSubitems,1))


# for each drug
for (item_num in 1:length(items)) {

  # call xml_find_all(), xml_find_first(), and xml_text() functions to extract info;
  # e.g., record the drug's name:
  item.name[row_count] <- xml_text(xml_find_first(current_item, './name'));

  …

}

# Create composite matrix that holds all variables being reported
CompositeMatrix = cbind(item.name,value2,value3,value4)

# Specify column names
colnames(CompositeMatrix) <- c("Item Name", “Value 2”, “Value 3”, “Value 4”)

# Write to output file with column headers... BUT these are misaligned with the rest of the columns...
write.fwf(CompositeMatrix,file="OutputList.txt",sep="\t", quote=F, rownames=FALSE, colnames=TRUE)

在不将整个文件读入内存的情况下解析R中的大型XML文件的最佳方法？

0 个答案: