Question

我有一些JSON格式的数据，我想做一些可视化。数据（大约10MB的JSON）加载速度非常快，但是将其重新整形为可用的形式需要几分钟，而不到100,000行。我有一些有用的东西，但我认为它可以做得更好。

从我的sample data开始，可能最容易理解。

假设您在/tmp中运行以下命令：

curl http://public.west.spy.net/so/time-series.json.gz \
    | gzip -dc - > time-series.json

你应该可以在这里看到我想要的输出（一段时间后）：

require(rjson)

trades <- fromJSON(file="/tmp/time-series.json")$rows


data <- do.call(rbind,
                lapply(trades,
                       function(row)
                           data.frame(date=strptime(unlist(row$key)[2], "%FT%X"),
                                      price=unlist(row$value)[1],
                                      volume=unlist(row$value)[2])))

someColors <- colorRampPalette(c("#000099", "blue", "orange", "red"),
                               space="Lab")
smoothScatter(data, colramp=someColors, xaxt="n")

days <- seq(min(data$date), max(data$date), by = 'month')
smoothScatter(data, colramp=someColors, xaxt="n")
axis(1, at=days,
     labels=strftime(days, "%F"),
     tick=FALSE)

Answer 1

使用plyr可以获得40倍的加速。这是代码和基准比较。一旦你拥有了数据框，就可以完成到目前为止的转换，因此我已经从代码中删除了它，以便于进行逐项比较。我相信存在更快的解决方案。

f_ramnath = function(n) plyr::ldply(trades[1:n], unlist)[,-c(1, 2)]
f_dustin  = function(n) do.call(rbind, lapply(trades[1:n], 
                function(row) data.frame(
                    date   = unlist(row$key)[2],
                    price  = unlist(row$value)[1],
                    volume = unlist(row$value)[2]))
                )
f_mrflick = function(n) as.data.frame(do.call(rbind, lapply(trades[1:n], 
               function(x){
                   list(date=x$key[2], price=x$value[1], volume=x$value[2])})))

f_mbq = function(n) data.frame(
          t(sapply(trades[1:n],'[[','key')),    
          t(sapply(trades[1:n],'[[','value')))

rbenchmark::benchmark(f_ramnath(100), f_dustin(100), f_mrflick(100), f_mbq(100),
    replications = 50)

test            elapsed   relative 
f_ramnath(100)  0.144       3.692308     
f_dustin(100)   6.244     160.102564     
f_mrflick(100)  0.039       1.000000    
f_mbq(100)      0.074       1.897436

EDIT。 MrFlick的解决方案可以带来3.5倍的加速。我已经更新了我的测试。

Answer 2

我接受了MrFlick在irc的另一次转型，这个改造明显更快，值得一提：

data <- as.data.frame(do.call(rbind,
                              lapply(trades,
                                     function(x) {list(date=x$key[2],
                                                   price=x$value[1],
                                                   volume=x$value[2])})))

通过不构建内部框架似乎可以显着提高速度。

Answer 3

您正在对单个元素进行矢量化操作，效率非常低。价格和数量可以像这样提取：

t(sapply(trades,'[[','value'))

这样的日期：

strptime(sapply(trades,'[[','key')[c(F,T)],'%FT%X')

现在只有一些糖和完整的代码看起来像这样：

data.frame(
 strptime(sapply(trades,'[[','key')[c(F,T)],'%FT%X'),
 t(sapply(trades,'[[','value')))->data
names(data)<-c('date','price','volume')

在我的笔记本上，整套设备的转换速度约为0.7秒，而10k的第一行（10％）使用原始算法大约需要8s。

Answer 4

批量选项吗？一次处理1000行，可能取决于你的json有多深。你真的需要转换所有数据吗？我不确定r和你究竟在处理什么，但我正在考虑一种通用方法。

另外请看一下：http://jackson.codehaus.org/：高性能JSON处理器。

转换JSON数据的性能问题

4 个答案: