将范围数据转换为R中的意思

时间:2016-08-06 12:19:38

标签: r machine-learning

许多时候给出的数据,例如年龄是范围。我想计算这些范围的平均值。我能够计算它,但我觉得有更优雅,也许更快的方式。

以下是工作示例:

age <- c("0-10", "11-20", "21-30", "31-40") # define the age vector in ranges
age_split<-strsplit(age,"-") # gives the list with splits

for(ii in 1:length(age)){
  age[ii] <- mean(as.numeric(unlist(age_split[ii])))
}
print(age)
## [1] "5"    "15.5" "25.5" "35.5"

根据lmo和akron的建议,这里的代码可以通过各种方法进行性能测试:

irows = 100000
data1 <- paste0(sample(1:10, irows, replace = TRUE),"-", sample(11:20, irows, replace = TRUE))
data2 <- data1; data3 <- data1; data4 <- data1 # replicated for testing different methods

#--method 1 -- originally proposed
a1<-Sys.time()
age_split<-strsplit(data1,"-")
for(ii in 1:length(data1)){
  data1[ii] <- mean(as.numeric(unlist(age_split[ii])))
}
Sys.time()-a1

# method 2 (lmo suggestion)
a2<-Sys.time()
data2 <- sapply(strsplit(data2, split="-"), function(i) mean(as.numeric(i)))
Sys.time()-a2

# method 3 (cue from akron)
a3<-Sys.time()
age_split_matrix <-do.call(rbind, strsplit(data3,"-"))
class(age_split_matrix) <- "numeric"
data3<-rowMeans(age_split_matrix)
Sys.time()-a3

# method 4 (akron proposed)
a4<-Sys.time()
data4 <-rowMeans(read.table(text=data4, sep = "-"))
Sys.time()-a4

# validating if outputs match
all.equal(as.numeric(data1),data2)
all.equal(as.numeric(data1),data3)
all.equal(as.numeric(data1),data4)

当irow = 100K时,从方法1到4的时间为:(1)2.5s(2)1.4s(3)0.34s(4)6.3s。当irow = 1mil时,时间为(1)23s(2)14s(3)6s(4)非常长。当irow = 10mil时,时间为(1)3.9分钟(2)2.9分钟(3)非常长。这让我得出结论,read.table真的很慢。方法3占用了大量内存。

2 个答案:

答案 0 :(得分:1)

我们可以在rowMeans中使用data.frame

后阅读read.table
rowMeans(read.table(text=age, sep="-"))
#[1]  5.0 15.5 25.5 35.5

答案 1 :(得分:0)

以下是sapply的单行班次

sapply(strsplit(age, split="-"), function(i) mean(as.numeric(i)))
[1]  5.0 15.5 25.5 35.5

strplit将字符串拆分为&#34; - &#34;并返回一个列表,该列表被送到sapply,然后获取每个列表项,将向量转换为数字并计算均值。