Question

我有一个如下所示的数据框：

user1,product1,0
user1,product2,2
user1,product3,1
user1,product4,2
user2,product3,0
user2,product2,2
user3,product4,0
user3,product5,3

数据框有数百万行。我需要遍历每一行，如果最后一列中的值为0，则保留该产品编号，否则将产品编号附加到值为0的上一个产品编号，然后写入新数据框。 / p>

例如，结果矩阵应为

user1,product1
user1,product1product2
user1,product1product3
user1,product1product4
user2,product3
user2,product3product2
user3,product4
user3,product4product5

我编写了一个for循环来遍历每一行，它可以工作，但速度非常慢。我怎样才能加快速度？我试图对它进行矢量化，但我不确定如何，因为我需要检查前一行的值。

Answer 1

请注意，您实际上没有矩阵。矩阵只能包含一种原子类型（数字，整数，字符等）。你真的有一个data.frame。

您可以使用动物园套餐中的na.locf和ifelse功能轻松完成您想要做的事。

x <- structure(list(V1 = c("user1", "user1", "user1", "user1", "user2", 
"user2", "user3", "user3"), V2 = c("product1", "product2", "product3", 
"product4", "product3", "product2", "product4", "product5"), 
    V3 = c("0", "2", "1", "2", "0", "2", "0", "3")), .Names = c("V1", 
"V2", "V3"), class = "data.frame", row.names = c(NA, 8L))

library(zoo)
# First, create a column that contains the value from the 2nd column
# when the 3rd column is zero.
x$V4 <- ifelse(x$V3==0,x$V2,NA)
# Next, replace all the NA with the previous non-NA value
x$V4 <- na.locf(x$V4)
# Finally, create a column that contains the concatenated strings
x$V5 <- ifelse(x$V2==x$V4,x$V2,paste(x$V4,x$V2,sep=""))
# Desired output
x[,c(1,5)]

由于您使用的是data.frame，因此需要确保“product”列是字符而不是因子（如果“product”列是因子，则上面的代码会给出奇怪的结果。）

如何使这个循环更有效？

1 个答案: