根据多列中的数据生成新列

时间:2018-03-21 20:32:03

标签: r

我有一位同事的数据集。 在数据集中,我们记录给定皮肤问题的位置。 我们最多可记录20个皮肤问题。

  

scaloc1 == 2   scaloc2 == 24   scaloc3 == NA   scalocn ......

意味着皮肤问题已经到位1和24以及其他地方

我想重新组织数据,而不是像这样

  

面对1/0躯干1/0等

因此,例如,如果任何scaloc1到scalocn包含值3,则将face的值设置为1.

我之前在STATA中使用过:

foreach var in scaloc1 scaloc2 scaloc3 scaloc4 scaloc5 scaloc6 scaloc7 scaloc8 scaloc9 scal10 scal11 scal12 scal13 scal14 scal15 scal16 scal17 scal18 scal19 scal20{
  replace facescalp=1 if (`var'>=1 & `var'<=6) | (`var'>=21 & `var'<=26)
}

我觉得我应该能够使用可怕的for循环或者申请系列中的某些东西来做到这一点?

我试过

dataframe$facescalp <-0
#Default to zero
apply(dataframe[,c("scaloc1","scaloc2","scalocn")],2,function(X){
      dataframe$facescalp[X>=1 & X<7] <-1
      })
#I thought this would look at location columns 1 to n and if the value was between 1 and 7 then assign face-scalp to 1

但没有工作......

我以前没有真正使用过申请,但在这里的例子中找到了很好的根据,而且无法找到准确描述我当前问题的例子。

可以使用示例数据集: https://www.dropbox.com/s/0lkx1tfybelc189/example_data.xls?dl=0

如果有什么不清楚或者已经有不同答案的解释,请告诉我。

2 个答案:

答案 0 :(得分:1)

如果我正确理解您的问题,最简单的解决方法可能是以下内容(这使用您提供的示例数据集并将其存储为df

# Add an ID column to identify each patient or skin problem 
df$ID <- row.names(df)

# Gather rows other than ID into a long-format data frame
library(tidyr)
dfl <- gather(df, locID,  loc, -ID)

# Order by ID
dfl <- dfl[order(dfl$ID), ]

# Keep only the rows where a skin problem location is present
dfl <- dfl[!is.na(dfl$loc), ]

# Set `face` to 1 where `locD` is 'scaloc1' and `loc` is 3
dfl$face <- ifelse(dfl$locID == 'scaloc1' & dfl$loc == 3, 1, 0)

因为您需要应用许多条件来填充各个正文部分列,所以最有效的路径可能是创建查找表并使用match函数。有很多关于SO的例子描述了使用match这样的情况。

答案 1 :(得分:0)

非常有帮助。 我最终使用了这种方法的变体

data_loc <- gather(data, "site", "location", c("scaloc1", "scaloc2", "scaloc3", "scaloc4", "scaloc5", "scaloc6", "scaloc7", "scaloc8", "scaloc9", "scal10", "scal11", "scal12", "scal13", "scal14", "scal15", "scal16", "scal17", "scal18", "scal19", "scal20")) 
#Make a single long dataframe

data_loc$facescalp <- 0
data_loc$facescalp[data_loc$location >=1 & data_loc$location <=6] <-1
#These two lines were repeated for each of the eventual categories I wanted
locations <- group_by(data_loc,ID) %>% summarise(facescalp = max(facescalp), upperarm = max(upperarm), lowerarm = max(lowerarm), hand = max(hand),buttockgroin = max(buttockgroin), upperleg = max(upperleg), lowerleg = max(lowerleg), feet = max(feet))

#Generate per individual the maximum value for each category, hence if in any of locations 1 to 20 they had a value corresponding to face then this ends up giving a 1

data <- inner_join(data,locations, by = "ID")
#This brings the data back together