给定(预先存在的)具有各种类型列的数据框,将所有字符列转换为因子的最简单方法是什么,而不影响其他类型的任何列?
以下是data.frame
示例:
df <- data.frame(A = factor(LETTERS[1:5]),
B = 1:5, C = as.logical(c(1, 1, 0, 0, 1)),
D = letters[1:5],
E = paste(LETTERS[1:5], letters[1:5]),
stringsAsFactors = FALSE)
df
# A B C D E
# 1 A 1 TRUE a A a
# 2 B 2 TRUE b B b
# 3 C 3 FALSE c C c
# 4 D 4 FALSE d D d
# 5 E 5 TRUE e E e
str(df)
# 'data.frame': 5 obs. of 5 variables:
# $ A: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
# $ B: int 1 2 3 4 5
# $ C: logi TRUE TRUE FALSE FALSE TRUE
# $ D: chr "a" "b" "c" "d" ...
# $ E: chr "A a" "B b" "C c" "D d" ...
我知道我能做到:
df$D <- as.factor(df$D)
df$E <- as.factor(df$E)
有没有办法让这个过程自动化一点?
答案 0 :(得分:85)
罗兰的答案对于这个具体问题很有帮助,但我想我会分享一种更为通用的方法。
DF <- data.frame(x = letters[1:5], y = 1:5, z = LETTERS[1:5],
stringsAsFactors=FALSE)
str(DF)
# 'data.frame': 5 obs. of 3 variables:
# $ x: chr "a" "b" "c" "d" ...
# $ y: int 1 2 3 4 5
# $ z: chr "A" "B" "C" "D" ...
## The conversion
DF[sapply(DF, is.character)] <- lapply(DF[sapply(DF, is.character)],
as.factor)
str(DF)
# 'data.frame': 5 obs. of 3 variables:
# $ x: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
# $ y: int 1 2 3 4 5
# $ z: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
对于转换,assign(DF[sapply(DF, is.character)]
)的左侧子集是字符列。在右侧,对于该子集,您使用lapply
执行您需要执行的任何转换。 R非常聪明,可以用结果替换原始列。
关于这一点的一个方便的事情是,如果你想采取其他方式或进行其他转换,就像在左边改变你想要的东西一样简单,并在右边指定你想要改变它的东西。 / p>
答案 1 :(得分:55)
DF <- data.frame(x=letters[1:5], y=1:5, stringsAsFactors=FALSE)
str(DF)
#'data.frame': 5 obs. of 2 variables:
# $ x: chr "a" "b" "c" "d" ...
# $ y: int 1 2 3 4 5
(恼人的)as.data.frame
默认值是将所有字符列转换为因子列。你可以在这里使用它:
DF <- as.data.frame(unclass(DF))
str(DF)
#'data.frame': 5 obs. of 2 variables:
# $ x: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
# $ y: int 1 2 3 4 5
答案 2 :(得分:29)
正如@Raf Z对此question发表评论,dplyr现在有了mutate_if。超级有用,简单易读。
> str(df)
'data.frame': 5 obs. of 5 variables:
$ A: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ B: int 1 2 3 4 5
$ C: logi TRUE TRUE FALSE FALSE TRUE
$ D: chr "a" "b" "c" "d" ...
$ E: chr "A a" "B b" "C c" "D d" ...
> df <- df %>% mutate_if(is.character,as.factor)
> str(df)
'data.frame': 5 obs. of 5 variables:
$ A: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ B: int 1 2 3 4 5
$ C: logi TRUE TRUE FALSE FALSE TRUE
$ D: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
$ E: Factor w/ 5 levels "A a","B b","C c",..: 1 2 3 4 5
答案 3 :(得分:2)
最简单的方法是使用下面给出的代码。它会自动化将所有变量转换为R中数据帧中的因子的整个过程。它对我来说非常好。 food_cat这里是我正在使用的数据集。将其更改为您正在处理的那个。
for(i in 1:ncol(food_cat)){
food_cat[,i] <- as.factor(food_cat[,i])
}
答案 4 :(得分:2)
使用dplyr
library(dplyr)
df <- data.frame(A = factor(LETTERS[1:5]),
B = 1:5, C = as.logical(c(1, 1, 0, 0, 1)),
D = letters[1:5],
E = paste(LETTERS[1:5], letters[1:5]),
stringsAsFactors = FALSE)
str(df)
我们得到:
'data.frame': 5 obs. of 5 variables:
$ A: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ B: int 1 2 3 4 5
$ C: logi TRUE TRUE FALSE FALSE TRUE
$ D: chr "a" "b" "c" "d" ...
$ E: chr "A a" "B b" "C c" "D d" ...
现在,我们可以将所有chr
转换为factors
:
df <- df%>%mutate_if(is.character, as.factor)
str(df)
我们得到:
'data.frame': 5 obs. of 5 variables:
$ A: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ B: int 1 2 3 4 5
$ C: logi TRUE TRUE FALSE FALSE TRUE
$ D: chr "a" "b" "c" "d" ...
$ E: chr "A a" "B b" "C c" "D d" ...
我们还提供其他解决方案:
带有基本包装:
df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)],
as.factor)
使用dplyr
1.0.0
df <- df%>%mutate(across(where(is.factor), as.character))
使用purrr
软件包:
library(purrr)
df <- df%>% modify_if(is.factor, as.character)
答案 5 :(得分:1)
我曾经做过一个简单的for
循环。作为@ A5C1D2H2I1M1N2O1R2T1答案,lapply
是一个很好的解决方案。但是,如果您转换所有列,则之前需要data.frame
,否则最终会得到list
。执行时间差异很小。
mm2N=mm2New[,10:18]
str(mm2N)
'data.frame': 35487 obs. of 9 variables:
$ bb : int 4 6 2 3 3 2 5 2 1 2 ...
$ vabb : int -3 -3 -2 -2 -3 -1 0 0 3 3 ...
$ bb55 : int 7 6 3 4 4 4 9 2 5 4 ...
$ vabb55: int -3 -1 0 -1 -2 -2 -3 0 -1 3 ...
$ zr : num 0 -2 -1 1 -1 -1 -1 1 1 0 ...
$ z55r : num -2 -2 0 1 -2 -2 -2 1 -1 1 ...
$ fechar: num 0 -1 1 0 1 1 0 0 1 0 ...
$ varr : num 3 3 1 1 1 1 4 1 1 3 ...
$ minmax: int 3 0 4 6 6 6 0 6 6 1 ...
# For solution
t1=Sys.time()
for(i in 1:ncol(mm2N)) mm2N[,i]=as.factor(mm2N[,i])
Sys.time()-t1
Time difference of 0.2020121 secs
str(mm2N)
'data.frame': 35487 obs. of 9 variables:
$ bb : Factor w/ 6 levels "1","2","3","4",..: 4 6 2 3 3 2 5 2 1 2 ...
$ vabb : Factor w/ 7 levels "-3","-2","-1",..: 1 1 2 2 1 3 4 4 7 7 ...
$ bb55 : Factor w/ 8 levels "2","3","4","5",..: 6 5 2 3 3 3 8 1 4 3 ...
$ vabb55: Factor w/ 7 levels "-3","-2","-1",..: 1 3 4 3 2 2 1 4 3 7 ...
$ zr : Factor w/ 5 levels "-2","-1","0",..: 3 1 2 4 2 2 2 4 4 3 ...
$ z55r : Factor w/ 5 levels "-2","-1","0",..: 1 1 3 4 1 1 1 4 2 4 ...
$ fechar: Factor w/ 3 levels "-1","0","1": 2 1 3 2 3 3 2 2 3 2 ...
$ varr : Factor w/ 5 levels "1","2","3","4",..: 3 3 1 1 1 1 4 1 1 3 ...
$ minmax: Factor w/ 7 levels "0","1","2","3",..: 4 1 5 7 7 7 1 7 7 2 ...
#lapply solution
mm2N=mm2New[,10:18]
t1=Sys.time()
mm2N <- lapply(mm2N, as.factor)
Sys.time()-t1
Time difference of 0.209012 secs
str(mm2N)
List of 9
$ bb : Factor w/ 6 levels "1","2","3","4",..: 4 6 2 3 3 2 5 2 1 2 ...
$ vabb : Factor w/ 7 levels "-3","-2","-1",..: 1 1 2 2 1 3 4 4 7 7 ...
$ bb55 : Factor w/ 8 levels "2","3","4","5",..: 6 5 2 3 3 3 8 1 4 3 ...
$ vabb55: Factor w/ 7 levels "-3","-2","-1",..: 1 3 4 3 2 2 1 4 3 7 ...
$ zr : Factor w/ 5 levels "-2","-1","0",..: 3 1 2 4 2 2 2 4 4 3 ...
$ z55r : Factor w/ 5 levels "-2","-1","0",..: 1 1 3 4 1 1 1 4 2 4 ...
$ fechar: Factor w/ 3 levels "-1","0","1": 2 1 3 2 3 3 2 2 3 2 ...
$ varr : Factor w/ 5 levels "1","2","3","4",..: 3 3 1 1 1 1 4 1 1 3 ...
$ minmax: Factor w/ 7 levels "0","1","2","3",..: 4 1 5 7 7 7 1 7 7 2 ...
#data.frame lapply solution
mm2N=mm2New[,10:18]
t1=Sys.time()
mm2N <- data.frame(lapply(mm2N, as.factor))
Sys.time()-t1
Time difference of 0.2010119 secs
str(mm2N)
'data.frame': 35487 obs. of 9 variables:
$ bb : Factor w/ 6 levels "1","2","3","4",..: 4 6 2 3 3 2 5 2 1 2 ...
$ vabb : Factor w/ 7 levels "-3","-2","-1",..: 1 1 2 2 1 3 4 4 7 7 ...
$ bb55 : Factor w/ 8 levels "2","3","4","5",..: 6 5 2 3 3 3 8 1 4 3 ...
$ vabb55: Factor w/ 7 levels "-3","-2","-1",..: 1 3 4 3 2 2 1 4 3 7 ...
$ zr : Factor w/ 5 levels "-2","-1","0",..: 3 1 2 4 2 2 2 4 4 3 ...
$ z55r : Factor w/ 5 levels "-2","-1","0",..: 1 1 3 4 1 1 1 4 2 4 ...
$ fechar: Factor w/ 3 levels "-1","0","1": 2 1 3 2 3 3 2 2 3 2 ...
$ varr : Factor w/ 5 levels "1","2","3","4",..: 3 3 1 1 1 1 4 1 1 3 ...
$ minmax: Factor w/ 7 levels "0","1","2","3",..: 4 1 5 7 7 7 1 7 7 2 ...
答案 6 :(得分:0)
我注意到“ []索引列在迭代时无法创建级别:
for(convert.to.factors中的a_feature){
feature.df [a_feature] <-factor(feature.df [a_feature])}
它创建例如对于“状态”列:
状态:因子为1级“ c(\“成功\”,\“失败\”)“:不适用不适用...
通过使用“ [[”索引”来补救:
for(convert.to.factors中的a_feature){
feature.df [[a_feature]] <-factor(feature.df [[a_feature]])}
根据需要提供:
。状态:具有2个级别的“成功”,“失败”因子:1 1 2 1 ...
答案 7 :(得分:0)
根据@Roland 的回答和@Paul de Barros 的评论,我得出以下结论:
df <- data.frame(A = factor(LETTERS[1:5]),
B = 1:5, C = as.logical(c(1, 1, 0, 0, 1)),
D = letters[1:5],
E = paste(LETTERS[1:5], letters[1:5]),
stringsAsFactors = FALSE)
df<-as.data.frame(unclass(df),stringsAsFactors=TRUE)
str(df)
实际上而且看起来很简单。
> str(df)
'data.frame': 5 obs. of 5 variables:
$ A: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ B: int 1 2 3 4 5
$ C: logi TRUE TRUE FALSE FALSE TRUE
$ D: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
$ E: Factor w/ 5 levels "A a","B b","C c",..: 1 2 3 4 5