替换列中的行名称

时间:2017-09-28 11:36:47

标签: r

我在R中有一个包含数千行和4列的大型data.frame。 例如:

   Chromosome    Start      End Count
1 NC_031985.1 16255093 16255094     1
2 NC_031972.1 11505205 11505206     1
3 NC_031971.1 24441227 24441228     1
4 NC_031977.1 29030540 29030541     1
5 NC_031969.1   595867   595868     1
6 NC_031986.1 40147812 40147813     1

我有这个带有染色体名称的data.frame

LG1     NC_031965.1
LG2     NC_031966.1
LG3a    NC_031967.1
LG3b    NC_031968.1
LG4     NC_031969.1
LG5     NC_031970.1
LG6     NC_031971.1
LG7     NC_031972.1
LG8     NC_031973.1
LG9     NC_031974.1
LG10    NC_031975.1
LG11    NC_031976.1
LG12    NC_031977.1
LG13    NC_031978.1
LG14    NC_031979.1
LG15    NC_031980.1
LG16    NC_031987.1
LG17    NC_031981.1
LG18    NC_031982.1
LG19    NC_031983.1
LG20    NC_031984.1
LG22    NC_031985.1
LG23    NC_031986.1

我想用上面列出的染色体名称替换大矩阵的所有行名称并得到:

   Chromosome    Start      End Count
1 LG22        16255093 16255094     1
2 LG7         11505205 11505206     1
3 LG6         24441227 24441228     1
4 LG12        29030540 29030541     1
5 LG4           595867   595868     1
6 LG23        40147812 40147813     1

有人知道哪种方法不那么痛苦吗? 这可能很容易(或没有)但我在R中的经验是有限的。

非常感谢!

1 个答案:

答案 0 :(得分:0)

正如人们在寻找的那样,这里的评论中讨论的是dplyr解决方案:

library(dplyr)
df %>%
  inner_join(chromo_names, by = c("Chromosome" = "V2")) %>%
  select(Chromosome = V1, Start, End, Count) 

这会发出一条警告消息,指出两个合并列具有不同的因子级别。您可以忽略它并使用字符或将合并列转换为如下因素:

df %>%
  inner_join(chromo_names, by = c("Chromosome" = "V2")) %>%
  select(Chromosome = V1, Start, End, Count) %>%
  mutate(Chromosome = as.factor(Chromosome))

以下是 Base R 解决方案:

merged = merge(df, chromo_names, 
               by.x = "Chromosome", 
               by.y = "V2", 
               sort = FALSE)

merged = merged[c(5,2:4)]
names(merged)[1] = "Chromosome"

<强>结果:

  Chromosome    Start      End Count
1       LG22 16255093 16255094     1
2        LG7 11505205 11505206     1
3        LG6 24441227 24441228     1
4       LG12 29030540 29030541     1
5        LG4   595867   595868     1
6       LG23 40147812 40147813     1

数据:

df = read.table(text = "   Chromosome    Start  End Count
                1 NC_031985.1 16255093 16255094     1
                2 NC_031972.1 11505205 11505206     1
                3 NC_031971.1 24441227 24441228     1
                4 NC_031977.1 29030540 29030541     1
                5 NC_031969.1   595867   595868     1
                6 NC_031986.1 40147812 40147813     1", header = TRUE)

chromo_names = read.table(text = "LG1     NC_031965.1
                         LG2     NC_031966.1
                         LG3a    NC_031967.1
                         LG3b    NC_031968.1
                         LG4     NC_031969.1
                         LG5     NC_031970.1
                         LG6     NC_031971.1
                         LG7     NC_031972.1
                         LG8     NC_031973.1
                         LG9     NC_031974.1
                         LG10    NC_031975.1
                         LG11    NC_031976.1
                         LG12    NC_031977.1
                         LG13    NC_031978.1
                         LG14    NC_031979.1
                         LG15    NC_031980.1
                         LG16    NC_031987.1
                         LG17    NC_031981.1
                         LG18    NC_031982.1
                         LG19    NC_031983.1
                         LG20    NC_031984.1
                         LG22    NC_031985.1
                         LG23    NC_031986.1", header = FALSE)