将模式的一部分匹配到字符串

时间:2018-10-15 23:13:54

标签: r merge pattern-matching

我有两个数据框,我想进行匹配和合并。 最初,我使用inner_join并合并,但意识到匹配部分未正确匹配。

我发现了一个似乎朝着正确方向How to merge two data frame based on partial string match with R?的例子。建议使用此代码一个答案:

activeOpacity

但是它没有达到要求。问题是要用作模式匹配的数据集,其字符串比我要匹配的字符串长,因此没有任何匹配。让我显示数据的子集:


idx2 <- sapply(df_mouse_human$Protein.IDs, grep, df_mouse$Protein.IDs)
idx1 <- sapply(seq_along(idx2), function(i) rep(i, length(idx2[[i]])))
merged <- cbind(df_mouse_human[unlist(idx1),,drop=F], df_mouse[unlist(idx2),,drop=F])

因此,我想将dput(droplevels(df_mouse)) structure(list(Protein.IDs = c("Q8CBM2;A2AL85;Q8BSY0", "A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8", "A2AMW0;P47757-2;A2AMV7;P47757;F6QJN8;F6YHZ8;F7CAZ6", "Q3U8S1;A2APM5;A2APM3;A2APM4;E9QKM8;Q80X37;A2APM1;A2APM2;P15379-2;P15379-3;P15379-6;P15379-11;P15379-5;P15379-10;P15379-9;P15379-4;P15379-8;P15379-7;P15379;P15379-12;P15379-13", "A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78", "A2AUR7;Q9D031;Q01730" ), Replicate = c(2L, 2L, 2L, 2L, 2L, 2L), Ratio.H.L.normalized.01 = c(NaN, NaN, NaN, NaN, NaN, NaN), Ratio.H.L.normalized.02 = c(NaN, NaN, NaN, NaN, NaN, NaN), Ratio.H.L.normalized.03 = c(NaN, NaN, NaN, NaN, NaN, NaN)), .Names = c("Protein.IDs", "Replicate", "Ratio.H.L.normalized.01", "Ratio.H.L.normalized.02", "Ratio.H.L.normalized.03"), row.names = 12:17, class = "data.frame") dput(droplevels(df_mouse_human)) structure(list(Human = c("Q8WZ42", "Q8NF91", "Q9UPN3", "Q96RW7", "Q8WXG9", "P20929", "Q5T4S7", "O14686", "Q2LD37", "Q92736"), Protein.IDs = c("A2ASS6", "Q6ZWR6", "Q9QXZ0", "D3YXG0", "Q8VHN7", "E9Q1W3", "A2AN08", "Q6PDK2", "A2AAE1", "E9Q401")), .Names = c("Human", "Protein.IDs"), row.names = c(NA, 10L), class = "data.frame") 中的Protein.ID与它们在df_mouse中存在的位置进行匹配。在示例数据中,我尝试将A2ASS6; E9Q8N1; E9Q8K5; A2ASS6-2; A2AT70; F7CR78与条目A2ASS6匹配。如果我以其他方式进行操作,则效果很好,但是有没有一种方式,如果模式的一部分与查询匹配,它将返回TRUE?

我的长期目标是匹配和合并数据,以使df_mouse带有匹配的人类蛋白质ID的新列,如果没有匹配,我将用原始的鼠标ID字符串替换NA值

谢谢

2 个答案:

答案 0 :(得分:2)

我通常在部分匹配中使用的一种方法是减少更复杂的字段,使其看起来更简单。有时,这仅涉及删除多余的字符(例如,如果“仅匹配前四个字符”,那么我将从substr(idcol, 1, 4)中创建一个新的索引列并加入该列),但是在这种情况下,它涉及到破坏一个串成多个。

这涉及将每个用分号分隔的id与大字符串相关联,从而使此中间帧比原始数据更高(有时更高)。

(出于展示性/美学的考虑,我正在修改df1以删除其他不变列,并为了“其他数据”而添加行号列。)

我正在使用dplyrtidyr,所以:

library(dplyr)
library(tidyr)
df1 <- select(df1, Protein.IDs) %>%
  mutate(other = row_number())

首先,我将6行帧分解为更大的帧:

df1ids <- tbl_df(df1) %>%
  select(Protein.IDs) %>%
  mutate(eachID = strsplit(Protein.IDs, ";")) %>%
  unnest()
df1ids
# # A tibble: 46 x 2
#    Protein.IDs                                        eachID  
#    <chr>                                              <chr>   
#  1 Q8CBM2;A2AL85;Q8BSY0                               Q8CBM2  
#  2 Q8CBM2;A2AL85;Q8BSY0                               A2AL85  
#  3 Q8CBM2;A2AL85;Q8BSY0                               Q8BSY0  
#  4 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8        A2AMH3  
#  5 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8        A2AMH5  
#  6 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8        A2AMH4  
#  7 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8        Q6X893  
#  8 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8        Q6X893-2
#  9 A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8        A2AMH8  
# 10 A2AMW0;P47757-2;A2AMV7;P47757;F6QJN8;F6YHZ8;F7CAZ6 A2AMW0  
# # ... with 36 more rows

注意三列的第一行现在变成三列的三列。我们将使用"eachID"来加入。

left_join(df1ids, df2, by = c("eachID" = "Protein.IDs")) %>%
  filter(complete.cases(.)) %>%
  select(Human, Protein.IDs) %>%
  right_join(df1)
# Joining, by = "Protein.IDs"
# # A tibble: 6 x 3
#   Human  Protein.IDs                                                  other
#   <chr>  <chr>                                                        <int>
# 1 <NA>   Q8CBM2;A2AL85;Q8BSY0                                             1
# 2 <NA>   A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8                      2
# 3 <NA>   A2AMW0;P47757-2;A2AMV7;P47757;F6QJN8;F6YHZ8;F7CAZ6               3
# 4 <NA>   Q3U8S1;A2APM5;A2APM3;A2APM4;E9QKM8;Q80X37;A2APM1;A2APM2;P15~     4
# 5 Q8WZ42 A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78                      5
# 6 <NA>   A2AUR7;Q9D031;Q01730                                             6

如果您碰巧每个Human都有多个Proteins.IDs行,那么情况会有所变化。

df2$Protein.IDs[2] <- "E9Q8K5"
left_join(df1ids, df2, by = c("eachID" = "Protein.IDs")) %>%
  filter(complete.cases(.)) %>%
  select(Human, Protein.IDs) %>%
  right_join(df1)
# Joining, by = "Protein.IDs"
# # A tibble: 7 x 3
#   Human  Protein.IDs                                                  other
#   <chr>  <chr>                                                        <int>
# 1 <NA>   Q8CBM2;A2AL85;Q8BSY0                                             1
# 2 <NA>   A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8                      2
# 3 <NA>   A2AMW0;P47757-2;A2AMV7;P47757;F6QJN8;F6YHZ8;F7CAZ6               3
# 4 <NA>   Q3U8S1;A2APM5;A2APM3;A2APM4;E9QKM8;Q80X37;A2APM1;A2APM2;P15~     4
# 5 Q8WZ42 A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78                      5
# 6 Q8NF91 A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78                      5
# 7 <NA>   A2AUR7;Q9D031;Q01730                                             6

请注意,您现在如何拥有other 5的两个副本?可能不是您想要的。但是,如果您打算继续使用以分号分隔的主题,则:

left_join(df1ids, df2, by = c("eachID" = "Protein.IDs")) %>%
  filter(complete.cases(.)) %>%
  group_by(Protein.IDs) %>%
  summarize(Human = paste(Human, collapse = ";")) %>%
  select(Human, Protein.IDs) %>%
  right_join(df1)
# Joining, by = "Protein.IDs"
# # A tibble: 6 x 3
#   Human       Protein.IDs                                             other
#   <chr>       <chr>                                                   <int>
# 1 <NA>        Q8CBM2;A2AL85;Q8BSY0                                        1
# 2 <NA>        A2AMH3;A2AMH5;A2AMH4;Q6X893;Q6X893-2;A2AMH8                 2
# 3 <NA>        A2AMW0;P47757-2;A2AMV7;P47757;F6QJN8;F6YHZ8;F7CAZ6          3
# 4 <NA>        Q3U8S1;A2APM5;A2APM3;A2APM4;E9QKM8;Q80X37;A2APM1;A2APM~     4
# 5 Q8WZ42;Q8N~ A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78                 5
# 6 <NA>        A2AUR7;Q9D031;Q01730                                        6

答案 1 :(得分:1)

@ r2evans提出了一个很好的问题,即如何处理多个匹配项。回答完该问题后,我可能需要编辑答案,但这是一种快速解决方案。首先,我们拆分可能的ID的字符串,然后查看在另一个数据框中匹配的ID,然后加入匹配项的行索引。

library(tidyverse)

df_mouse %>% mutate(all_id = str_split(Protein.IDs, ";"),
                    row = map(all_id, ~.x %in% df_mouse_human$Protein.IDs %>% which())) %>%
  unnest(row) %>%
  list(., df_mouse_human %>% rownames_to_column("row") %>% mutate(row = as.numeric(row))) %>%
  reduce(left_join, by = "row")
#>                                 Protein.IDs.x Replicate
#> 1 A2ASS6;E9Q8N1;E9Q8K5;A2ASS6-2;A2AT70;F7CR78         2
#>   Ratio.H.L.normalized.01 Ratio.H.L.normalized.02 Ratio.H.L.normalized.03
#> 1                     NaN                     NaN                     NaN
#>   row  Human Protein.IDs.y
#> 1   1 Q8WZ42        A2ASS6