如何找到相似字符串的行号

时间:2019-02-19 21:43:12

标签: r

我的数据很大,但是我想知道相似字符串的行数

df<- structure(list(x = structure(c(5L, 5L, 5L, 5L, 1L, 1L, 3L, 5L, 
5L, 6L, 6L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 3L), .Label = c("AJ5ter2", 
"al-1Tter2", "AY9ter2", "CY-Yter2", "LK2ter2", "YY49ter2"), class = "factor")), class = "data.frame", row.names = c(NA, 
-19L))

期望输出如下所示

LK2ter2  1:4, 9:10
AJ5ter2  5:6
AY9ter2  7, 19
YY49ter2 10:11
al-1Tter2 12:15
CY-Yter2 16:18

4 个答案:

答案 0 :(得分:3)

另一个使用data.table

的选项
library(data.table)
DT <- as.data.table(df)
DT[, .(index = paste(unique(range(.I)), collapse = ":")), by = .(x, rleid(x))
   ][, .(index = toString(index)), by = x]
#           x    index
#1:   LK2ter2 1:4, 8:9
#2:   AJ5ter2      5:6
#3:   AY9ter2    7, 19
#4:  YY49ter2    10:11
#5: al-1Tter2    12:15
#6:  CY-Yter2    16:18

答案 1 :(得分:2)

您可以尝试以下方法:

z <- sapply(levels(df$x), function(x) which(x == df$x))
data.frame(key = names(z), index = sapply(z, paste, collapse = ", "), row.names = NULL)

        key            index
1   AJ5ter2             5, 6
2 al-1Tter2   12, 13, 14, 15
3   AY9ter2            7, 19
4  CY-Yter2       16, 17, 18
5   LK2ter2 1, 2, 3, 4, 8, 9
6  YY49ter2           10, 11

答案 2 :(得分:2)

这是dplyr方法的一种方法。不确定要输出文本还是数字矢量

library(tidyverse)
df <- structure(list(x = structure(c(5L, 5L, 5L, 5L, 1L, 1L, 3L, 5L, 5L, 6L, 6L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 3L), .Label = c("AJ5ter2", "al-1Tter2", "AY9ter2", "CY-Yter2", "LK2ter2", "YY49ter2"), class = "factor")), class = "data.frame", row.names = c(NA, -19L))
df %>%
  mutate(row_number = row_number()) %>%
  group_by(x) %>%
  summarise(row_nums = str_c(row_number, collapse = ","))
#> # A tibble: 6 x 2
#>   x         row_nums   
#>   <fct>     <chr>      
#> 1 AJ5ter2   5,6        
#> 2 al-1Tter2 12,13,14,15
#> 3 AY9ter2   7,19       
#> 4 CY-Yter2  16,17,18   
#> 5 LK2ter2   1,2,3,4,8,9
#> 6 YY49ter2  10,11

reprex package(v0.2.1)于2019-02-19创建

答案 3 :(得分:2)

使用tidyversedata.table,您可以执行以下操作:

df %>%
 rowid_to_column() %>%
 group_by(x, rleid(x)) %>%
 summarise(res = ifelse(min(rowid) != max(rowid), 
                        paste(min(rowid), max(rowid), sep = ":"), paste(rowid))) %>%
 group_by(x) %>%
 summarise(res = paste(res, collapse = ", "))

  x         res     
  <fct>     <chr>   
1 AJ5ter2   5:6     
2 al-1Tter2 12:15   
3 AY9ter2   7, 19   
4 CY-Yter2  16:18   
5 LK2ter2   1:4, 8:9
6 YY49ter2  10:11

或者只是tidyverse一样:

df %>%
 rowid_to_column() %>%
 group_by(x, x_rleid = {x_rleid = rle(as.numeric(x)); rep(seq_along(x_rleid$lengths), x_rleid$lengths)}) %>%
 summarise(res = ifelse(min(rowid) != max(rowid), 
                        paste(min(rowid), max(rowid), sep = ":"), paste(rowid))) %>%
 group_by(x) %>%
 summarise(res = paste(res, collapse = ", "))

两个代码都首先添加具有行ID的列。其次,它们按“ x”和游程长度组ID“ x”分组。第三,他们评估最小行ID是否等于最大行ID。如果不是,则将最小和最大行ID的值组合在一起,并用:分隔,否则仅使用一个行ID值。最后,它们仅按“ x”分组,并按,组合不同的元素。

或者如果您需要所有值,而不仅仅是范围:

df %>%
 rowid_to_column() %>%
 group_by(x, x_rleid = {x_rleid = rle(as.numeric(x)); rep(seq_along(x_rleid$lengths), x_rleid$lengths)}) %>%
 summarise(res = paste(rowid, collapse = ",")) %>%
 group_by(x) %>%
 summarise(res = paste(res, collapse = ","))

  x         res        
  <fct>     <chr>      
1 AJ5ter2   5,6        
2 al-1Tter2 12,13,14,15
3 AY9ter2   7,19       
4 CY-Yter2  16,17,18   
5 LK2ter2   1,2,3,4,8,9
6 YY49ter2  10,11