R效率挑战:拆分长字符向量

时间:2019-03-18 22:35:53

标签: r regex string performance

问题是要有效地解析这种格式的数据:

lineup = " C James McCann P Robbie Ray P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"

分为两列的数据框;一个是位置,一个是玩家。

名字是棒球运动员,每个名字都以他们的位置作为开头,这是按照{C,P,P,OF,3B,SS,1B,OF,2B,OF}的确切集合。也就是说,这些确切位置总是会发生。

例如,“ C James McCann”应变成

data.frame(position = "C", player = "James McCann")

实际上,我有成千上万个这样的字符串,并且我想高效地解析它们。这是我效率不高的解决方案:

data.frame(
    position = str_match_all(lineup, "\\s[0-9A-Z]{1,2}\\s")[[1]] %>% as.character() %>% str_trim(),
    player = str_split(lineup, "\\s[0-9A-Z]{1,2}\\s")[[1]][-1],
    stringsAsFactors = F
)

这个整洁的解决方案很简单,但是我怀疑我可以做得更好。有人有什么想法吗?

3 个答案:

答案 0 :(得分:3)

您可以使用stringi :: stri_match_all_regex制作一个单一的图案,同时获得位置和球员名称:

stri_match_all_regex(lineup, 
                   patt= "(C|P|OF|3B|SS|1B|OF|2B) ([A-Z][A-Za-z]+ [A-Z][A-Za-z]+)" )
[[1]]
      [,1]                   [,2] [,3]               
 [1,] "C James McCann"       "C"  "James McCann"     
 [2,] "P Robbie Ray"         "P"  "Robbie Ray"       
 [3,] "P Rafael Montero"     "P"  "Rafael Montero"   
 [4,] "OF Giancarlo Stanton" "OF" "Giancarlo Stanton"
 [5,] "3B Derek Dietrich"    "3B" "Derek Dietrich"   
 [6,] "SS Miguel Rojas"      "SS" "Miguel Rojas"     
 [7,] "1B Tommy Joseph"      "1B" "Tommy Joseph"     
 [8,] "OF Marcell Ozuna"     "OF" "Marcell Ozuna"    
 [9,] "OF Christian Yelich"  "OF" "Christian Yelich" 

由于我的模式将空格之间的一个或两个字母限制为仅与棒球位置匹配的组合,因此我对模式的限制比对您更大。您将获得一个包含每行矩阵的项目的列表。您可能应该发布一个更复杂的示例,以支持将需要的进一步处理。您将需要使用lapply( results, function(x){ as.data.frame(x[ , -1]) })

lapply( results, function(x){ as.data.frame(x[ , -1]) })
[[1]]
  V1                V2
1  C      James McCann
2  P        Robbie Ray
3  P    Rafael Montero
4 OF Giancarlo Stanton
5 3B    Derek Dietrich
6 SS      Miguel Rojas
7 1B      Tommy Joseph
8 OF     Marcell Ozuna
9 OF  Christian Yelich

如果要使用带连字符的名称,中间名或缩写,则模式可能需要更复杂。

答案 1 :(得分:2)

这是一个stringr::str_split选项,使用正向后看和前瞻

pos <- c("C", "P", "P", "OF", "3B", "SS", "1B", "OF", "2B", "OF")
pat <- sprintf("(%s)", paste(pos, collapse = "|"))

library(stringr)
matrix(unlist(str_split(trimws(lineup), sprintf(
    "((?<=(%s))\\s|\\s(?=(%s)))", pat, pat))), ncol = 2, byrow = T)
#    [,1] [,2]
#[1,] "C"  "James McCann"
#[2,] "P"  "Robbie Ray"
#[3,] "P"  "Rafael Montero"
#[4,] "OF" "Giancarlo Stanton"
#[5,] "3B" "Derek Dietrich"
#[6,] "SS" "Miguel Rojas"
#[7,] "1B" "Tommy Joseph"
#[8,] "OF" "Marcell Ozuna"
#[9,] "2B" "C?sar Hern?ndez"
#[10,] "OF" "Christian Yelich"

我不知道这涵盖了所有极端情况。更复杂,更具代表性的示例字符串将有助于测试。

答案 2 :(得分:2)

这是一个将lineup转换为csv文件格式的字符串的解决方案,然后由fread()读取:

library(magrittr)  # piping used to improve readability
lineup %>% 
  stringr::str_replace_all("\\s(C|P|OF|SS|1B|2B|3B)\\s", "\\\n\\1;") %>% 
  data.table::fread(header = FALSE, col.names = c("position", "player"))
    position            player
 1:        C      James McCann
 2:        P        Robbie Ray
 3:        P    Rafael Montero
 4:       OF Giancarlo Stanton
 5:       3B    Derek Dietrich
 6:       SS      Miguel Rojas
 7:       1B      Tommy Joseph
 8:       OF     Marcell Ozuna
 9:       2B   C?sar Hern?ndez
10:       OF  Christian Yelich

“技巧”是在位置字符之前插入换行符,并在" C "变为"\nC;"之后插入列分隔符。

lineup %>% 
  stringr::str_replace_all("\\s(C|P|OF|SS|1B|2B|3B)\\s", "\\\n\\1;")

返回

[1] "\nC;James McCann\nP;Robbie Ray\nP;Rafael Montero\nOF;Giancarlo  Stanton\n3B;Derek Dietrich\nSS;Miguel Rojas\n1B;Tommy Joseph\nOF;Marcell Ozuna\n2B;C?sar Hern?ndez\nOF;Christian Yelich"

这种方法对名称没有太多假设。它甚至可以与James P. McCannRobbie Ray, Jr之类的名称一起使用。

lineup2 %>% 
  stringr::str_replace_all("\\s(C|P|OF|SS|1B|2B|3B)\\s", "\\\n\\1;") %>% 
  data.table::fread(header = FALSE, col.names = c("position", "player"))
    position            player
 1:        C   James P. McCann
 2:        P    Robbie Ray, Jr
 3:        P  Rafael D Montero
 4:       OF Giancarlo Stanton
 5:       3B    Derek Dietrich
 6:       SS      Miguel Rojas
 7:       1B      Tommy Joseph
 8:       OF     Marcell Ozuna
 9:       2B   C?sar Hern?ndez
10:       OF  Christian Yelich

必须满足三个先决条件:

  1. 名称部分不得包含任何也用作位置指示符的首字母,例如,首字母CP必须用点号完成,以免造成混淆。
  2. 不得在;中的其他地方使用列分隔符lineup
  3. 字符串必须以前导空格开头。

可以使用改进的正则表达式来挥舞条件3,并可以检查条件2:

lineup3 %T>% 
  {stopifnot(!stringr::str_detect(., ";"))} %>% 
  stringr::str_replace_all("(^\\s?|\\s)(C|P|OF|SS|1B|2B|3B)\\s", "\\\n\\2;") %>% 
  data.table::fread(header = FALSE, col.names = c("position", "player"))
    position            player
 1:        C   James P. McCann
 2:        P    Robbie Ray, Jr
 3:        P    Rafael Montero
 4:       OF Giancarlo Stanton
 5:       3B    Derek Dietrich
 6:       SS      Miguel Rojas
 7:       1B      Tommy Joseph
 8:       OF     Marcell Ozuna
 9:       2B   C?sar Hern?ndez
10:       OF  Christian Yelich

数据

# original
lineup = " C James McCann P Robbie Ray P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"

# other use cases
lineup1 = "C James McCann P Robbie Ray P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup2 = " C James P. McCann P Robbie Ray, Jr P Rafael D Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup2a = " C James P. McCann P Robbie Ray P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup2b = " C James McCann P Robbie Ray, Jr P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup3 = "C James P. McCann P Robbie Ray, Jr P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"
lineup4 = " C James P. McCann P Robbie Ray; Jr P Rafael Montero OF Giancarlo Stanton 3B Derek Dietrich SS Miguel Rojas 1B Tommy Joseph OF Marcell Ozuna 2B C?sar Hern?ndez OF Christian Yelich"