正确大写与公司名称混合的名称字符串

时间:2015-05-22 21:31:58

标签: regex r

我有一个所有上限的所有者名单列表,我想将其转换为正确的大写字母:

                   owner1
 1:    DXXXXX JOSEPH V JR
 2:          MIRNA NXXXXX
 3:          ADRIAN TXXXX
 4: CUTLER PXXXXXXXXX LLC
 5:    GVM PXXXXXXXXX LLC
 6:      EARLENA RXXXXXXX
 7:      NATHANIEL TXXXXX
 8:         DXXXXXX DONNA
 9:     LXXXX ELAINE E TR
10:      SXXXXXX KIMBERLY

(用于复制目的:

 owner1<-c("DXXXXX JOSEPH V JR","MIRNA NXXXXX","ADRIAN TXXXX",
           "CUTLER PXXXXXXXXX LLC","GVM PXXXXXXXXX LLC",
           "EARLENA RXXXXXXX","NATHANIEL TXXXXX","DXXXXXX DONNA",
           "LXXXX ELAINE E TR","SXXXXXX KIMBERLY")

期望的输出:

                   owner1
 1:   Dxxxxx Joseph V. Jr
 2:          Mirna Nxxxxx
 3:          Adrian Txxxx
 4: Cutler Pxxxxxxxxx LLC
 5:    GVM Pxxxxxxxxx LLC
 6:      Earlena Rxxxxxxx
 7:      Nathaniel Txxxxx
 8:         Dxxxxxx Donna
 9:    Lxxxx Elaine E. TR
10:      Sxxxxxx Kimberly

重要的第一步是.simpleCap中提到的?chartr功能版本:

.simpleCap <- function(x) {
    s <- strsplit(tolower(x), " ")[[1]]
    paste(toupper(substring(s, 1, 1)), substring(s, 2),
          sep = "", collapse = " ")
}

这是问题的一大部分,但在4,5和9上失败。我可以补充这个来分别处理关键短语(LLC,TR等),但这仍然留下像观察5。

这是我到目前为止所使用的功能(通过下面的@ eipi10解决方案非常巧妙地加速了.simpleCap函数,允许将整个函数应用于向量):

to.proper<-function(strings){
  #vectorized version of .simpleCap;
  #  I've also built in that I know `strings` is all caps
  res<-gsub("\\b([A-Z])([A-Z]+)*","\\U\\1\\L\\2",strings,perl=T)
  #In my data, some Irish/Scottish names separated the MC prefix
  #  Also, re-capitalize following a hyphen
  res<-gsub("\\bMc\\s","Mc",gsub("(-.)","\\U\\1",res,perl=T))
  for (init in c("[A-Z]","Inc","Assoc","Co",
                 "Jr","Sr","Tr","Bros")){
    #Add a period after common abbreviations
    res<-gsub(paste0("\\b(",init,")\\b"),"\\1.",res)
  }
  for (abbr in c("[B-DF-HJ-NP-TV-XZ][b-df-hj-np-tv-xz]{2,}",
                 "Pa","Ii","Iii","Iv","Lp","Tj",
                 "Xiv","Ll","Yml","Us")){
    #Re-capitalize any string of >=3 consonants (excluding
    #   Y for such names as LYNN and WYNN), as well as
    #   some other common phrases that need upper-casing
    res<-gsub(paste0("\\b(",abbr,")\\b"),"\\U\\1",res,perl=T)
  }
  #Re-capitalize post-Mc letters, e.g. in Mcmahon
  gsub("\\bMc([a-z])","Mc\\U\\1",res,perl=T)
}

任何关于健壮的方法 - 在这个过程中单独留下可能不可预测的缩写的方法(特别是像观察5中那些不常见的那些)?

1 个答案:

答案 0 :(得分:2)

这是一个使用正则表达式将字符串转换为标题大小写的函数(改编自@BenBolker's answer to a question I asked on SO a while back)。

编写函数,以便您可以传递一个名为exceptions的参数来处理GVM等特殊情况。我不确定这是否足够灵活以满足您的需求,因为您必须对异常进行硬编码,但我想我会发布它并看看是否有人可以提出改进建议。

dat = data.frame(owner1 = c("DXXXXX JOSEPH V JR","MIRNA NXXXXX","ADRIAN TXXXX",
                                    "CUTLER PXXXXXXXXX LLC","GVM PXXXXXXXXX LLC",
                                    "EARLENA RXXXXXXX","NATHANIEL TXXXXX","DXXXXXX DONNA",
                                    "LXXXX ELAINE E TR","SXXXXXX KIMBERLY"))

# Convert a string to title case
tc = function(strings, exceptions="\\b(gvm)\\b") {

  # Convert to title case, excluding terminal LLC, TR, etc.
  title.case = gsub("\\b([a-zA-Z])([a-zA-Z]+)*( LLC| TR| FBO| LP)?", 
                    "\\U\\1\\L\\2\\U\\3", strings, perl=TRUE)

  # Add a period after initials (presumed to be any lone capital letter)
  title.case = gsub(" ([A-Z]) ", " \\1\\. ", title.case)

  # Deal with exceptions
  title.case = gsub(exceptions, "\\U\\1", title.case, perl=TRUE, ignore.case=TRUE)

  return(title.case)
}

dat$title.case = tc(dat$owner1)

                  owner1            title.case
1     DXXXXX JOSEPH V JR   Dxxxxx Joseph V. Jr
2           MIRNA NXXXXX          Mirna Nxxxxx
3           ADRIAN TXXXX          Adrian Txxxx
4  CUTLER PXXXXXXXXX LLC Cutler Pxxxxxxxxx LLC
5     GVM PXXXXXXXXX LLC    GVM Pxxxxxxxxx LLC
6       EARLENA RXXXXXXX      Earlena Rxxxxxxx
7       NATHANIEL TXXXXX      Nathaniel Txxxxx
8          DXXXXXX DONNA         Dxxxxxx Donna
9      LXXXX ELAINE E TR    Lxxxx Elaine E. TR
10      SXXXXXX KIMBERLY      Sxxxxxx Kimberly