R:仅从名称列表中提取名字和姓氏

时间:2017-03-29 17:08:54

标签: r

我正在使用R进行数据操作。我有一个很长的名单,如下所示:

"names"

[1] ""                               
[2] "Victoria Marie"                 
[3] "Ori Mann"                     
[4] "Lina Pearl Right"          
[5] "David Berg"                     
[6] "Anthony Lee"                  
[7] "Brian Michael Ingraham"         
[8] "Jay Ling"             

我想只将整个列表的名字和姓氏提取到新列中,并丢弃任何中间名称。我该怎么做呢? 我使用了以下代码:

mat  = matrix(unlist(names), ncol=2, byrow=TRUE)

这只是遍历每个条目中的所有名称,并按顺序将它们全部抛出到列中。

非常感谢任何帮助。

1 个答案:

答案 0 :(得分:1)

这是一种在基础R中执行此操作的方法,它还可以处理后缀的可能性。如果您发现其他后缀(例如,'II'),则可以将它们添加到%in%后面的向量中。

# some representative data
names <- list("", "Ed Smith", "Jennifer Jason Leigh", "Ed Begley, Jr.")

# use strsplit to get a list of vectors of each name broken into its parts,
# keying off the space between names
names.split <- strsplit(unlist(names), " ")

# make new vectors with the first and last names, based on their position in
# those vectors. for last names, make the result conditional on whether or
# not a recognized suffix is in the last spot, and get rid of any 
# punctuation attached to the last name if there was a suffix.
name.first <- sapply(names.split, function(x) x[1])
name.last <- sapply(names.split, function(x)

  # this deals with empty name slots in your original list, returning NA
  if(length(x) == 0) {

    NA

  # now check for a suffix; if one is there, use the penultimate item
  # after stripping it of any punctuation
  } else if (x[length(x)] %in% c("Jr.", "Jr", "Sr.", "Sr")) {

    gsub("[[:punct:]]", "", x[length(x) - 1])

  } else {

    x[length(x)]

})

结果:

> name.first
[1] NA         "Ed"       "Jennifer" "Ed"      
> name.last
[1] NA       "Smith"  "Leigh"  "Begley"