Question

我有一些数据：

testData <- tibble(fname = c("Alice", "Bob", "Charlie", "Dan", "Eric"), 
lname = c("Smith", "West", "CharlieBlack", "DanMcDowell", "Bush"))

一些姓氏与他们的名字串联在一起。

解决并修复lname列的有效方法是什么？

我希望它看起来像这样：

lname = c("Smith", "West", "Black", "McDowell", "Bush")

我可以使用for循环，但是我有50万行数据，所以我想找到一种更有效的方法。

Answer 1

We can use str_remove

library(tidyverse)
testData %>%
   mutate(lname = str_remove(lname, fname))
# A tibble: 5 x 2
#  fname   lname   
#  <chr>   <chr>   
#1 Alice   Smith   
#2 Bob     West    
#3 Charlie Black   
#4 Dan     McDowell
#5 Eric    Bush

Answer 2

We can use gsub within apply:

apply(testData,1,function(x) gsub(x['fname'],"",x['lname']))

Output:

[1] "Smith"    "West"     "Black"    "McDowell" "Bush"

Answer 3

try mutate with an ifelse clause to catch the lname entires that are concatenated, e.g.:

library(dplyr) testData <- testData %>% mutate(lname = ifelse(grepl('[[:upper:]][[:lower:]]+[[:upper:]]', lname), gsub('^[[:upper:]][[:lower:]]+', "", lname), lname))

In this example, you are saying "mutate lname IF the string has an uppercase letter + at least one lowercase letter + an uppercase letter. If that condition is met, replace the first uppercase letter and following lowercase letters with nothing. If that condition is not met, just keep the original lname text".

跨列使用gsub

3 个答案: