如何提取"域"来自电子邮件地址

时间:2016-10-14 08:37:01

标签: r regex

我的专栏中有以下模式

xyz@gmail.com
abc@hotmail.com

现在,我想在@之后和.之前提取文本,即gmail和hotmail。我能够使用以下代码在.之后提取文本。

sub(".*@", "", email)

如何修改上述内容以适应我的用例?

4 个答案:

答案 0 :(得分:6)

您:

  1. 真的需要阅读RFC 3696的第3部分(TLDR:@可以出现在多个地方)
  2. 似乎没有考虑过电子邮件可以是“someone@department.example.com”,“someone.else@yet.another.department.example.com”(即天真地假设只有一个域可能会在此分析的某个时刻再次咬你)
  3. 应该知道,如果您真的在寻找电子邮件“域名”,那么您还必须考虑really constitutes a domain name and a proper suffix
  4. 所以 - 除非你确定你有并且总是会有简单的电子邮件地址 - 我可以建议:

    library(stringi)
    library(urltools)
    library(dplyr)
    library(purrr)
    
    emails <- c("yz@gmail.com", "abc@hotmail.com",
                "someone@department.example.com",
                "someone.else@yet.another.department.com",
                "some.brit@froodyorg.co.uk")
    
    stri_locate_last_fixed(emails, "@")[,"end"] %>%
      map2_df(emails, function(x, y) {
        substr(y, x+1, nchar(y)) %>%
          suffix_extract()
      })
    ##                         host    subdomain      domain suffix
    ## 1                  gmail.com         <NA>       gmail    com
    ## 2                hotmail.com         <NA>     hotmail    com
    ## 3      deparment.example.com   department     example    com
    ## 4 yet.another.department.com  yet.another  department    com
    ## 5             froodyco.co.uk         <NA>   froodyorg  co.uk
    

    请注意子域,域和域的正确拆分后缀,特别是最后一个。

    知道了这一点,我们就可以将代码更改为:

    stri_locate_last_fixed(emails, "@")[,"end"] %>%
      map2_chr(emails, function(x, y) {
        substr(y, x+1, nchar(y)) %>%
          suffix_extract() %>%
          mutate(full_domain=ifelse(is.na(subdomain), domain, sprintf("%s.%s", subdomain, domain))) %>%
          select(full_domain) %>%
          flatten_chr()
      })
    ## [1] "gmail"                   "hotmail"               
    ## [3] "department.example"      "yet.another.department"
    ## [5] "froodyorg"
    

答案 1 :(得分:2)

我们可以使用gsub

gsub(".*@|\\..*", "", email)
#[1] "gmail"   "hotmail"

答案 2 :(得分:2)

您可以使用:

emails <- c("xyz@gmail.com", "abc@hotmail.com")
emails_new <- gsub("@(.+)$", "\\1", emails)
emails_new
# [1] "gmail.com"   "hotmail.com"

查看demo on ideone.com

答案 3 :(得分:1)

这是@hrbrmstr 的 stringr 函数:

stringr::str_locate_all(email,"@") %>% purrr::map_int(~ .[2]) %>%
purrr::map2_df(email, ~ {
  stringr::str_sub(.y, .x+1, nchar(.y)) %>%
    urltools::suffix_extract()
})