在存在空白的情况下,根据嵌套列表的子集df

时间:2019-07-03 04:35:57

标签: r list dataframe

我有一个数据框,我想根据特定值将其子集化。当我尝试执行此操作时,由于sample_df$mentions中的值内的空白而出现问题。

我使用此脚本来设置数据框:

sample_list <- list()
for (i in colnames(sample_name)){
  sample_list <- sapply(sample_df$mentions, function(x)any(x %in% sample_name[[i]]))
  new_sample_df <- sample_df[sample_list,]
}

我已经尝试使用strsplit函数来摆脱空间,但是它带来了其他问题。

sample_df$mentions <- strsplit(as.charater(sample_df$mentions),"[[:space:]]") 

谢谢您的帮助。

我的预期结果应该是这样的:

                                                            mentions  screen_name
5          islambey1453,  hamzayerlikaya,  tahaayhan,  hidoturkoglu15  ak_Furkan54
10 nurhandnci,  SSSBBL777,  serkanacar007,  Chequevera06,  kubilayy81 tanrica_gaia

sample_name可复制的数据:

sample_name <- structure(list(Name = structure(2:1, .Label = c("hamzayerlikaya", 
                                                               "SSSBBL777"), class = "factor")), row.names = c(NA, -2L), class = "data.frame")

sample_df可复制数据:

sample_df <- structure(list(mentions = list(character(0), "srgnsnmz92", character(0), 
                               "Berivan_Aslan_", c("islambey1453", " hamzayerlikaya", " tahaayhan", 
                                                   " hidoturkoglu15"), character(0), "themarginale", character(0), 
                               character(0), c("nurhandnci", " SSSBBL777", " serkanacar007", 
                                               " Chequevera06", " kubilayy81")), screen_name = c("SaadetYakar", 
                                                                                                 "beraydogru", "EL_Turco_DLC", "hebunagel", "ak_Furkan54", "zaferakyol011", 
                                                                                                 "melmitem", "mobbingabla", "BekarKronik", "tanrica_gaia")), row.names = c(NA, 
                                                                                                                                                                           10L), class = "data.frame")

2 个答案:

答案 0 :(得分:1)

由于mentions是列表,我们可以使用sapply并仅选择sample_dfany的{​​{1}}具有mentions的行在里面。

Name

sample_df[sapply(sample_df$mentions, function(x) any(grepl(pattern, x))), ] # mentions screen_name #5 islambey1453, hamzayerlikaya, tahaayhan, hidoturkoglu15 ak_Furkan54 #10 nurhandnci, SSSBBL777, serkanacar007, Chequevera06, kubilayy81 tanrica_gaia 在哪里

pattern

答案 1 :(得分:1)

我们可以循环遍历“名称”,并在greplReduce中使用它到单个逻辑向量,并将“ sample_df”行的子集作为子集

sample_df[Reduce(`|`, lapply(as.character(sample_name$Name), 
      grepl, x = sample_df$mentions)),]
#                                                           mentions  screen_name
#5          islambey1453,  hamzayerlikaya,  tahaayhan,  hidoturkoglu15  ak_Furkan54
#10 nurhandnci,  SSSBBL777,  serkanacar007,  Chequevera06,  kubilayy81 tanrica_gaia

注意:这适用于“名称”列中的任何length


另一个选项是regex_inner_join

library(fuzzyjoin)
library(tidyverse)
regex_inner_join(sample_df, sample_name, by = c("mentions" = "Name")) %>% 
      select(mentions, screen_name)
#                                                          mentions  screen_name
#1         islambey1453,  hamzayerlikaya,  tahaayhan,  hidoturkoglu15  ak_Furkan54
#2 nurhandnci,  SSSBBL777,  serkanacar007,  Chequevera06,  kubilayy81 tanrica_gaia