如何搜索数据框中的列表?

时间:2019-05-16 23:01:56

标签: r list dataframe

我正在尝试在data.frame中定位特定术语。有7列1356行。我对搜索感兴趣的两列是列表类型。我想知道“猎人”一词出现在这两栏中。

如果我使用sapply检查每一列的数据类型,则会得到以下信息:

sapply(dataframe, class)

         ID    pdf_name     keyword    page_num    line_num   line_text  token_text 
"integer"    "factor" "character"   "integer"   "integer"      "list"      "list" 

当我尝试使用

过滤掉data.frame中不包含搜索词的行时
filter(dataframe, !grepl("hunt",token_text))

我从整个data.frame中打印出来。理想情况下,我只想从搜索词出现在列表之一中的行中打印出来。这是 我到目前为止所获得的head

structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L), pdf_name = structure(c(1L, 
1L, 1L, 1L, 1L, 1L), .Label = c("Ames - 1994 - The Northwest Coast Complex Hunter-Gatherers, Eco.pdf", 
"Byers and Broughton - 2004 - Holocene Environmental Change, Artiodactyl Abundan.pdf", 
"Byers et al. - 2005 - Holocene artiodactyl population histories and larg.pdf", 
"Clarkson and Bellas - 2014 - Mapping stone using GIS spatial modelling to pred.pdf", 
"Codding and Jones - 2013 - Environmental productivity predicts migration, dem.pdf", 
"Elston and Zeanah - 2002 - Thinking outside the box a new perspective on die.pdf", 
"Elston et al. - 2014 - Living outside the box An updated perspective on .pdf", 
"FinlaysonBillWa_2017_2ExpandingNotionsOfHu_TheDiversityOfHunterG.pdf", 
"FinlaysonBillWa_2017_3ConceptualisingSubsi_TheDiversityOfHunterG.pdf", 
"FinlaysonBillWa_2017_5OkhotskAndSushenHist_TheDiversityOfHunterG.pdf", 
"FinlaysonBillWa_2017_6ComparativeAnalysisO_TheDiversityOfHunterG.pdf", 
"FinlaysonBillWa_2017_7LetsStartWithOurAcad_TheDiversityOfHunterG.pdf", 
"FinlaysonBillWa_2017_8ExperimentalEthnoarc_TheDiversityOfHunterG.pdf", 
"Fowler et al. - 2013 - Archaeology in the Great Basin and Southwest Pap.pdf", 
"Fulkerson - 2017 - Engendering the Past The Status of Gender and Fem.pdf", 
"GowdyJohnM_1998_2WhatHuntersDoForALiv_LimitedWantsUnlimited.pdf", 
"GowdyJohnM_1998_3SharingTalkingAndGiv_LimitedWantsUnlimited.pdf", 
"GowdyJohnM_1998_5BeyondTheOriginalAff_LimitedWantsUnlimited.pdf", 
"GowdyJohnM_1998_8TheFutureOfHunterGat_LimitedWantsUnlimited.pdf", 
"Gray - 2011 - The Evolutionary Biology of Education How Our Hun.pdf", 
"Grayson and Woolfenden - 2016 - Giant Sloths and Sabertooth Cats Archaeology of .pdf", 
"GraysonDonaldKW_2016_ClovisCometsAndClimat_GiantSlothsAndSaberto.pdf", 
"GraysonDonaldKW_2016_ExtinctMammalsDangero_GiantSlothsAndSaberto.pdf", 
"Hildebrandt and McGuire - 2003 - Large-Game Hunting, Gender-Differentiated Work Org.pdf", 
"Hockett - 1991 - Toward Distinguishing Human and Raptor Patterning .pdf", 
"Hockett - 2005 - Middle and Late Holocene Hunting in the Great Basi.pdf", 
"Hockett - 2010 - Back to Study Hall Further Reflections on Large G.pdf", 
"Hockett et al. - 2013 - Large-scale trapping features from the Great Basin.pdf", 
"Hockett et al. - 2014 - Identifying Dart and Arrow Points in The Great Bas.pdf", 
"Janz - 2016 - Fragmented Landscapes and Economies of Abundance.pdf", 
"Kintigh - 1997 - Thoughts on Writing in Archaeology With Special Re.pdf", 
"LaBelle and Pelton - 2013 - Communal hunting along the Continental Divide of N.pdf", 
"Lawson and Borgerhoff Mulder - 2016 - The offspring quantity-quality trade-off and human.pdf", 
"Lemke - 2016 - Hunting Architecture and Foraging Lifeways beneath.pdf", 
"Lew-Levy et al. - 2017 - How Do Hunter-Gatherer Children Learn Subsistence .pdf", 
"Louderback et al. - 2011 - Middle-Holocene climates and human population dens.pdf", 
"M. W. Lake - 2014 - Trends in Archaeological Simulation.pdf", 
"Madsen and Simms - 1998 - The Fremont Complex A Behavioral Perspective.pdf", 
"Margaret W. Conkey and Joan M. Gero - 1997 - Programme to Practice Gender and Feminism in Arch.pdf", 
"Ross et al. - 2016 - Evidence for quantity–quality trade-offs, sex-spec.pdf", 
"Silva et al. - 2014 - Historical ethnobotany an overview of selected st.pdf", 
"Smith et al. - 2013 - Paleoindian technological provisioning strategies .pdf", 
"Stirn - 2014 - Modeling site location patterns amongst late-prehi.pdf", 
"Trigger - 1984 - Archaeology at the Crossroads What's New.pdf"
), class = "factor"), keyword = c("table", "table", "table", 
"table", "table", "table"), page_num = c(2L, 2L, 2L, 3L, 3L, 
3L), line_num = c(29L, 38L, 63L, 98L, 102L, 106L), line_text = list(
    "Salmon have advantages for foragers (72, 111); they occur at predictable times, in predictable places, and in once prodigious numbers. ", 
    "Such variation in clumping is not predictable. ", "People inevitably began taking advantage of the rich, predictable resource. ", 
    "Matson reasons that intensification, sedentism, and ownership of resource patches evolved among hunter-gatherers when the resources were sufficiently abundant, reliable, predictable, and limited geographically and temporally. ", 
    "Matson holds that intensification, inequality, and sedentism each flow as inevitable consequences of the stmcture of the resource base, but only intensification and status differentials are causally linked. ", 
    "Matson's view is that Northwest Coast societies would only develop in an environment that was reliably rich and predictable. "), 
    token_text = list(list(c("salmon", "have", "advantages", 
    "for", "foragers", "72", "111", "they", "occur", "at", "predictable", 
    "times", "in", "predictable", "places", "and", "in", "once", 
    "prodigious", "numbers")), list(c("such", "variation", "in", 
    "clumping", "is", "not", "predictable")), list(c("people", 
    "inevitably", "began", "taking", "advantage", "of", "the", 
    "rich", "predictable", "resource")), list(c("matson", "reasons", 
    "that", "intensification", "sedentism", "and", "ownership", 
    "of", "resource", "patches", "evolved", "among", "hunter", 
    "gatherers", "when", "the", "resources", "were", "sufficiently", 
    "abundant", "reliable", "predictable", "and", "limited", 
    "geographically", "and", "temporally")), list(c("matson", 
    "holds", "that", "intensification", "inequality", "and", 
    "sedentism", "each", "flow", "as", "inevitable", "consequences", 
    "of", "the", "stmcture", "of", "the", "resource", "base", 
    "but", "only", "intensification", "and", "status", "differentials", 
    "are", "causally", "linked")), list(c("matson's", "view", 
    "is", "that", "northwest", "coast", "societies", "would", 
    "only", "develop", "in", "an", "environment", "that", "was", 
    "reliably", "rich", "and", "predictable")))), row.names = c(NA, 
6L), class = "data.frame")

2 个答案:

答案 0 :(得分:0)

这是一个示例,该示例使用我用sentences数据集制作的假数据帧。这是一个长字符向量,但是我们将在空格上进行分割,以使listcol是每个句子中各个单词的列表列:

library(tidyverse)

dataframe <- sentences %>%
  enframe(name = "rowid", value = "sentence") %>%
  mutate(listcol = str_split(sentence, "\\s"))
dataframe
#> # A tibble: 720 x 3
#>    rowid sentence                                    listcol  
#>    <int> <chr>                                       <list>   
#>  1     1 The birch canoe slid on the smooth planks.  <chr [8]>
#>  2     2 Glue the sheet to the dark blue background. <chr [8]>
#>  3     3 It's easy to tell the depth of a well.      <chr [9]>
#>  4     4 These days a chicken leg is a rare dish.    <chr [9]>
#>  5     5 Rice is often served in round bowls.        <chr [7]>
#>  6     6 The juice of lemons makes fine punch.       <chr [7]>
#>  7     7 The box was thrown beside the parked truck. <chr [8]>
#>  8     8 The hogs were fed chopped corn and garbage. <chr [8]>
#>  9     9 Four hours of steady work faced us.         <chr [7]>
#> 10    10 Large size in stockings is hard to sell.    <chr [8]>
#> # … with 710 more rows

因此,我们有一个带有一些非列表列rowid和一个列表列listcol的数据框。我们可以过滤以仅包含句子包含"The"的行。诀窍是使用map_lgl(或sapply)检查列表的每个元素,以查看元素中的any是否与{ {1}}(或str_detect)。

grepl

reprex package(v0.2.1)于2019-05-16创建

答案 1 :(得分:0)

这是一个tidyverse解决方案。由于数据的结构方式,有些混乱。我没有将您的最后一栏列出为字符串。我将您的dput保存为df

首先,我unnest最后一列,并将其折叠为字符串。然后,我select仅是您感兴趣的列,然后在which行中找到第三个单词“ hunter”。

library(dplyr)
library(stringr)
df %>% 
  dplyr::mutate(token_text = unlist(lapply(lapply(token_text, unlist), paste, collapse = " "))) %>% 
  dplyr::select(line_text, token_text) %>% 
  lapply(function(x) which(stringr::str_detect(x, "hunter")))
$`line_text`
[1] 4

$token_text
[1] 4