Question

我有一张描述火车轨道的表，每条线都是具有from和to站以及trackID和segment的轨道的一部分， ID。电台名称完全是随机的，而不像此处显示的那样结构化。

tracks <- data.frame(
  trackID = c(rep("A",4),rep("B",4)),
  segment = letters[1:8],
  from = paste0("station_1",1:8),
  to = paste0("station_2",1:8)
  )

tracks 

  trackID segment       from         to
1       A       a station_11 station_21
2       A       b station_12 station_22
3       A       c station_13 station_23
4       A       d station_14 station_24
5       B       e station_15 station_25
6       B       f station_16 station_26
7       B       g station_17 station_27
8       B       h station_18 station_28

我在这张火车上还有一张桌子上有目击者，我想知道每次目击者所对应的trackID是什么。该表如下所示：

sightings <- data.frame(from = c("station_24","station_28","station_14"),
                    to = c("station_14","station_16","station_25"))

sightings 

        from         to
1 station_24 station_14
2 station_28 station_16
3 station_14 station_25

我可以从目击表中提供的trackID和to信息中收集有关from的信息。但是from表中的to和sightings与from表中的to和track不对应：{{ 1}}和from可以位于不同的段中，并且和可以互换（to-to）。在某些有问题的情况下，from和from位于不同的to中，然后将不返回任何匹配项。该示例的期望输出为：

trackID

在我看来，解决方案包括用from to trackID 1 station_24 station_14 A 2 station_28 station_16 B 3 station_14 station_25 <NA> # no match since station_14 and 25 are from two different trackIDs折叠tracks表，然后对字符串进行双部分匹配（使用trackID？）。接下来的几行会解决折叠的问题，但是我不知道从这儿去哪里。有人可以指出我正确的方向吗？

非常喜欢使用grepl() / R的解决方案，但我会采取任何措施！

dplyr

编辑：在我的最小示例中，我似乎简化了我的问题。主要问题是工作站（library(dplyr) tracks %>% group_by(trackID) %>% summarise( from_to = paste(paste(from,collapse = ","),paste(to,collapse = ","),sep = ",") ) tracks trackID from_to <fct> <chr> 1 A station_11,station_12,station_13,station_14,station_21,station_22,station_23,station_24 2 B station_15,station_16,station_17,station_18,station_25,station_26,station_27,station_28和from）在表中不是唯一的，甚至不是to唯一的。 trackID和to的组合对于from是唯一的。我已经接受了答案，因为它可以解决上述问题，但同时我也会提供自己的解决方案。

Answer 1

双向联接可以工作。

注意：您似乎没有使用segment，因此我在这里将其丢弃，但是如果需要，可以对其进行修改。另外，我在您的数据中添加了stringsAsFactors=FALSE，因为否则组合factor的向量可能会有问题。）

library(dplyr)

tracksmod <- bind_rows(
  select(tracks, trackID, sta=from),
  select(tracks, trackID, sta=to)
)
head(tracksmod)
#   trackID        sta
# 1       A station_11
# 2       A station_12
# 3       A station_13
# 4       A station_14
# 5       B station_15
# 6       B station_16

sightings %>%
  left_join(select(tracksmod, trackID, from=sta), by="from") %>%
  left_join(select(tracksmod, trackID2=trackID, to=sta), by="to") %>%
  mutate(trackID = if_else(trackID == trackID2, trackID, NA_character_)) %>%
  select(-trackID2)
#         from         to trackID
# 1 station_24 station_14       A
# 2 station_28 station_16       B
# 3 station_14 station_25    <NA>

我不认为方向性很重要。也就是说，我不认为from中列出的电台必须始终在from列中。这就是为什么我将tracks转换为tracksmod以便识别具有ID（与方向无关）的电台的原因。

Answer 2

正如我在问题的EDIT中所述，我在最小的Example中过分简化了我的问题。这是数据的更新版本，与我的数据更准确。正如@ r2evans所说，我还添加了stringsAsFactor = F。

tracks <- data.frame(
  trackID = c(rep("A",4),rep("B",4)),
  segment = letters[1:8],
  from = paste0("station_1",c(1:4,1,2,5,6)),
  to = paste0("station_2",1:8),
  stringsAsFactors = F
  )

sightings <- data.frame(
  from = c("station_24","station_28","station_14"),
  to = c("station_14","station_11","station_25"),
  trackID = c("A","B",NA),
  stringsAsFactors = F
)

我通过在tracks的基础上折叠trackID表，然后使用purrr包以嵌套方式使用循环函数来解决了这个问题。

library(dplyr)

# Collapsing the tracks-dataframe
tracks_collapse <- tracks %>%
  group_by(trackID) %>%
  summarise(
    from_to = paste(paste(from,collapse = ","),paste(to,collapse = ","),sep = ",")
    # from = list(from),
    # to = list(to),
    # stas = list(c(from,to))
    )

# a helper function to remove NAs when looking for matches
remove_na <- function(x){x[!is.na(x)]}

library(purrr)


pmap_dfr(sightings, function(from,to,trackID){                         # pmap_dfr runs over a data.frame and returns a data.frame
  data.frame(
    from = from,                                                       # recreates the sightings data.frame
    to = to,                                                           # dito
    trackID = paste(                                                   # collapses the resulting vector
      remove_na(                                                       # removes the NA values
        pmap_chr(                                                      # matches every row from the sightings-data.frame with the tracks-data.frame
          tracks_collapse,
          function(trackID,from_to){
            ifelse(grepl(from,from_to) & grepl(to,from_to),trackID,NA) # does partial string matching and returns the trackID if both strings match
            }
          )
        ),collapse = ","
      )
    )
  })

输出：

        from         to trackID
1 station_24 station_14       A
2 station_28 station_11       B
3 station_14 station_25    <NA>

通过多个部分匹配来联接表

2 个答案: