查找一列中的值是否在其他几列的范围内

时间:2021-06-09 12:48:04

标签: r string range comparison multiple-columns

我正在寻找一种简单的方法来确定列中的值是否在其他列中的值范围内。

我的输入是这样的:

ID  "Q1 Comm - 01 Scope Thesis" "Q1 Comm - 02 Scope Project" "Q1 Comm - 03 Learn Intern"    "Q1 Comm - 04 Biography"    "Q1 Comm - Overall Plan"
10   NA                          NA                           4                              NA      4
31   2                           NA                           NA                             NA      2
225  0                           NA                           NA                             NA      1
243  NA                          2                            NA                             1       0
310  NA                          2                            NA                             1       NA

对于每个唯一的 ID,我有兴趣确定 Q1 Comm - Overall Plan 列何时是:

1 - Below 所有其他列的 min(),或

2 - Above 所有其他列的 max(),或

3 - Within 所有其他列的范围

完整的列列表(连同 overall 列)如下:

"Q1 Comm - 01 Scope Thesis"
"Q1 Comm - 02 Scope Project"
"Q1 Comm - 03 Learn Intern"
"Q1 Comm - 04 Biography"
"Q1 Comm - 05 Exhibit"
"Q1 Comm - 06 Social Act"
"Q1 Comm - 07 Post Project"
"Q1 Comm - 08 Learn Plant"
"Q1 Comm - 09 Study Narrate"
"Q1 Comm - 10 Learn Participate"
"Q1 Comm - 11 Write 1"
"Q1 Comm - 12 Read 2"
"Q1 Comm - Overall Plan"

我需要的输出是这样的:

ID  "Q1 Comm - 01 Scope Thesis" "Q1 Comm - 02 Scope Project" "Q1 Comm - 03 Learn Intern"    "Q1 Comm - 04 Biography"    "Q1 Comm - Overall Plan" "Q1_check"
10   NA                          NA                           4                              NA      4 "within"
31   2                           NA                           NA                             NA      2 "within"
225  0                           NA                           NA                             NA      1 "above"
243  NA                          2                            NA                             1       0 "below"
310  NA                          2                            NA                             1       NA NA

我的数据框 df 的 dput() 如下。

dput(df)

structure(list(ID = c(10L, 31L, 225L, 243L), Q1.Comm...01.Scope.Thesis = c(NA, 
2L, 0L, NA), Q1.Comm...02.Scope.Project = c(NA, NA, NA, 2L), 
    Q1.Comm...03.Learn.Intern = c(4L, NA, NA, NA), Q1.Comm...04.Biography = c(NA, 
    NA, NA, 1L), Q1.Comm...Overall.Plan = c(4L, 1L, 2L, 
    NA), X = c(NA, NA, NA, NA), X.1 = c(NA, NA, NA, NA), X.2 = c(NA, 
    NA, NA, NA)), class = "data.frame", row.names = c(NA, -4L
))

注意:

我曾在 Finding if a value is within the range of other columns 处问过这个问题,但示例过于简化,没有一个解决方案适合我。

问题变得太长了,因此,为了清楚起见,我将此作为一个新问题发布。

感谢您抽出宝贵时间在这篇文章中提供帮助。

3 个答案:

答案 0 :(得分:1)

您可以尝试使用 rowwisec_across

library(dplyr)
df %>%
  rowwise %>%
  summarise(ID = ID,
            Max = `Q1.Comm...Overall.Plan` > max(c_across(-c(ID,`Q1.Comm...Overall.Plan`)),na.rm = TRUE),
            Min = `Q1.Comm...Overall.Plan` < min(c_across(-c(ID,`Q1.Comm...Overall.Plan`)),na.rm = TRUE),
            Range = `Q1.Comm...Overall.Plan` >= range(c_across(-c(ID,`Q1.Comm...Overall.Plan`)),na.rm = TRUE)[1] &
                    `Q1.Comm...Overall.Plan` <= range(c_across(-c(ID,`Q1.Comm...Overall.Plan`)),na.rm = TRUE)[2]) %>%
  mutate(Result = case_when(Max ~ "above",
                            Min ~ "below",
                            Range ~ "within",
                            TRUE ~ NA_character_))
# A tibble: 4 x 5
     ID Max   Min   Range Result
  <int> <lgl> <lgl> <lgl> <chr> 
1    10 FALSE FALSE TRUE  within
2    31 FALSE FALSE TRUE  within
3   225 TRUE  FALSE FALSE above 
4   243 NA    NA    NA    NA    

您可以将 summarise 更改为 mutate 以保留原始列和/或 select 以删除它们。

有关详细信息,请参阅 dplyr rowwise tutorial

答案 1 :(得分:1)

library(purrr)
library(data.table)

needed_cols <- setdiff(names(df), c("ID", "Q1.Comm...Overall.Plan"))

setDT(df)[, c("min", "max") := transpose(pmap(.SD, range, na.rm = TRUE)), .SDcols = needed_cols]
df[, Q1_check := fcase(
    is.na(`Q1.Comm...Overall.Plan`), NA_character_,
    `Q1.Comm...Overall.Plan` < min, "below",
    `Q1.Comm...Overall.Plan` > max, "above",
    default = "within"
  )
]
df[, c("max", "min") := NULL]

答案 2 :(得分:0)

我已经修改了您的 dput 以满足您在链接问题中讨论的要求。我认为这会对你有所帮助。我使用了 janitor::clean_names(),建议您在继续之前使用它,以便清理您的列名。

所以修改后的dput是

df <- structure(list(id = c(10L, 31L, 225L, 243L), q1_comm_01_scope_thesis = c(NA, 
2L, 0L, NA), q1_comm_02_scope_project = c(NA, NA, NA, 2L), q1_comm_03_learn_intern = c(4L, 
NA, NA, NA), q1_comm_04_biography = c(NA, NA, NA, 1L), q1_comm_overall_plan = c(4L, 
1L, 2L, NA), q2_comm_01_scope_thesis = c(NA, 4, 0, NA), q2_comm_02_scope_project = c(NA, 
NA, NA, 4), q2_comm_03_learn_intern = c(8, NA, NA, NA), q2_comm_04_biography = c(NA, 
NA, NA, 2), q2_comm_overall_plan = c(8, 2, 4, NA)), row.names = c(NA, 
-4L), class = "data.frame")

df
   id q1_comm_01_scope_thesis q1_comm_02_scope_project q1_comm_03_learn_intern q1_comm_04_biography q1_comm_overall_plan q2_comm_01_scope_thesis
1  10                      NA                       NA                       4                   NA                    4                      NA
2  31                       2                       NA                      NA                   NA                    1                       4
3 225                       0                       NA                      NA                   NA                    2                       0
4 243                      NA                        2                      NA                    1                   NA                      NA
  q2_comm_02_scope_project q2_comm_03_learn_intern q2_comm_04_biography q2_comm_overall_plan
1                       NA                       8                   NA                    8
2                       NA                      NA                   NA                    2
3                       NA                      NA                   NA                    4
4                        4                      NA                    2                   NA

现在按照建议进行。 您必须在 cur_data() 中修改 [-5] 以满足您的要求(根据整体_column 的相对位置,我认为在您的情况下为 9)

library(tidyverse)

split.default(df[-1], gsub('(q\\d*)(.*)', '\\1', names(df[-1]), perl = T)) %>%
  map(., ~ .x %>% bind_cols('id' = df$id) %>%
        group_by(id) %>%
        mutate(across(ends_with('_overall_plan'), ~ case_when(. < min(cur_data()[-5], na.rm = T) ~ 'below',
                                                              . > max(cur_data()[-5], na.rm = T) ~ 'above',
                                                              is.na(.) ~ NA_character_,
                                                              TRUE ~ 'within'),
                      .names = '{str_remove(.col,"_comm_overall_plan")}_check'))
        ) %>%
  reduce(left_join, by = 'id')

# A tibble: 4 x 13
# Groups:   id [4]
  q1_comm_01_scop~ q1_comm_02_scop~ q1_comm_03_lear~ q1_comm_04_biog~ q1_comm_overall~    id q1_check q2_comm_01_scop~ q2_comm_02_scop~ q2_comm_03_lear~ q2_comm_04_biog~
             <int>            <int>            <int>            <int>            <int> <int> <chr>               <dbl>            <dbl>            <dbl>            <dbl>
1               NA               NA                4               NA                4    10 within                 NA               NA                8               NA
2                2               NA               NA               NA                1    31 below                   4               NA               NA               NA
3                0               NA               NA               NA                2   225 above                   0               NA               NA               NA
4               NA                2               NA                1               NA   243 NA                     NA                4               NA                2
# ... with 2 more variables: q2_comm_overall_plan <dbl>, q2_check <chr>
相关问题