用不均匀的列填充数据框的最快方法

时间:2021-05-29 14:25:43

标签: r dataframe matrix

与此 previous question 类似,我正在尝试将向量转换为 R 中的数据框。我使用 this trick 将其转换为矩阵,然后转换为数据框,但问题是某些行可能有不同数量的列,这会抛出我的数据框。每行可以有任意数量的值(即不一定是示例中的 3 列),因此我首先检查以确定我需要多少列。

例如,给出下面的示例数据,我得到了一个整洁的数据框。

example <- c(
"col-a",
"col-b",
"col-c",
"col-a",
"col-b",
"col-c",
"col-a",
"col-b",
"col-c")

# Get the number of values between the repeating start == number of columns
ncols <- diff(grep("col-a", example))

data.frame(matrix(example, ncol = ncols[1], byrow = T))

#      X1    X2    X3
# 1 col-a col-b col-c
# 2 col-a col-b col-c
# 3 col-a col-b col-c

这一切都很好,直到我得到一个在一行中有一个额外值的向量(即需要和额外的列)。例如:

example <- c("col-a",
"col-b",
"col-c",
"col-a",
"col-b",
"col-c",
"WATCH OUT!",
"col-a",
"col-b",
"col-c")

# Get the number of values between the repeating start == number of columns
ncols <- diff(grep("col-a", example))

data.frame(matrix(example, ncol = ncols[1], byrow = T))

#           X1    X2    X3
# 1      col-a col-b col-c
# 2      col-a col-b col-c
# 3 WATCH OUT! col-a col-b
# 4      col-c col-a col-b

然而,我真正想要的是:

#           X1    X2    X3         X4
# 1      col-a col-b col-c         NA
# 2      col-a col-b col-c WATCH OUT!
# 3      col-a col-b col-c         NA

在检查第一列元素之间是否存在奇数个元素之后,我可以使用双循环来处理这个问题,但这肯定不会接近最佳选择。

额外的复杂性是“额外”列可能在任何地方,不一定是最后一列。 编辑:列的顺序实际上是任意的,所以没有理由为什么额外的列必须在中间,它可以附加在最后。这是我考虑的一种选择,将其拉出并在之后用 NA 填充后附加它。应该在同一列中的文本也被分隔,因此很清楚它们所属的位置。已更新以下示例。

以下是一些更现实的示例数据和所需的输出:

example <- c("name:start",
"date:a",
"value:b",
"name:start",
"date:c",
"desc:WATCH OUT!",
"value:d",
"name:start",
"date:e",
"value:f")

# Desired output
     X1                         X2                               X3           X4
1 name:start     date:a                              NA   value:b  
2 name:start     date:c  desc:WATCH OUT!   value:d 
3 name:start     date:e                              NA   value:f 

处理这个问题的最快方法是什么?

提前致谢!

编辑:变成行的“块”是明确定义的,所以块的开始和结束很清楚,找到块的大小并不难,因此我的{{ 1}} 命令(也可以使用 diff(grep(...)) 获得类似的结果)。小心!文本可以是任意的,所以它不像搜索 WATCH OUT! 那样简单。

2 个答案:

答案 0 :(得分:1)

这个有用吗?

library(tidyverse)
library(rebus)
#> 
#> Attaching package: 'rebus'
#> The following object is masked from 'package:stringr':
#> 
#>     regex
#> The following object is masked from 'package:ggplot2':
#> 
#>     alpha

example <- c("name:start",
             "date:a",
             "value:b",
             "name:start",
             "date:c",
             "desc:WATCH OUT!",
             "value:d",
             "name:start",
             "date:e",
             "value:f")

example_dirty <- example #i will use it at the end of the script for replacing

custom_pattern <- rebus::or('name:.*', 'date:.', 'value:.') 


alien_text_index <- str_detect(example, pattern = custom_pattern) %>%
    as.character() 
replacement <- which(alien_text_index == 'FALSE') %>%
    `/`(., 3) %>% #in this case every three rows the repetition should start over.
    round() #round for getting an index to modify



example <- str_match(example , pattern = custom_pattern) %>% keep(~!is.na(.))

df <- c('name:.*', 'date:.', 'value:.') %>% 
    map(~example[str_detect(example, .x)])  %>% reduce(bind_cols) %>%
    mutate(..4 = '')
#> New names:
#> * NA -> ...1
#> * NA -> ...2
#> New names:
#> * NA -> ...3





for (i in length(replacement)) {
    df[replacement[i], 4] <- example_dirty[!as.logical(alien_text_index)][i]
}

df
#> # A tibble: 3 x 4
#>   ...1       ...2   ...3    ..4              
#>   <chr>      <chr>  <chr>   <chr>            
#> 1 name:start date:a value:b ""               
#> 2 name:start date:c value:d "desc:WATCH OUT!"
#> 3 name:start date:e value:f ""

reprex package (v2.0.0) 于 2021 年 5 月 29 日创建

答案 1 :(得分:1)

我不确定这种格式的输出是否有用

example <- c("name:start",
             "date:a",
             "value:b",
             "name:start",
             "date:c",
             "desc:WATCH OUT!",
             "value:d",
             "name:start",
             "date:e",
             "value:f")
library(tidyverse)

example %>% as.data.frame() %>% setNames('dummy') %>%
  separate(dummy, into=c("name", 'value'), sep = '\\:') %>%
  mutate(rowid = cumsum(name == first(name))) %>%
  pivot_wider(id_cols = rowid, names_from = name, values_from = value)

#> # A tibble: 3 x 5
#>   rowid name  date  value desc      
#>   <int> <chr> <chr> <chr> <chr>     
#> 1     1 start a     b     <NA>      
#> 2     2 start c     d     WATCH OUT!
#> 3     3 start e     f     <NA>

或者这个?


library(tidyverse)

example %>% as.data.frame() %>% setNames('dummy') %>%
  separate(dummy, into=c("name", 'value'), sep = '\\:', remove = F) %>%
  mutate(rowid = cumsum(name == first(name))) %>%
  pivot_wider(id_cols = rowid, names_from = name, values_from = dummy)
#> # A tibble: 3 x 5
#>   rowid name       date   value   desc           
#>   <int> <chr>      <chr>  <chr>   <chr>          
#> 1     1 name:start date:a value:b <NA>           
#> 2     2 name:start date:c value:d desc:WATCH OUT!
#> 3     3 name:start date:e value:f <NA>

reprex package (v2.0.0) 于 2021 年 5 月 30 日创建


对于你的第一个例子,你可以这样做

``` r
example <- c("col-a",
             "col-b",
             "col-c",
             "col-a",
             "col-b",
             "col-c",
             "WATCH OUT!",
             "col-a",
             "col-b",
             "col-c")
library(tidyverse)

example %>% as.data.frame() %>% setNames('dummy') %>%
  group_by(rowid = cumsum(dummy == first(dummy))) %>%
  mutate(name = paste0('X', row_number())) %>%
  pivot_wider(id_cols = rowid, names_from = name, values_from = dummy)

#> # A tibble: 3 x 5
#> # Groups:   rowid [3]
#>   rowid X1    X2    X3    X4        
#>   <int> <chr> <chr> <chr> <chr>     
#> 1     1 col-a col-b col-c <NA>      
#> 2     2 col-a col-b col-c WATCH OUT!
#> 3     3 col-a col-b col-c <NA>

reprex package (v2.0.0) 于 2021 年 5 月 30 日创建

相关问题