R - 将字符向量转换为数据帧

时间:2018-06-05 17:44:02

标签: r

这似乎应该是一个相当简单的问题,但我似乎无法找到一个简单的解决方案。

我有一个如下所示的字符列表:

my_info <- c("Fruits",
             "North America",
             "Apples",
             "Michigan",
             "Europe",
             "Pomegranates",
             "Greece",
             "Oranges",
             "Italy",
             "Vegetables",
             "North America",
             "Potatoes",
             "Idaho",
             "Avocados",
             "California",
             "Europe",
             "Artichokes",
             "Italy",
             "Meats",
             "North America",
             "Beef",
             "Illinois")

我想将这个字符向量解析为一个如下所示的数据框:

screenshot of R console

食物类型和地区名单将始终保持不变,但食物及其位置可能会发生变化。

food_type <- c("Fruits","Vegetables","Meats")
region <- c("North America","Europe")

我以为我需要使用类似str_split的东西,但是使用food_types和region作为某种分隔符?但我不知道该怎么办。字符向量确实有一个顺序。

谢谢。

3 个答案:

答案 0 :(得分:1)

一种解决方案是首先使用=FORECAST($A2,OFFSET($F$2:$F$11,MATCH($A2,$E$2:$E$11,1)-1,0,2),OFFSET($E$2:$E$11,MATCH($A2,$E$2:$E$11,1)-1,0,2)) 在矩阵中转换my_info向量。这将在矩阵/数据框中分割矢量。

现在,您可以应用ncol = 4food_type的规则,并交换其他列中存在的任何regionfood_type

注意:我请求OP检查一次数据,似乎每4个元素都不能用OP提供的描述完整一行。

region

答案 1 :(得分:0)

我有一个很长的解决方案,但只要食物和位置始终处于相同的顺序,就应该有效。

首先使用dplyr创建一些data.frames。

library(dplyr)

info <- data_frame(my_info = my_info) 
region <- data_frame(region_id = region, region = region)
food_type <- data_frame(food_type_id = food_type, food_type)

接下来创建一个data.frame,将所有这些连接在一起并用tidyr填充缺失的值并删除我们不需要的行。然后最重要的技巧是最后一个,根据订单总是相同的假设创建一个cols列

library(tidyr)

df <- info %>% 
  left_join(food_type, by = c("my_info" = "food_type_id")) %>% 
  left_join(region, by = c("my_info" = "region_id")) %>% 
  fill(food_type) %>% 
  group_by(food_type) %>% 
  fill(region) %>% 
  filter(!is.na(region) & !(my_info == region)) %>% 
  ungroup %>% 
  mutate(cols = rep(c("food", "location"), group_size(.)/2 ))

返回:

# A tibble: 14 x 4
   my_info      food_type  region        cols    
   <chr>        <chr>      <chr>         <chr>   
 1 Apples       Fruits     North America food    
 2 Michigan     Fruits     North America location
 3 Pomegranates Fruits     Europe        food    
 4 Greece       Fruits     Europe        location
 5 Oranges      Fruits     Europe        food    
 6 Italy        Fruits     Europe        location
 7 Beef         Meats      North America food    
 8 Illinois     Meats      North America location
 9 Potatoes     Vegetables North America food    
10 Idaho        Vegetables North America location
11 Avocados     Vegetables North America food    
12 California   Vegetables North America location
13 Artichokes   Vegetables Europe        food    
14 Italy        Vegetables Europe        location

接下来使用tidyr将cols分散到食物和位置列中。

df <- df %>%
  group_by(food_type, region, cols) %>%
  mutate(ind = row_number()) %>% 
  spread(cols, my_info) %>% 
  select(-ind)

# A tibble: 7 x 4
# Groups:   food_type, region [5]
  food_type  region        food         location  
  <chr>      <chr>         <chr>        <chr>     
1 Fruits     Europe        Pomegranates Greece    
2 Fruits     Europe        Oranges      Italy     
3 Fruits     North America Apples       Michigan  
4 Meats      North America Beef         Illinois  
5 Vegetables Europe        Artichokes   Italy     
6 Vegetables North America Potatoes     Idaho     
7 Vegetables North America Avocados     California

这一切都可以一次完成,只需删除创建data.frame的中间步骤。

答案 2 :(得分:0)

以下是三种选择。所有这些都使用动物园中的na.locf0cn向量仅在第一个中显示。

1)cn成为与my_info长度相同的向量,它标识my_info元素所属的输出的列号。令cdef为输出列定义向量1:4,输出列名称为其名称。然后,对于每个输出列,创建一个与my_info长度相同的向量,其行对应于该列,而其他元素则为NA。然后使用na.locf0填写NA值并获取与第4列对应的元素。

library(zoo)

cn <- (my_info %in% food_type) + 2 * (my_info %in% region)
cn[cn == 0] <- 3:4

cdef <- c(food_type = 1, region = 2, food = 3, location = 4)

m <- sapply(cdef, function(i) na.locf0(ifelse(cn == i, my_info, NA))[cn == 4])

,并提供:

> m
     food_type    region          food           location    
[1,] "Fruits"     "North America" "Apples"       "Michigan"  
[2,] "Fruits"     "Europe"        "Pomegranates" "Greece"    
[3,] "Fruits"     "Europe"        "Oranges"      "Italy"     
[4,] "Vegetables" "North America" "Potatoes"     "Idaho"     
[5,] "Vegetables" "North America" "Avocados"     "California"
[6,] "Vegetables" "Europe"        "Artichokes"   "Italy"     
[7,] "Meats"      "North America" "Beef"         "Illinois"  

我们创建了字符矩阵输出,因为输出完全是字符,但如果你想要一个数据帧,那么使用:

as.data.frame(mm, stringsAsFactors = FALSE)

2)或者,我们可以通过将m放入nx 4矩阵的位置(i,cn [i]),从cn创建my_info[i]使用na.locf来填充NAs并获取与第4列对应的行。

n <- length(my_info)
m2 <- na.locf(replace(matrix(NA, n, 4), cbind(1:n, cn), my_info))[cn == 4, ]
colnames(m2) <- c("food_type", "region", "food", "location")

identical(m2, m) # test
## [1] TRUE

3)m创建cn的第三种方法是按列构建矩阵,如下所示:

m3 <- cbind( food_type = na.locf0(ifelse(cn == 1, my_info, NA))[cn == 3], 
        region = na.locf0(ifelse(cn == 2, my_info, NA))[cn == 3], 
        food = my_info[cn == 3], 
        location = my_info[cn == 4])

identical(m, m3) # test
## [1] TRUE