Question

我的数据框有两列：value和article_topics，如下所示：

 str(myData)
Classes ‘tbl_df’ and 'data.frame':  10 obs. of  2 variables:
 $ value         : num  288 253 967 36769 2769 ...
 $ article_topics:List of 10
  ..$ : logi NA
  ..$ : logi NA
  ..$ : chr  "art and entertainment" "music" "style and fashion" "clothing" ...
  ..$ : chr  "hobbies and interests" "guitar" "art and entertainment" "music" ...
  ..$ : logi NA
  ..$ : chr  "pets" "large animals" "sports" "fishing" ...
  ..$ : chr "health and fitness"
  ..$ : chr  "style and fashion" "clothing" "shirts"
  ..$ : logi NA
  ..$ : logi NA

我想unlist article_topics，例如我每article_topics个观察一次。

如果我举一个更简单的例子，它基本上意味着转变：

value        article_topics
10       “Hello” , “This is an example”

进入这个：

value           article_topics
10                “Hello”
10                “This is an example”

这是数据集：

structure(list(value = c(288, 253, 967, 36769, 2769, 541, 17, 
889, 532, 2621), article_topics = list(NA, NA, c("art and entertainment", 
"music", "style and fashion", "clothing", "lingerie", "movies and tv", 
"movies"), c("hobbies and interests", "guitar", "art and entertainment", 
"music", "musical instruments", "guitars", "technology and computing", 
"consumer electronics", "telephones", "mobile phones", "smart phones"
), NA, c("pets", "large animals", "sports", "fishing", "freshwater fishing"
), "health and fitness", c("style and fashion", "clothing", "shirts"
), NA, NA)), class = c("tbl_df", "data.frame"), row.names = c(NA, 
-10L), .Names = c("value", "article_topics"))

我一直在尝试使用melt中的reshape2和gather中的tidyr。然而，它不适用于这种结构，或者我无法弄明白。

我找到了部分解决方案：

library(splitstackshape)
cSplit(ll, 'article_topics',',', 'long')
   value             article_topics
 1:   288                         NA
 2:   253                         NA
 3:   967  c("art and entertainment"
 4:   967                    "music"
 5:   967        "style and fashion"
 6:   967                 "clothing"
 7:   967                 "lingerie"
 8:   967            "movies and tv"
 9:   967                  "movies")
10: 36769  c("hobbies and interests"
11: 36769                   "guitar"
12: 36769    "art and entertainment"
13: 36769                    "music"
14: 36769      "musical instruments"
15: 36769                  "guitars"
16: 36769 "technology and computing"
17: 36769     "consumer electronics"
18: 36769               "telephones"
19: 36769            "mobile phones"
20: 36769            "smart phones")
21:  2769                         NA
22:   541                   c("pets"
23:   541            "large animals"
24:   541                   "sports"
25:   541                  "fishing"
26:   541      "freshwater fishing")
27:    17         health and fitness
28:   889      c("style and fashion"
29:   889                 "clothing"
30:   889                  "shirts")
31:   532                         NA
32:  2621                         NA

下一步是使用stringr之类的内容来替换c(和)。然而，在我看来，这并不是一个很好的方法。欢迎任何帮助。

Answer 1

您可以使用unnest。尝试：

library(tidyr)
unnest(myData, article_topics)

示例输出：

> head(unnest(df, article_topics))
Source: local data frame [6 x 2]

  value        article_topics
1   288                    NA
2   253                    NA
3   967 art and entertainment
4   967                 music
5   967     style and fashion
6   967              clothing

或者，您可以从我的“splitstackshape”包中尝试listCol_l。但它与tbl_df不兼容，因此您首先需要unclass。

尝试：

library(splitstackshape)
listCol_l(unclass(df), "article_topics")[]

取消列在数据框内并将其作为新行插入

1 个答案: