Question

我有一个dadtaframe，其中的genre列包含如下数据：

类型：

[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]
[{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]
[{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name': 'Comedy'}]
[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]

我需要这种方式，在字符串“'name'：” ：

之后的单词

genre1    |   genre2   |  genre3 
Animation |   Comedy   |  Family 
Adventure |   Fantasy  |  Family 
Comedy    |   Drama    |  Romance

我尝试了str_split_fixed选项，但是结果与预期不符。任何方向都会有所帮助。

Answer 1

这似乎是不正确的ndjson，因为它使用的是单引号。但这是可以使用的，因此我们可以将其解析为JSON（带有修复程序）。尝试使用正则表达式解析JSON可能会给您带来痛苦，沮丧和遗憾。

数据：

vec <- c("[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]",
"[{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]",
"[{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name': 'Comedy'}]",
"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]")

具有正则表达式的蛮力

我什至不会尝试：Regex for parsing single key: values out of JSON in Javascript。其他语言也存在其他Q / A，它们都说相同的话：不要尝试。您可以设计一个可处理“完美”和“结构相同”的json的正则表达式，但是只要出现有效但不同的json结构，就可以完成。

避免这种情况。

（使用任何字符串拆分功能实际上是在尝试使用普通的/固定的正则表达式工作。我并不是想对str_split_fixed，strsplit，...与复杂的regex操作相比，我认为最终结果是那里有很棒的json解析器，并且它们比我们可以调制的任何字符串拆分器都更好/更快/更健壮。）

`jsonlite::fromJSON`的蛮力

代码的第一个剪切仅按原样处理矢量。这有点低效，因为它每个人都独立地调用fromJSON。如果您的向量很短，那么这可能不是问题。如果这需要一段时间，则可能要继续使用stream_in。（请注意，这不是像ndjson一样对待。）

ret <- lapply(gsub("'",'"',vec), jsonlite::fromJSON)
str(ret)
# List of 4
#  $ :'data.frame': 3 obs. of  2 variables:
#   ..$ id  : int [1:3] 16 35 10751
#   ..$ name: chr [1:3] "Animation" "Comedy" "Family"
#  $ :'data.frame': 3 obs. of  2 variables:
#   ..$ id  : int [1:3] 12 14 10751
#   ..$ name: chr [1:3] "Adventure" "Fantasy" "Family"
#  $ :'data.frame': 2 obs. of  2 variables:
#   ..$ id  : int [1:2] 10749 35
#   ..$ name: chr [1:2] "Romance" "Comedy"
#  $ :'data.frame': 3 obs. of  2 variables:
#   ..$ id  : int [1:3] 35 18 10749
#   ..$ name: chr [1:3] "Comedy" "Drama" "Romance"

仅提取名称，这可行：

lapply(ret, `[[`, "name")
# [[1]]
# [1] "Animation" "Comedy"    "Family"   
# [[2]]
# [1] "Adventure" "Fantasy"   "Family"   
# [[3]]
# [1] "Romance" "Comedy" 
# [[4]]
# [1] "Comedy"  "Drama"   "Romance"

请注意，如果不将每个向量的长度都扩展为相同的长度，则列式输出格式将无法与您给我的示例一起使用（如果这样，请参见https://stackoverflow.com/a/34570893/3358272）。

ret2 <- lapply(ret, `[[`, "name")
ret2 <- lapply(ret2, `length<-`, max(lengths(ret2)))
do.call(rbind, ret2)
#      [,1]        [,2]      [,3]     
# [1,] "Animation" "Comedy"  "Family" 
# [2,] "Adventure" "Fantasy" "Family" 
# [3,] "Romance"   "Comedy"  NA       
# [4,] "Comedy"    "Drama"   "Romance"

（这是一个matrix，您可以将它带入新的水平。）

使用`jsonlite::stream_in`

快一点

（这将您的数据像NDJSON一样对待。这是一个细微差别，并不重要。）

如果您需要加快速度，可以在原始原始文件（首选）上使用jsonlite::stream_in，或者如果原始文件中没有textConnection，则我们将{{ 1}}化事物是虚假的（效率比原始文件略低，但仍应比jsonlite::fromJSON更快）。

jsonlite::stream_in(textConnection(paste(gsub("'",'"',vec), collapse="\n")))
#  Imported 4 records. Simplifying...
# Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
#   arguments imply differing number of rows: 3, 2

我在此处保留此错误是为了证明jsonlite的默认设置是尝试将简单列表转换为框架。它并不总是有效，在这种情况下，它正在尝试rbind，而第三个元素（元素减少了一个）引起了一些问题。我们可以通过关闭框架化来减轻这种情况。

str(
  ret <- jsonlite::stream_in(textConnection(paste(gsub("'",'"',vec), collapse="\n")),
                             simplifyDataFrame=FALSE)
)
#  Imported 4 records. Simplifying...
# List of 4
#  $ :List of 3
#   ..$ :List of 2
#   .. ..$ id  : int 16
#   .. ..$ name: chr "Animation"
#   ..$ :List of 2
#   .. ..$ id  : int 35
#   .. ..$ name: chr "Comedy"
#   ..$ :List of 2
#   .. ..$ id  : int 10751
#   .. ..$ name: chr "Family"
#  $ :List of 3
#   ..$ :List of 2
#   .. ..$ id  : int 12
#   .. ..$ name: chr "Adventure"
#   ..$ :List of 2
#   .. ..$ id  : int 14
#   .. ..$ name: chr "Fantasy"
#   ..$ :List of 2
#   .. ..$ id  : int 10751
#   .. ..$ name: chr "Family"
#  $ :List of 2
#   ..$ :List of 2
#   .. ..$ id  : int 10749
#   .. ..$ name: chr "Romance"
#   ..$ :List of 2
#   .. ..$ id  : int 35
#   .. ..$ name: chr "Comedy"
#  $ :List of 3
#   ..$ :List of 2
#   .. ..$ id  : int 35
#   .. ..$ name: chr "Comedy"
#   ..$ :List of 2
#   .. ..$ id  : int 18
#   .. ..$ name: chr "Drama"
#   ..$ :List of 2
#   .. ..$ id  : int 10749
#   .. ..$ name: chr "Romance"

这与第一次尝试有些不同，但一点也不难提取。

lapply(ret, sapply, `[[`, "name")
# [[1]]
# [1] "Animation" "Comedy"    "Family"   
# [[2]]
# [1] "Adventure" "Fantasy"   "Family"   
# [[3]]
# [1] "Romance" "Comedy" 
# [[4]]
# [1] "Comedy"  "Drama"   "Romance"

（使用与上述相同的步骤对内容进行列化。）

此呼叫看起来很奇怪，因为它是两次*apply呼叫。等效于（但短于）：

lapply(ret, function(x) sapply(x, `[[`, "name"))

在特定字符值之后，多个字符串在R中拆分

1 个答案:

具有正则表达式的蛮力

`jsonlite::fromJSON`的蛮力

使用`jsonlite::stream_in`

在特定字符值之后，多个字符串在R中拆分

1 个答案:

具有正则表达式的蛮力

jsonlite::fromJSON的蛮力

使用jsonlite::stream_in

`jsonlite::fromJSON`的蛮力

使用`jsonlite::stream_in`