将Json文件读入没有嵌套列表的data.frame中

时间:2016-02-16 23:20:56

标签: json r jsonlite

我正在尝试将json文件加载到r中的data.frame中。我对jsonlite包中的fromJSON函数运气不错 - 但我得到嵌套列表并且不确定如何将输入展平为二维data.frame。 Jsonlite以data.frame的形式读取文件,但在一些变量中留下嵌套列表。

在使用嵌套列表读入时,是否有人在将JSON文件加载到data.frame时有任何提示。

#*#*#*#*#*#*#*#*#*##*#*#*#*#*#*#*#*#*# HERE IS MY EXAMPLE #*#*#*#*#*#*#*#*#*##*#*#*#*#*#*#*#*#*#
# loads the packages
library("httr")
library( "jsonlite")

# downloads an example file
providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json" , simplifyDataFrame=TRUE ) 

# the flatten function breaks the name variable into three vars ( first name, middle name, last name)
providers <- flatten( providers )

# but many of the columns are still lists:
sapply( providers , class)

# Some of these lists have a single level
head( providers$facility_type )

# Some have lot more than two - for example nine
providers[ , 6][[1]]

我想要每个npi一行,而不是单个列表的每个切片的单独列 - 以便数据框具有#34; plan_id_type&#34;,&#34; plan_id&#34;的cols, &#34; network_tier&#34;九次,也许是colnames,从0到8。 我已经能够使用这个网站:http://www.convertcsv.com/json-to-csv.htm来获取这个文件的两个维度,但由于我正在做数百个这样的工作,我希望能够动态地完成它。这是文件:http://s000.tinyupload.com/download.php?file_id=10808537503095762868&t=1080853750309576286812811 - 我想使用fromJson函数将这个结构加载为data.frame的文件

这是我尝试过的一些事情; 所以我想到了两种方法; 首先:使用不同的函数读取Json文件,我看了

rjson but that reads in a list
library( rjson )
providers <- fromJSON( getURL( "https://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json") )
class( providers )

我尝试过RJSONIO - 我试过这个Getting imported json data into a data frame in R

json-data-into-a-data-frame-in-r
library( RJSONIO )
providers <- fromJSON( getURL( "https://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json") )

json_file <- lapply(providers, function(x) {
  x[sapply(x, is.null)] <- NA
  unlist(x)
})

# but When converting the lists to a data.frame I get an error
a <- do.call("rbind", json_file)

所以,我尝试过的第二种方法是将所有列表转换为data.frame

中的变量
detach("package:RJSONIO", unload = TRUE )
detach("package:rjson", unload = TRUE )

library( "jsonlite")
providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json" , simplifyDataFrame=TRUE ) 
providers <- flatten( providers )

我可以拉出其中一个列表 - 但由于缺失,我无法合并回我的数据框

a <- data.frame(Reduce(rbind,  providers$facility_type))
length( a ) == nrow( providers )

我也尝试了这些建议:Converting nested list to dataframe。和其他一些东西一样好但是没有运气

a <- sapply( providers$facility_type, unlist )
as.data.frame(t(sapply( providers$providers, unlist )) )

任何帮助非常感谢

4 个答案:

答案 0 :(得分:12)

更新:2016年2月21日

col_fixer已更新,其中包含vec2col参数,可让您将列表列展平为单个字符串或一组列。

在您下载的data.frame中,我看到了几种不同的列类型。存在包含相同类型的载体的正常列。列表列中的项目可以是NULL,也可以是平面向量。列表列中有data.frame个列表元素。列表列包含与主data.frame行数相同的data.frame

以下是重新创建这些条件的示例数据集:

mydf <- data.frame(id = 1:3, type = c("A", "A", "B"), 
                   facility = I(list(c("x", "y"), NULL, "x")),
  address = I(list(data.frame(v1 = 1, v2 = 2, v4 = 3), 
                   data.frame(v1 = 1:2, v2 = 3:4, v3 = 5), 
                   data.frame(v1 = 1, v2 = NA, v3 = 3))))

mydf$person <- data.frame(name = c("AA", "BB", "CC"), age = c(20, 32, 23),
                          preference = c(TRUE, FALSE, TRUE))

此示例str的{​​{1}}如下所示:

data.frame

你可以“扁平化”这种方法的一种方法是“修复”列表列。有三个修复。

  1. str(mydf) ## 'data.frame': 3 obs. of 5 variables: ## $ id : int 1 2 3 ## $ type : Factor w/ 2 levels "A","B": 1 1 2 ## $ facility:List of 3 ## ..$ : chr "x" "y" ## ..$ : NULL ## ..$ : chr "x" ## ..- attr(*, "class")= chr "AsIs" ## $ address :List of 3 ## ..$ :'data.frame': 1 obs. of 3 variables: ## .. ..$ v1: num 1 ## .. ..$ v2: num 2 ## .. ..$ v4: num 3 ## ..$ :'data.frame': 2 obs. of 3 variables: ## .. ..$ v1: int 1 2 ## .. ..$ v2: int 3 4 ## .. ..$ v3: num 5 5 ## ..$ :'data.frame': 1 obs. of 3 variables: ## .. ..$ v1: num 1 ## .. ..$ v2: logi NA ## .. ..$ v3: num 3 ## ..- attr(*, "class")= chr "AsIs" ## $ person :'data.frame': 3 obs. of 3 variables: ## ..$ name : Factor w/ 3 levels "AA","BB","CC": 1 2 3 ## ..$ age : num 20 32 23 ## ..$ preference: logi TRUE FALSE TRUE ## NULL (来自“jsonlite”)会处理“人物”栏目等栏目。
  2. 可以使用flatten修复“设施”列之类的列,这会将每个元素转换为逗号分隔的项目,也可以将其转换为多个列。
  3. toString个,有些有多行的列,首先需要展平成一行(通过转换为“宽”格式)然后需要绑定在一起作为单个{{ 1}}。 (我正在使用“data.table”进行重新整形和将行绑定在一起)。
  4. 我们可以使用如下函数来处理第二和第三点:

    data.frame

    我们会将该data.table函数与另一个可以执行大部分处理的函数集成。

    col_fixer <- function(x, vec2col = FALSE) {
      if (!is.list(x[[1]])) {
        if (isTRUE(vec2col)) {
          as.data.table(data.table::transpose(x))
        } else {
          vapply(x, toString, character(1L))
        }
      } else {
        temp <- rbindlist(x, use.names = TRUE, fill = TRUE, idcol = TRUE)
        temp[, .time := sequence(.N), by = .id]
        value_vars <- setdiff(names(temp), c(".id", ".time"))
        dcast(temp, .id ~ .time, value.var = value_vars)[, .id := NULL]
      }
    }
    

    运行该功能给我们:

    flatten

    或者,矢量进入单独的列:

    Flattener <- function(indf, vec2col = FALSE) {
      require(data.table)
      require(jsonlite)
      indf <- flatten(indf)
      listcolumns <- sapply(indf, is.list)
      newcols <- do.call(cbind, lapply(indf[listcolumns], col_fixer, vec2col))
      indf[listcolumns] <- list(NULL)
      cbind(indf, newcols)
    }
    

    这是Flattener(mydf) ## id type person.name person.age person.preference facility address.v1_1 ## 1 1 A AA 20 TRUE x, y 1 ## 2 2 A BB 32 FALSE 1 ## 3 3 B CC 23 TRUE x 1 ## address.v1_2 address.v2_1 address.v2_2 address.v4_1 address.v4_2 address.v3_1 ## 1 NA 2 NA 3 NA NA ## 2 2 3 4 NA NA 5 ## 3 NA NA NA NA NA 3 ## address.v3_2 ## 1 NA ## 2 5 ## 3 NA

    Flattener(mydf, TRUE)
    ##   id type person.name person.age person.preference facility.V1 facility.V2
    ## 1  1    A          AA         20              TRUE           x           y
    ## 2  2    A          BB         32             FALSE        <NA>        <NA>
    ## 3  3    B          CC         23              TRUE           x        <NA>
    ##   address.v1_1 address.v1_2 address.v2_1 address.v2_2 address.v4_1 address.v4_2
    ## 1            1           NA            2           NA            3           NA
    ## 2            1            2            3            4           NA           NA
    ## 3            1           NA           NA           NA           NA           NA
    ##   address.v3_1 address.v3_2
    ## 1           NA           NA
    ## 2            5            5
    ## 3            3           NA
    

    在“提供者”对象上,它可以非常快速地运行

    str

    enter image description here

答案 1 :(得分:11)

我的第一步是根据您的第二个代码示例,通过RCurl::getURL()rjson::fromJSON()加载数据:

##--------------------------------------
## libraries
##--------------------------------------
library(rjson);
library(RCurl);

##--------------------------------------
## get data
##--------------------------------------
URL <- 'https://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json';
jsonRList <- fromJSON(getURL(URL)); ## recursive list representing the original JSON data

接下来,为了深入了解数据的结构和清晰度,我编写了一组辅助函数:

##--------------------------------------
## helper functions
##--------------------------------------
## apply a function to a set of nodes at the same depth level in a recursive list structure
levelApply <- function(
    nodes, ## the root node of the list (recursive calls pass deeper nodes as they drill down into the list)
    keyList, ## another list, expected to hold a sequence of keys (component names, integer indexes, or NULL for all) specifying which nodes to select at each depth level
    func=identity, ## a function to run separately on each node once keyList has been exhausted
    ..., ## further arguments passed to func()
    joinFunc=NULL ## optional function for joining the return values of func() at each successive depth, as the stack is unwound. An alternative is calling unlist() on the result, but careful not to lose the top-level index association
) {
    if (length(keyList) == 0L) {
        ret <- if (is.null(nodes)) NULL else func(nodes,...)
    } else if (is.null(keyList[[1L]]) || length(keyList[[1L]]) != 1L) {
        ret <- lapply(if (is.null(keyList[[1L]])) nodes else nodes[keyList[[1L]]],levelApply,keyList[-1L],func,...,joinFunc=joinFunc);
        if (!is.null(joinFunc))
            ret <- do.call(joinFunc,ret);
    } else {
        ret <- levelApply(nodes[[keyList[[1L]]]],keyList[-1L],func,...,joinFunc=joinFunc);
    }; ## end if
    ret;
}; ## end if
## these two wrappers automatically attempt to simplify the results of func() to a vector or matrix/data.frame, respectively
levelApplyToVec <- function(...) levelApply(...,joinFunc=c);
levelApplyToFrame <- function(...) levelApply(...,joinFunc=rbind); ## can return matrix or data.frame, depending on ret

理解上述内容的关键是keyList参数。我们假设您有一个这样的列表:

list(NULL,'addresses',2:3,'city')

这将选择主列表所有元素下面的地址列表下面的第二个和第三个地址元素下面的所有城市字符串。

R中没有内置的应用功能可以在这样的&#34; parallel&#34;节点选择(rapply()很接近,但没有雪茄),这就是我写自己的原因。 levelApply()找到每个匹配的节点并在其上运行给定的func()(默认为identity(),从而返回节点本身),将结果返回给调用者,按照{{{ 1}},或者在输入列表中存在这些节点的相同递归列表结构中。快速演示:

joinFunc()

以下是我在处理此问题的过程中编写的其余辅助函数:

unname(levelApplyToVec(jsonRList,list(4L,'addresses',1:2,c('address','city'))));
## [1] "1001 Noble St"  "Fairbanks"      "1650 Cowles St" "Fairbanks"

在我第一次检查数据时,我试图捕获我对数据运行的命令序列。下面是结果,显示我运行的命令,命令输出,描述我的意图的主要评论,以及我从输出中得出的结论:

## for the given node selection key union, retrieve a data.frame of logicals representing the unique combinations of keys possessed by the selected nodes, possibly with a count
keyCombos <- function(node,keyList,allKeys) `rownames<-`(setNames(unique(as.data.frame(levelApplyToFrame(node,keyList,function(h) allKeys%in%names(h)))),allKeys),NULL);
keyCombosWithCount <- function(node,keyList,allKeys) { ks <- keyCombos(node,keyList,allKeys); ks$.count <- unname(apply(ks,1,function(combo) sum(levelApplyToVec(node,keyList,function(h) identical(sort(names(ks)[combo]),sort(names(h))))))); ks; };

## return a simple two-component list with type (list, namedlist, or atomic vector type) and len for non-namedlist types; tlStr() returns a nice stringified form of said list
tl <- function(e) { if (is.null(e)) return(NULL); ret <- typeof(e); if (ret == 'list' && !is.null(names(e))) ret <- list(type='namedlist') else ret <- list(type=ret,len=length(e)); ret; };
tlStr <- function(e) { if (is.null(e)) return(NA); ret <- tl(e); if (is.null(ret$len)) ret <- ret$type else ret <- paste0(ret$type,'[',ret$len,']'); ret; };

## stringification functions for display
mkcsv <- function(v) paste0(collapse=',',v);
keyListToStr <- function(keyList) paste0(collapse='','/',sapply(keyList,function(key) if (is.null(key)) '*' else paste0(collapse=',',key)));

## return a data.frame giving a comma-separated list of the unique types possessed by the selected nodes; useful for learning about the structure of the data
keyTypes <- function(node,keyList,allKeys) data.frame(key=allKeys,tl=sapply(allKeys,function(key) mkcsv(unique(na.omit(levelApplyToVec(node,c(keyList,key),tlStr))))),row.names=NULL);

## useful for testing; can call npiToFrame() to show the row with a specified npi value, in a nice vertical form
rowToFrame <- function(dfrow) data.frame(column=names(dfrow),value=c(as.matrix(dfrow)));
getNPIRow <- function(df,npi) which(df$npi == npi);
npiToFrame <- function(df,npi) rowToFrame(df[getNPIRow(df,npi),]);

这是我对数据的总结:

  • 一个顶级主列表,长度为3256。
  • 每个元素都是具有不一致键集的哈希。所有主要哈希共有12个按键,有3种按键模式。
  • 6个散列值是标量字符串,3个是可变长度字符串向量,##-------------------------------------- ## data examination ##-------------------------------------- ## type of object -- plain unnamed list => array, length 3256 levelApplyToVec(jsonRList,list(),tlStr); ## [1] "list[3256]" ## unique types of main array elements => all named lists => hashes unique(levelApplyToVec(jsonRList,list(NULL),tlStr)); ## [1] "namedlist" ## get the union of keys among all hashes allKeys <- unique(levelApplyToVec(jsonRList,list(NULL),names)); allKeys; ## [1] "npi" "type" "facility_name" "facility_type" "addresses" "plans" "last_updated_on" "name" "speciality" "accepting" "languages" "gender" ## get the unique pattern of keys among all hashes, and how often each occurs => shows there are inconsistent key sets among the top-level hashes keyCombosWithCount(jsonRList,list(NULL),allKeys); ## npi type facility_name facility_type addresses plans last_updated_on name speciality accepting languages gender .count ## 1 TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE 279 ## 2 TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE 2973 ## 3 TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE 4 ## for each key, get the unique set of types it takes on among all hashes, ignoring hashes where the key is omitted => some scalar strings, some multi-string, addresses is a variable-length list, plans is length-9 list, and name is a hash keyTypes(jsonRList,list(NULL),allKeys); ## key tl ## 1 npi character[1] ## 2 type character[1] ## 3 facility_name character[1] ## 4 facility_type character[1],character[2],character[3] ## 5 addresses list[1],list[2],list[3],list[6],list[5],list[7],list[4],list[8],list[9],list[13],list[12] ## 6 plans list[9] ## 7 last_updated_on character[1] ## 8 name namedlist ## 9 speciality character[1],character[2],character[3],character[4] ## 10 accepting character[1] ## 11 languages character[2],character[3],character[4],character[6],character[5] ## 12 gender character[1] ## must look deeper into addresses array, plans array, and name hash; we'll have to flatten them ## ==== addresses ===== ## note: the addresses key is always present under main array elements ## unique types of address elements across all hashes => all named lists, thus nested hashes unique(levelApplyToVec(jsonRList,list(NULL,'addresses',NULL),tlStr)); ## [1] "namedlist" ## union of keys among all address element hashes allAddressKeys <- unique(levelApplyToVec(jsonRList,list(NULL,'addresses',NULL),names)); allAddressKeys; ## [1] "address" "city" "state" "zip" "phone" "address_2" ## pattern of keys among address elements => only address_2 varies, similar frequency with it as without it keyCombosWithCount(jsonRList,list(NULL,'addresses',NULL),allAddressKeys); ## address city state zip phone address_2 .count ## 1 TRUE TRUE TRUE TRUE TRUE FALSE 1898 ## 2 TRUE TRUE TRUE TRUE TRUE TRUE 2575 ## for each address element key, get the unique set of types it takes on among all hashes, ignoring hashes where the key (only address_2 in this case) is omitted => all scalar strings keyTypes(jsonRList,list(NULL,'addresses',NULL),allAddressKeys); ## key tl ## 1 address character[1] ## 2 city character[1] ## 3 state character[1] ## 4 zip character[1] ## 5 phone character[1] ## 6 address_2 character[1] ## ==== plans ===== ## note: the plans key is always present under main array elements ## unique types of plan elements across all hashes => all named lists, thus nested hashes unique(levelApplyToVec(jsonRList,list(NULL,'plans',NULL),tlStr)); ## [1] "namedlist" ## union of keys among all plan element hashes allPlanKeys <- unique(levelApplyToVec(jsonRList,list(NULL,'plans',NULL),names)); allPlanKeys; ## [1] "plan_id_type" "plan_id" "network_tier" ## pattern of keys among plan elements => good, all plan elements have all 3 keys, perfectly consistent keyCombosWithCount(jsonRList,list(NULL,'plans',NULL),allPlanKeys); ## plan_id_type plan_id network_tier .count ## 1 TRUE TRUE TRUE 29304 ## for each plan element key, get the unique set of types it takes on among all hashes (note: no plan keys are ever omitted, so don't have to worry about that) => all scalar strings keyTypes(jsonRList,list(NULL,'plans',NULL),allPlanKeys); ## key tl ## 1 plan_id_type character[1] ## 2 plan_id character[1] ## 3 network_tier character[1] ## ==== name ===== ## note: the name key is *not* always present under main array elements ## union of keys among all name hashes allNameKeys <- unique(levelApplyToVec(jsonRList,list(NULL,'name'),names)); allNameKeys; ## [1] "first" "middle" "last" ## pattern of keys among name elements => sometimes middle is missing, relatively infrequently keyCombosWithCount(jsonRList,list(NULL,'name'),allNameKeys); ## first middle last .count ## 1 TRUE TRUE TRUE 2679 ## 2 TRUE FALSE TRUE 298 ## for each name element key, get the unique set of types it takes on among all hashes, ignoring hashes where the key (only middle in this case) is omitted => all scalar strings keyTypes(jsonRList,list(NULL,'name'),allNameKeys); ## key tl ## 1 first character[1] ## 2 middle character[1] ## 3 last character[1] 是可变长度列表,addresses是总长度为9的列表,{{1}是一个哈希。
  • 每个plans列表元素是一个散列,其中有5或6个键用于标量字符串,name是不一致的字符串。
  • 每个addresses列表元素都是一个散列,其中包含3个标量字符串的键,没有任何不一致。
  • 每个address_2哈希都有plansname但不总是first标量字符串。

这里最重要的观察是并行节点之间没有类型不一致(除了遗漏和长度差异)。这意味着我们可以将所有并行节点组合成向量而不考虑类型强制。我们可以将所有数据展平为二维结构,前提是我们将列与足够深的节点相关联,这样所有列都对应于输入列表中的单个标量字符串节点。

以下是我的解决方案。请注意,它取决于我之前定义的辅助函数lastmiddletl()

keyListToStr()

mkcsv()函数遍历输入列表并提取每个叶节点位置的所有节点值,将它们组合到NA中缺少值的向量,然后转换为单列data.frame。立即设置列名,利用参数化##-------------------------------------- ## solution ##-------------------------------------- ## recursively traverse the list structure, building up a column at each leaf node extractLevelColumns <- function( nodes, ## current level node selection ..., ## additional arguments to data.frame() keyList=list(), ## current key path under main list sep=NULL, ## optional string separator on which to join multi-element vectors; if NULL, will leave as separate columns mkname=function(keyList,maxLen) paste0(collapse='.',if (is.null(sep) && maxLen == 1L) keyList[-length(keyList)] else keyList) ## name builder from current keyList and character vector max length across node level; default to dot-separated keys, and remove last index component for scalars ) { cat(sprintf('extractLevelColumns(): %s\n',keyListToStr(keyList))); if (length(nodes) == 0L) return(list()); ## handle corner case of empty main list tlList <- lapply(nodes,tl); typeList <- do.call(c,lapply(tlList,`[[`,'type')); if (length(unique(typeList)) != 1L) stop(sprintf('error: inconsistent types (%s) at %s.',mkcsv(typeList),keyListToStr(keyList))); type <- typeList[1L]; if (type == 'namedlist') { ## hash; recurse allKeys <- unique(do.call(c,lapply(nodes,names))); ret <- do.call(c,lapply(allKeys,function(key) extractLevelColumns(lapply(nodes,`[[`,key),...,keyList=c(keyList,key),sep=sep,mkname=mkname))); } else if (type == 'list') { ## array; recurse lenList <- do.call(c,lapply(tlList,`[[`,'len')); maxLen <- max(lenList,na.rm=T); allIndexes <- seq_len(maxLen); ret <- do.call(c,lapply(allIndexes,function(index) extractLevelColumns(lapply(nodes,function(node) if (length(node) < index) NULL else node[[index]]),...,keyList=c(keyList,index),sep=sep,mkname=mkname))); ## must be careful to guard out-of-bounds to NULL; happens automatically with string keys, but not with integer indexes } else if (type%in%c('raw','logical','integer','double','complex','character')) { ## atomic leaf node; build column lenList <- do.call(c,lapply(tlList,`[[`,'len')); maxLen <- max(lenList,na.rm=T); if (is.null(sep)) { ret <- lapply(seq_len(maxLen),function(i) setNames(data.frame(sapply(nodes,function(node) if (length(node) < i) NA else node[[i]]),...),mkname(c(keyList,i),maxLen))); } else { ## keep original type if maxLen is 1, IOW don't stringify ret <- list(setNames(data.frame(sapply(nodes,function(node) if (length(node) == 0L) NA else if (maxLen == 1L) node else paste(collapse=sep,node)),...),mkname(keyList,maxLen))); }; ## end if } else stop(sprintf('error: unsupported type %s at %s.',type,keyListToStr(keyList))); if (is.null(ret)) ret <- list(); ## handle corner case of exclusively empty sublists ret; }; ## end extractLevelColumns() ## simple interface function flattenList <- function(mainList,...) do.call(cbind,extractLevelColumns(mainList,...)); 函数定义extractLevelColumns()到字符串列名称的字符串化。从每个递归调用返回多个列作为data.frames列表,同样从顶层调用返回。

它还验证并行节点之间没有类型不一致。虽然我之前手动验证了数据的一致性,但我尝试尽可能地编写通用和可重用的解决方案,因为这样做总是一个好主意,因此这个验证步骤是合适的。

mkname()是主要的接口函数;它只需调用keyList然后调用flattenList()即可将列合并为一个data.frame。

这种解决方案的一个优点是它完全通用;它可以处理无限数量的深度级别,因为它是完全递归的。此外,它没有包依赖关系,参数化列名构建逻辑,并将可变参数转发给extractLevelColumns(),因此例如,您可以传递do.call(cbind,...)来禁止通常由{{data.frame()自动分解字符列。 1}}和/或stringsAsFactors=F设置生成的data.frame或data.frame()的行名称,以防止将顶级列表组件名称用作行名称(如果存在)输入列表。

我还添加了row.names={namevector}参数,默认为row.names=NULL。如果sep,多元素叶节点将被分成多个列,每个元素一个,在列名称上有一个索引后缀用于区分。否则,它将作为字符串分隔符,将所有元素连接到单个字符串,并且只为该节点生成一个列。

在性能方面,速度非常快。这是一个演示:

NULL

结果:

NULL

生成的data.frame非常广泛,但我们可以使用## actually run it system.time({ df <- flattenList(jsonRList); }); ## extractLevelColumns(): / ## extractLevelColumns(): /npi ## extractLevelColumns(): /type ## extractLevelColumns(): /facility_name ## extractLevelColumns(): /facility_type ## extractLevelColumns(): /addresses ## extractLevelColumns(): /addresses/1 ## extractLevelColumns(): /addresses/1/address ## extractLevelColumns(): /addresses/1/city ## ## ... snip ... ## ## extractLevelColumns(): /plans/9/network_tier ## extractLevelColumns(): /last_updated_on ## extractLevelColumns(): /name ## extractLevelColumns(): /name/first ## extractLevelColumns(): /name/middle ## extractLevelColumns(): /name/last ## extractLevelColumns(): /speciality ## extractLevelColumns(): /accepting ## extractLevelColumns(): /languages ## extractLevelColumns(): /gender ## user system elapsed ## 2.265 0.000 2.268 class(df); dim(df); names(df); ## [1] "data.frame" ## [1] 3256 126 ## [1] "npi" "type" "facility_name" "facility_type.1" "facility_type.2" "facility_type.3" "addresses.1.address" "addresses.1.city" "addresses.1.state" ## [10] "addresses.1.zip" "addresses.1.phone" "addresses.1.address_2" "addresses.2.address" "addresses.2.city" "addresses.2.state" "addresses.2.zip" "addresses.2.phone" "addresses.2.address_2" ## [19] "addresses.3.address" "addresses.3.city" "addresses.3.state" "addresses.3.zip" "addresses.3.phone" "addresses.3.address_2" "addresses.4.address" "addresses.4.city" "addresses.4.state" ## [28] "addresses.4.zip" "addresses.4.phone" "addresses.4.address_2" "addresses.5.address" "addresses.5.address_2" "addresses.5.city" "addresses.5.state" "addresses.5.zip" "addresses.5.phone" ## [37] "addresses.6.address" "addresses.6.address_2" "addresses.6.city" "addresses.6.state" "addresses.6.zip" "addresses.6.phone" "addresses.7.address" "addresses.7.address_2" "addresses.7.city" ## [46] "addresses.7.state" "addresses.7.zip" "addresses.7.phone" "addresses.8.address" "addresses.8.address_2" "addresses.8.city" "addresses.8.state" "addresses.8.zip" "addresses.8.phone" ## [55] "addresses.9.address" "addresses.9.address_2" "addresses.9.city" "addresses.9.state" "addresses.9.zip" "addresses.9.phone" "addresses.10.address" "addresses.10.address_2" "addresses.10.city" ## [64] "addresses.10.state" "addresses.10.zip" "addresses.10.phone" "addresses.11.address" "addresses.11.address_2" "addresses.11.city" "addresses.11.state" "addresses.11.zip" "addresses.11.phone" ## [73] "addresses.12.address" "addresses.12.address_2" "addresses.12.city" "addresses.12.state" "addresses.12.zip" "addresses.12.phone" "addresses.13.address" "addresses.13.city" "addresses.13.state" ## [82] "addresses.13.zip" "addresses.13.phone" "plans.1.plan_id_type" "plans.1.plan_id" "plans.1.network_tier" "plans.2.plan_id_type" "plans.2.plan_id" "plans.2.network_tier" "plans.3.plan_id_type" ## [91] "plans.3.plan_id" "plans.3.network_tier" "plans.4.plan_id_type" "plans.4.plan_id" "plans.4.network_tier" "plans.5.plan_id_type" "plans.5.plan_id" "plans.5.network_tier" "plans.6.plan_id_type" ## [100] "plans.6.plan_id" "plans.6.network_tier" "plans.7.plan_id_type" "plans.7.plan_id" "plans.7.network_tier" "plans.8.plan_id_type" "plans.8.plan_id" "plans.8.network_tier" "plans.9.plan_id_type" ## [109] "plans.9.plan_id" "plans.9.network_tier" "last_updated_on" "name.first" "name.middle" "name.last" "speciality.1" "speciality.2" "speciality.3" ## [118] "speciality.4" "accepting" "languages.1" "languages.2" "languages.3" "languages.4" "languages.5" "languages.6" "gender" 一次获得一行的良好垂直布局。例如,这是第一行:

rowToFrame()

通过对单个记录进行多次抽查,我对结果进行了彻底的测试,结果看起来都是正确的。如果您有任何问题,请告诉我。

答案 2 :(得分:3)

这个答案是一个数据组织的建议(并且比吸引赏金的答案短得多;)

如果您想保留字段的语义,例如将所有plan_id保留在一个列中,您可以将数据设计标准化一点,然后在需要信息的情况下进行连接:< / p>

library(dplyr)

# notice the simplifyVector=F
providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json", simplifyVector=F) 

# pick and repeat fields for each element of array
# {field1:val, field2:val2, array:[{af1:av1, af2:av2}, {af1:av3, af2:av4}]}
# gives data.frame 
# field1, field2 array.af1 array.af2
# val     val2  av1        av2
# val     val2  av3        av4
denormalize <- function(data, fields, array) {
  data.frame(
    c(
      data[fields], 
      as.list(
        bind_rows(
          lapply(data[[array]], data.frame)))))
}

plans_df <- bind_rows(lapply(providers, denormalize, c('npi'), 'plans'))
addresses_df <- bind_rows(lapply(providers, denormalize, c('npi'), 'addresses'))
npis <- bind_rows(lapply(providers, function(d, fields) data.frame(d[fields]), 
                         c('npi', 'type', 'last_updated_on')))

然后您可以先过滤数据,然后加入其他信息:

addresses_df %>%
  filter(city == "Healy") %>%
  left_join(plans_df, by="npi") ->
  plans_in_healy

答案 3 :(得分:2)

所以这不是真正有资格作为解决方案,因为它没有直接回答这个问题,但这里是我如何分析这些数据。

首先,我必须了解您的数据集。它似乎是关于医疗服务提供者的信息。

 providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json" , simplifyDataFrame=FALSE ) 
 types = sapply(providers,"[[","type")
 table(types)

 # FACILITY INDIVIDUAL 
 #    279       2977 
  • FACILITY条目包含“ID”字段facility_namefacility_type
  • INDIVIDUAL条目包含“ID”字段namespecialityacceptinglanguagesgender
  • 所有条目都有“ID”字段npilast_updated_on
  • 所有条目都有两个嵌套字段:addressesplans。例如,addresses是包含城市,州等的list

由于每个npi有多个地址,我更愿意将它们转换为包含城市,州等列的数据框。我还会为{{{{{{ 1}}。然后我会将plansaddresses加入到单个数据框中。因此,如果有4个地址和8个计划,则在连接的数据帧中将有4 * 8 = 32行。最后,我将使用另一个合并来识别具有“ID”信息的类似非规范化数据帧。

plans

然后做一些清理。

library(dplyr)
unfurl_npi_data = function (x) {
  repeat_cols = c("plans","addresses")
  id_cols = setdiff(names(x),repeat_cols)
  repeat_data = x[repeat_cols]
  id_data  = x[id_cols]

  # Denormalized ID data
  id_data_df = Reduce(function(x,y) merge(x,y,by=NULL), id_data, "")[,-1]
  atomic_colnames = names(which(!sapply(id_data, is.list)))
  df_atomic_cols = unlist(sapply(id_data,function(x) if(is.list(x)) rep(FALSE, length(x)) else TRUE))
  colnames(id_data_df)[df_atomic_cols] = atomic_colnames

  # Join the plans and addresses (denormalized)
  repeated_data = lapply(repeat_data, rbind_all)
  repeated_data_crossed = Reduce(merge, repeated_data, repeated_data[[1]])

  merge(id_data_df, repeated_data_crossed)
}

providers2 = split(providers, types)
providers3 = lapply(providers2, function(x) rbind_all(lapply(x, unfurl_npi_data)))

现在你可以问一些有趣的问题。例如,每个医疗保健提供者有多少个地址?

unique_df = function(x) {
  chr_col_names = names(which(sapply(x, class) == "character"))
  for( col in chr_col_names )
    x[[col]] = toupper(x[[col]])
  unique(x)
}
providers3 = lapply(providers3, unique_df)
facilities = providers3[["FACILITY"]]
individuals = providers3[["INDIVIDUAL"]]
rm(providers, providers2, providers3)

在人数超过五人的地址中,男性医疗服务提供者的百分比是多少?

 unique_providers = individuals %>% select(first, middle, last, gender, state, city, address) %>% unique()
 num_addresses = unique_providers %>% count(first, middle, last, gender)
 table(num_addresses$n)

 #    1    2    3    4    5    6    7    8    9   12   13 
 # 2258  492  119   33   43   21    6    1    2    1    1 

enter image description here

等等......

相关问题