如何使用xml解析此数据

时间:2016-11-15 20:32:11

标签: r xml

我有一个可以从这里下载的数据 http://mips.helmholtz-muenchen.de/proj/ppi/ 在页面的末尾,它被写成"您可以获得完整的数据集"

然后我尝试使用xml

library(XML)
doc <- xmlTreeParse("path to/allppis.xml", useInternal = TRUE)
root <- xmlRoot(doc)

但似乎是空的

我想要什么?

如果我打开从该网站下载的allppi.xml, 我想将特定行解析为txt文件,它以<fullName>开头,以</fullName>结尾

例如,如果我打开该文件,我可以看到这个

<fullName>S100A8;CAGA;MRP8; calgranulin A (migration inhibitory factor-related protein 8)</fullName>

然后我想要这个

Proteins                   description 
S100A8;CAGA;MRP8     calgranulin A (migration inhibitory factor-related protein 8)

1 个答案:

答案 0 :(得分:2)

我认为你想要这样的东西(IMO的问题不是很清楚)。我还认为主要问题是默认命名空间,这绝对是一种皇家的痛苦:

library(xml2)
library(purrr)
library(dplyr)
library(stringi)

doc <- read_xml("allppis.xml")

ns <- xml_ns_rename(xml_ns(doc), d1="x")

xml_find_all(doc, ".//x:proteinInteractor/x:names/x:fullName", ns) %>% 
  xml_text() %>% 
  stri_split_fixed("; ", n=2, simplify=TRUE) %>% 
  as_data_frame() %>% 
  setNames(c("Proteins", "Description")) %>% 
  mutate(Proteins=trimws(Proteins),
         Description=trimws(Description))
## # A tibble: 3,628 × 2
##             Proteins                                                    Description
##                <chr>                                                          <chr>
## 1   S100A8;CAGA;MRP8  calgranulin A (migration inhibitory factor-related protein 8)
## 2  S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14)
## 3  S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14)
## 4   S100A8;CAGA;MRP8  calgranulin A (migration inhibitory factor-related protein 8)
## 5   S100A8;CAGA;MRP8  calgranulin A (migration inhibitory factor-related protein 8)
## 6  S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14)
## 7  S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14)
## 8   S100A8;CAGA;MRP8  calgranulin A (migration inhibitory factor-related protein 8)
## 9               TRP3                                 calcium influx channel protein
## 10            IP3R-3                  inositol 1,4,5-trisphosphate receptor, type 3
## # ... with 3,618 more rows

你需要稍微清理一下(View()生成的数据框,看看我的意思。)