Question

我有一个可以从这里下载的数据 http://mips.helmholtz-muenchen.de/proj/ppi/ 在页面的末尾，它被写成＆＃34;您可以获得完整的数据集＆＃34;

然后我尝试使用xml包

library(XML)
doc <- xmlTreeParse("path to/allppis.xml", useInternal = TRUE)
root <- xmlRoot(doc)

但似乎是空的

我想要什么？

如果我打开从该网站下载的allppi.xml，我想将特定行解析为txt文件，它以<fullName>开头，以</fullName>结尾

例如，如果我打开该文件，我可以看到这个

<fullName>S100A8;CAGA;MRP8; calgranulin A (migration inhibitory factor-related protein 8)</fullName>

然后我想要这个

Proteins                   description 
S100A8;CAGA;MRP8     calgranulin A (migration inhibitory factor-related protein 8)

Answer 1

我认为你想要这样的东西（IMO的问题不是很清楚）。我还认为主要问题是默认命名空间，这绝对是一种皇家的痛苦：

library(xml2)
library(purrr)
library(dplyr)
library(stringi)

doc <- read_xml("allppis.xml")

ns <- xml_ns_rename(xml_ns(doc), d1="x")

xml_find_all(doc, ".//x:proteinInteractor/x:names/x:fullName", ns) %>% 
  xml_text() %>% 
  stri_split_fixed("; ", n=2, simplify=TRUE) %>% 
  as_data_frame() %>% 
  setNames(c("Proteins", "Description")) %>% 
  mutate(Proteins=trimws(Proteins),
         Description=trimws(Description))
## # A tibble: 3,628 × 2
##             Proteins                                                    Description
##                <chr>                                                          <chr>
## 1   S100A8;CAGA;MRP8  calgranulin A (migration inhibitory factor-related protein 8)
## 2  S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14)
## 3  S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14)
## 4   S100A8;CAGA;MRP8  calgranulin A (migration inhibitory factor-related protein 8)
## 5   S100A8;CAGA;MRP8  calgranulin A (migration inhibitory factor-related protein 8)
## 6  S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14)
## 7  S100A9;CAGB;MRP14 calgranulin B (migration inhibitory factor-related protein 14)
## 8   S100A8;CAGA;MRP8  calgranulin A (migration inhibitory factor-related protein 8)
## 9               TRP3                                 calcium influx channel protein
## 10            IP3R-3                  inositol 1,4,5-trisphosphate receptor, type 3
## # ... with 3,618 more rows

你需要稍微清理一下（View()生成的数据框，看看我的意思。）

如何使用xml解析此数据

1 个答案: