从xml中提取信息

时间:2017-09-06 05:18:31

标签: r xml dataframe

我昨天抓了一个需要登录的网站,页面是xml格式,如下所示。我解决它时遇到了麻烦,因为有些教师属于两个部门,而且我不需要前三行因为这只意味着我成功登录。我需要把它变成一个数据框(或列表,json格式)

我的代码:

ID <- xpathApply(xml, "//teacher[@id]")
ID_unlist <- unlist(ID)
matrix <- as.data.frame(matrix(ID_unlist),nrow= 2, byrow=TRUE)

Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width, 3L,  : 
  first argument must be atomic

XML:

<result status="success">
  <code>1</code>
  <note>success</note>
  <teacherList>
    <teacher id="D95">
      <name>Mary</name>
      <department id="420">
        <name>Math</name>
      </department>
      <department id="421">
        <name>Statistics</name>
      </department>
    </teacher>
    <teacher id="D73">
      <name>Adam</name>
      <department id="412">
        <name>English</name>
      </department>
    </teacher>
  </teacherList>
</result> 

我预期的结果将是:

t_id      teacher       d_id   department
 D95         Mary        420         Math
 D95         Mary        421   statistics
 D73         Adam        412      English

1 个答案:

答案 0 :(得分:2)

可能不是最有效的方式,但有效。

require(XML)
content_list <- XML::xmlToList(content)
df<-as.data.frame ( do.call(rbind,
    lapply(content_list$teacherList, function(teacher) { 
      unname ( do.call(cbind, list ( teacher$.attrs, teacher$name,  do.call(rbind, teacher[names(teacher) == "department"]) ) )  )
    }) 
  )
)
colnames(df)<-c("id","teacher","department","did")


   id teacher department did
1 D95    Mary       Math 420
2 D95    Mary Statistics 421
3 D73    Adam    English 412