获取子节点属性值

时间:2017-06-23 16:00:05

标签: r xml-parsing

我试图将retrosheet boxscore生成的xml文件转换为可以插入到sql表中的数据框。我大部分时间都在那里,但我无法弄清楚如何抓取中间xml节点的属性。下面是一个例子,希望我正确粘贴它。我想要抓住的是game_id,id(来自玩家)以及完整的击球部分。

<boxscores>
<boxscore game_id="CHA191204110" date="1912/04/11" site="CHI10" 
visitor="SLA" visitor_city="St.Louis" visitor_name="Browns" home="CHA" 
home_city="Chicago" home_name="White Sox" start_time="0:00PM" 
day_night="day" temperature="0" wind_direction="unknown" wind_speed="-1" 
field_condition="unknown" precip="unknown" sky="unknown" time_of_game="110" 
attendance="30000" umpire_hp="evanb901" umpire_1b="eganr101" umpire_2b="" 
umpire_3b="" >
<linescore away_runs="2" away_hits="7" away_errors="1" home_runs="6" 
home_hits="10" home_errors="1">
<inning_line_score away="0" home="0" inning="1"/>
<inning_line_score away="0" home="0" inning="2"/>
<inning_line_score away="0" home="1" inning="3"/>
<inning_line_score away="0" home="0" inning="4"/>
<inning_line_score away="2" home="0" inning="5"/>
<inning_line_score away="0" home="1" inning="6"/>
<inning_line_score away="0" home="1" inning="7"/>
<inning_line_score away="0" home="3" inning="8"/>
<inning_line_score away="0" home="x" inning="9"/>
</linescore>
<players team="SLA" lob="5" dp="0" tp="0" risp_ab="0" risp_h="0">

<player id="shotb101" lname="Shotton" fname="Burt" slot="1" seq="1" pos="8">
  <batting ab="4" r="0" h="0" d="0" t="0" hr="0" bi="0" bi2out="-1" bb="0" ibb="-1" so="3" gdp="-1" hp="0" sh="0" sf="-1" sb="0" cs="-1" />
  <fielding pos="8" outs="24" po="1" a="0" e="0" dp="0" tp="0" bip="-1" bf="-1" />
</player>
<player id="austj101" lname="Austin" fname="Jimmy" slot="2" seq="1" pos="5">
  <batting ab="4" r="0" h="1" d="0" t="0" hr="0" bi="0" bi2out="-1" bb="0" ibb="-1" so="1" gdp="-1" hp="0" sh="0" sf="-1" sb="0" cs="-1" />
  <fielding pos="5" outs="24" po="0" a="3" e="0" dp="0" tp="0" bip="-1" bf="-1" />
  </player>
<player id="stovg101" lname="Stovall" fname="George" slot="3" seq="1" pos="3" >
  <batting ab="4" r="0" h="1" d="0" t="0" hr="0" bi="0" bi2out="-1" bb="0" ibb="-1" so="0" gdp="-1" hp="0" sh="0" sf="-1" sb="0" cs="-1" />
  <fielding pos="3" outs="24" po="11" a="0" e="0" dp="0" tp="0" bip="-1" bf="-1" />
</player>

</players>
</boxscore>
</boxscores>

以下是我使用

的代码
box <- 
read_xml("Q:\\Sabermetrics\\Retrosheet\\download.folder\\unzipped\\1912.xml")

atbat <- xml_find_all(box, "//boxscore")

bind_rows(lapply(atbat, function(x) {

player <- try(xml_find_all(x, "./players/player/batting"), silent=FALSE)

if (inherits(player, "try-error") |
  length(player) == 0) return(NULL)

bind_rows(lapply(player, function(y) {
  data.frame(t(xml_attrs(y)), stringsAsFactors=FALSE)
})) -> player_dat

game_id <- try(xml_attr(x, "game_id"))

if (inherits(game_id, "try-error") |
  length(game_id) == 0) return(NULL)

player_dat$game_id <- game_id

player_dat

})) -> player

我想最终得到像这样的东西

game_id        player_id     ab    r   h    d  ....
CHA191204110   shotb101      4     0   0    0  ....
CHA191204110   austj101      4     0   1    0  ....
CHA191204110   stovg101      4     0   0    0  ....

我已经尝试复制game_id代码并抓住了“id&#39;来自玩家,但它不起作用。我尝试过使用路径./players/player[@id]和./players/player/@id这两种方法都没有用。我尝试过使用@id,仍然是NA。

我不确定自己做错了什么,而且我只是把东西扔在墙上看它是否坚持......

1 个答案:

答案 0 :(得分:0)

这对你有帮助吗?

xml <- xmlParse('Q:\\Sabermetrics\\Retrosheet\\download.folder\\unzipped\\1912.xml')
lxml <- xmlToList(xml)
df <- cbind(t(lxml$boxscore$.attrs),t(data.frame(unlist(lxml$boxscore$players))))

您可以通过向cbind()传递更多参数来从xml中提取其他信息。

我认为你正在迭代多个xmls,所以原则上你可以将这样的东西包装成sapply()然后通过执行:library(plyr);do.call(rbind.fill, your_df_list)将所有东西收集到一个df中。