用演讲者和对话解析古希腊戏剧的XML

时间:2017-07-26 19:57:32

标签: r xml parsing

我目前正在尝试将希腊语播放作为XML文件在线提供到带有对话和扬声器列的数据框中。 我运行以下命令来下载XML并解析对话和扬声器。

library(XML)
library(RCurl)
url <- "http://www.perseus.tufts.edu/hopper/dltext?doc=Perseus%3Atext%3A1999.01.0186"
html <- getURL(url, followlocation = TRUE)
doc <- htmlParse(html, asText=TRUE)
plain.text <- xpathSApply(doc, "//p", xmlValue)
speakersc <- xpathSApply(doc, "//speaker", xmlValue)
dialogue <- data.frame(text = plain.text, stringsAsFactors = FALSE)
speakers <- data.frame(text = speakersc, stringsAsFactors = FALSE)

然而,我遇到了一个问题。对话将产生300行(在剧中300个不同的行),但演讲者将产生297。 出现此问题的原因是由于下面重现的XML结构,其中<speaker>标签不会因为舞台方向中断的持续对话而重复。因为我必须分开对话 使用<p>标记,它会产生两个对话行,但只有一个扬声器行,而不会相应地重复说话者。

  

<speaker> {克里昂{1}}

     

</speaker>致卫队。<stage>

     

</stage>

     

你可以把自己带到任何地方,

     

-<p>

     

自由而且没有沉重的电荷。

     

<milestone n="445" unit="line" ed="p"/>退出警卫。<stage>

     

</stage>

     

</p>

     

</sp>

     

-<sp>致Antigone。<stage>

     

</stage>但是,你告诉我 - 不是很长,而是简短地 - 你知道法令禁止了吗?<p>

     

</p>

如何解析XML,以便数据能够正确地为相同数量的相应扬声器行产生相同数量的对话行?

对于上面的例子,我希望得到的数据框要么包含两行,用于Creon的对话,对应于舞台方向之前和之后的两行对话,或者一行将Creon的对话视为忽略分离的一行由于舞台方向。

非常感谢你的帮助。

2 个答案:

答案 0 :(得分:1)

在扬声器为空时,考虑使用xpath的前瞻following-sibling来搜索下一个<p>标记,同时在<sp>的父<speaker>进行迭代,<p> { {1}}:

# ALL SP NODES
sp <- xpathSApply(doc, "//body/descendant::sp", xmlValue)

# ITERATE THROUGH EACH SP BY NODE INDEX TO CREATE LIST OF DFs
dfList <- lapply(seq_along(sp), function(i){
  data.frame(
    speakers = xpathSApply(doc, paste0("concat(//body/descendant::sp[",i,"]/speaker,'')"), xmlValue),
    dialogue = xpathSApply(doc, paste0("concat(//body/descendant::sp[",i,"]/speaker/following-sibling::p[1], ' ',
                                               //body/descendant::sp[position()=",i+1," and not(speaker)]/p[1])"), xmlValue)
  )

# ROW BIND LIST OF DFs AND SUBSET EMPTY SPEAKER/DIALOGUE
finaldf <- subset(do.call(rbind, dfList), speakers!="" & dialogue!="")
})

# SPECIFIC ROWS IN OP'S HIGHLIGHT
finaldf[85,]
#    speakers
# 85    Creon
#
#    dialogue
# 85 You can take yourself wherever you please,free and clear of a heavy
#    charge.Exit Guard. You, however, tell me—not at length, but 
#    briefly—did you know that an edict had forbidden this?

finaldf[86,]
#    speakers                                      dialogue
# 87 Antigone I knew it.  How could I not?  It was public. 

Dataframe Output

答案 1 :(得分:0)

另一种选择是在解析XML之前读取url并进行一些更新,在这种情况下,将里程碑标记替换为空格以避免将单词混合在一起,删除阶段标记然后在没有发言者的情况下修复sp节点

x <- readLines(url)
x <- gsub("<milestone[^>]*>", " ", x)  # add space
x <- gsub("<stage>[^>]*stage>", "", x) # no space
x <- paste(x, collapse = "")
x <- gsub("</p></sp><sp><p>", "", x)   # fix sp without speaker

现在,XML具有相同数量的sp和扬声器标签。

doc <- xmlParse(x)
summary(doc)
  p              sp         speaker            div2       placeName 
299             297             297              51              25    ...

最后,获取sp节点并解析说话者和段落。

sp <- getNodeSet(doc, "//sp")
s1 <- sapply( sp, xpathSApply, ".//speaker", xmlValue)
# collapse the 1 node with 2 <p>
p1 <- lapply( sp, xpathSApply, ".//p", xmlValue)
p1 <- trimws(sapply(p1, paste, collapse= " "))
speakers <- data.frame(speaker=s1, dialogue = p1)

    speaker                                                                  dialogue
1  Antigone Ismene, my sister, true child of my own mother, do you know any evil o...
2  Ismene   To me no word of our friends, Antigone, either bringing joy or bringin...
3  Antigone I knew it well, so I was trying to bring you outside the courtyard gat...
4  Ismene   Hear what?  It is clear that you are brooding on some dark news.         
5  Antigone Why not?  Has not Creon destined our brothers, the one to honored buri...
6  Ismene   Poor sister, if things have come to this, what would I profit by loose...
7  Antigone Consider whether you will share the toil and the task.                   
8  Ismene   What are you hazarding?  What do you intend?                             
9  Antigone Will you join your hand to mine in order to lift his corpse?             
10 Ismene   You plan to bury him—when it is forbidden to the city?     
...