Can I use XPath or something else like a regex to extract data from XML?

时间:2015-05-24 21:30:11

标签: php regex xml xpath

Certainly I could use regular expressions to parse data from an XML.

<?xml version="1.0"?>
<definitions>
  <message name="notificationInput">
    <part name="body" element="xsd:notificationRequest" />
  </message>
  <message name="notificationOutput">
    <part name="body" element="xsd:notificationResponse" />
  </message>
</definitions>

A pattern like

/<message.*name="(.*)".*part.*name=".*".*element="xsd:(.*)".*<\/message>/sUg

would probably give me the data I want, here shown as a PHP array:

array(
  array("notificationInput", "body", "notificationRequest"),
  array("notificationOutput", "body", "notificationResponse")
)

This is of course extremely cumbersome and error-prone.

I know how to use XPath to query complete nodes, but I don't think I can tell it "I want attributes name and element from node /definitions/message/part and for each result I also want attribute name from its parent".

Now I wonder if there is some language or technique (prefereably with an implementation in PHP) that I can use to specify the data I want to extract.

In other words, I am looking for a solution that more or less can be configured instead of programmed, because I have quite a few similar definitions to extract.

3 个答案:

答案 0 :(得分:2)

You could use the XPath

//message/@name|//message[@name]/part/@name|//message/part/@element

to generate a 1-dimensional sequence of all the desired attributes (sorry, this is in Python):

In [366]: doc.xpath('//message/@name|//message[@name]/part/@name|//message/part/@element')
Out[366]: 
['notificationInput',
 'body',
 'xsd:notificationRequest',
 'notificationOutput',
 'body',
 'xsd:notificationResponse']

and then use array_chunk to rearrange the result in groups of 3. (Note you would still need to use a bit of regex or string manipulation to remove the xsd: from the notificationResponse, but that would still be much easier and more robust than using regex to parse the XML.

The XPath will collect all the attributes even if there is more than one <part> per <message>.

答案 1 :(得分:1)

这个简短的XPath 1.0表达式选择所有想要的属性节点

.transbox 
{
       background : black;
       opacity : 0.2;
       width : 100%;
       margin: 0;
       padding: 0;
       position:absolute;
       top:0px;
       left:0px;
}

然后,对于每个选定的节点,您可以使用PHP获取其字符串值(我不知道)。

如果您可以使用XPath 2.0,那么所有想要的值都是通过评估类似的表达式生成的

/*//*/@*

这是一个简单的XSLT 2.0转换,只是评估上面的表达式并输出结果

/*//*/@*/data(.)

在提供的XML文档上应用此转换时

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="text"/>

  <xsl:template match="/">
    <xsl:sequence select="/*//*/@*/data(.)"/>
  </xsl:template>
</xsl:stylesheet>

生成了想要的结果

<definitions>
  <message name="notificationInput">
    <part name="body" element="xsd:notificationRequest" />
  </message>
  <message name="notificationOutput">
    <part name="body" element="xsd:notificationResponse" />
  </message>
</definitions>

答案 2 :(得分:0)

我知道不建议使用正则表达式解析html,除非你知道所涉及的字符集是什么,但我发布这个答案,因为它可能对你有用。

对于您提供的示例文本,您可以使用这样的简单正则表达式:

([a-z]+)"

<强> Working demo

Php 代码:

$re = "/([a-z]+)\"/i"; 
$str = "<?xml version=\"1.0\"?>\n<definitions>\n  <message name=\"notificationInput\">\n    <part name=\"body\" element=\"xsd:notificationRequest\" />\n  </message>\n  <message name=\"notificationOutput\">\n    <part name=\"body\" element=\"xsd:notificationResponse\" />\n  </message>\n</definitions>"; 

preg_match_all($re, $str, $matches);

然后,您可以从$matches获取捕获的内容。

匹配信息:

MATCH 1
1.  [53-70] `notificationInput`
MATCH 2
1.  [89-93] `body`
MATCH 3
1.  [108-127]   `notificationRequest`
MATCH 4
1.  [162-180]   `notificationOutput`
MATCH 5
1.  [199-203]   `body`
MATCH 6
1.  [218-238]   `notificationResponse`