Question

我是XSLT的新手。在大型文本语料库中，我应该合并节点的子节点的所有值：

<?xml version='1.0' encoding='UTF-8'?>
    <Informations>
        <Information lang="de" type="a">
            <title>Product</title>
            <Holder>Big Company</Holder>
            <Code>0101010</Code>
            <content>
                <div>
                    <p class="s4" id="section1">
                        <span class="s2">
                            <span>This is Text</span>
                        </span>
                        <sup class="s3">
                            <span>®</span>
                        </sup>
                    </p>
                    <p class="s6">
                        <span class="s5">
                            <span>Sometimes sentences ar</span>
                        </span>
                        <span class="s5">
                            <span>e split by tags</span>
                        </span>
                    </p>
                 </div>
             </content>
         </Information>
    <Informations>

生成的文档看起来应该是

<?xml version='1.0' encoding='UTF-8'?>
    <Informations>
        <Information lang="de" type="a">
            <title>Product</title>
            <Holder>Big Company</Holder>
            <Code>0101010</Code>
            <content>
                <div>
                    <p>This is Text®</p>
                    <p>Sometimes sentences are split by tags</p>
                </div>
            </content>
        </Information>
    <Informations>

所以基本上，我必须复制整个结构，但合并p标签的子节点的所有值，同时摆脱这些子节点。我会非常感谢你的帮助。谢谢！

更新：在不保留要删除的p个节点内的空格的情况下，某些单词会直接相互连接。例如。在诸如

之类的环境中

        <p class="s8">
           <span class="s9">
              <span>Inform your </span>
           </span>
           <span class="s9">
              <span>dentist.</span>
           </span>
        </p>

结果文本是：告知您的辩护人因此，我试图在样式表中保留尾随空格：

<xsl:strip-space elements="*"/>
<xsl:preserve-space elements="p"/>
<xsl:output indent="yes" method="xml" encoding="utf-8" omit-xml-declaration="no"/>
<xsl:template match="@*|node()">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
</xsl:template>

<xsl:template match="p">
    <xsl:copy>
        <xsl:value-of select='normalize-space()'/>
    </xsl:copy>
</xsl:template>

但是，分成两个span标签的单词现在没有正确连接。例如，上面的例子：“句子分裂”。我试图找到一种掩盖/转换所有尾随空格的方法，但它没有用完..

更新2 ：以下部分出现错误结果，其中空标记（仅包含空格）分隔两个单词。是否有可能不将它们作为空标签处理？

        <p class="s3">
           <span class="s4">
              <span>Read</span>
           </span>
           <span class="s4">
              <span> </span>
           </span>
           <span class="s4">
              <span>the</span>
           </span>
           <span class="s4">
              <span> information </span>
           </span>
           <span class="s4">
              <span>sheet</span>
           </span>
           <span class="s4">
              <span> </span>
           </span>
           <span class="s4">
              <span>carefully</span>
           </span>
        </p>

更新3 ：通过两步转换，除了忽略标记这一事实外，几乎所有事情都有效，所以

        <p class="s3">
           <span class="s4">
              <span>Read</span>
           </span>
           <span class="s4">
              <span> </span>
           </span>
           <span class="s4">
              <span>the</span>
           </span>
           <span class="s4">
              <span> information </span>
           </span>
           <span class="s4">
              <span>sheet</span>
           </span>
           <span class="s4">
              <span> </span>
           </span>
           <span class="s4">
              <span>carefully</span>
           </span>
           <span class="s4">
              </br>
           </span>
           <span class="s3">
              <span>This is important</span>
        </p>

成为“仔细阅读信息表这很重要。”我试图将每个/ br转换为换行符，但这不起作用。是否有可能以某种方式将每个/ br转换为关闭/ p-Tag并同时打开一个新的p标签？或者将其转换为

<span> </span>

也可以。

Answer 1

在您的脚本中包含p标记的模板：

<xsl:template match="p">
  <xsl:copy>
    <xsl:value-of select="."/>
  </xsl:copy>
</xsl:template>

但这不是全部。在脚本开头添加：

<xsl:strip-space elements="*"/>

否则输出将包含额外的空格和/n个字符。

该脚本还必须包含身份模板。

修改

我的整个脚本如下：

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
  <xsl:output method="xml" omit-xml-declaration="yes" encoding="UTF-8" indent="yes" />
  <xsl:strip-space elements="*"/>

  <xsl:template match="p">
    <xsl:copy>
      <xsl:value-of select="."/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="@*|node()">
    <xsl:copy><xsl:apply-templates select="@*|node()"/></xsl:copy>
  </xsl:template>
</xsl:transform>

不要将preserve-space用于p标记，因为输出将包含不必要的空格。

在p的模板中，您可以将.更改为normalize-space()：

删除初始和尾随空格（来自整个连接文本），
更改multipe＆＃34; middle＆＃34;空间到一个空间。

但请注意，如果您的来源是：

<span>Inform your</span
<span>dentist</span>

然后你会得到Inform yourdentist（你的之间没有空格和牙医，无论是来源还是结果）。

Answer 2

重新更新2

给定的输入：

<强> XML

<p class="s3">
   <span class="s4">
      <span>Read</span>
   </span>
   <span class="s4">
      <span> </span>
   </span>
   <span class="s4">
      <span>the</span>
   </span>
   <span class="s4">
      <span> information </span>
   </span>
   <span class="s4">
      <span>sheet</span>
   </span>
   <span class="s4">
      <span> </span>
   </span>
   <span class="s4">
      <span>carefully</span>
   </span>
</p>

提出了相互矛盾的要求：

一方面，你想剥去周围的空白节点内部span元素;
要保留内部span元素中包含的空格。

由于内部元素和外部元素的名称相同，因此这是不可能的。

您可以使用以下样式表预处理输入XML：

第一个XSLT 1.0

<xsl:stylesheet version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>

<!-- identity transform -->
<xsl:template match="@*|node()">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
</xsl:template>

<xsl:template match="span[not(@class)]">
    <new-span>
        <xsl:apply-templates select="@*|node()"/>
    </new-span>
</xsl:template>

</xsl:stylesheet>

生产：

第二个XML

<p class="s3">
   <span class="s4">
      <new-span>Read</new-span>
   </span>
   <span class="s4">
      <new-span> </new-span>
   </span>
   <span class="s4">
      <new-span>the</new-span>
   </span>
   <span class="s4">
      <new-span> information </new-span>
   </span>
   <span class="s4">
      <new-span>sheet</new-span>
   </span>
   <span class="s4">
      <new-span> </new-span>
   </span>
   <span class="s4">
      <new-span>carefully</new-span>
   </span>
</p>

然后使用下一个样式表处理结果：

第二个XSLT 1.0

<xsl:stylesheet version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:preserve-space elements="new-span"/>

<!-- identity transform -->
<xsl:template match="@*|node()">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
</xsl:template>

<xsl:template match="p">
    <xsl:copy>
        <xsl:value-of select='.'/>
    </xsl:copy>
</xsl:template>

</xsl:stylesheet>

产生最终结果：

<?xml version="1.0" encoding="UTF-8"?>
<p>Read the information sheet carefully</p>

合并节点的子节点

2 个答案:

修改

重新更新2