XSLT拆分大型单父节点,将组分成较小的子节点

时间:2013-08-22 22:46:24

标签: xslt split while-loop grouping xslt-2.0

我最近问了这个问题,但意识到我没有很清楚地解释它。 我有一个由发票组成的大型.csv文件(8000多行),每张发票有多行。我正在将其解析为XML结构,如下所示(简化)。

输入1 - $ XMLInput

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <row>
        <invoiceNumber>1</invoiceNumber>
        <invoiceText>invoice 1-1</invoiceText>
        <position>1<position>
        ...
    </row>
    <row>
        <invoiceNumber>1</invoiceNumber>
        <invoiceText>invoice 1-2</invoiceText>
        <position>2<position>
        ...
    </row>
    <row>
        <invoiceNumber>2</invoiceNumber>
        <invoiceText>invoice 2-1</invoiceText>
        <position>3<position>
        ...
    </row>
    <row>
        <invoiceNumber>2</invoiceNumber>
        <invoiceText>invoice 2-2</invoiceText>
        <position>4<position>
        ...
    </row>
    <row>
        <invoiceNumber>3</invoiceNumber>
        <invoiceText>invoice 3-1</invoiceText>
        <position>5<position>
        ...
    </row>
    <row>
        <invoiceNumber>3</invoiceNumber>
        <invoiceText>invoice 3-2</invoiceText>
        <position>6<position>
        ...
    </row>
</roow>

输入2 - $ maxBatchSize 描述:在大于此大小(常量)之后,中断到下一批次

输入3 - $ listOfInvoices 描述:文档中唯一发票号的重复变量。例如:

<root>
    <row>
        <invoiceNumber>1</invoiceNumber>
    </row>
    <row>
        <invoiceNumber>2</invoiceNumber>
    </row>
    <row>
        <invoiceNumber>3</invoiceNumber>
    </row>
</root>

为了提高性能时间,我需要将invoiceNumber这些元素分组为不超过每个X节点的批次(要导入的变量)。从那里我将并行地将每个批次发送到子处理器,而不是一次处理整个原始文档。例如,在上面的示例XML文档中,如果批量大小不能大于3,我需要以下XML输出:

输出1 - $ XMLOutput

<root>
    <batch>
        <row>
            <invoiceNumber>1</invoiceNumber>
            <invoiceText>invoice 1-1</invoiceText>
            <position>1<position>
            ...
        </row>
        <row>
            <invoiceNumber>1</invoiceNumber>
            <invoiceText>invoice 1-2</invoiceText>
            <position>2<position>
            ...
        </row>
        <row>
            <invoiceNumber>2</invoiceNumber>
            <invoiceText>invoice 2-1</invoiceText>
            <position>3<position>
            ...
        </row>
        <row>
            <invoiceNumber>2</invoiceNumber>
            <invoiceText>invoice 2-2</invoiceText>
            <position>4<position>
            ...
        </row>
    </batch>
    <batch>
        <row>
            <invoiceNumber>3</invoiceNumber>
            <invoiceText>invoice 3-1</invoiceText>
            <position>5<position>
            ...
        </row>
        <row>
            <invoiceNumber>3</invoiceNumber>
            <invoiceText>invoice 3-2</invoiceText>
            <position>6<position>
            ...
        </row>
    </batch>
</root>

要求发票的所有行都在同一批次中发送。我的初始XSLT尝试低于(2.0),我试图模拟一个while循环,通过递归调用模板继续将发票组附加到当前节点。达到最大批量大小时,我递归调用批处理模板以创建新批处理。我在每次递归调用之间传递发票和批处理计数器。

编辑:感谢Ken的帮助,我越来越近了。我确实需要每次按行数打开发票,而不是不同发票的数量。从理论上讲,如果以下内容有效,我不确定如何确保前一个兄弟节点中不存在发票号。

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:bpws="http://schemas.xmlsoap.org/ws/2003/03/business-process/" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <xsl:variable name="batch-size" select="40" as="xs:integer"/>
<xsl:variable name="input" select="bpws:getVariableData('sortedInvoicesByBU')"/>
<xsl:key name="invoice-lines-by-invoice-number" match="row" use="invoiceNumber4z"/>

<xsl:template match="/">
    <xsl:element name="batches">
        <!--establish batches from possible non-contiguous invoice numbers-->
        <xsl:for-each-group select="$input/*:UPSData/*:row" group-by="(position() - 1) idiv $batch-size">
            <xsl:for-each select="distinct-values($input/*:UPSData/*:row/*:invoiceNumber4z)[not(.=preceding-sibling::item)]">
                <xsl:element name="UPSData">
                    <xsl:for-each select="current()">
                        <xsl:for-each select="key('invoice-lines-by-invoice-number',.,$input)">
                            <!--copy rows as they are-->
                            <xsl:copy-of select="."/>
                        </xsl:for-each>
                    </xsl:for-each>
                </xsl:element>
            </xsl:for-each>
        </xsl:for-each-group>
    </xsl:element>
</xsl:template>
</xsl:stylesheet>

2 个答案:

答案 0 :(得分:4)

我告诉我的学生,人们可以尽可能地折磨样式表以最终使其工作,但这并不能使其可维护甚至是正确的做事方式。我希望你会接受这样一种分析,即你将XSLT视为一种命令式编程语言,它使语言不公正,只会让你相信,尝试做一些在C和Java中更容易的事情是困难的,冗长的和尴尬的

但是如果你按照设计的方式使用XSLT,它就比命令式语言更容易,并且启动它完全基于XML,你可以在其中显示你想要的结果。因为它更短,所以更容易维护。当您理解所使用的声明性指令时,您不必尝试解开命令式算法。并且XSLT处理器可以优化声明性方法,但如果它遵循书面命令方法而没有机会对其进行优化,则它必须缓慢工作。

在下面的解决方案中,它会准确地生成您的Output1结果,请注意我如何确定唯一的发票编号,然后按有效的那些过滤它们。然后我根据批量大小(这是一个参数)批量处理。没有调用模板,没有任何类型的计数器......使用XSLT 2.0的内置工具的解决方案。

不包括全局参数和变量及注释的声明,它只有5个元素长:<root><xsl:for-each-group><batch><xsl:for-each>和{{ 1}}。

至于你的问题为什么你的工作没有用,我不知道......你采取的方法并不像XSLT那样“感觉”......感觉就像某种程序性命令式方法的XSLT表达。 / p>

<xsl:copy-of>

我正在编辑这个答案以添加下面的备选方案,因为你声明你有800万个输入记录我认为使用键查找表会比我的简单变量谓词表现更好。它通过模板中的一个额外的XSLT指令产生相同的结果(可以在不添加它的情况下完成,但我觉得这更具可读性)并删除不再需要的变量。

t:\ftemp>type numbers.xml 
<root>
    <row>
        <invoiceNumber>1</invoiceNumber>
    </row>
    <row>
        <invoiceNumber>2</invoiceNumber>
    </row>
    <row>
        <invoiceNumber>3</invoiceNumber>
    </row>
</root>

t:\ftemp>type invoices.xml 
<?xml version="1.0" encoding="UTF-8"?>
<root>
    <row>
        <invoiceNumber>1</invoiceNumber>
        <invoiceText>invoice 1-1</invoiceText>
        <position>1</position>
        ...
    </row>
    <row>
        <invoiceNumber>1</invoiceNumber>
        <invoiceText>invoice 1-2</invoiceText>
        <position>2</position>
        ...
    </row>
    <row>
        <invoiceNumber>2</invoiceNumber>
        <invoiceText>invoice 2-1</invoiceText>
        <position>3</position>
        ...
    </row>
    <row>
        <invoiceNumber>2</invoiceNumber>
        <invoiceText>invoice 2-2</invoiceText>
        <position>4</position>
        ...
    </row>
    <row>
        <invoiceNumber>3</invoiceNumber>
        <invoiceText>invoice 3-1</invoiceText>
        <position>5</position>
        ...
    </row>
    <row>
        <invoiceNumber>3</invoiceNumber>
        <invoiceText>invoice 3-2</invoiceText>
        <position>6</position>
        ...
    </row>
</root>

t:\ftemp>call xslt2 invoices.xml invoices.xsl 
<?xml version="1.0" encoding="UTF-8"?>
<root>
   <batch>
      <row>
        <invoiceNumber>1</invoiceNumber>
        <invoiceText>invoice 1-1</invoiceText>
        <position>1</position>
        ...
    </row>
      <row>
        <invoiceNumber>1</invoiceNumber>
        <invoiceText>invoice 1-2</invoiceText>
        <position>2</position>
        ...
    </row>
      <row>
        <invoiceNumber>2</invoiceNumber>
        <invoiceText>invoice 2-1</invoiceText>
        <position>3</position>
        ...
    </row>
      <row>
        <invoiceNumber>2</invoiceNumber>
        <invoiceText>invoice 2-2</invoiceText>
        <position>4</position>
        ...
    </row>
   </batch>
   <batch>
      <row>
        <invoiceNumber>3</invoiceNumber>
        <invoiceText>invoice 3-1</invoiceText>
        <position>5</position>
        ...
    </row>
      <row>
        <invoiceNumber>3</invoiceNumber>
        <invoiceText>invoice 3-2</invoiceText>
        <position>6</position>
        ...
    </row>
   </batch>
</root>

t:\ftemp>type invoices.xsl 
<?xml version="1.0" encoding="US-ASCII"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="2.0">

<xsl:output indent="yes"/>

<xsl:param name="batch-size" select="2"/>

<xsl:variable name="valid-numbers"
              select="doc('numbers.xml')/root/row/invoiceNumber"/>

<xsl:template match="/">
  <xsl:variable name="invoiceLines" select="root/row"/>
  <root>
    <!--establish batches from possible non-contiguous invoice numbers-->
    <xsl:for-each-group  group-by="(position() - 1) idiv $batch-size" 
      select="distinct-values($invoiceLines/invoiceNumber)[.=$valid-numbers]">
      <!--create a batch using all invoice lines for all numbers in group-->
      <batch>
        <xsl:for-each select="$invoiceLines[invoiceNumber=current-group()]">
          <!--copy rows as they are-->
          <xsl:copy-of select="."/>
        </xsl:for-each>
      </batch>
    </xsl:for-each-group>
  </root>
</xsl:template>

</xsl:stylesheet>
t:\ftemp>rem Done! 

答案 1 :(得分:0)

请不要将此标记为答案,因为我之前的答案回答了原始问题。

下面的代码回答了如何按发票的总行数进行批处理的辅助问题,而不会破坏两批之间的发票。

我无法想象一种以声明方式执行此操作的方法,因此下面的答案是一个命令式递归解决方案,但编写的是实现尾递归的XSLT处理器不会占用堆栈空间。我还利用了原生的XSLT功能(关键表和序列),这些功能在其他语言中很难模仿。

代码非常紧凑,只有一个部分实际写出了一批发票......没有更多的批处理代码块。我很高兴结果如何。

我欢迎任何改进的建议或比这更紧凑的替代解决方案的帖子。

t:\ftemp>type numbers.xml 
<root>
    <row>
        <invoiceNumber>1</invoiceNumber>
    </row>
    <row>
        <invoiceNumber>2</invoiceNumber>
    </row>
    <row>
        <invoiceNumber>3</invoiceNumber>
    </row>
    <row>
        <invoiceNumber>4</invoiceNumber>
    </row>
    <row>
        <invoiceNumber>5</invoiceNumber>
    </row>
</root>

t:\ftemp>type invoices.xml 
<?xml version="1.0" encoding="UTF-8"?>
<root>
    <row>
        <invoiceNumber>1</invoiceNumber>
        <invoiceText>invoice 1-1</invoiceText>
        <position>1</position>
        ...
    </row>
    <row>
        <invoiceNumber>1</invoiceNumber>
        <invoiceText>invoice 1-2</invoiceText>
        <position>2</position>
        ...
    </row>
    <row>
        <invoiceNumber>2</invoiceNumber>
        <invoiceText>invoice 2-1</invoiceText>
        <position>3</position>
        ...
    </row>
    <row>
        <invoiceNumber>2</invoiceNumber>
        <invoiceText>invoice 2-2</invoiceText>
        <position>4</position>
        ...
    </row>
    <row>
        <invoiceNumber>3</invoiceNumber>
        <invoiceText>invoice 3-1</invoiceText>
        <position>5</position>
        ...
    </row>
    <row>
        <invoiceNumber>3</invoiceNumber>
        <invoiceText>invoice 3-2</invoiceText>
        <position>6</position>
        ...
    </row>
    <row>
        <invoiceNumber>4</invoiceNumber>
        <invoiceText>invoice 4-1</invoiceText>
        <position>7</position>
        ...
    </row>
    <row>
        <invoiceNumber>4</invoiceNumber>
        <invoiceText>invoice 4-2</invoiceText>
        <position>8</position>
        ...
    </row>
    <row>
        <invoiceNumber>4</invoiceNumber>
        <invoiceText>invoice 4-3</invoiceText>
        <position>9</position>
        ...
    </row>
    <row>
        <invoiceNumber>4</invoiceNumber>
        <invoiceText>invoice 4-4</invoiceText>
        <position>10</position>
        ...
    </row>
    <row>
        <invoiceNumber>4</invoiceNumber>
        <invoiceText>invoice 4-5</invoiceText>
        <position>11</position>
        ...
    </row>
    <row>
        <invoiceNumber>4</invoiceNumber>
        <invoiceText>invoice 4-6</invoiceText>
        <position>12</position>
        ...
    </row>
    <row>
        <invoiceNumber>5</invoiceNumber>
        <invoiceText>invoice 5-1</invoiceText>
        <position>13</position>
        ...
    </row>
    <row>
        <invoiceNumber>5</invoiceNumber>
        <invoiceText>invoice 5-2</invoiceText>
        <position>14</position>
        ...
    </row>
</root>

t:\ftemp>call xslt2 invoices.xml invoices.xsl 
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <!--Batch max lines: 5-->
  <batch>
    <!--invoice numbers: 1 2-->
    <!--total line count: 4-->
    <row>
        <invoiceNumber>1</invoiceNumber>
        <invoiceText>invoice 1-1</invoiceText>
        <position>1</position>
        ...
    </row>
      <row>
        <invoiceNumber>1</invoiceNumber>
        <invoiceText>invoice 1-2</invoiceText>
        <position>2</position>
        ...
    </row>
      <row>
        <invoiceNumber>2</invoiceNumber>
        <invoiceText>invoice 2-1</invoiceText>
        <position>3</position>
        ...
    </row>
      <row>
        <invoiceNumber>2</invoiceNumber>
        <invoiceText>invoice 2-2</invoiceText>
        <position>4</position>
        ...
    </row>
   </batch>
   <batch>
    <!--invoice numbers: 3-->
    <!--total line count: 2-->
    <row>
        <invoiceNumber>3</invoiceNumber>
        <invoiceText>invoice 3-1</invoiceText>
        <position>5</position>
        ...
    </row>
      <row>
        <invoiceNumber>3</invoiceNumber>
        <invoiceText>invoice 3-2</invoiceText>
        <position>6</position>
        ...
    </row>
   </batch>
   <batch>
    <!--invoice numbers: 4-->
    <!--total line count: 6-->
    <row>
        <invoiceNumber>4</invoiceNumber>
        <invoiceText>invoice 4-1</invoiceText>
        <position>7</position>
        ...
    </row>
      <row>
        <invoiceNumber>4</invoiceNumber>
        <invoiceText>invoice 4-2</invoiceText>
        <position>8</position>
        ...
    </row>
      <row>
        <invoiceNumber>4</invoiceNumber>
        <invoiceText>invoice 4-3</invoiceText>
        <position>9</position>
        ...
    </row>
      <row>
        <invoiceNumber>4</invoiceNumber>
        <invoiceText>invoice 4-4</invoiceText>
        <position>10</position>
        ...
    </row>
      <row>
        <invoiceNumber>4</invoiceNumber>
        <invoiceText>invoice 4-5</invoiceText>
        <position>11</position>
        ...
    </row>
      <row>
        <invoiceNumber>4</invoiceNumber>
        <invoiceText>invoice 4-6</invoiceText>
        <position>12</position>
        ...
    </row>
   </batch>
   <batch>
    <!--invoice numbers: 5-->
    <!--total line count: 2-->
    <row>
        <invoiceNumber>5</invoiceNumber>
        <invoiceText>invoice 5-1</invoiceText>
        <position>13</position>
        ...
    </row>
      <row>
        <invoiceNumber>5</invoiceNumber>
        <invoiceText>invoice 5-2</invoiceText>
        <position>14</position>
        ...
    </row>
   </batch>
</root>

t:\ftemp>type invoices.xsl 
<?xml version="1.0" encoding="US-ASCII"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="2.0">

<xsl:output indent="yes"/>

<xsl:param name="batch-size" select="5"/>

<xsl:variable name="valid-numbers"
              select="doc('numbers.xml')/root/row/invoiceNumber"/>

<xsl:key name="invoice-lines-by-invoice-number"
         match="row" use="invoiceNumber"/>

<xsl:variable name="input" select="/"/>

<xsl:template match="/">
  <root>
    <xsl:text>&#xa;  </xsl:text>
    <xsl:comment select="'Batch max lines:',$batch-size"/>
    <xsl:text>&#xa;  </xsl:text>
    <xsl:call-template name="next-batch">
      <xsl:with-param name="remaining-numbers" 
        select="distinct-values(root/row/invoiceNumber)[.=$valid-numbers]"/>
    </xsl:call-template>
  </root>
</xsl:template>

<xsl:template name="next-batch">
  <xsl:param name="this-batch-lines" select="0"/>
  <xsl:param name="this-batch-numbers" select="()"/>
  <xsl:param name="remaining-numbers" required="yes"/>
  <xsl:variable name="this-invoice" select="$remaining-numbers[1]"/>
  <xsl:variable name="this-invoice-lines"
  select="count(key('invoice-lines-by-invoice-number',$this-invoice,$input))"/>

  <xsl:choose>
    <xsl:when test="not($this-invoice) and not($this-batch-lines)">
      <!--nothing to clean up and nothing more to do-->
    </xsl:when>
    <xsl:when test="not($this-invoice) (:last invoice complete:) or
                    ( $this-batch-lines + $this-invoice-lines > $batch-size )
                      (:this invoice exceeds limit:)">
      <!--clean up previous unfinished batch-->
      <batch>
        <xsl:text>&#xa;    </xsl:text>
        <xsl:comment select="'invoice numbers:',$this-batch-numbers"/>
        <xsl:text>&#xa;    </xsl:text>
        <xsl:comment select="'total line count:',$this-batch-lines"/>
        <xsl:text>&#xa;    </xsl:text>
        <xsl:copy-of select="for $num in $this-batch-numbers return
                         key('invoice-lines-by-invoice-number',$num,$input)"/>
      </batch>
      <xsl:if test="$this-invoice">
        <!--continue with the next batch comprised of this invoice only-->
        <xsl:call-template name="next-batch">
          <xsl:with-param name="this-batch-lines"
                          select="$this-invoice-lines"/>
          <xsl:with-param name="this-batch-numbers"
                          select="$this-invoice"/>
          <xsl:with-param name="remaining-numbers" 
                          select="$remaining-numbers[position()>1]"/>
        </xsl:call-template>
      </xsl:if>
      <!--the cleaned up batch was the last batch, template recursion ends-->
    </xsl:when>
    <xsl:otherwise>
      <!--a batch limit has not been exceeded; add this invoice to batch-->
      <xsl:call-template name="next-batch">
        <xsl:with-param name="this-batch-lines"
                        select="$this-batch-lines + $this-invoice-lines"/>
        <xsl:with-param name="this-batch-numbers"
                        select="($this-batch-numbers,$this-invoice)"/>
        <xsl:with-param name="remaining-numbers"
                          select="$remaining-numbers[position()>1]"/>
      </xsl:call-template>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

</xsl:stylesheet>