Removing special characters and search and replace using xslt

时间:2017-06-15 10:04:17

标签: regex xslt

I'm trying to replace certain strings of text and then remove all RTF tags from the same text string.

So the initial value is:

<test>
<data>{\rtf1\ansi\ansicpg1252\uc1\htmautsp\deff2{\fonttbl{\f0\fcharset0     Times New Roman;}{\f2\fcharset0 Segoe UI;}{\f3\fcharset0 arial;}}{\colortbl\red0\green0\blue0;\red255\green255\blue255;}\loch\hich\dbch\pard\plain\ltrpar\itap0{\lang1033\fs16\f3\cf0 \cf0\ql{\ql{{\ltrch Ingredients: roast British chicken breast \'b7 chicken stock mayo and smoked  \'b7 prawns with mayo on malted brown bread \'b7 smoked British ham with mustard mayo on oatmeal bread \'b7 .}\li0\ri0\sa0\sb0\fi0\ql\par}
{{\ltrch }{\ltrch }{\ltrch  }\li0\ri0\sa0\sb0\fi0\ql\par}
{{\ltrch roast British chicken breast \'b7 chicken stock mayo and smoked  : Chicken Breast (25.89%) \'b7 }{\ltrch {\b Wheatflour}}{\ltrch  contains }{\ltrch {\b Gluten}}{\ltrch  (with Wheatflour \'b7 Calcium Carbonate \'b7 Iron \'b7 Niacin \'b7 Thiamin) \'b7 Water \'b7 Pork (10.32%) \'b7 Malted }{\ltrch {\b Wheatflakes}}{\ltrch  (contain }{\ltrch {\b Gluten}}{\ltrch ) \'b7 Rapeseed Oil \'b7 }{\ltrch {\b Wheat}}\li0\ri0\sa0\sb0\fi0\ql\par}
{{\ltArch }{\ltrch }{\ltrch  }\li0\ri0\sa0\sb0\fi0\ql\par}

}
}
</test>

So what needs to be done:

  1. Values like {\b Wheat} should become <bold>Wheat</bold> - where the Wheat can be anything or different.
  2. \'b7 should become a comma (',')

The result would be:

<test>
<data>Ingredients: roast British chicken breast , chicken stock mayo and smoked  , prawns with mayo on malted brown bread , smoked British ham with mustard mayo on oatmeal bread , .
roast British chicken breast , chicken stock mayo and smoked  : Chicken Breast (25.89%) , <bold> Wheatflour</bold> contains <bold>Gluten</bold>(with Wheatflour , Calcium Carbonate , Iron , Niacin , Thiamin) , Water , Pork (10.32%) , Malted <bold> Wheatflakes</bold>contain <bold> Gluten</bold>, Rapeseed Oil , <bold> Wheat</bold>
</data>
</test>

Can this be done? If so, how?

1 个答案:

答案 0 :(得分:0)

如果您可以使用XSLT 2.0或更新版本(包括正则表达式功能),这并不是非常困难。关键是replace()函数。

这是一段开始清理你的RTF混乱的代码片段:

<xsl:template match="data">
    <xsl:copy>
        <!-- Note: XSL variables are _immutable_: once created, their values 
            cannot be changed.  I use a chain of variables here simply for 
            purposes of illustration, as a means of showing each regex 
            replacement operation on its own.  These could all be stacked
            into a single statement, but that is somewhat harder for
            humans to read. :) -->
        <xsl:variable name="bolded" select="replace(., '\{\\b (.*?)\}', '&lt;bold&gt;$1&lt;/bold&gt;')"/>
        <xsl:variable name="commas" select="replace($bolded, '\\''b7', ',')"/>
        <xsl:variable name="unfonted" select="replace($commas, '\{\\fonttbl\{.*?\}\}', '')"/>
        <xsl:variable name="uncolored" select="replace($unfonted, '\{\\colortbl\\.*?\}', '')"/>
        <xsl:variable name="no-ltrch" select="replace($uncolored, '\{\\ltrch (.*?)\}', '$1')"/>
        <xsl:value-of select="$no-ltrch" disable-output-escaping="yes"/>
    </xsl:copy>
</xsl:template>

当前输出(在添加示例输入XML中缺少的结束</data>标记之后):

<?xml version="1.0" encoding="UTF-8"?><test>
    <data>{\rtf1\ansi\ansicpg1252\uc1\htmautsp\deff2\loch\hich\dbch\pard\plain\ltrpar\itap0{\lang1033\fs16\f3\cf0 \cf0\ql{\ql{Ingredients: roast British chicken breast , chicken stock mayo and smoked  , prawns with mayo on malted brown bread , smoked British ham with mustard mayo on oatmeal bread , .\li0\ri0\sa0\sb0\fi0\ql\par}
        { \li0\ri0\sa0\sb0\fi0\ql\par}
        {roast British chicken breast , chicken stock mayo and smoked  : Chicken Breast (25.89%) , <bold>Wheatflour</bold> contains <bold>Gluten</bold> (with Wheatflour , Calcium Carbonate , Iron , Niacin , Thiamin) , Water , Pork (10.32%) , Malted <bold>Wheatflakes</bold> (contain <bold>Gluten</bold>) , Rapeseed Oil , <bold>Wheat</bold>\li0\ri0\sa0\sb0\fi0\ql\par}
        {{\ltArch } \li0\ri0\sa0\sb0\fi0\ql\par}

        }
        }</data>
</test>

此时,您只需要找出去除剩余RTF代码所需的其余正则表达式。

相关问题