使用xsl将HTML写入用户定义的xml

时间:2017-02-25 08:17:07

标签: xml xslt xslt-1.0

我们计划将单词内容转换为xml。使用java将单词列表转换为HTMl。试图将Html转换为用户定义的xml。由于我们面临许多挑战,有人可以帮我创建同样的xsl。

单词输入:

[root@bdhost001 ~]$ spark-submit --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/

Type --help for more information.

INPUT XML:

[![Word Content][1]][1]

规则:

<body lang=EN-US style='tab-interval:.5in'>

<p class=MsoListParagraphCxSpFirst style='text-indent:-.25in;mso-list:l2 level1 lfo1'>
<span style='mso-bidi-font-family:Calibri;mso-bidi-theme-font:minor-latin'>
<span style='mso-list:Ignore'>1.<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></span></span>
FirstListItem para1</p>

<p class=MsoListParagraphCxSpMiddle>FirstListItem para2</p>

<p class=MsoListParagraphCxSpMiddle>FirtsListItem para3</p>

<p class=MsoListParagraphCxSpMiddle style='margin-left:1.0in;mso-add-space:auto;text-indent:-.25in;mso-list:l1 level1 lfo2'>
<span style='mso-bidi-font-family:Calibri;mso-bidi-theme-font:minor-latin'>
<span style='mso-list:Ignore'>A.<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></span></span>
First Inner ListItem para1</p>

<p class=MsoListParagraphCxSpMiddle><span style='mso-tab-count:1'></span>
First Inner ListItem para2</p>

<p class=MsoListParagraphCxSpMiddle style='margin-left:1.0in;mso-add-space:auto;text-indent:-.25in;mso-list:l1 level1 lfo2'>
<span style='mso-bidi-font-family:Calibri;mso-bidi-theme-font:minor-latin'>
<span style='mso-list:Ignore'>B.<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></span></span>
Second Inner ListItem para1</p>

<p class=MsoListParagraphCxSpMiddle style='margin-left:1.25in;mso-add-space:auto;text-indent:-.25in;mso-list:l0 level1 lfo3'>
<span style='font-family:Symbol;mso-fareast-font-family:Symbol;mso-bidi-font-family:Symbol'>
<span style='mso-list:Ignore'>·<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span></span>Outer Bulleted List para1</p>

<p class=MsoListParagraphCxSpMiddle style='margin-left:1.25in;mso-add-space:auto'>Outer Bulleted List Para2</p>

<p class=MsoListParagraphCxSpMiddle style='margin-left:1.75in;mso-add-space:auto;text-indent:-.25in;mso-list:l3 level1 lfo4'>
<span style='font-family:Wingdings;mso-fareast-font-family:Wingdings;mso-bidi-font-family:Wingdings'><span style='mso-list:Ignore'>ü<span style='font:7.0pt "Times New Roman"'>&nbsp;</span></span></span>Inner Bulleted List para</p>

<p class=MsoListParagraphCxSpMiddle style='margin-left:1.75in;mso-add-space:auto;text-indent:-.25in;mso-list:l3 level1 lfo4'>
<span style='font-family:Wingdings;mso-fareast-font-family:Wingdings;mso-bidi-font-family:Wingdings'>
<span style='mso-list:Ignore'>ü<span style='font:7.0pt "Times New Roman"'>&nbsp;</span></span></span>Second Inner bulleted List para1</p>

<p class=MsoListParagraphCxSpMiddle style='margin-left:1.75in;mso-add-space:auto'>Second Inner bulleted List para2</p>

<p class=MsoListParagraphCxSpMiddle style='margin-left:1.25in;mso-add-space:auto'>Outer Bulleted List Para3</p>

<p class=MsoListParagraphCxSpLast style='margin-left:1.25in;mso-add-space:auto;text-indent:-.25in;mso-list:l0 level1 lfo3'>
<spanstyle='font-family:Symbol;mso-fareast-font-family:Symbol;mso-bidi-font-family:Symbol'>
<span style='mso-list:Ignore'>·<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span></span>Second Outer bulleted List</p>

<p class=MsoNormal style='margin-left:.5in;text-indent:.5in'>Second Inner List Item para2</p>

<p class=MsoListParagraphCxSpFirst>First List Item para4</p>

<p class=MsoListParagraphCxSpLast style='text-indent:-.25in;mso-list:l2 level1 lfo1'>
<span style='mso-bidi-font-family:Calibri;mso-bidi-theme-font:minor-latin'>
<span style='mso-list:Ignore'>2.<span style='font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span></span>SecondListItem para1</p>

<!--EndFragment-->
</body>

LOGIC:

a) can have seven list  such as FirstList, SecondList... SevenhList
b)Every List should have its own Item such as FirstListItem, SecondListItem .... SeventhListItem
c) Every Item should have Content. ie) Content
d) All bullted List will be considered as BulletedList.
e) Each BulletedList Should  have Item and Item should have Content.

预期输出:

    1) Based on "mso-list" attribute value  and any font-family: other then Wingdings/Symbol can be know that as  Numbered List (FirstList, SecondList... SevenhList. Content of P tag should go inside FirstList--> FirstListItem --> Content
    2) Based on "mso-list" attribute value and font-family:Symbol or font-family:Wingdings of span should consider as BulletedListItem.Content of P tag should go inside FirstList--> FirstListItem --> Content
    3) Based on Class name "MsoListParagraphCxSpFirst" it can be identified as the starting point of the list.
    4) P tag without span tag is part of the previous list. In this sample, 8 P tag belong to this catogory. This can be attched to the previous list as next content. 
5) Ptag  without span margin-left+text-indent= margin-left of P + span tag

    Example for content mapping to List item.
    15th -> It dont have any any margin left attribute or margin left =0 .Its should be content of First ListItem.
    14th -> margin-left:.5in;text-indent:.5in Add 1in. Preceding sibling margin-left:1.0in.Both are equal. So this para belong to the previous List element whose margin left is 1in.
12th -> margin-left:1.25in belongs to previous list whose margin-left:1.25in.

0 个答案:

没有答案