Python在htm标签之间删除文本延续主题

时间:2016-12-02 10:10:14

标签: python html

全部, 这是my previous post的延续,但针对不同的情况。

现在有特定的场景,我需要在标签之间提取文字。

    data='''<BR><DIV CLASS="c0"><P CLASS="c1"><SPAN CLASS="c2">1 of 2 DOCUMENTS</SPAN></P>
</DIV>
<BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">The </SPAN><SPAN CLASS="c3">New York Times</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c1"><SPAN CLASS="c3">March</SPAN><SPAN CLASS="c2"> 17, 2016 Thursday</SPAN><SPAN CLASS="c2">&nbsp;</SPAN><SPAN CLASS="c2">&nbsp;<BR>Late Edition - Final</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c7">Paid Notice: Deaths THORNTON, ROBERT</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">SECTION: </SPAN><SPAN CLASS="c2">Section A; Column 0; Classified; Pg. 19</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LENGTH: </SPAN><SPAN CLASS="c2">176 words</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">THORNTON--Robert. Robert &quot;Bob&quot; Richard Thornton, 89, of Peoria, IL, died peacefully and surrounded by family on Friday, March 11, 2016. Bob was born October 16, 1926, in Jersey City, New Jersey. He graduated from Regis High School in New York City on June 15, 1945, and immediately thereafter served in the U.S. Navy. He received a B.A. from Georgetown University in 1950 and a J.D. from Columbia University Law School in 1953. He practiced law in New York City for 17 years with the law firms of Dorr Hand and Nixon, Mudge, Rose, Guthrie &amp; Alexander. He joined the legal department of Caterpillar Tractor Co. in 1970 and served as the company's General Counsel and Corporate Secretary from 1983 to 1991. He is survived by his wife, Dorothy (McGuire) of Peoria; and his children, Matthew, Nicholas, Jennifer, and Julia. In lieu of flowers, donations may be made in the name of Robert and Dorothy Thornton to St. Philomena's School in Peoria, IL, Regis High School in New York City, or the National Association for Rare Disorders (www.rare diseases.org). 1/3</SPAN><SPAN CLASS="c2">&nbsp;</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">URL: </SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LANGUAGE: </SPAN><SPAN CLASS="c2">ENGLISH</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">Paid Death Notice</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">SUBJECT: </SPAN><SPAN CLASS="c2">DEATHS &amp; OBITUARIES (92%); HIGH SCHOOLS (90%); LAWYERS (87%); LAW SCHOOLS (77%); CORPORATE COUNSEL (75%); LEGAL SERVICES (70%); GRADUATE &amp; PROFESSIONAL SCHOOLS (70%); ASSOCIATIONS &amp; ORGANIZATIONS (65%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">COMPANY: </SPAN><SPAN CLASS="c2">CATERPILLAR INC (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">ORGANIZATION: </SPAN><SPAN CLASS="c2">COLUMBIA UNIVERSITY (57%); GEORGETOWN UNIVERSITY (57%); US NAVY (57%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">TICKER: </SPAN><SPAN CLASS="c2">CATR (PAR) (70%); </SPAN><SPAN CLASS="c3">CAT</SPAN><SPAN CLASS="c2"> (SWX) (70%); </SPAN><SPAN CLASS="c3">CAT</SPAN><SPAN CLASS="c2"> (NYSE) (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">INDUSTRY: </SPAN><SPAN CLASS="c2">NAICS333131 MINING MACHINERY &amp; EQUIPMENT MANUFACTURING (70%); NAICS333120 CONSTRUCTION MACHINERY MANUFACTURING (70%); NAICS333111 FARM MACHINERY &amp; EQUIPMENT MANUFACTURING (70%); SIC3531 CONSTRUCTION MACHINERY &amp; EQUIPMENT (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">PERSON: </SPAN><SPAN CLASS="c2">RICHARD NIXON (78%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">CITY: </SPAN><SPAN CLASS="c2">NEW YORK, NY, USA (94%); PEORIA, IL, USA (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">STATE: </SPAN><SPAN CLASS="c2">NEW YORK, USA (94%); ILLINOIS, USA (94%); NEW JERSEY, USA (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">COUNTRY: </SPAN><SPAN CLASS="c2">UNITED STATES (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LOAD-DATE: </SPAN><SPAN CLASS="c2">March 17, 2016</SPAN></P>
</DIV>
<BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Copyright 2016 The </SPAN><SPAN CLASS="c3">New York Times</SPAN><SPAN CLASS="c2"> Company</SPAN></P>
</DIV>
<!-- Hide XML section from browser
</DOCFULL>
</DOC> -->
<DIV CLASS="c10">&nbsp;</DIV>
<A NAME="DOC_ID_0_1"></A><!-- Hide XML section from browser
<DOC NUMBER=2>
<DOCFULL> -->
<BR><DIV CLASS="c0"><P CLASS="c1"><SPAN CLASS="c2">2 of 2 DOCUMENTS</SPAN></P>
</DIV>
<BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">The </SPAN><SPAN CLASS="c3">New York Times Company</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c1"><SPAN CLASS="c3">March</SPAN><SPAN CLASS="c2"> 16, 2016 Wednesday</SPAN><SPAN CLASS="c2">&nbsp;</SPAN><SPAN CLASS="c2">&nbsp;<BR>Late Edition - Final</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c7">Paid Notice: Deaths THORNTON, ROBERT</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">SECTION: </SPAN><SPAN CLASS="c2">Section B; Column 0; Classified; Pg. 16</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LENGTH: </SPAN><SPAN CLASS="c2">176 words</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">THORNTON--Robert. Robert &quot;Bob&quot; Richard Thornton, 89, of Peoria, IL, died peacefully and surrounded by family on Friday, March 11, 2016. Bob was born October 16, 1926, in Jersey City, New Jersey. He graduated from Regis High School in New York City on June 15, 1945, and immediately thereafter served in the U.S. Navy. He received a B.A. from Georgetown University in 1950 and a J.D. from Columbia University Law School in 1953. He practiced law in New York City for 17 years with the law firms of Dorr Hand and Nixon, Mudge, Rose, Guthrie &amp; Alexander. He joined the legal department of Caterpillar Tractor Co. in 1970 and served as the company's General Counsel and Corporate Secretary from 1983 to 1991. He is survived by his wife, Dorothy (McGuire) of Peoria; and his children, Matthew, Nicholas, Jennifer, and Julia. In lieu of flowers, donations may be made in the name of Robert and Dorothy Thornton to St. Philomena's School in Peoria, IL, Regis High School in New York City, or the National Association for Rare Disorders (www.rare diseases.org). 1/3 </SPAN><SPAN CLASS="c2">&nbsp;</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">URL: </SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LANGUAGE: </SPAN><SPAN CLASS="c2">ENGLISH</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">Paid Death Notice</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">SUBJECT: </SPAN><SPAN CLASS="c2">DEATHS &amp; OBITUARIES (92%); HIGH SCHOOLS (90%); LAWYERS (87%); LAW SCHOOLS (77%); CORPORATE COUNSEL (75%); LEGAL SERVICES (70%); GRADUATE &amp; PROFESSIONAL SCHOOLS (70%); ASSOCIATIONS &amp; ORGANIZATIONS (65%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">COMPANY: </SPAN><SPAN CLASS="c2">CATERPILLAR INC (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">ORGANIZATION: </SPAN><SPAN CLASS="c2">COLUMBIA UNIVERSITY (57%); GEORGETOWN UNIVERSITY (57%); US NAVY (57%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">TICKER: </SPAN><SPAN CLASS="c2">CATR (PAR) (70%); </SPAN><SPAN CLASS="c3">CAT</SPAN><SPAN CLASS="c2"> (SWX) (70%); </SPAN><SPAN CLASS="c3">CAT</SPAN><SPAN CLASS="c2"> (NYSE) (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">INDUSTRY: </SPAN><SPAN CLASS="c2">NAICS333131 MINING MACHINERY &amp; EQUIPMENT MANUFACTURING (70%); NAICS333120 CONSTRUCTION MACHINERY MANUFACTURING (70%); NAICS333111 FARM MACHINERY &amp; EQUIPMENT MANUFACTURING (70%); SIC3531 CONSTRUCTION MACHINERY &amp; EQUIPMENT (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">PERSON: </SPAN><SPAN CLASS="c2">RICHARD NIXON (78%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">CITY: </SPAN><SPAN CLASS="c2">NEW YORK, NY, USA (94%); PEORIA, IL, USA (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">STATE: </SPAN><SPAN CLASS="c2">NEW YORK, USA (94%); ILLINOIS, USA (94%); NEW JERSEY, USA (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">COUNTRY: </SPAN><SPAN CLASS="c2">UNITED STATES (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LOAD-DATE: </SPAN><SPAN CLASS="c2">March 16, 2016</SPAN></P>
</DIV>
<BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Copyright 2015 The </SPAN><SPAN CLASS="c3">New York Times</SPAN><SPAN CLASS="c2"> Company</SPAN></P>
</DIV>

'''

我尝试过的解决方案

publicationnamepattern="\<DIV CLASS=\"c0\"\>\<BR>\<P CLASS=\"c1\"\><SPAN CLASS=\"c2\"\>(.*)\</SPAN>\</P>"

copyrightpattern = "\<DIV CLASS=\"c0\"\>\<BR>\<P CLASS=\"c1\"\><SPAN CLASS=\"c2\"\>([^<]*)\</SPAN>"

publicationnamepatternvalues = [a.strip("*") for a in re.findall(publicationnamepattern, data)]

copyrightpatternvalues = [a.strip("*") for a in re.findall(copyrightpattern, data)]

print(str(publicationnamepatternvalues))

print(str(copyrightpatternvalues))

结果:

['The </SPAN><SPAN CLASS="c3">New York Times', 'Copyright 2016 The </SPAN><SPAN CLASS="c3">New York Times</SPAN><SPAN CLASS="c2"> Company', 'The </SPAN><SPAN CLASS="c3">New York Times', 'Copyright 2016 The </SPAN><SPAN CLASS="c3">New York Times</SPAN><SPAN CLASS="c2"> Company']

我只需要“纽约时报”用于publicationnamepatternvalues和“ Copyright 2016 The New York Times Company ”for Copyrightpatternvalues

我无法提供更多静态值,因为只有这些字段在data.i中是常见的。纽约时报

某些数据包含span类,因为c2有些包含c4等。

任何人都可以帮助我,如何解决这种情况。

2 个答案:

答案 0 :(得分:1)

使用BeautifulSoup

from bs4 import BeautifulSoup

data = '''... your html ...'''

soup = BeautifulSoup(data, 'html.parser')

for x in soup.select('div.c0 br p.c1'):
    print(x.text)

结果

The New York Times
Copyright 2016 The New York Times Company

答案 1 :(得分:1)

from bs4 import BeautifulSoup

a="""
data='''<BR><DIV CLASS="c0"><P CLASS="c1"><SPAN CLASS="c2">1 of 2 DOCUMENTS</SPAN></P>
</DIV>
<BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">The </SPAN><SPAN CLASS="c3">New York Times</SPAN></P>
</DIV>
<BR><DIV CLASS="c4"><P CLASS="c1"><SPAN CLASS="c3">March</SPAN><SPAN CLASS="c2"> 17, 2016 Thursday</SPAN><SPAN CLASS="c2">&nbsp;</SPAN><SPAN CLASS="c2">&nbsp;<BR>Late Edition - Final</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c7">Paid Notice: Deaths THORNTON, ROBERT</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">SECTION: </SPAN><SPAN CLASS="c2">Section A; Column 0; Classified; Pg. 19</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LENGTH: </SPAN><SPAN CLASS="c2">176 words</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">THORNTON--Robert. Robert &quot;Bob&quot; 1/3</SPAN><SPAN CLASS="c2">&nbsp;</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">URL: </SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LANGUAGE: </SPAN><SPAN CLASS="c2">ENGLISH</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">Paid Death Notice</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">SUBJECT: </SPAN><SPAN CLASS="c2">DEATHS &amp; OBITUARIES (92%); </SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">COMPANY: </SPAN><SPAN CLASS="c2">CATERPILLAR INC (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">ORGANIZATION: </SPAN><SPAN CLASS="c2">COLUMBIA UNIVERSITY (57%); GEORGETOWN UNIVERSITY (57%); US NAVY (57%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">TICKER: </SPAN><SPAN CLASS="c2">CATR (PAR) (70%); </SPAN><SPAN CLASS="c3">CAT</SPAN><SPAN CLASS="c2"> (SWX) (70%); </SPAN><SPAN CLASS="c3">CAT</SPAN><SPAN CLASS="c2"> (NYSE) (70%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">INDUSTRY: </SPAN><SPAN CLASS="c2">NAICS333131 MINING MACHINERY &amp; </SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">PERSON: </SPAN><SPAN CLASS="c2">RICHARD NIXON (78%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">CITY: </SPAN><SPAN CLASS="c2">NEW YORK, NY, USA (94%); PEORIA, IL, USA (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">STATE: </SPAN><SPAN CLASS="c2">NEW YORK, USA (94%); ILLINOIS, USA (94%); NEW JERSEY, USA (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">COUNTRY: </SPAN><SPAN CLASS="c2">UNITED STATES (94%)</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c6"><SPAN CLASS="c8">LOAD-DATE: </SPAN><SPAN CLASS="c2">March 17, 2016</SPAN></P>
</DIV>
<BR><DIV CLASS="c0"><BR><P CLASS="c1"><SPAN CLASS="c2">Copyright 2016 The </SPAN><SPAN CLASS="c3">New York Times</SPAN><SPAN CLASS="c2"> Company</SPAN></P>
</DIV>'''
"""
soup=BeautifulSoup(a)
soup2 = soup.select('div.c0')
list1 = [b.text.strip().encode('utf-8') for b in soup2]
print list1
var1, var2 = list1[1], list1[2]
print var1
print var2

输出:

['1 of 2 DOCUMENTS', 'The New York Times', 'Copyright 2016 The New York Times Company']
The New York Times
Copyright 2016 The New York Times Company
相关问题