Scrapy:获取某个<a> tag within a tag</a>之后的所有标签

时间:2015-01-13 21:11:38

标签: python html scrapy scrapy-spider

我有以下页面需要使用Scrapy:http://www.genecards.org/cgi-bin/carddisp.pl?gene=B2M

我的任务是从GeneCard中获取摘要,它在HTML中看起来像这样:

<td>
    <a name="summaries"></a>
    <br >
    <b>Entrez Gene summary for <a href="http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=full_report&list_uids=567" title="See EntrezGene 
    entry for B2M" target="aaa" 
    onClick="doFocus('aaa')">B2M</a> Gene:</b><br >
    <dd> This gene encodes a serum protein found in association with the major histocompatibility complex (MHC) class I
        <br >
    <dd>heavy chain on the surface of nearly all nucleated cells. The protein has a predominantly beta-pleated sheet
        <br >
    <dd>structure that can form amyloid fibrils in some pathological conditions. A mutation in this gene has been shown<br ><dd>to result in hypercatabolic hypoproteinemia.(provided by RefSeq, Sep 2009) </dd><br ><b>GeneCards Summary for B2M Gene:</b><br ><dd> B2M (beta-2-microglobulin) is a protein-coding gene. Diseases associated with B2M include <i><a href="http://www.malacards.org/card/balkan_nephropathy" title="See balkan nephropathy at MalaCards" target="aaa" 
        onClick="doFocus('aaa')">balkan nephropathy</a></i>, and <i><a href="http://www.malacards.org/card/plasmacytoma" title="See plasmacytoma at MalaCards" target="aaa" 
        onClick="doFocus('aaa')">plasmacytoma</a></i>. GO annotations related to this gene include <i>identical protein binding</i>.</dd><br ><Font size=-1><b>UniProtKB/Swiss-Prot: </b></font><a href="http://www.uniprot.org/uniprot/P61769#section_comments" target="aaa" 
                onClick="doFocus('aaa')">B2MG_HUMAN, P61769</a></font><dd><b>Function</b>:  Component of the class I major histocompatibility complex (MHC). Involved in the presentation of peptide<br >
    <dd>antigens to the immune system</dd>

现在,我希望scrapy从中获取文本。但是,我无法弄清楚如何让Scrapy根据其中<td>的事实来选择<a name="summaries">。 Scrapy是否有一个未记录的选择器功能,它允许您根据它(或不显式)包含特定子标记的事实选择标记?

1 个答案:

答案 0 :(得分:0)

更新

您可以使用sel.xpath('.//a[@name="summaries"]')开头的 xpath ...我没有在这个mac上进行scrapy,所以我使用 lxml ,事实上,在 lxml 中,您可以使用 getparent() iterslibings 等。确实有很多种方式,这是其中一个样本:

from lxml import html

s = '... your very long html page source ...'
tree = html.fromstring(s)

for a in tree.xpath('.//a[@name="summaries"]'):
    td = a.getparent() # getparent() which returns td
    # iterchildren() get all children nodes under td 
    for node in td.iterchildren():
        print node.text

结果:

None


None
Summaries
(According to 
None
None
Entrez Gene summary for 
None
 This gene encodes a serum protein found in association with the major histocompatibility complex (MHC) class I

或者,使用itersiblings()抓取<a>周围的所有兄弟节点:

for a in tree.xpath('.//a[@name="summaries"]'):
    for node in t.itersiblings():
        print node.text

...

或者,如果您在父td中实际包含的所有文字之后,您可以使用xpath //text()来抓取它们:

for a in tree.xpath('.//a[@name="summaries"]'):
    print a.xpath('./..//text()')

很长的结果:

['\n\t', '\n', '\n', 'Jump to Section...', '\n', 'Aliases', '\n', 'Databases', '\n', 'Disorders / Diseases', '\n', 'Domains / Families', '\n', 'Drugs / Compounds', '\n', 'Expression', '\n', 'Function', '\n', 'Genomic Views', '\n', 'Intellectual Property', '\n', 'Localization', '\n', 'Orthologs', '\n', 'Paralogs', '\n', 'Pathways / Interactions', '\n', 'Products', '\n', 'Proteins', '\n', 'Publications', '\n', 'Search Box', '\n', 'Summaries', '\n', 'Transcripts', '\n', 'Variants', '\n', 'TOP', '\n', 'BOTTOM', '\n', '\n', '\n', 'Summaries', 'for B2M gene', '(According to ', 'Entrez Gene', ',\n\t\t', 'GeneCards', ',\n\t\t', 'Tocris Bioscience', ',\n\t\t', "Wikipedia's", ' \n\t\t', 'Gene Wiki', ',\n\t\t', 'PharmGKB', ',', '\n\t\t', 'UniProtKB/Swiss-Prot', ',\n\t\tand/or \n\t\t', 'UniProtKB/TrEMBL', ')\n\t\t', 'About This Section', 'Try', 'GeneCards Plus']
['Entrez Gene summary for ', 'B2M', ' Gene:', ' This gene encodes a serum protein found in association with the major histocompatibility complex (MHC) class I', 'heavy chain on the surface of nearly all nucleated cells. The protein has a predominantly beta-pleated sheet', 'structure that can form amyloid fibrils in some pathological conditions. A mutation in this gene has been shown', 'to result in hypercatabolic hypoproteinemia.(provided by RefSeq, Sep 2009) ', 'GeneCards Summary for B2M Gene:', ' B2M (beta-2-microglobulin) is a protein-coding gene. Diseases associated with B2M include ', 'balkan nephropathy', ', and ', 'plasmacytoma', '. GO annotations related to this gene include ', 'identical protein binding', '.', 'UniProtKB/Swiss-Prot: ', 'B2MG_HUMAN, P61769', 'Function', ':  Component of the class I major histocompatibility complex (MHC). Involved in the presentation of peptide', 'antigens to the immune system', 'Gene Wiki entry for ', 'B2M', ' (Beta-2 microglobulin) Gene']