如何创建自定义xpath查询?

时间:2014-02-12 14:18:07

标签: xpath scrapy

这是我的HTML文件数据:

<article class='course-box'>
<div class='row-fluid'>
    <div class='span2'>
        <div class='course-cover' style='width: 100%'>
            <img alt='' src='https://d2d6mu5qcvgbk5.cloudfront.net/courses/cover_photos/c4f5fd2efb200e71d09014970cf0b8c86e1e7013.png?1375831955'>
        </div>
    </div>
    <div class='span10'>
        <h2 class='coursetitle'>
            <a href='https://novoed.com/hc'>Hippocrates Challenge</a>
        </h2>
        <figure class='pricetag'>
            Free
        </figure>
        <div class='timeline independent-text'>
            <div class='timeline inline-block'>
                Starting Spring 2014
            </div>

        </div>
        By Jill Helms
        <div class='university' style='margin-top:0px; font-style:normal;'>
            Stanford University
        </div>
    </div>
</div>
<div class='hovered row-fluid' onclick="location.href='https://novoed.com/hc'">
    <div class='span2'>
        <div class='course-cover'>
            <img alt='' src='https://d2d6mu5qcvgbk5.cloudfront.net/courses/cover_photos/c4f5fd2efb200e71d09014970cf0b8c86e1e7013.png?1375831955' style='width: 100%'>
        </div>
    </div>
    <div class='span10'>
        <h2 class='coursetitle' style='margin-top: 10px'>
            <a href='https://novoed.com/hc'>
                Hippocrates Challenge
            </a>
        </h2>
        <p class='description' style='width: 70%'>
            Hippocrates Challenge 2014 is a course designed for anyone with an interest in medicine. The course focuses on teaching anatomy in an interactive way, students will learn about diagnosis and treatment planning while...
        </p>
        <div style='margin-right: 10px'>
            <a class='btn action-btn novoed-primary' href='https://novoed.com/users/sign_up?class=hc'>
                Sign Up
            </a>

        </div>
    </div>
</div>

从代码上方我需要获取以下标记类值。

  1. coursetitle
  2. coursetitle href link
  3. pircetag
  4. timeline inline-block
  5. uinversity
  6. 描述
  7. 讲师姓名
  8. 但coursetitle可在两个地方使用,但我只需要一次。相同的教师姓名不包含任何fecth的特定标签。

    我的xpath查询是:

        novoedData = HtmlXPathSelector(response)
        courseTitle = novoedData.xpath('//div[re:test(@class, "row-fluid")]/div[re:test(@class, "span10")]/h2[re:test(@class, "coursetitle")]/a/text()').extract()
        courseDetailLink = novoedData.xpath('//div[re:test(@class, "row-fluid")]/div[re:test(@class, "span10")]/h2[re:test(@class, "coursetitle")]/a/@href').extract()
        courseInstructorName = novoedData.xpath('//div[re:test(@class, "row-fluid")]/div[re:test(@class, "span10")]/text()').extract()
        coursePriceType = novoedData.xpath('//div[re:test(@class, "row-fluid")]/div[re:test(@class, "span10")]/figure[re:test(@class, "pricetag")]/text()').extract()
        courseShortSummary = novoedData.xpath('//div[re:test(@class, "hovered row-fluid")]/div[re:test(@class, "span10")]/p[re:test(@class, "description")]/text()').extract()
        courseUniversity = novoedData.xpath('//div[re:test(@class, "row-fluid")]/div[re:test(@class, "span10")]/div[re:test(@class, "university")]/text()').extract()
    

    但每个列表变量中的值的数量是不同的:

    len(courseTitle) = 40 (two times because of repetition)
    len(courseDetailLink) = 40 (two times because of repetition)
    len(courseInstructorName) = 160 (some unwanted character is coming because no specific tag for this value)
    len(coursePriceType) = 20 (correct count no repetition)
    len(courseShortSummary)= 20 (correct count no repetition)
    len(courseUniversity) = 20 (correct count no repetition)
    

    请修改我的xpath查询以解决我的问题。提前谢谢..

1 个答案:

答案 0 :(得分:3)

您不需要re:test,只需执行:

>>> s = sel.xpath('//div[@class="row-fluid"]/div[@class="span10"]')[0]
>>> len(s)
1
>>> s.xpath('h2[@class="coursetitle"]/a/@href').extract()
[u'https://novoed.com/hc']

另请注意,一旦s设置在正确的位置,您就可以继续使用它。