从Python span标签提取文本

时间:2019-10-15 20:14:39

标签: python selenium web-scraping

我在做硒机器人,我需要在搜索后从页面中提取信息,但是遇到了麻烦。

下图中有HTML

Image-here

我想从“ class ='escapamentoLinhas'”的这些斜体标签中提取文本

    from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
from selenium.webdriver.support.ui import Select

URL = '\x0\x0\x0\x0'

search = '\x0\x0'
print("Running...")

class ScrapingTJ:
    def __init__(self):

        self.browser = webdriver.Firefox()
        self.browser.get(URL)

        sleep(1)
        select = Select(self.browser.find_element_by_id('cbPesquisa'))
        select.select_by_value('NMPARTE')
        sleep(1)
        self.browser.find_element_by_xpath('//*[@id="campo_NMPARTE"]').send_keys(CNPJ_CLARO)
        self.browser.find_element_by_id('pbEnviar').click()
        sleep(2)

        dados = self.browser.find_element_by_id('listagemDeProcessos')
        HTML = dados.get_attribute("innerHTML")

        scraping = BeautifulSoup(HTML, "html.parser")
        # links
        links = scraping.find_all('a')
        for scrap in links:
            print(scrap.get_text())


        textos = scraping.find(class_ = 'espacamentoLinhas')
        subtextos = scraping.find_all('span')
        for ext in subtextos:
            print(ext.get_text())



if __name__ == '__main__':
    ScrapingTJ()

退出:

Exectdo:
Recebido em:




Exectda:
   Recebido em:

:中,我应该得到'2007年4月30日-图片中的下划线是Vara dasExecuçõesFiscais Estaduais'

1 个答案:

答案 0 :(得分:0)

根据您提供的HTML图像,看起来元素文本位于div元素中,而不是span中。您将需要从div中提取文本,而不是span。我将替换此块:

textos = scraping.find(class_ = 'espacamentoLinhas')
subtextos = scraping.find_all('span')
for ext in subtextos:
    print(ext.get_text())

与此:

elements = self.browser.find_elements_by_xpath("//div[@class='espacamentoLinhas']")
for element in elements:
    print(element.text)

span仅包含文本'Recebido em:',而不包含您要查找的文本30/04/2007 - Vara das Execuções Fiscais Estaduais。该文本实际上包含在我所包含的XPath中引用的div中。

如果您不想使用self.browser.find_elements_by_xpath,则可以只删除代码中的scraping.find_all('span')部分:

 textos = scraping.find(class_ = 'espacamentoLinhas')
 for ext in textos:
     print(ext.get_text())
相关问题