问题是使用bs4提取html页面的字符串

时间:2016-03-27 22:34:42

标签: python regex bs4

我正在编写一个程序来寻找歌词,程序差不多要完成但我对bs4数据类型有点问题, 我的问题是如何从行尾的歌词变量中提取纯文本?

import re
import requests
import bs4
from urllib import unquote

def getLink(fileName):
    webFileName = unquote(fileName)
    page = requests.get("http://songmeanings.com/query/?query="+str(webFileName)+"&type=songtitles")    
    match = re.search('songmeanings\.com\/[^image].*?\/"',page.content)
    if match:
        Mached = str("http://"+match.group())
        return(Mached[:-1:]) # this line used to remove a " at the end of line
    else:
        return(1)       

def getText(link):    
    page = requests.get(str(link))          
    soup = bs4.BeautifulSoup(page.content ,"lxml")     
    return(soup)        

Soup = getText(getLink("paranoid android"))
lyric = Soup.findAll(attrs={"lyric-box"})
print (lyric)

这是出局:

[\ n \ t \ t \ t \ t \ t \ t \ t \ t请允许你停止噪音,我想要休息一下<\ n>从所有未出生的鸡的声音在我的脑袋里\ n那是什么?左\ n那是什么?
\ n \ n \ n当我成为国王时,你将首先靠墙
\ n。你的意见完全没有意义。\ n那是什么?
\ n那是什么?
\ n \ n \ \ nAmbition让你看起来很丑陋郎\ n嘻嘻哈哈的Gucci小猪崽子\ n你不记得了\ n你不记得了\ n为什么不这样做&#39你还记得我的名字吗?左边的男人用头,\ man man man man man man man











W W W W W W W W W W我猜他确实没有下雨,下雨了。\ n下雨时我就趴在了一个很高的地方。\ n从很高的高度来看,高度
\ n降下雨,下雨了\ n降雨量降低了我\ n从很高的高度来看\ n \ n从高度,高度,\ n \ n下降下雨了。\ n下雨对我来说,\ n \ \ n那就是,先生,\ n你要离开了。\ n猪皮的噼啪声。\ n灰尘和尖叫的\ \雅皮士网络\ The恐慌,呕吐< / \ \ \ \

\ \ \ \ \ \ \,,,,!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
]

3 个答案:

答案 0 :(得分:0)

首先修剪前导和尾随[],然后执行stringvar[1:-1],然后在每一行调用linevar.strip(),这将删除所有空格。

答案 1 :(得分:0)

附加以下代码行:

lyric = ''.join([tag.text for tag in lyric])

之后

lyric = Soup.findAll(attrs={"lyric-box"})

您将获得类似

的输出
                        Please could you stop the noise,
I'm trying to get some rest
From all the unborn chicken voices in my head
What's that?
What's that?

When I am king, you will be first against the wall
With your opinion which is of no consequence at all
What's that?
What's that?

...

答案 2 :(得分:0)

对于那些喜欢这个想法的人来说,最后我的代码看起来有点变化:)

&#13;
&#13;
import re
import pycurl
import bs4
from urllib import unquote
from StringIO import StringIO


def getLink(fileName):
    fileName = unquote(fileName)
    baseAddres = "https://songmeanings.com/query/?query="
    linkToPage = str(baseAddres)+str(fileName)+str("&type=songtitles")
    
    buffer = StringIO()
    page = pycurl.Curl()
    page.setopt(page.URL,linkToPage)
    page.setopt(page.WRITEDATA,buffer)
    page.perform()
    page.close()
    
    pageSTR = buffer.getvalue()
    
    soup = bs4.BeautifulSoup(pageSTR,"lxml")  
    
    tab_content = str(soup.find_all(attrs={"tab-content"}))    
    pattern = r'\"\/\/songmeanings.com\/.+?\"'
    links = re.findall(pattern,tab_content)
    
    """returns first mached item without double quote
    at the beginning and at the end of the string"""
    return("http:"+links[0][1:-1:])

    
def getText(linkToSong):
    
    buffer = StringIO()
    page = pycurl.Curl()
    page.setopt(page.URL,linkToSong)
    page.setopt(page.WRITEDATA,buffer)
    page.perform()
    page.close()
    
    pageSTR = buffer.getvalue()
    
    soup = bs4.BeautifulSoup(pageSTR,"lxml")  
    
    lyric_box = soup.find_all(attrs={"lyric-box"})
    lyric_boxSTR = ''.join([tag.text for tag in lyric_box])
    return(lyric_boxSTR)
    
    
link = getLink("Anarchy In The U.K")
text = getText(link)
print(text)
&#13;
&#13;
&#13;