Question

我正在编写一个程序来寻找歌词，程序差不多要完成但我对bs4数据类型有点问题，我的问题是如何从行尾的歌词变量中提取纯文本？

import re
import requests
import bs4
from urllib import unquote

def getLink(fileName):
    webFileName = unquote(fileName)
    page = requests.get("http://songmeanings.com/query/?query="+str(webFileName)+"&type=songtitles")    
    match = re.search('songmeanings\.com\/[^image].*?\/"',page.content)
    if match:
        Mached = str("http://"+match.group())
        return(Mached[:-1:]) # this line used to remove a " at the end of line
    else:
        return(1)       

def getText(link):    
    page = requests.get(str(link))          
    soup = bs4.BeautifulSoup(page.content ,"lxml")     
    return(soup)        

Soup = getText(getLink("paranoid android"))
lyric = Soup.findAll(attrs={"lyric-box"})
print (lyric)

这是出局：

[\ n \ t \ t \ t \ t \ t \ t \ t \ t请允许你停止噪音，我想要休息一下<\ n>从所有未出生的鸡的声音在我的脑袋里\ n那是什么？左\ n那是什么？
\ n \ n \ n当我成为国王时，你将首先靠墙
\ n。你的意见完全没有意义。\ n那是什么？
\ n那是什么？
\ n \ n \ \ nAmbition让你看起来很丑陋郎\ n嘻嘻哈哈的Gucci小猪崽子\ n你不记得了\ n你不记得了\ n为什么不这样做＆＃39你还记得我的名字吗？左边的男人用头，\ man man man man man man man

W W W W W W W W W W我猜他确实没有下雨，下雨了。\ n下雨时我就趴在了一个很高的地方。\ n从很高的高度来看，高度
\ n降下雨，下雨了\ n降雨量降低了我\ n从很高的高度来看\ n \ n从高度，高度，\ n \ n下降下雨了。\ n下雨对我来说，\ n \ \ n那就是，先生，\ n你要离开了。\ n猪皮的噼啪声。\ n灰尘和尖叫的\ \雅皮士网络\ The恐慌，呕吐< / \ \ \ \

\ \ \ \ \ \ \，，，，！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！！
]

Answer 1

首先修剪前导和尾随[]，然后执行stringvar[1:-1]，然后在每一行调用linevar.strip()，这将删除所有空格。

Answer 2

附加以下代码行：

lyric = ''.join([tag.text for tag in lyric])

之后

lyric = Soup.findAll(attrs={"lyric-box"})

您将获得类似

的输出

                        Please could you stop the noise,
I'm trying to get some rest
From all the unborn chicken voices in my head
What's that?
What's that?

When I am king, you will be first against the wall
With your opinion which is of no consequence at all
What's that?
What's that?

...

Answer 3

对于那些喜欢这个想法的人来说，最后我的代码看起来有点变化：）

＆＃13;

import re
import pycurl
import bs4
from urllib import unquote
from StringIO import StringIO


def getLink(fileName):
    fileName = unquote(fileName)
    baseAddres = "https://songmeanings.com/query/?query="
    linkToPage = str(baseAddres)+str(fileName)+str("&type=songtitles")
    
    buffer = StringIO()
    page = pycurl.Curl()
    page.setopt(page.URL,linkToPage)
    page.setopt(page.WRITEDATA,buffer)
    page.perform()
    page.close()
    
    pageSTR = buffer.getvalue()
    
    soup = bs4.BeautifulSoup(pageSTR,"lxml")  
    
    tab_content = str(soup.find_all(attrs={"tab-content"}))    
    pattern = r'\"\/\/songmeanings.com\/.+?\"'
    links = re.findall(pattern,tab_content)
    
    """returns first mached item without double quote
    at the beginning and at the end of the string"""
    return("http:"+links[0][1:-1:])

    
def getText(linkToSong):
    
    buffer = StringIO()
    page = pycurl.Curl()
    page.setopt(page.URL,linkToSong)
    page.setopt(page.WRITEDATA,buffer)
    page.perform()
    page.close()
    
    pageSTR = buffer.getvalue()
    
    soup = bs4.BeautifulSoup(pageSTR,"lxml")  
    
    lyric_box = soup.find_all(attrs={"lyric-box"})
    lyric_boxSTR = ''.join([tag.text for tag in lyric_box])
    return(lyric_boxSTR)
    
    
link = getLink("Anarchy In The U.K")
text = getText(link)
print(text)

＆＃13;

问题是使用bs4提取html页面的字符串

3 个答案: