替换字符串python3的许多部分

时间:2017-07-27 19:43:17

标签: html string python-3.x urllib

通过模块urllib和我试图抓取一个网页的文本内容。我按照" SentDex"提供的指南进行操作。在youtube上找到这里(https://www.youtube.com/watch?v=GEshegZzt3M)和官方Python网站的文档,以拼凑一个快速的解决方案。回来的信息有很多HTML标记和我想要删除的特殊字符。我的最终结果是成功的,但我觉得它是硬编码解决方案,仅对这一场景有用。

代码如下:

url = "http://someUrl.com/dir/doc.html" #Target URL

values = {'s':'basics',
        'submit':'search'} #Set parameters for later use

data = urllib.parse.urlencode(values) #Really not sure...

data = data.encode('utf-8') #set to UTF-8

req = urllib.request.Request(url,data)#Arrange the request parameters 

resp = urllib.request.urlopen(req)#Get the document's contents matching that data type from that URL

respData = resp.read() #read the content into a variable
#BS4 method
soup = BeautifulSoup(respData, 'html.parser')
text = soup.find_all("p")
#end BS4
#re method
text = re.findall(r"<p>(.*?)</p>",str(respData)) #get all paragraph tag contents
text = str(text) #convert it to a string 
#end re
conds = ["<b>","</b>","<i>","</i>","\\","[","]","\'"] #things to remove from text

for case in conds:#for each of those things

    text = text.replace(case,"") #remove string AKA replace with nothing

是否有更有效的方法来实现消除所有&#34; Markup&#34;的最终目标?来自一个字符串,而不是每个条件的明确定义?

0 个答案:

没有答案