替换txt文件python3中的文本标签

时间:2017-08-18 15:12:02

标签: python-3.x web web-scraping beautifulsoup

我试图制作代理剪贴板,这是我的代码:

import bs4
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import lxml
from contextlib import redirect_stdout

meh=[]

pathf = '/home/user/tests.txt'

url = Request('https://www.path.to/table', headers={'User-Agent': 'Mozilla/5.0'})

page_html = urlopen(url).read()

page_soup = soup(page_html, features="xml")

final = page_soup.tbody

meh.append(final)

with open(pathf, 'w') as f:
    with redirect_stdout(f):
        print(meh[0].text.strip())

现在我希望文本以更易读的方式显示,因为它是这样的:

  

12.183.20.3615893USUnited StatesSocks5AnonymousYes11秒ago220.133.97.7445657TWTaiwanSocks5AnonymousYes11秒ago

如何将此文本转换为更易读的文件?类似的东西:

  

12.183.20.36 15893美国Socks5匿名是11秒前(新线)......

这是没有' .text.strip()'的实际输出。如果jsbeautifier旅行有帮助,请格式化 https://ghostbin.com/paste/g56qe

1 个答案:

答案 0 :(得分:0)

您可以将所有td元素提取为列表,而不是提取完整的表格主体:

final_list = page_soup.findAll('td')

然后获取文本节点列表:

list_of_text_nodes = [td.text.strip() for td in final_list]

输出

[u'182.235.38.81', u'40748', u'TW', u'Taiwan', u'Socks5', u'Anonymous'...]

或将所有文本节点作为单个字符串:

complete_text = " ".join([i.text.strip() for i in final_list])

输出

'182.235.38.81 40748 TW Taiwan Socks5 Anonymous Yes 14 seconds ago ...'