Question

我正在尝试删除＆lt;＆gt;（html标记）中的文本，并将结果写入新文件。例如，一行文本可以是：

< asdf> Text <here>more text< /asdf >

所以程序会写入输出文件：“Text more text”，不包括html标签内的那些。

这是我到目前为止的尝试：

import urllib.request

data=urllib.request.urlopen("some website").read()

text1=data.decode("utf-8")

import re

def asd(text1):

    x=re.compile("<>")

    y=re.sub(x,"",text1)

    file1=open("textfileoutput.txt","w")

    file1.write(y)

    return y

asd(text1)

它似乎没有写干净版本，仍然有标签。谢谢你的帮助。

Answer 1

x=re.compile("<>")

我不确定为什么你认为这个表达式会匹配< asdf>或< /asdf >。

在任何情况下，使用正则表达式can rarely be justified接近HTML。 为任务使用更合适的工具 - HTML解析器。

使用BeautifulSoup及其unwrap() method的示例：

In [1]: from bs4 import BeautifulSoup

In [2]: html = "<asdf>Text more text</asdf>"

In [3]: soup = BeautifulSoup(html, "html.parser")

In [4]: soup.asdf.unwrap()
Out[4]: <asdf></asdf>

In [5]: print(soup)
Text more text

Answer 2

只需将re.compile("<>")替换为re.compile(r"<[^<>]*>")就足够了

Python删除网站的HTML标签不起作用

2 个答案: