从字符串中删除HTML标记

时间:2014-09-22 21:42:11

标签: html ios swift

如何从字符串中删除HTML标记,以便输出干净的文本?

let str = string.stringByReplacingOccurrencesOfString("<[^>]+>", withString: "", options: .RegularExpressionSearch, range: nil)
print(str)

9 个答案:

答案 0 :(得分:124)

嗯,我尝试了你的功能,它只是一个小例子:

var string = "<!DOCTYPE html> <html> <body> <h1>My First Heading</h1> <p>My first paragraph.</p> </body> </html>"
let str = string.stringByReplacingOccurrencesOfString("<[^>]+>", withString: "", options: .RegularExpressionSearch, range: nil)
print(str)

//output "  My First Heading My first paragraph. "

你能举个问题的例子吗?

答案 1 :(得分:21)

由于HTML不是regular language(HTML是context-free语言),因此您无法使用正则表达式。请参阅:Using regular expressions to parse HTML: why not?

我会考虑改用NSAttributedString。

let htmlString = "LCD Soundsystem was the musical project of producer <a href='http://www.last.fm/music/James+Murphy' class='bbcode_artist'>James Murphy</a>, co-founder of <a href='http://www.last.fm/tag/dance-punk' class='bbcode_tag' rel='tag'>dance-punk</a> label <a href='http://www.last.fm/label/DFA' class='bbcode_label'>DFA</a> Records. Formed in 2001 in New York City, New York, United States, the music of LCD Soundsystem can also be described as a mix of <a href='http://www.last.fm/tag/alternative%20dance' class='bbcode_tag' rel='tag'>alternative dance</a> and <a href='http://www.last.fm/tag/post%20punk' class='bbcode_tag' rel='tag'>post punk</a>, along with elements of <a href='http://www.last.fm/tag/disco' class='bbcode_tag' rel='tag'>disco</a> and other styles. <br />"    
let htmlStringData = htmlString.dataUsingEncoding(NSUTF8StringEncoding)!
let options: [String: AnyObject] = [NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType, NSCharacterEncodingDocumentAttribute: NSUTF8StringEncoding]
let attributedHTMLString = try! NSAttributedString(data: htmlStringData, options: options, documentAttributes: nil)
let string = attributedHTMLString.string

或者,正如评论中的Irshad Mohamed所做的那样:

let attributed = try NSAttributedString(data: htmlString.data(using: .unicode)!, options: [NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType], documentAttributes: nil)
print(attributed.string)

答案 2 :(得分:7)

我使用以下扩展程序删除特定的HTML元素:

extension String {
    func deleteHTMLTag(tag:String) -> String {
        return self.stringByReplacingOccurrencesOfString("(?i)</?\(tag)\\b[^<]*>", withString: "", options: .RegularExpressionSearch, range: nil)
    }

    func deleteHTMLTags(tags:[String]) -> String {
        var mutableString = self
        for tag in tags {
            mutableString = mutableString.deleteHTMLTag(tag)
        }
        return mutableString
    }
}

这样就可以只从字符串中删除<a>标记,例如:

let string = "my html <a href="">link text</a>"
let withoutHTMLString = string.deleteHTMLTag("a") // Will be "my  html link text"

答案 3 :(得分:7)

穆罕默德解决方案,但作为Swift 4中的String扩展。

extension String {

    func stripOutHtml() -> String? {
        do {
            guard let data = self.data(using: .unicode) else {
                return nil
            }
            let attributed = try NSAttributedString(data: data, options: [.documentType: NSAttributedString.DocumentType.html, .characterEncoding: String.Encoding.utf8.rawValue], documentAttributes: nil)
            return attributed.string
        } catch {
            return nil
        }
    }
}

答案 4 :(得分:4)

斯普利特4:

extension String {
    func deleteHTMLTag(tag:String) -> String {
        return self.replacingOccurrences(of: "(?i)</?\(tag)\\b[^<]*>", with: "", options: .regularExpression, range: nil)
    }

    func deleteHTMLTags(tags:[String]) -> String {
        var mutableString = self
        for tag in tags {
            mutableString = mutableString.deleteHTMLTag(tag: tag)
        }
        return mutableString
    }
}

答案 5 :(得分:1)

已为Swift 4更新。         警卫让htmlStringData = htmlString.data(使用:.unicode),否则{fatalError()}

    let options: [NSAttributedString.DocumentReadingOptionKey: Any] = [
        .documentType: NSAttributedString.DocumentType.html
        .characterEncoding: String.Encoding.unicode.rawValue
    ]

    let attributedHTMLString = try! NSAttributedString(data: htmlStringData, options: options, documentAttributes: nil)
    let string = attributedHTMLString.string

答案 6 :(得分:1)

@AutoConfigureTestDatabase(replace = AutoConfigureTestDatabase.Replace.NONE)

快乐编码

答案 7 :(得分:0)

与使用NSAttributedString HTML转换相比,我更喜欢使用正则表达式,请注意这非常耗时,并且也需要在主线程上运行。 这里的更多信息:https://developer.apple.com/documentation/foundation/nsattributedstring/1524613-initwithdata

这对我来说很成功,首先我删除所有CSS内联样式,然后删除所有HTML标记。可能不像NSAttributedString选项那样可靠,但是对于我的情况来说要快得多。

extension String {
    func withoutHtmlTags() -> String {
        let str = self.replacingOccurrences(of: "<style>[^>]+</style>", with: "", options: .regularExpression, range: nil)
        return str.replacingOccurrences(of: "<[^>]+>", with: "", options: .regularExpression, range: nil)
    }
}

答案 8 :(得分:0)

Swift 5

extension String {
    public func trimHTMLTags() -> String? {
        guard let htmlStringData = self.data(using: String.Encoding.utf8) else {
            return nil
        }
    
        let options: [NSAttributedString.DocumentReadingOptionKey : Any] = [
            .documentType: NSAttributedString.DocumentType.html,
            .characterEncoding: String.Encoding.utf8.rawValue
        ]
    
        let attributedString = try? NSAttributedString(data: htmlStringData, options: options, documentAttributes: nil)
        return attributedString?.string
    }
}

使用:

let  str = "my html <a href='https://www.google.com'>link text</a>"

print(str.trimHTMLTags() ?? "--") //"my html link text"