转换消毒数据的最佳方法是什么?

时间:2015-02-10 22:27:40

标签: xml database sanitization

我有一组非常大的数据(stackoverflow的一个数据转储),它完全处于原始和清理状态。

For example:  </p>

为了便于阅读和使用,是否已经建立了将上述和类似内容转换回原始形式的方法?一个python脚本或函数调用偶然吗?

1 个答案:

答案 0 :(得分:0)

这是我必须使用的解决方案才能使一切正常工作 - 请注意,HTML解析器并没有按照我的数据集执行所有操作

!在/ usr / bin中/ python3

    import html.parser
    import string
    import sys

    # Amount of lines to put into a buffer before writing
    BUFFER_SIZE_LINES = 1024
    html_parser = html.parser.HTMLParser()

    # Few HTML reserved chars that are not being cleaned up by HTMLParser
    dict = {}
    dict[ '"' ] = '"'
    dict[ ''' ] = "'"
    dict[ '&' ] = '&'
    dict[ '&lt;' ] = '<'
    dict[ '&gt;' ] = '>'

    # Process the file
    def ProcessLargeTextFile(fileIn, fileOut):
        r = open(fileIn, "r")
        w = open(fileOut, "w")
        buff = ""
        buffLines = 0
        for lineIn in r:

            lineOut = html_parser.unescape(lineIn)
            for key, value in dict.items():
                lineOut = lineOut.replace(key,value)

            buffLines += 1

            if buffLines >= BUFFER_SIZE_LINES:
                w.write(buff)
                buffLines = 1
                buff = ""

            buff += lineOut + "\n"

        w.write(buff)
        r.close()
        w.close()


    # Now run
    ProcessLargeTextFile(sys.argv[1],sys.argv[2])