将Unicode列表转换为可读格式

时间:2018-10-23 08:35:47

标签: python unicode tokenize python-unicode

我正在使用compactMap来标记缅甸语文本。这就是我在做什么。

Dim sResponse As String, html As HTMLDocument
Dim url As String
Dim N As Long
Dim X As Long


        url = ActiveCell.Value
        With CreateObject("MSXML2.XMLHTTP")
            .Open "GET", url, False
            .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
            .send
            sResponse = StrConv(.responseBody, vbUnicode)
        End With
        Set html = New HTMLDocument

        With html
            .body.innerHTML = sResponse
        ActiveCell.Offset(1, 0) = .getElementByClass("Jpag").innerText
        ActiveCell.Offset(1, 0) = .getElementById("srchpagination").innerText
        ActiveCell.Offset(0, 1).Select
        End With

当我这样做时:

    from polyglot.text import Text

    blob = u"""
ထိုင္းေရာက္ျမန္မာလုပ္သားမ်ားကို လုံၿခဳံေရး အေၾကာင္းျပၿပီး ထိုင္းရဲဆက္လက္ဖမ္းဆီး၊ ဧည့္စာရင္းအေၾကာင္းျပ၍ ဒဏ္ေငြ႐ိုက္
"""
    text = Text(blob)

它以以下格式输出:

print(text.words)

这是什么输出?我不确定为什么输出是这样的。如何将其转换回我可以理解的格式?

我还尝试了以下方法:

[u'\u1011\u102d\u102f', u'\u1004\u1039\u1038\u1031', u'\u101b\u102c', u'\u1000\u1039\u103b', u'\u1019', u'\u1014\u1039', u'\u1019\u102c', u'\u101c\u102f', u'\u1015\u1039', u'\u101e\u102c\u1038', u'\u1019\u103a\u102c\u1038', u'\u1000\u102d\u102f', u'\u101c\u102f\u1036', u'\u107f', u'\u1001\u1033\u1036\u1031', u'\u101b\u1038', u'\u1021\u1031\u107e', u'\u1000\u102c', u'\u1004\u1039\u1038\u103b', u'\u1015\u107f', u'\u1015\u102e\u1038', u'\u1011\u102d\u102f', u'\u1004\u1039\u1038', u'\u101b\u1032', u'\u1006', u'\u1000\u1039', u'\u101c', u'\u1000\u1039', u'\u1016', u'\u1019\u1039\u1038', u'\u1006\u102e\u1038', u'\u104a', u'\u1027', u'\u100a\u1037\u1039', u'\u1005\u102c', u'\u101b', u'\u1004\u1039\u1038', u'\u1021\u1031\u107e', u'\u1000\u102c', u'\u1004\u1039\u1038\u103b', u'\u1015', u'\u104d', u'\u1012', u'\u100f\u1039\u1031', u'\u1004\u103c\u1090\u102d\u102f', u'\u1000\u1039']

但抛出错误:text.words[1].decode('unicode-escape')

2 个答案:

答案 0 :(得分:2)

这就是Python 2打印列表的方式。调试输出(请参见repr())清楚地指示了列表的内容。 u''表示Unicode字符串,\uxxxx表示U + xxxx的Unicode代码点。输出为全ASCII,因此可在任何终端上使用。如果您直接打印列表中的字符串,则在您的终端支持所打印的字符时,它们将正确显示。示例:

words = [u'\u1011\u102d\u102f', u'\u1004\u1039\u1038\u1031', u'\u101b\u102c']
print words
for word in words:
    print word

输出:

[u'\u1011\u102d\u102f', u'\u1004\u1039\u1038\u1031', u'\u101b\u102c']
ထို
င္းေ
ရာ

要再次强调,您的终端必须配置有支持Unicode代码点的编码(理想情况下为UTF-8),并且还使用支持字符的字体。否则,您可以将文本打印为UTF-8编码的文件,并在支持UTF-8并具有支持字符的字体的编辑器中查看该文件:

import io
with io.open('example.txt','w',encoding='utf8') as f:
    for word in words:
        f.write(word + u'\n')

切换到Python 3,事情变得更加简单。如果终端支持,默认情况下显示字符,但是您仍然可以获得调试输出:

words = [u'\u1011\u102d\u102f', u'\u1004\u1039\u1038\u1031', u'\u101b\u102c']
print(words)
print(ascii(words))

输出:

['ထို', 'င္းေ', 'ရာ']
['\u1011\u102d\u102f', '\u1004\u1039\u1038\u1031', '\u101b\u102c']

答案 1 :(得分:0)

好像您的终端无法处理UTF-8编码的Unicode。尝试通过将每个令牌编码为utf-8来保存输出,如下所示。

    # -*- coding: utf-8 -*-

    from _future_ import unicode_literals
    from polyglot.text import Text

    blob = u"""
    ထိုင္းေရာက္ျမန္မာလုပ္သားမ်ားကို လုံၿခဳံေရး အေၾကာင္းျပၿပီး ထိုင္းရဲဆက္လက္ဖမ္းဆီး၊ ဧည့္စာရင္းအေၾကာင္းျပ၍ ဒဏ္ေငြ႐ိုက္
    """
    text = Text(blob)


    with open('output.txt', 'a') as the_file:
        for word in text.words:
            the_file.write("\n")
            the_file.write(word.encode("utf-8"))