我正在使用compactMap
来标记缅甸语文本。这就是我在做什么。
Dim sResponse As String, html As HTMLDocument
Dim url As String
Dim N As Long
Dim X As Long
url = ActiveCell.Value
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", url, False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
Set html = New HTMLDocument
With html
.body.innerHTML = sResponse
ActiveCell.Offset(1, 0) = .getElementByClass("Jpag").innerText
ActiveCell.Offset(1, 0) = .getElementById("srchpagination").innerText
ActiveCell.Offset(0, 1).Select
End With
当我这样做时:
from polyglot.text import Text
blob = u"""
ထိုင္းေရာက္ျမန္မာလုပ္သားမ်ားကို လုံၿခဳံေရး အေၾကာင္းျပၿပီး ထိုင္းရဲဆက္လက္ဖမ္းဆီး၊ ဧည့္စာရင္းအေၾကာင္းျပ၍ ဒဏ္ေငြ႐ိုက္
"""
text = Text(blob)
它以以下格式输出:
print(text.words)
这是什么输出?我不确定为什么输出是这样的。如何将其转换回我可以理解的格式?
我还尝试了以下方法:
[u'\u1011\u102d\u102f', u'\u1004\u1039\u1038\u1031', u'\u101b\u102c', u'\u1000\u1039\u103b', u'\u1019', u'\u1014\u1039', u'\u1019\u102c', u'\u101c\u102f', u'\u1015\u1039', u'\u101e\u102c\u1038', u'\u1019\u103a\u102c\u1038', u'\u1000\u102d\u102f', u'\u101c\u102f\u1036', u'\u107f', u'\u1001\u1033\u1036\u1031', u'\u101b\u1038', u'\u1021\u1031\u107e', u'\u1000\u102c', u'\u1004\u1039\u1038\u103b', u'\u1015\u107f', u'\u1015\u102e\u1038', u'\u1011\u102d\u102f', u'\u1004\u1039\u1038', u'\u101b\u1032', u'\u1006', u'\u1000\u1039', u'\u101c', u'\u1000\u1039', u'\u1016', u'\u1019\u1039\u1038', u'\u1006\u102e\u1038', u'\u104a', u'\u1027', u'\u100a\u1037\u1039', u'\u1005\u102c', u'\u101b', u'\u1004\u1039\u1038', u'\u1021\u1031\u107e', u'\u1000\u102c', u'\u1004\u1039\u1038\u103b', u'\u1015', u'\u104d', u'\u1012', u'\u100f\u1039\u1031', u'\u1004\u103c\u1090\u102d\u102f', u'\u1000\u1039']
但抛出错误:text.words[1].decode('unicode-escape')
答案 0 :(得分:2)
这就是Python 2打印列表的方式。调试输出(请参见repr())清楚地指示了列表的内容。 u''
表示Unicode字符串,\uxxxx
表示U + xxxx的Unicode代码点。输出为全ASCII,因此可在任何终端上使用。如果您直接打印列表中的字符串,则在您的终端支持所打印的字符时,它们将正确显示。示例:
words = [u'\u1011\u102d\u102f', u'\u1004\u1039\u1038\u1031', u'\u101b\u102c']
print words
for word in words:
print word
输出:
[u'\u1011\u102d\u102f', u'\u1004\u1039\u1038\u1031', u'\u101b\u102c']
ထို
င္းေ
ရာ
要再次强调,您的终端必须配置有支持Unicode代码点的编码(理想情况下为UTF-8),并且还使用支持字符的字体。否则,您可以将文本打印为UTF-8编码的文件,并在支持UTF-8并具有支持字符的字体的编辑器中查看该文件:
import io
with io.open('example.txt','w',encoding='utf8') as f:
for word in words:
f.write(word + u'\n')
切换到Python 3,事情变得更加简单。如果终端支持,默认情况下显示字符,但是您仍然可以获得调试输出:
words = [u'\u1011\u102d\u102f', u'\u1004\u1039\u1038\u1031', u'\u101b\u102c']
print(words)
print(ascii(words))
输出:
['ထို', 'င္းေ', 'ရာ']
['\u1011\u102d\u102f', '\u1004\u1039\u1038\u1031', '\u101b\u102c']
答案 1 :(得分:0)
好像您的终端无法处理UTF-8编码的Unicode。尝试通过将每个令牌编码为utf-8
来保存输出,如下所示。
# -*- coding: utf-8 -*-
from _future_ import unicode_literals
from polyglot.text import Text
blob = u"""
ထိုင္းေရာက္ျမန္မာလုပ္သားမ်ားကို လုံၿခဳံေရး အေၾကာင္းျပၿပီး ထိုင္းရဲဆက္လက္ဖမ္းဆီး၊ ဧည့္စာရင္းအေၾကာင္းျပ၍ ဒဏ္ေငြ႐ိုက္
"""
text = Text(blob)
with open('output.txt', 'a') as the_file:
for word in text.words:
the_file.write("\n")
the_file.write(word.encode("utf-8"))