Question

我正在从Google文档中提取数据，处理数据并将其写入文件（最终我将粘贴到Wordpress页面中）。

它有一些非ASCII符号。如何将这些安全地转换为可以在HTML源中使用的符号？

目前我正在将所有内容转换为Unicode，在Python字符串中将它们连接在一起，然后执行：

import codecs
f = codecs.open('out.txt', mode="w", encoding="iso-8859-1")
f.write(all_html.encode("iso-8859-1", "replace"))

最后一行有编码错误：

UnicodeDecodeError：'ascii'编解码器无法将字节0xa0解码到位 12286：序数不在范围内（128）

部分解决方案：

此Python运行时没有错误：

row = [unicode(x.strip()) if x is not None else u'' for x in row]
all_html = row[0] + "<br/>" + row[1]
f = open('out.txt', 'w')
f.write(all_html.encode("utf-8")

但是如果我打开实际的文本文件，我会看到许多符号，如：

Qur‚Äôan

也许我需要写一些文本文件以外的东西？

Answer 1

通过在第一次获取unicode对象时将其解码为unicode对象并在出路时根据需要对其进行编码，尽可能多地处理unicode对象。

如果你的字符串实际上是一个unicode对象，你需要在将它写入文件之前将其转换为unicode编码的字符串对象：

foo = u'Δ, Й, ק, ‎ م, ๗, あ, 叶, 葉, and 말.'
f = open('test', 'w')
f.write(foo.encode('utf8'))
f.close()

再次读取该文件时，您将获得一个可以解码为unicode对象的unicode编码字符串：

f = file('test', 'r')
print f.read().decode('utf8')

Answer 2

在Python 2.6+中，你可以{3}在Python 3上默认（use io.open()）：

import io

with io.open(filename, 'w', encoding=character_encoding) as file:
    file.write(unicode_text)

如果您需要逐步编写文本（您不需要多次调用unicode_text.encode(character_encoding)），这可能会更方便。与codecs模块不同，io模块具有适当的通用换行符支持。

Answer 3

Unicode字符串处理在Python 3中标准化。

Char以Unicode

您只需要在utf-8中打开文件

out1 = "(嘉南大圳 ㄐㄧㄚ　ㄋㄢˊ　ㄉㄚˋ　ㄗㄨㄣˋ )"
fobj = open("t1.txt", "w", encoding="utf-8")
fobj.write(out1)
fobj.close()

Answer 4

codecs.open打开的文件是一个获取unicode数据的文件，在iso-8859-1中对其进行编码并将其写入文件。但是，您尝试编写的内容不是unicode;您需要unicode并在iso-8859-1 自己中对其进行编码。这就是unicode.encode方法的作用，编码unicode字符串的结果是字节字符串（str类型。）

您应该使用普通open()并自行编码unicode，或者（通常更好的主意）使用codecs.open()和不自行编码数据。

Answer 5

前言：你的观众会工作吗？

确保您的查看器/编辑器/终端（无论您是否与utf-8编码文件进行交互）都可以读取该文件。这通常是digit-by-digit root calculation上的一个问题，例如记事本。

将Unicode文本写入文本文件？

在Python 2中，使用open模块中的io（这与Python 3中的内置open相同）：

import io

一般情况下，最佳做法是使用UTF-8来写入文件（我们甚至不必担心utf-8的字节顺序）。

encoding = 'utf-8'

utf-8是最现代和普遍可用的编码 - 它适用于所有网络浏览器，大多数文本编辑器（如果有问题，请参阅您的设置）和大多数终端/ shell。

在Windows上，如果您只能在记事本（或其他受限查看器）中查看输出，则可以尝试utf-16le。

encoding = 'utf-16le' # sorry, Windows users... :(

然后用上下文管理器打开它并编写你的unicode字符：

with io.open(filename, 'w', encoding=encoding) as f:
    f.write(unicode_object)

使用许多Unicode字符的示例

这是一个例子，试图将每个可能的字符从数字表示（整数）映射到三位宽（4是最大值，但有点远）到编码的可打印输出，以及它的名称，如果可能（将其放入名为uni.py的文件中）：

from __future__ import print_function
import io
from unicodedata import name, category
from curses.ascii import controlnames
from collections import Counter

try: # use these if Python 2
    unicode_chr, range = unichr, xrange
except NameError: # Python 3
    unicode_chr = chr

exclude_categories = set(('Co', 'Cn'))
counts = Counter()
control_names = dict(enumerate(controlnames))
with io.open('unidata', 'w', encoding='utf-8') as f:
    for x in range((2**8)**3): 
        try:
            char = unicode_chr(x)
        except ValueError:
            continue # can't map to unicode, try next x
        cat = category(char)
        counts.update((cat,))
        if cat in exclude_categories:
            continue # get rid of noise & greatly shorten result file
        try:
            uname = name(char)
        except ValueError: # probably control character, don't use actual
            uname = control_names.get(x, '')
            f.write(u'{0:>6x} {1}    {2}\n'.format(x, cat, uname))
        else:
            f.write(u'{0:>6x} {1}  {2}  {3}\n'.format(x, cat, char, uname))
# may as well describe the types we logged.
for cat, count in counts.items():
    print('{0} chars of category, {1}'.format(count, cat))

这应该以大约一分钟的顺序运行，您可以查看数据文件，如果您的文件查看器可以显示unicode，您将看到它。可以找到有关类别的信息Windows。根据计数，我们可以通过排除没有与之关联的符号的Cn和Co类别来改进我们的结果。

$ python uni.py

它将显示十六进制映射，here，符号（除非无法获取名称，因此可能是控制字符），以及符号的名称。 e.g。

我建议在Unix或Cygwin上使用less（不要将整个文件打印/输出到输出中）：

$ less unidata

e.g。将显示类似于我使用Python 2（unicode 5.2）从中采样的以下行：

     0 Cc NUL
    20 Zs     SPACE
    21 Po  !  EXCLAMATION MARK
    b6 So  ¶  PILCROW SIGN
    d0 Lu  Ð  LATIN CAPITAL LETTER ETH
   e59 Nd  ๙  THAI DIGIT NINE
  2887 So  ⢇  BRAILLE PATTERN DOTS-1238
  bc13 Lo  밓  HANGUL SYLLABLE MIH
  ffeb Sm  ￫  HALFWIDTH RIGHTWARDS ARROW

我的Anaconda的Python 3.5有unicode 8.0，我认为大多数都是3。

Answer 6

如何将unicode字符打印到文件中：

将此保存到文件：foo.py：

#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import codecs
import sys 
UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)
print(u'e with obfuscation: é')

运行它并将输出管道输出到文件：

python foo.py > tmp.txt

打开tmp.txt并查看内部，你会看到：

el@apollo:~$ cat tmp.txt 
e with obfuscation: é

因此，您已将带有模糊标记的unicode e保存到文件中。

Answer 7

当您尝试对非unicode字符串进行编码时会出现该错误：它会尝试对其进行解码，假设它是纯ASCII格式。有两种可能性：

您将其编码为bytestring，但由于您使用了codecs.open，因此write方法需要一个unicode对象。所以你编码它，它试图再次解码它。请尝试：f.write(all_html)。
all_html实际上不是一个unicode对象。执行.encode(...)时，它首先尝试对其进行解码。

Answer 8

如果要写入python3

>>> a = u'bats\u00E0'
>>> print a
batsà
>>> f = open("/tmp/test", "w")
>>> f.write(a)
>>> f.close()
>>> data = open("/tmp/test").read()
>>> data
'batsà'

如果要写入python2：

>>> a = u'bats\u00E0'
>>> f = open("/tmp/test", "w")
>>> f.write(a)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)

为避免此错误，您将必须使用编解码器“ utf-8”将其编码为字节，如下所示：

>>> f.write(a.encode("utf-8"))
>>> f.close()

并使用编解码器“ utf-8”对数据进行解码：

>>> data = open("/tmp/test").read()
>>> data.decode("utf-8")
u'bats\xe0'

而且，如果您尝试在此字符串上执行打印，它将使用类似“ utf-8”的编解码器自动进行解码

>>> print a
batsà

将Unicode文本写入文本文件？

8 个答案:

前言：你的观众会工作吗？

将Unicode文本写入文本文件？

使用许多Unicode字符的示例