Unicode Dict键问题

时间:2016-01-31 19:52:08

标签: python dictionary unicode python-3.5 python-unicode

使用python 3.5。据我所知,默认情况下所有字符串都应该是unicode。为什么这些unicode键名使用ascii进行编码?

row_map = {
            'α-Pinene': 7,
            'β-Pinene': 8,
            'Terpinolene': 9,
            'Geraniol': 10,
            'α-Terpinene': 11,
            'γ-Terpinene': 12,
            'Camphene': 13,
            'Linalool': 14,
            'd-Limonene': 15,
            'Citral': 16,
            'Myrcene': 17,
            'α-Terpineol': 18,
            'Citronellol': 19,
            'dl-Menthol': 20,
            '1-Borneol': 21,
            '2-Piperidone': 22,
            'β-Caryophyllene': 23,
            'α-Humulene': 24,
            'Caryophyllene Oxide': 5,
        }
        with open("log.txt", "w", encoding="utf-8") as f:
            print(row_map, file=f)
        print(open("log.txt", "rb").read())

以下是将这些键写入utf-8文本的结果。 的 log.txt的

dict_keys([
    'Terpinolene', 
    'Camphene', 
    'Myrcene', 
    'α-Terpineol', 
    'd-Limonene', 
    '2-Piperidone', 
    'γ-Terpinene', 
    'Geraniol', 
    'Linalool', 
    'α-Humulene', 
    'α-Pinene', 
    'β-Caryophyllene', 
    'β-Pinene', 
    'Caryophyllene Oxide', 
    'Citronellol', 
    '1-Borneol', 
    'Citral', 
    'α-Terpinene', 
    'dl-Menthol'])

编辑: Here是实际的txt文件。所以可以证实它不是我的观众。

编辑#2:请看看这个funkiness。

Python 3.5.0 (v3.5.0:374f501f4567, Sep 13 2015, 02:27:37) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> row_map = {
...                 'α-Pinene': 7,
...                 'β-Pinene': 8,
...                 'Terpinolene': 9,
...                 'Geraniol': 10,
...                 'α-Terpinene': 11,
...                 'γ-Terpinene': 12,
...                 'Camphene': 13,
...                 'Linalool': 14,
...                 'd-Limonene': 15,
...                 'Citral': 16,
...                 'Myrcene': 17,
...                 'α-Terpineol': 18,
...                 'Citronellol': 19,
...                 'dl-Menthol': 20,
...                 '1-Borneol': 21,
...                 '2-Piperidone': 22,
...                 'β-Caryophyllene': 23,
...                 'α-Humulene': 24,
...                 'Caryophyllene Oxide': 25,
...             }
>>> row_map
{'Citral': 16, 'd-Limonene': 15, 'Myrcene': 17, 'Camphene': 13, 'ß-Caryophyllene': 23, 'α-Terpinene': 11, 'Linalool': 14, 'α-Humulene': 24, '1-Borneol': 21, 'Citronellol': 19, 'Caryophyllene Oxide': 25, 'α-Terpineol': 18, 'α-Pinene': 7, '2-Piperidone': 22, 'dl-Menthol': 20, 'Terpinolene': 9, 'ß-Pinene': 8, 'Geraniol': 10, '?-Terpinene': 12}
>>> from strains.models import Terpene
>>> Terpene.row_map
{'Citral': 16, 'd-Limonene': 15, 'Myrcene': 17, 'Camphene': 13, 'α-Terpinene': 11, 'Linalool': 14, 'α-Humulene': 24, '1-Borneol': 21, 'Citronellol': 19, 'Caryophyllene Oxide': 25, 'α-Terpineol': 18, 'α-Pinene': 7, '\u03b2-Caryophyllene': 23, '2-Piperidone': 22, 'dl-Menthol': 20, 'Terpinolene': 9, '\u03b3-Terpinene': 12, 'Geraniol': 10, '\u03b2-Pinene': 8}
>>>

我从问题代码中复制了这个并将其粘贴到shell中。请注意粘贴的dict如何使用自动替换无法编码的任何内容。

请注意,作为Terpene obj属性的完全相同的dict如何逃脱了unicode!

这是Terpene对象的row_map

class Terpene(models.Model):

    name = models.CharField(max_length=50, unique=True)
    short_desc = models.CharField(max_length=250, blank=True, null=True)
    long_desc = models.TextField(blank=True, null=True)
    aroma = models.CharField(max_length=250, blank=True, null=True)
    flavor = models.CharField(max_length=250, blank=True, null=True)
    effects = models.CharField(max_length=250, blank=True, null=True)

row_map = {
    'α-Pinene': 7,
    'β-Pinene': 8,
    'Terpinolene': 9,
    'Geraniol': 10,
    'α-Terpinene': 11,
    'γ-Terpinene': 12,
    'Camphene': 13,
    'Linalool': 14,
    'd-Limonene': 15,
    'Citral': 16,
    'Myrcene': 17,
    'α-Terpineol': 18,
    'Citronellol': 19,
    'dl-Menthol': 20,
    '1-Borneol': 21,
    '2-Piperidone': 22,
    'β-Caryophyllene': 23,
    'α-Humulene': 24,
    'Caryophyllene Oxide': 25,
}

编辑3:

这是从问题代码中读取的二进制文件:

b"{'\xc3\x8e\xc2\xb1-Terpinene': 11, 'Geraniol': 10, '\xc3\x8e\xc2\xb1-Pinene': 7, 'dl-Menthol': 20, 'Myrcene': 17, 'Citral': 16, 'Citronellol': 19, 'Camphene': 13, '\xc3\x8e\xc2\xb3-Terpinene': 12, '\xc3\x8e\xc2\xb1-Terpineol': 18, '1-Borneol': 21, '\xc3\x8e\xc2\xb1-Humulene': 24, '\xc3\x8e\xc2\xb2-Caryophyllene': 23, '\xc3\x8e\xc2\xb2-Pinene': 8, '2-Piperidone': 22, 'Caryophyllene Oxide': 25, 'Linalool': 14, 'Terpinolene': 9, 'd-Limonene': 15}\r\n"

这是从Terpene对象row_map中读取的二进制文件:

b"{'Geraniol': 10, '\xce\xb2-Caryophyllene': 23, '\xce\xb1-Pinene': 7, 'Citral': 16, '\xce\xb3-Terpinene': 12, 'Myrcene': 17, 'Camphene': 13, '\xce\xb1-Terpinene': 11, 'dl-Menthol': 20, '1-Borneol': 21, '\xce\xb1-Humulene': 24, '\xce\xb2-Pinene': 8, 'd-Limonene': 15, 'Citronellol': 19, '2-Piperidone': 22, 'Caryophyllene Oxide': 25, '\xce\xb1-Terpineol': 18, 'Linalool': 14, 'Terpinolene': 9}\r\n"

1 个答案:

答案 0 :(得分:1)

  

“为什么这些unicode键名被编码为ascii?”

他们不是。 “编码为ASCII”甚至没有意义,使用 ASCII编码甚至不起作用:

>>> 'α-Terpineol'.encode('ascii')
Traceback (most recent call last):
  File "<pyshell#17>", line 1, in <module>
    'α-Terpineol'.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character '\u03b1' in position 0: ordinal not in range(128)

使用UTF-8进行正确编码后,您的文件查看器已解码,使用ISO-8859-1左右:

>>> 'α-Terpineol'.encode('utf-8').decode('ISO-8859-1')
'α-Terpineol'