Question

我需要在Python中做些什么来确定字符串的编码方式？

Answer 1

在Python 3中，所有字符串都是Unicode字符序列。有一个bytes类型可以保存原始字节。

在Python 2中，字符串可以是str类型或unicode类型。你可以告诉使用这样的代码：

def whatisthis(s):
    if isinstance(s, str):
        print "ordinary string"
    elif isinstance(s, unicode):
        print "unicode string"
    else:
        print "not a string"

这不区分“Unicode或ASCII”;它只区分Python类型。 Unicode字符串可以包含ASCII范围内的纯字符，字节字符串可以包含ASCII，编码的Unicode或甚至非文本数据。

Answer 2

如何判断对象是unicode字符串还是字节字符串

您可以使用type或isinstance。

在Python 2中：

>>> type(u'abc')  # Python 2 unicode string literal
<type 'unicode'>
>>> type('abc')   # Python 2 byte string literal
<type 'str'>

在Python 2中，str只是一个字节序列。 Python不知道什么它的编码是。 unicode类型是存储文本的更安全的方式。如果您想更多地了解这一点，我建议http://farmdev.com/talks/unicode/。

在Python 3中：

>>> type('abc')   # Python 3 unicode string literal
<class 'str'>
>>> type(b'abc')  # Python 3 byte string literal
<class 'bytes'>

在Python 3中，str就像Python 2的unicode一样，用于存储文本。 Python 2中所谓的str在Python 3中称为bytes。

如何判断字节字符串是否有效utf-8或ascii

您可以致电decode。如果它引发UnicodeDecodeError异常，则它无效。

>>> u_umlaut = b'\xc3\x9c'   # UTF-8 representation of the letter 'Ü'
>>> u_umlaut.decode('utf-8')
u'\xdc'
>>> u_umlaut.decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Answer 3

在python 3.x中，所有字符串都是Unicode字符序列。并且执行str的isinstance检查（默认情况下意味着unicode字符串）就足够了。

isinstance(x, str)

关于python 2.x，大多数人似乎都在使用带有两个检查的if语句。一个用于str，一个用于unicode。

如果你想用一个语句检查你是否有一个'类似字符串'的对象，你可以执行以下操作：

isinstance(x, basestring)

Answer 4

Unicode不是一种编码 - 引用Kumar McMillan：

如果ASCII，UTF-8和其他字节字符串是＆＃34; text＆＃34; ...

...然后Unicode是＆＃34; text-ness＆＃34 ;;

它是文本的抽象形式

阅读了PyCon 2008上McMillan的Unicode In Python, Completely Demystified演讲，它解释了比Stack Overflow上大多数相关答案好得多的东西。

Answer 5

如果您的代码需要与兼容 Python 2和Python 3兼容，则无法直接使用isinstance(s,bytes)或isinstance(s,unicode)之类的内容/ except或python版本测试，因为在Python 2中未定义bytes并且在Python 3中未定义unicode。

有一些丑陋的解决方法。一个非常难看的是比较类型的名称，而不是比较类型本身。这是一个例子：

# convert bytes (python 3) or unicode (python 2) to str
if str(type(s)) == "<class 'bytes'>":
    # only possible in Python 3
    s = s.decode('ascii')  # or  s = str(s)[2:-1]
elif str(type(s)) == "<type 'unicode'>":
    # only possible in Python 2
    s = str(s)

可以说稍微不那么丑陋的解决方法是检查Python版本号，例如：

if sys.version_info >= (3,0,0):
    # for Python 3
    if isinstance(s, bytes):
        s = s.decode('ascii')  # or  s = str(s)[2:-1]
else:
    # for Python 2
    if isinstance(s, unicode):
        s = str(s)

这些都是单声道的，大部分时间都可能有更好的方法。

Answer 6

使用：

import six
if isinstance(obj, six.text_type)

在六个库中，它表示为：

if PY3:
    string_types = str,
else:
    string_types = basestring,

Answer 7

请注意，在Python 3中，任何说法都不公平：

str是任何x的UTFx（例如UTF8）
str是Unicode
str是Unicode字符的有序集合

Python的str类型（通常）是一系列Unicode代码点，其中一些映射到字符。

即使在Python 3上，回答这个问题并不像你想象的那么简单。

测试ASCII兼容字符串的一种显而易见的方法是尝试编码：

"Hello there!".encode("ascii")
#>>> b'Hello there!'

"Hello there... ☃!".encode("ascii")
#>>> Traceback (most recent call last):
#>>>   File "", line 4, in <module>
#>>> UnicodeEncodeError: 'ascii' codec can't encode character '\u2603' in position 15: ordinal not in range(128)

错误区分了案例。

在Python 3中，甚至有一些字符串包含无效的Unicode代码点：

"Hello there!".encode("utf8")
#>>> b'Hello there!'

"\udcc3".encode("utf8")
#>>> Traceback (most recent call last):
#>>>   File "", line 19, in <module>
#>>> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 0: surrogates not allowed

使用相同的方法区分它们。

Answer 8

你可以使用Universal Encoding Detector，但要注意它只会给你最好的猜测，而不是实际的编码，因为例如，不可能知道字符串“abc”的编码。您需要在其他地方获取编码信息，例如，HTTP协议使用Content-Type标头。

Answer 9

这可能有助于其他人，我开始测试变量s的字符串类型，但对于我的应用程序，简单地将s返回为utf-8更有意义。调用return_utf的进程然后知道它正在处理什么，并且可以适当地处理字符串。代码不是原始的，但我打算将它与Python版本无关，无需进行版本测试或导入六个。请评论以下示例代码的改进，以帮助其他人。

def return_utf(s):
    if isinstance(s, str):
        return s.encode('utf-8')
    if isinstance(s, (int, float, complex)):
        return str(s).encode('utf-8')
    try:
        return s.encode('utf-8')
    except TypeError:
        try:
            return str(s).encode('utf-8')
        except AttributeError:
            return s
    except AttributeError:
        return s
    return s # assume it was already utf-8

Answer 10

对于py2 / py3兼容性，只需使用

sudo chown -R user anaconda3

Answer 11

一种简单的方法是检查unicode是否为内置函数。如果是这样，则说明您使用的是Python 2，并且您的字符串将是一个字符串。要确保所有内容都在unicode中，您可以执行以下操作：

import builtins

i = 'cats'
if 'unicode' in dir(builtins):     # True in python 2, False in 3
  i = unicode(i)

Answer 12

在 Python-3 中，我必须了解字符串是像 b='\x7f\x00\x00\x01' 还是 b='127.0.0.1' 我的解决方案是这样的：

def get_str(value):
    str_value = str(value)
    
    if str_value.isprintable():
        return str_value

    return '.'.join(['%d' % x for x in value])

为我工作，我希望为需要的人工作

如何检查字符串是unicode还是ascii？

12 个答案:

如何判断对象是unicode字符串还是字节字符串

如何判断字节字符串是否有效utf-8或ascii