Question

我正在比较JSON对象之间的Unicode字符串。

它们具有相同的值：

a = '人口じんこうに膾炙かいしゃする'
b = '人口じんこうに膾炙かいしゃする'

但它们有不同的Unicode表示形式：

String a : u'\u4eba\u53e3\u3058\u3093\u3053\u3046\u306b\u81be\u7099\u304b\u3044\u3057\u3083\u3059\u308b'
String b : u'\u4eba\u53e3\u3058\u3093\u3053\u3046\u306b\u81be\uf9fb\u304b\u3044\u3057\u3083\u3059\u308b'

如何比较两个Unicode字符串的值？

Answer 1

Unicode normalization会帮你找到这个：

>>> import unicodedata
>>> unicodedata.normalize("NFC", "\uf9fb") == "\u7099"
True

在两个字符串上使用unicodedata.normalize，然后将其与==进行比较，以检查规范的Unicode等效性。

字符U+F9FB是“CJK兼容性”字符。标准化后，这些字符会分解为一个或多个常规CJK字符。

Answer 2

字符U+F9FB（）是CJK Compatibility Ideograph。这些字符是常规CJK字符的不同代码点，但在规范化时它们会分解为一个或多个常规CJK字符。

Unicode有一个名为UCA的官方字符串整理算法，专门用于此目的。从3.7，^*开始，Python没有UCA支持，但是有像pyuca这样的第三方库：

>>> from pyuca import Collator
>>> ck = Collator().sort_key
>>> ck(a) == ck(b)
True

对于这种情况 - 以及许多其他情况，但绝对不是全部 - 在比较之前选择适当的normalization来应用于两个字符串将起作用，并且它具有内置于stdlib中的支持的优势。

_{*自3.4以来，这个想法已被原则上接受，但是没有人编写实现 - 部分是因为大多数关心的核心开发者正在使用pyuca或两个ICU绑定中的一个，具有在当前和旧版本的Python中工作的优势。}

Answer 3

我会使用PyICU及其Collator类。但首先，你应该考虑你希望平等发生在Unicode collation algorithm的哪个级别。

select *, row_number() over (partition by number,system order by number,system) as rc from [dbo].[info]) tk0 where tk0.rc =1

输出：

#!/usr/bin/python
# -*- coding: utf-8 -*-

from icu import Collator

coll = Collator.createInstance()
coll.setStrength(Collator.IDENTICAL)

a = u'人口じんこうに膾炙かいしゃする'
b = u'人口じんこうに膾炙かいしゃする'
print repr(a)
print repr(b)
print ('%s == %s : %s' % (a, b, coll.equals(a,b)))

a = u'ｴﾚﾍﾞｰﾀｰ'
b = u'エレベーター'
print ('%s == %s : %s' % (a, b, coll.equals(a,b)))

coll.setStrength(Collator.PRIMARY)
print ('%s == %s : %s' % (a, b, coll.equals(a,b)))

a = u'hello'
b = u'HELLO'
coll.setStrength(Collator.PRIMARY)
print ('%s == %s : %s' % (a, b, coll.equals(a,b)))

coll.setStrength(Collator.TERTIARY)
print ('%s == %s : %s' % (a, b, coll.equals(a,b)))

如何比较具有不同字节但具有相同值的Unicode字符串？

3 个答案: