Question

我有一个纯文本文件UTF32.red.codes内的表情符号代码列表。文件的纯文本内容是

\U0001F600
\U0001F601
\U0001F602
\U0001F603 
\U0001F604
\U0001F605
\U0001F606
\U0001F609
\U0001F60A
\U0001F60B

基于question，我的想法是从文件内容创建正则表达式以捕获表情符号。这是我最小的工作示例

import re

with open('UTF32.red.codes','r') as emof:
   codes = [emo.strip() for emo in emof]
   emojis = re.compile(u"(%s)" % "|".join(codes))

string = u'string to check \U0001F601'
found = emojis.findall(string)

print found

found始终为空。哪里错了？我正在使用python 2.7

Answer 1

您的代码在python 3中运行良好（只需将<table> <tr> <td> ICON 1 Some Value ICON 2 </td> </tr> <tr> <td> ICON 3 Some big big value ICON 4 </td> </tr> </table>修复为print found）。但是，在python 2.7中它不会起作用，因为它的print(found)模块有一个已知错误（参见this thread和this issue）。

如果您仍然需要python 2版本的代码，只需使用re模块，该模块可以与regex一起安装。然后使用pip2 install regex导入它，用import regex（即re.和regex.）替换所有regex.compile语句，并将其替换为regex.findall。它应该工作。

Answer 2

此代码适用于python 2.7

import re
with open('UTF32.red.codes','rb') as emof:
    codes = [emo.decode('unicode-escape').strip() for emo in emof]
    emojis = re.compile(u"(%s)" % "|".join(map(re.escape,codes)))

search = ur'string to check \U0001F601'
found = emojis.findall(search)

print found

如何在python中构建常规词汇表？

2 个答案: