Question

我目前正在做一个项目，包括从SMTP服务器读取日志文件，以及提取有关每个经过的电子邮件的有意义的信息。我有一个表格，其中有一些列稍后将与搜索相关;垃圾邮件分数，从域，到域，时间戳，主题等。一切正常，直到遇到一些非ASCII字符，通常在主题字段上（如预期的那样）。

我试图将str解码为iso-8859-1（这是文件的编码）并保存它，我也尝试将其编码回UTF-8，说实话，我我有点迷失在这里。我听说在python 2.7中使用unicode是一场噩梦，但直到现在我还没有经历过。

无论如何，让我解释一下。这就是我提取主题的方式：

if 'subject' in realInfo: 
emailDict[keywrd].setSubject(realInfo[realInfo.index('subject') + 
len('subject') + 1:].decode('ISO-8859-1'))

emailDict是一个包含正在处理的所有电子邮件的字典。

这就是我将所有内容插入数据库的方式：

    info = (e.getID(), str(e.getSpamScore()), str(e.getMCPScore()), " ".join(e.getFrom()), " ".join(e.getTo()), e.getStatus(), e.getTimestamp(), e.getSubject(), dumps(e))
    print repr(e.getSubject())  # DEBUG
    print type(e.getSubject())  # DEBUG
    self.conn.cursor().execute(u"INSERT INTO emails (emailID, SpamScore, MCPScore, FromDomain, ToDomain, status, timestamp, subject, object)"
                      " VALUES (?,?,?,?,?,?,?,?,?)", info)
    self.conn.commit()

我添加了2个打印语句，以帮助我了解问题所在。

'e'是一个电子邮件对象，可作为每封电子邮件的蓝图。它包含以前由口译员获得的信息。之后，我将保存关于列的最重要信息，如前所述，将用于搜索（“对象”列是一个电子邮件对象，此处使用pickle）。但只要出现特殊字符，就会引发异常：

u'VPXL \xffM-^W no more compromises. Better size, better life. \n'
<type 'unicode'>
Exception in thread Thread-25:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/local/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/ProjMail/projMail_lib.py", line 174, in refresher
self.interpreter.start()
File "/ProjMail/projMail_lib.py", line 213, in start
c.save(self.emailTracker)
File "/ProjMail/projMail_lib.py", line 56, in save
self.saveEmails()
File "/ProjMail/projMail_lib.py", line 62, in saveEmails
else: self.add(key) # If it's new
File "/ProjMail/projMail_lib.py", line 82, in add
" VALUES (?,?,?,?,?,?,?,?,?)", info)

ProgrammingError: You must not use 8-bit bytestrings unless you use a 
text_factory that can interpret 8-bit bytestrings (like text_factory = str). 
It is highly recommended that you instead just switch your application to 
Unicode strings.

从我看来，它是unicode，所以我无法理解为什么sqlite在抱怨。知道我在这里做错了什么吗？提前致谢！

Answer 1

问题不是将主题本身插入数据库，而是插入pickled Email实例。

>>> subject = u'VPXL \xffM-^W no more compromises. Better size, better life. \n'
>>> conn = sqlite3.connect(':memory:')
>>> c = conn.cursor()                            
>>> c.execute("""CREATE TABLE foo (bar text, baz text)""")                                   
<sqlite3.Cursor object at 0x7fab5cf280a0>
>>> c.execute("""INSERT INTO foo VALUES (?, ?)""", (subject, 'random text'))
<sqlite3.Cursor object at 0x7fab5cf280a0>

>>> class Email(object):pass
... 
>>> e = Email()
>>> e.subject = subject
>>> c.execute("""INSERT INTO foo VALUES (?, ?)""", (subject, pickle.dumps(e)))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

挑选Email实例会在内部创建一个带有混合编码的字节字符串，触发异常（即使仅选择subject也会这样做。）

要防止异常，您可以将连接的text_factory属性更改为str：

>>> conn.text_factory = str
>>> c.execute(stmt2, (subject, pickle.dumps(e)))
<sqlite3.Cursor object at 0x7fab5b3343b0>

如果您希望继续使用默认的unicode text_factory，则可以将pickle类存储在blob列中，并包含在buffer实例中。

>>> conn.text_factory = unicode
>>> c.execute("""CREATE TABLE foo2 (bar text, baz blob)""")
>>> c.execute("""INSERT INTO foo VALUES (?, ?)""", (subject, buffer(pickle.dumps(e))))                       
<sqlite3.Cursor object at 0x7fab5b3343b0>

在检索时恢复了pickle实例：

>>> c.execute("""SELECT bar, baz FROM foo2""")
<sqlite3.Cursor object at 0x7fab5b3343b0>
>>> res = c.fetchone()
>>> res
(u'VPXL \xffM-^W no more compromises. Better size, better life. \n', <read-write buffer ptr 0x7fab5e9706c8, size 167 at 0x7fab5e970688>)
>>> pickle.loads(res[1])
<__main__.Email object at 0x7fab5b333ad0>

将unicode保存到sqlite的问题

1 个答案: