Python findall,regex,unicode

时间:2015-02-06 12:32:35

标签: python regex unicode findall

我试图编写一个Python脚本,通过目录树搜索并列出所有.flac文件,并从resp派生出Arist,Album和Title。 dir / subdir / filename并将其写入文件。代码工作正常,直到它达到unicode字符。这是代码:

import os, glob, re

def scandirs(path):
    for currentFile in glob.glob(os.path.join(path, '*')):
    if os.path.isdir(currentFile):
        scandirs(currentFile)
    if os.path.splitext(currentFile)[1] == ".flac":
        rpath = os.path.relpath(currentFile)
        print "**DEBUG** rpath =", rpath
        title = os.path.basename(currentFile)
        title = re.findall(u'\d\d\s(.*).flac', title, re.U)
        title = title[0].decode("utf8")
        print "**DEBUG** title =", title
        fpath = os.path.split(os.path.dirname(currentFile))
        artist = fpath[0][2:]
        print "**DEBUG** artist =", artist
        album = fpath[1]
        print "**DEBUG** album =", album
        out = "%s | %s | %s | %s\n" % (rpath, artist, album, title)
        flist = open('filelist.tmp', 'a')
        flist.write(out)
        flist.close()

scandirs('./')

代码输出:

**DEBUG** rpath = Thriftworks/Fader/Thriftworks - Fader - 01 180°.flac
**DEBUG** title = 180°
**DEBUG** artist = Thriftworks
**DEBUG** album = Fader
Traceback (most recent call last):
  File "decflac.py", line 25, in <module>
    scandirs('./')
  File "decflac.py", line 7, in scandirs
    scandirs(currentFile)
  File "decflac.py", line 7, in scandirs
    scandirs(currentFile)
  File "decflac.py", line 20, in scandirs
    out = "%s | %s | %s | %s\n" % (rpath, artist, album, title)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 46: ordinal not in range(128)

但是在Python控制台中尝试时,它运行正常:

>>> import re
>>> title = "Thriftworks - Fader - 01 180°.flac"
>>> title2 = "dummy"
>>> title = re.findall(u'\d\d\s(.*).flac', title, re.U)
>>> title = title[0].decode("utf8")
>>> out = "%s | %s\n" % (title2, title)
>>> print out
dummy | 180°

所以,我的问题: 1)为什么相同的代码在控制台中工作,但在脚本中却没有? 2)如何修复脚本?

4 个答案:

答案 0 :(得分:0)

Python控制台与您的终端配合使用,并根据其语言环境解释unicode编码。

用新str.format替换该行:

out = u"{} | {} | {} | {}\n".format(rpath, artist, album, title)

写入文件时编码为utf8:

with open('filelist.tmp', 'a') as f:
    f.write(out.encode('utf8'))

import codecs并直接执行:

with codecs.open('filelist.tmp', 'a', encoding='utf8') as f:
    f.write(out)

或者,因为utf8是默认值:

with open('filelist.tmp', 'a') as f:
    f.write(out)

答案 1 :(得分:0)

  1. 在控制台中,终端设置定义编码。如今,这主要是unices上的Unicode,例如Windows上的Linux / BSD / MacOS和Windows-1252。在解释器中,它默认为python文件的编码,通常是ascii(除非你的代码以UTF Byte-Order-Mark开头)。

  2. 我不完全确定,但可能在字符串“%s |%s |%s |%s \ n”前加上u前缀,以使其成为unicode字符串可以提供帮助。< / p>

答案 2 :(得分:0)

通过切换到Python3解决,Python3按预期处理unicode案例 替代:

title = title[0].decode("utf8")

for:

title = title[0]

甚至不需要将“out”的值加上“u”前缀或者在写入时指定编码 我喜欢Python3。

答案 3 :(得分:0)

glob与包含Unicode字符的文件名一起使用时,请为该模式使用Unicode字符串。这使得glob返回Unicode字符串而不是字节字符串。输出时,打印Unicode字符串会自动将其编码为控制台的编码。如果您的歌曲具有控制台编码不支持的字符,您仍会遇到问题。在这种情况下,将数据写入UTF-8编码的文件,并在支持UTF-8的编辑器中查看。

>>> import glob
>>> for f in glob.glob('*'): print f
...
ThriftworksFaderThriftworks - Fader - 01 180░.flac
>>> for f in glob.glob(u'*'): print f
...
ThriftworksFaderThriftworks - Fader - 01 180°.flac

这也适用于os.walk,并且是一种更简单的递归搜索方式:

#!python2
import os, fnmatch

def scandirs(path):
    for path,dirs,files in os.walk(path):
        for f in files:
            if fnmatch.fnmatch(f,u'*.flac'):
                album,artist,tracktitle = f.split(u' - ')
                print 'Album: ',album
                print 'Artist:',artist
                title,track = tracktitle.split(u' ',1)
                track = track[:-5]
                print 'Track: ',track
                print 'Title: ',title

scandirs(u'.')

输出:

Album:  ThriftworksFaderThriftworks
Artist: Fader
Track:  180°
Title:  01