Question

我使用以下代码下载在线PDF文件。它适用于大多数文件。

# -*- coding: utf8 -*-

import urllib2
import shutil
import urlparse
import os

def download(url, fileName=None):
    def getFileName(url,openUrl):
        if 'Content-Disposition' in openUrl.info():
            cd = dict(map(
                lambda x: x.strip().split('=') if '=' in x else (x.strip(),''),
                openUrl.info()['Content-Disposition'].split('')))
            if 'filename' in cd:
                filename = cd['filename'].strip("\"'")
                if filename: return filename
        return os.path.basename(urlparse.urlsplit(openUrl.url)[2])

    r = urllib2.urlopen(urllib2.Request(url))
    try:
        fileName = fileName or getFileName(url,r)
        with open(fileName, 'wb') as f:
            shutil.copyfileobj(r,f)
    finally:
        r.close()

但是对于地址中包含特殊字符的某些文件，例如：

download(u'http://www.poemhunter.com/i/ebooks/pdf/aogán_ó_rathaille_2012_5.pdf', 'c:\\the_file.pdf')

它会出现Unicode错误：

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 21: ordinal not in range(128)

我该如何解决这个问题？感谢。

Answer 1

[我想这可以作为答案，因为它显示了处理URL编码问题的另一种方法。但是我大部分是为了回应Mark K在dazedconfused的答案中的评论而写的。]

也许Acrobat太严格了;尝试另一种PDF工具。

我刚刚在Puppy Linux（Lupu 5.25）的Python 2.6.4中使用此代码下载了该PDF：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib
import urlparse

old_URL = u'http://www.poemhunter.com/i/ebooks/pdf/aogán_ó_rathaille_2012_5.pdf'

url_parts = urlparse.urlparse(old_URL)
url_parts = [urllib.quote(s.encode('utf-8')) for s in url_parts]
new_URL = urlparse.urlunparse(url_parts)
print new_URL

urllib.urlretrieve(new_URL, 'test.pdf')

PDF文件看起来不错，但

我的PDF阅读器，epdfview，抱怨：

(epdfview:10632): Pango-WARNING **: Invalid UTF-8 string passed to pango_layout_set_text()

但它似乎显示文件确定。

这就是pdfinfo所说的：

Title:          AogÃ¡n Ã Rathaille - poems - 
Creator:        PoemHunter.Com
Producer:       PoemHunter.Com
CreationDate:   Wed May 23 00:44:47 2012
Tagged:         no
Pages:          7
Encrypted:      yes (print:yes copy:no change:no addNotes:no)
Page size:      612 x 792 pts (letter)
File size:      50469 bytes
Optimized:      no
PDF version:    1.3

我也是通过我的浏览器（Seamonkey 2.31）下载的，并且正如预期的那样，它与通过Python检索的文件相同。

Answer 2

您必须在此行进行编码：

r = urllib2.urlopen(urllib2.Request(url.encode('utf-8'))

您需要将字节字符串传递给Request，因此您必须执行encode()。

此外，您可能希望阅读Python's Unicode HOWTO和How to percent-encode url parameters in python?

下载以特殊字符命名的在线PDF文件

2 个答案: