Question

我正在尝试从网站下载PDF文件并将其保存到磁盘。我的尝试要么失败，要么编码错误，要么导致空白PDF。

AIC(glm(repex$M~repex$Day+repex$Solar,data=repex,family=poisson))
AIC(lm(repex$M~repex$Day+repex$Solar,data=repex))
AIC(gam(repex$M~s(repex$Day)+repex$Solar,data=repex))

我知道这是某种编解码器问题，但我似乎无法让它发挥作用。

Answer 1

在这种情况下你应该使用response.content：

with open('/tmp/metadata.pdf', 'wb') as f:
    f.write(response.content)

来自the document：

对于非文本请求，您还可以以字节形式访问响应正文：
>>> r.content
b'[{"repository":{"open_issues":0,"url":"https://github.com/...

这意味着：response.text将输出作为字符串对象返回，在您下载文本文件时使用它。如HTML文件等

response.content将输出作为bytes对象返回，在您下载二进制文件时使用它。如PDF文件，音频文件，图像等

You can also use response.raw instead。但是，当您要下载的文件很大时，请使用它。以下是您可以在文档中找到的基本示例：

import requests

url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
r = requests.get(url, stream=True)

with open('/tmp/metadata.pdf', 'wb') as fd:
    for chunk in r.iter_content(chunk_size):
        fd.write(chunk)

chunk_size是您要使用的块大小。如果将其设置为2000，则请求将下载该文件的第一个2000个字节，将其写入文件，并再次执行此操作，一次又一次，除非完成。

所以这可以节省你的RAM。但在这种情况下，我更喜欢使用response.content，因为您的文件很小。如您所见，使用response.raw很复杂。

相关：

这是在网页上查找和下载所有pdf文件的不错的解释/解决方案：

https://medium.com/@dementorwriter/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48

Answer 4

您可以使用urllib：

import urllib.request
urllib.request.urlretrieve(url, "filename.pdf")

Answer 5

请注意，我是初学者。如果我的解决方案有误，请随时进行纠正和/或让我知道。我可能也会学到新东西。

我的解决方案：

相应地将downloadPath更改为到要保存文件的位置。您也可以随意使用绝对路径。

将以下内容另存为downloadFile.py。

用法：python downloadFile.py url-of-the-file-to-download new-file-name.extension

记住要添加扩展名！

用法示例：python downloadFile.py http://www.google.co.uk google.html

import requests
import sys
import os

def downloadFile(url, fileName):
    with open(fileName, "wb") as file:
        response = requests.get(url)
        file.write(response.content)


scriptPath = sys.path[0]
downloadPath = os.path.join(scriptPath, '../Downloads/')
url = sys.argv[1]
fileName = sys.argv[2]      
print('path of the script: ' + scriptPath)
print('downloading file to: ' + downloadPath)
downloadFile(url, downloadPath + fileName)
print('file downloaded...')
print('exiting program...')

Answer 6

关于凯文回答写在with open('./tmp/metadata.pdf', 'wb') as f: f.write(response.content)文件夹中，它应该是这样的：

他在地址前忘了tmp，当然你的文件夹$ mkdir gigablast $ cd gigablast $ wget --no-check-certificate "https://github.com/gigablast/open-source-search-engine/archive/master.zip" $ unzip master.zip $ cd open-source-search-engine-master/ $ make应该已经创建了

使用Python请求模块下载并保存PDF文件

6 个答案: