Python请求接收404,wget获取正确的页面

时间:2014-01-07 10:02:07

标签: python http python-requests http-authentication

我正在尝试检索需要从代理后面访问的网页,另外还需要HTTP身份验证:

$ wget -d --user=atwood --ask-password http://example.com/admin/admin.php

这很好用,我会粘贴HTTP标头(请求和响应如下)。

使用python-requests 检索同一页面会返回404错误

以下是Python代码,其前面是用户Inactivistdebugging the requests library发布的极好方法:

url = 'http://example.com/admin/admin.php'
proxy_config = {
    'http': '1.2.3.4',
    'https': '1.2.3.4',
    'ftp': '1.2.3.4'
}
head = {
    'User-Agent': 'Wget/1.13.4 (linux-gnu)',
    'Connection': 'Close',
    'Proxy-Connection': 'Keep-Alive'
}

response = requests.get(url, auth=('atwood', 'hunter2'), proxies=proxy_config, headers=head)

print("Status code: %s" % (response.status_code, ))
print("URL: %s" % (response.url, ))
print(pformat(response.text))

以下是wget HTTP标头(请求和响应),确实可以正确地返回请求的页面

$ export http_proxy=http://1.2.3.4:3128
$ wget -d --user=atwood --ask-password  http://example.com/admin/admin.php
Setting --user (user) to atwood
Setting --ask-password (askpassword) to 1
Password for user `atwood': 
DEBUG output created by Wget 1.13.4 on linux-gnu.

URI encoding = `UTF-8'
URI encoding = `UTF-8'
--2014-01-07 11:15:59--  http://example.com/admin/admin.php
Host `example.com' has not issued a general basic challenge.
Connecting to 1.2.3.4:3128... connected.
Created socket 3.
Releasing 0x000000000159bf20 (new refcount 0).
Deleting unused 0x000000000159bf20.

---request begin---
GET http://example.com/admin/admin.php HTTP/1.1
User-Agent: Wget/1.13.4 (linux-gnu)
Accept: */*
Host: example.com
Connection: Close
Proxy-Connection: Keep-Alive

---request end---
Proxy request sent, awaiting response... 
---response begin---
HTTP/1.0 401 Unauthorized
Date: Tue, 07 Jan 2014 09:16:00 GMT
Server: Apache/2.2.21 (Linux/SUSE)
X-Powered-By: PHP/5.3.8
WWW-Authenticate: Basic realm="CONTACT-ADMIN"
Content-Length: 43
Content-Type: text/html
X-Cache: MISS from proxyServer
X-Cache-Lookup: MISS from proxyServer:3128
Via: 1.0 proxyServer (squid/3.1.19)
Connection: keep-alive

---response end---
401 Unauthorized
Registered socket 3 for persistent reuse.
Skipping 43 bytes of body: [Login incorrect, please try again: |||BAD|
] done.
Inserted `example.com' into basic_authed_hosts
Reusing existing connection to 1.2.3.4:3128.
Reusing fd 3.

---request begin---
GET http://example.com/admin/admin.php HTTP/1.1
User-Agent: Wget/1.13.4 (linux-gnu)
Accept: */*
Host: example.com
Connection: Close
Proxy-Connection: Keep-Alive
Authorization: Basic NjY2Njp0cmlwczEyMw==

---request end---
Proxy request sent, awaiting response...
---response begin---
HTTP/1.0 200 OK
Date: Tue, 07 Jan 2014 09:16:00 GMT
Server: Apache/2.2.21 (Linux/SUSE)
X-Powered-By: PHP/5.3.8
Cache-Control: no-cache, must-revalidate
Pragma: no-cache
Content-Type: text/html; charset=utf-8
X-Cache: MISS from proxyServer
X-Cache-Lookup: MISS from proxyServer:3128
Via: 1.0 proxyServer (squid/3.1.19)
Connection: close

---response end---
200 OK
URI content encoding = `utf-8'
Length: unspecified [text/html]
Saving to: `admin.php'

    [ <=>                            ] 14,096      --.-K/s   in 0.1s

2014-01-07 11:16:00 (92.8 KB/s) - `admin.php' saved [14096]

您可能会注意到我已匿名化了我要提取的网址。事实上,我已经三次检查返回404的网址实际上是与wget中的网址相同的网址。

1 个答案:

答案 0 :(得分:1)

看起来Python中的代理端口与用于wget的代理端口不同(3128与默认的8080相比)。