使用在线扫描程序扫描pdf文件的python脚本

时间:2014-12-01 22:23:54

标签: python pdf web-scraping

我使用此代码使用此脚本使用在线扫描程序“https://wepawet.iseclab.org/”扫描文件夹中包含的多个PDF文件。

import mechanize
import re
import os

def upload_file(uploaded_file):
    url = "https://wepawet.iseclab.org/"
    br = mechanize.Browser()
    br.set_handle_robots(False) # ignore robots
    br.open(url)
    br.select_form(nr=0)
    f = os.path.join("200",uploaded_file)
    br.form.add_file(open(f) ,'text/plain', f)
    br.form.set_all_readonly(False)
    res = br.submit()
    content = res.read()
    with open("200_clean.html", "a") as f:
        f.write(content)

def main():

    for file in os.listdir("200"):
        upload_file(file)

if __name__ == '__main__':
    main()

但在执行代码后,我收到以下错误:

Traceback (most recent call last):
  File "test.py", line 56, in <module>
    main()
  File "test.py", line 50, in main
    upload_file(file)
  File "test.py", line 40, in upload_file
    res = br.submit()
  File "/home/suleiman/Desktop/mechanize/_mechanize.py", line 541, in submit
    return self.open(self.click(*args, **kwds))
  File "/home/suleiman/Desktop/mechanize/_mechanize.py", line 203, in open
    return self._mech_open(url, data, timeout=timeout)
  File "/home/suleiman/Desktop/mechanize/_mechanize.py", line 255, in _mech_open
    raise response
mechanize._response.httperror_seek_wrapper: HTTP Error refresh: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
OK

任何人都可以帮我解决这个问题吗?

1 个答案:

答案 0 :(得分:0)

我认为问题是您设置的mime-type text/plain。对于PDF,这应该是application/pdf。当我上传PDF样本时,您的代码对我有用。

br.form.add_file来电更改为:

br.form.add_file(open(f), 'application/pdf', f)