机械化不像浏览器那样处理cookie

时间:2016-03-11 08:57:54

标签: html perl cookies mechanize

我有以下代码:

use WWW::Mechanize;
$url = "http://daccess-ods.un.org/access.nsf/Get?Open&DS=A/HRC/WGAD/2015/28&Lang=E";
$mech = WWW::Mechanize->new();
$mech->get($url);
$content = $mech->content();
while ($content =~ m/<META HTTP-EQUIV="refresh" CONTENT="(\d+); URL=(.+?)">/) {
    $refresh = $1;
    $link = $2;
    sleep $refresh;
    $mech->get($link);
    $content = $mech->content();
}
$mech->save_content("output.txt");

当我在浏览器中放置分配给$url的URL时,最终结果是下载PDF文件,但是当我运行上面的代码时,我最终会得到一个不同的文件。我想也许Mechanize无法正确处理cookie。我怎样才能让它发挥作用?

4 个答案:

答案 0 :(得分:2)

当您请求http://daccess-ods.un.org/access.nsf/Get?Open&DS=A/HRC/WGAD/2015/28&Lang=E时,您首先会重定向到https

然后你会得到一个META REFRESH的页面。这会为您提供/TMP

中的文件

收到https://daccess-ods.un.org/TMP/xxx.xxx.html后跟META REFRESHhttps://documents-dds-ny.un.org/doc/UNDOC/GEN/G15/263/87/PDF/G1526387.pdf?OpenElement后,它仍然无法下载文档,但会显示错误消息。

从浏览器检查标题的原因是浏览器设置了三个cookie,而WWW :: Mechanize只设置了一个:

  • citrix_ns_id = XXX
  • citrix_ns_id_.un.org_%2F_wat = XXX
  • LtpaToken中= XXX

那么这些cookie来自哪里?事实证明,TMP html不仅仅是META REFRESH。它也有这个HTML:

<frameset ROWS="0,100%" framespacing="0" FrameBorder="0" Border="0">
  <frame name="footer" scrolling="no" noresize target="main" src="https://documents-dds-ny.un.org/prod/ods_mother.nsf?Login&Username=freeods2&Password=1234" marginwidth="0" marginheight="0">
  <frame name="main" src="" scrolling="auto" target="_top">
  <noframes>
  <body>
  <p>This page uses frames, but your browser doesn't support them.</p>
  </body>
  </noframes>
</frameset>

此网址https://documents-dds-ny.un.org/prod/ods_mother.nsf?Login&Username=freeods2&Password=1234会设置这些Cookie。

Set-Cookie: LtpaToken=xxx; domain=.un.org; path=/
Set-Cookie: citrix_ns_id=xxx; Domain=.un.org; Path=/; HttpOnly
Set-Cookie: citrix_ns_id_.un.org_%2F_wat=xxx; Domain=.un.org; Path=/

因此,通过更改代码来考虑这一点:

use strict;
use WWW::Mechanize;

my $url = "http://daccess-ods.un.org/access.nsf/Get?Open&DS=A/HRC/WGAD/2015/28&Lang=E";
my $mech = WWW::Mechanize->new();
$mech->get($url);
my $more = 1;
while ($more) {
    $more = 0;
    my $follow_link;
    my @links = $mech->links;
    foreach my $link (@links) {
        if ($link->tag eq 'meta') {
            $follow_link = $link;
        }
        if (($link->tag eq 'frame') && ($link->url)) {
            $mech->follow_link( url => $link->url );
            $mech->back;
        }
    }
    if ($follow_link) {
        $more = 1;
        $mech->follow_link( url => $follow_link->url );
    }
}
$mech->save_content("output.txt");

output.txt成功包含pdf。

$ file output.txt
output.txt: PDF document, version 1.5

答案 1 :(得分:1)

当我在浏览器中输入该URL时,我得到了404,但是尝试使用此代码来获得更详细的调试输出。

python -c $'import csv,sys; reader=csv.reader(sys.stdin)\nfor row in reader:\n  print("|".join(row))' < test.csv

答案 2 :(得分:-1)

你可以尝试在构造函数中添加一个cookie jar,就像这些行一样

use HTTP::Cookies;

my $cookie_jar = HTTP::Cookies->new(file => $cookie_file, autosave => 1, ignore_discard => 1);
my $mech = WWW::Mechanize->new('ssl_opts'=> {'SSL_verify_mode'=>'SSL_VERIFY_NONE'}, cookie_jar => $cookie_jar, autocheck => 0);

如果您想保存cookie,然后稍后加载以保留您的会话,请执行以下操作:

$cookie_jar->save; 
#after the content call

加载Cookie:

$mech->cookie_jar->load($cookie_file);
#before the get function (but you may want a conditional statement to check if the cookie even exists

希望这有帮助

答案 3 :(得分:-1)

这是我在VBA中自动执行此操作的方法:

Private Declare Function FindWindow Lib "user32" Alias "FindWindowA" _
(ByVal lpClassName As String, ByVal lpWindowName As String) As Long

Private Declare Function FindWindowEx Lib "user32" Alias "FindWindowExA" _
(ByVal hWnd1 As Long, ByVal hWnd2 As Long, ByVal lpsz1 As String, _
ByVal lpsz2 As String) As Long

Private Declare Function SetCursorPos Lib "user32" _
(ByVal X As Integer, ByVal Y As Integer) As Long

Private Declare Function GetWindowRect Lib "user32" _
(ByVal hwnd As Long, lpRect As RECT) As Long

Private Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)

Private Declare Sub mouse_event Lib "user32.dll" (ByVal dwFlags As Long, _
ByVal dx As Long, ByVal dy As Long, ByVal cButtons As Long, ByVal dwExtraInfo As Long)

Private Declare Sub SetWindowPos Lib "user32" (ByVal hwnd As Integer, ByVal _
    hWndInsertAfter As Integer, ByVal X As Integer, ByVal Y As Integer, ByVal cx As _
    Integer, ByVal cy As Integer, ByVal wFlags As Integer)


'~~> Constants for pressing left button of the mouse
Private Const MOUSEEVENTF_LEFTDOWN As Long = &H2
'~~> Constants for Releasing left button of the mouse
Private Const MOUSEEVENTF_LEFTUP As Long = &H4

Private Type RECT
    Left As Long
    Top As Long
    Right As Long
    Bottom As Long
End Type

Const HWND_TOPMOST = -1
Const HWND_NOTOPMOST = -2
Const SWP_NOSIZE = &H1
Const SWP_NOMOVE = &H2
Const SWP_NOACTIVATE = &H10
Const SWP_SHOWWINDOW = &H40

Dim ie As InternetExplorer

Sub GetFiles()

Set ie = New InternetExplorer

GetFileFromUrl "http://daccess-ods.un.org/access.nsf/Get?Open&DS=A/HRC/WGAD/2015/28&Lang=E"
GetFileFromUrl "http://daccess-ods.un.org/access.nsf/Get?Open&DS=A/HRC/WGAD/2015/31&Lang=F"

End Sub



Sub GetFileFromUrl(url As String)



Dim pos As RECT

ie.Navigate url

ie.Visible = True

While ie.ReadyState <> 4
    DoEvents
Wend

Sleep 7000

ie.ExecWB 4, 1, "c:\test.pdf"

Sleep 5000


SaveAsHwnd = FindWindow(vbNullString, "Save As")

If SaveAsHwnd <> 0 Then
    Debug.Print "Found Save As window"
Else
    Debug.Print "Did not find Save As window"
End If

SaveButtonHwnd = FindWindowEx(SaveAsHwnd, ByVal 0&, "Button", "&Save")

If SaveButtonHwnd <> 0 Then

    Debug.Print "Found Save button"

    ' click button
    'res = SendMessage(SaveButtonHwnd, TCM_SETCURFOCUS, 1, ByVal 0&)
    'res = PostMessage(SaveButtonHwnd, BM_CLICK, ByVal 0&, ByVal 0&)
    'res = SendMessage(SaveButtonHwnd, WM_COMMAND, 0&, 0&)

    GetWindowRect SaveButtonHwnd, pos

    '~~> Move the cursor to the specified screen coordinates.
    SetCursorPos (pos.Left - 10), (pos.Top - 10)
    '~~> Suspends the execution of the current thread for a specified interval.
    '~~> This give ample amount time for the API to position the cursor
    Sleep 100
    SetCursorPos pos.Left, pos.Top
    Sleep 100
    SetCursorPos (pos.Left + pos.Right) / 2, (pos.Top + pos.Bottom) / 2

    '~~> Set the size, position, and Z order of "File Download" Window
    SetWindowPos Ret, HWND_TOPMOST, 0, 0, 0, 0, SWP_NOACTIVATE Or SWP_SHOWWINDOW Or SWP_NOMOVE Or SWP_NOSIZE
    Sleep 100

    '~~> Simulate mouse motion and click the button
    '~~> Simulate LEFT CLICK
    mouse_event MOUSEEVENTF_LEFTDOWN, (pos.Left + pos.Right) / 2, (pos.Top + pos.Bottom) / 2, 0, 0
    Sleep 700
    '~~> Simulate Release of LEFT CLICK
    mouse_event MOUSEEVENTF_LEFTUP, (pos.Left + pos.Right) / 2, (pos.Top + pos.Bottom) / 2, 0, 0

Else

    Debug.Print "Did not find Save button"

End If


Sleep 5000


End Sub

或者,可以使用UIAutomation COM对象:

Sub GetFilesAutomation()

Dim o As IUIAutomation
Dim e As IUIAutomationElement

Dim SaveAsHwnd As LongPtr
Dim ie As New InternetExplorer
Set o = New CUIAutomation

ie.Navigate "http://daccess-ods.un.org/access.nsf/Get?Open&DS=A/HRC/WGAD/2015/28&Lang=E"

ie.Visible = True

Sleep 10000

ie.ExecWB 4, 1

Sleep 5000

SaveAsHwnd = FindWindow(vbNullString, "Save As")

Set e = o.ElementFromHandle(ByVal SaveAsHwnd)
Dim iCnd As IUIAutomationCondition
Set iCnd = o.CreatePropertyCondition(UIA_NamePropertyId, "Save")

Dim Button As IUIAutomationElement
Set Button = e.FindFirst(TreeScope_Subtree, iCnd)
Dim InvokePattern As IUIAutomationInvokePattern
Set InvokePattern = Button.GetCurrentPattern(UIA_InvokePatternId)
InvokePattern.Invoke

End Sub