Jsoup如何在页面内返回脚本生成的电子邮件ID

时间:2014-02-20 23:58:00

标签: java javascript web-scraping jsoup screen-scraping

我有一个文档对象:

Document secDoc = Jsoup.connect(a.attr("abs:href")).timeout(30*1000).get();
String txt = secDoc.text();

现在当我调试上面的内容并检查secDoc的值时,我得到了一个正常的页面源代码,其中包含一个元素:

For questions about your order, including anything shipping or billing related, please email <script type="text/javascript">write_email('oatmealsupport','gmail.com')</script>.

如果您自己看到该网页,则可以看到以下行:For questions about your order, including anything shipping or billing related, please email oatmealsupport@gmail.com. We only do email support at this time. 有趣的是,此脚本会在页面上生成电子邮件ID。在做一个检查元素时,我得到:

<p>
                For questions about your order, including anything shipping or billing related, please email <a href="mailto:oatmealsupport@gmail.com">oatmealsupport@gmail.com</a><script type="text/javascript">write_email('oatmealsupport','gmail.com')</script>.
                We only do email support at this time.<br><br>
                Hours of operation: <strong>Monday-Friday 8am - 6pm PT.</strong>
                <br>
                <strong>Shipping Times</strong>:
                We strive to fulfill the orders within 3-5 working days. When we are really busy we may take a day or two longer. 
              We ship orders Monday - Friday, so if your order is placed Friday evening we may not be able to process it until the following Monday. 
                If we are behind, it may be a few days before we respond.  The Oatmeal is an extremely small operation so please be patient. 
                <br>
                <a href="http://shop.theoatmeal.com/pages/shipping">More Shipping Info</a><br><br>
                Questions about shirt sizes? <a href="http://shop.theoatmeal.com/pages/shipping#shirts">Shirt Sizing Info</a>
            </p>

所以主播:<a href="mailto:oatmealsupport@gmail.com">oatmealsupport@gmail.com</a> 是由脚本生成的。

无论如何我可以使用Jsoup(或任何其他方法)获得此锚点吗?

1 个答案:

答案 0 :(得分:1)

对于此特定站点,地址的用户和域部分位于脚本标记中,因此选择脚本标记,获取其文本,使用正则表达式解析该文本,并使用{{连接用户和电子邮件介于两者之间。您的选择器可能只是@,假设script:contains(write_email)未在页面的其他位置使用。这只能起作用,因为地址在文本中公开,即使它分为两部分。

通常,Jsoup不是JavaScript引擎。如果你想看到使用网络浏览器的人看到同一页面,你可以尝试像Selenium这样的浏览器自动化工具。