Question

我正在尝试搜索可能需要身份验证的网站。当我尝试以下代码时出现错误：

org.jsoup.UnsupportedMimeTypeException：未处理的内容类型。必须是text / *，application / xml或application / xhtml + xml。的mimetype =应用/ JSON;字符集= UTF-8，网址= https://sso.mims.com/Account/Signin at org.jsoup.helper.HttpConnection $ Response.execute（HttpConnection.java:547）在 org.jsoup.helper.HttpConnection $ Response.execute（HttpConnection.java:493）在org.jsoup.helper.HttpConnection.execute（HttpConnection.java:205）在com.aiingov.proc.MedScraper.main（MedScraper.java:49）

public static void main(String[] args) throws IOException {

String url = "https://sso.mims.com/Account/Signin";
            String userAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.130 Safari/537.36";

            Connection.Response response = Jsoup.connect(url).userAgent(userAgent)
                    .method(Connection.Method.GET)
                    .execute();

            response = Jsoup.connect(url)
                    .cookies(response.cookies())
                    .data("action", "login")
                    .data("login", "xxxxx")
                    .data("password", "xxxxx")
                    .data("auto_login", "1")
                    .userAgent(userAgent)
                    .method(Connection.Method.POST)
                    .followRedirects(true)
                    .execute();           

            Document document = Jsoup.connect("https://www.mims.com/india/drug/info/abacavir/abacavir?type=full&mtype=generic")
                    .cookies(response.cookies())
                    .userAgent(userAgent)
                    .get();

            System.out.println(document);

            Elements elements = document.body().select("*");

               for (Element element : elements) {
                   System.out.println(element.ownText());
               }

如果没有登录代码，我会得到以下输出：

您很快就会被重定向到目的地。

我该如何解决这个问题？

Answer 1

尝试使用ignoreContentType方法。

 Jsoup.connect(url).ignoreContentType(true);//chain any other methods

来自JSoup Docs的方法说明：

解析响应时忽略文档的Content-Type。默认情况下，这是false，无法识别的content-type将导致抛出IOException。（例如，这是为了防止通过尝试解析JPEG二进制图像来产生垃圾。）设置为true以强制解析尝试而不考虑内容类型。

刮刮可能需要登录的网站 - jsoup

1 个答案: