Question

我刚刚开始使用网络分配而且我已经卡住了。分配要求我检查用户提供的网站链接，并通过阅读标题信息确定它们是活动还是非活动。谷歌搜索到目前为止，我只有这个代码检索网站。我不知道如何查看这些信息并查找HTML链接。这是代码：

import java.net.*; 
import java.io.*; 

public class url_checker { 
    public static void main(String[] args) throws Exception { 
        URL yahoo = new URL("http://yahoo.com"); 
        URLConnection yc = yahoo.openConnection(); 
        BufferedReader in = new BufferedReader( 
                                new InputStreamReader( 
                                yc.getInputStream())); 
        String inputLine; 
        int count = 0; 
        while ((inputLine = in.readLine()) != null) { 
            System.out.println (inputLine);                
            }      
        in.close(); 
    } 
}

请帮忙。谢谢！

Answer 1

您还可以尝试jsoup html检索器和解析器。

Document doc = Jsoup.parse(new URL("<url>"), 2000);

Elements resultLinks = doc.select("div.post-title > a");
for (Element link : resultLinks) {
    String href = link.attr("href");
    System.out.println("title: " + link.text());
    System.out.println("href: " + href);
}

使用此代码，您可以列出并分析div中所有元素，并使用网址中的“post-title”类。

Answer 2

你可以试试这个：

URL url = new URL(link);
Reader reader= new InputStreamReader((InputStream) url.getContent());
new ParserDelegator().parse(reader, new Page(), true);

然后创建一个名为Page

的类

class Page extends HTMLEditorKit.ParserCallback {

    public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
        if (t == HTML.Tag.A) {
            String link = null;
            Enumeration<?> attributeNames = a.getAttributeNames();
            if (attributeNames.nextElement().equals(HTML.Attribute.HREF))
                link = a.getAttribute(HTML.Attribute.HREF).toString();
            //save link some where 
        }
    }
}

Answer 3

我不知道如何查看此信息并查找HTML链接

我无法在我的作业中使用任何外部库

您有几个选择：

1）您可以将网页读入HTMLDocument。然后，您可以从Document获取迭代器以查找所有HTML.Tag.A标记。找到attrbute标签后，您可以从属性标签的属性集中获取HTML.Attribute.HREF。

2）您可以扩展HTMLEditor.ParserCallback并实现handleStartTag（...）方法。然后，只要找到A标记，就可以获得将再次包含该链接的href属性。调用解析器回调的基本代码是：

MyParserCallback parser = new MyParserCallback();

// simple test
String file = "<html><head><here>abc<div>def</div></here></head></html>";
StringReader reader = new StringReader(file);

// read a page from the internet
//URLConnection conn = new URL("http://yahoo.com").openConnection();
//Reader reader = new InputStreamReader(conn.getInputStream());

try
{
    new ParserDelegator().parse(reader, parser, true);
}
catch (IOException e)
{
    System.out.println(e);
}

Answer 4

HtmlParser就是你需要的。很多事情都可以用它完成。

Answer 5

您需要获取服务器随响应返回的HTTP状态代码。如果页面不存在，服务器将返回404。

看看这个： http://download.oracle.com/javase/1.4.2/docs/api/java/net/HttpURLConnection.html

最具体的是getResponseCode方法。

Answer 6

我会使用像NekoHTML这样的工具解析HTML。它基本上为您修复了格式错误的HTML，并允许像XML一样访问它。然后，您可以像处理原始页面一样处理链接元素并尝试关注它们。

您可以查看一些sample code that does this。

如何从URL获取HTML链接

6 个答案: