如何从给定的网站检索特定的HTML信息

时间:2019-06-13 17:40:25

标签: c# html search html-agility-pack converters

我正在尝试为不一致的API编程,我需要从网页https://myanimelist.net/character/214(以及其他类似网址为{{1}的类似网页)的HTML代码中检索出两条信息}代表整数N),特别是“字符图片”的URL(在本例中为https://myanimelist.net/character/N)和字符名称(在本例中为 Youji Kudou )。之后,我需要将这两部分信息保存到JSON。

我正在为此使用HTMLAgilityPack,但我不太清楚。以下是我的第一次尝试:

https://cdn.myanimelist.net/images/characters/14/54554.jpg

不幸的是,这没有输出。如果我正确地遵循了路径(这可能是第一个错误),则应为“ tr / td / div / a / img”。我没有错误,它可以运行,但是没有输出。

我的第二次尝试是:

public static void Main()
{ 
    var html = "https://myanimelist.net/character/214";
    HtmlWeb web = new HtmlWeb();
    var htmlDoc = web.Load(html);
    var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//body");

    foreach (var node in htmlNodes.Descendants("tr/td/div/a/img"))
    {
        Console.WriteLine(node.InnerHtml);
    }
}

但这也不起作用。

如何提取所需的信息?

编辑:

因此,我现在走得更远,而且我找到了找到链接的解决方案。这很简单。但是现在我一直坚持寻找角色的名字。该网站的结构在其他每个链接上都相同(更改最后一个数字),因此,我想通过for循环找到许多不同的链接。这是我尝试执行的操作:

public static void Main()
{
    var html = "https://myanimelist.net/character/214";
    HtmlWeb web = new HtmlWeb();
    var htmlDoc = web.Load(html);
    var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//body");
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    var script = htmlDoc.DocumentNode.Descendants()
                     .Where(n => n.Name == "tr/td/a/img")
                     .First().InnerText;

    // Return the data of spect and stringify it into a proper JSON object
    var engine = new Jurassic.ScriptEngine();
    var result = engine.Evaluate("(function() { " + script + " return src; })()");
    var json = JSONObject.Stringify(engine, result);

    Console.WriteLine(json);
    Console.ReadKey();
}

在第一个foreach中,我将尝试搜索总是在相同位置得出的名称(例如http://prntscr.com/o1uo3chttp://prntscr.com/o1uo91,具体来说是http://prntscr.com/o1xzbk),但是我尚未发现如何。由于HTML中的结构没有任何主体类型,因此我可以跟进。第二个foreach循环是搜索到现在可以使用的URL,n应该给我一个名称,这样我就可以为每个不同的字符弄清楚它。

2 个答案:

答案 0 :(得分:1)

我可以使用以下方法从https://myanimelist.net/character/214中提取字符名称和图像:

public static CharacterData ExtractCharacterNameAndImage(string url)
{
    //Use the following if you are OK with hardcoding the structure of <div> elements.
    //var tableXpath             = "/html/body/div[1]/div[3]/div[3]/div[2]/table"; 
    //Use the following if you are OK with hardcoding the fact that the relevant table comes first.
    var tableXpath             = "/html/body//table"; 
    var nameXpath              = "tr/td[2]/div[4]";
    var imageXpath             = "tr/td[1]/div[1]/a/img";

    var htmlDoc = new HtmlWeb().Load(url);

    var table = htmlDoc.DocumentNode.SelectNodes(tableXpath).First();

    var name = table.SelectNodes(nameXpath).Select(n => n.GetDirectInnerText().Trim()).SingleOrDefault();
    var imageUrl = table.SelectNodes(imageXpath).Select(n => n.GetAttributeValue("src", "")).SingleOrDefault();

    return new CharacterData { Name = name, ImageUrl = imageUrl, Url = url };
}

CharacterData的定义如下:

public class CharacterData
{
    public string Name { get; set; }
    public string ImageUrl { get; set; }
    public string Url { get; set; }
}

然后,可以使用 How to write a JSON file in C#? 中的任何工具将字符数据序列化为JSON,例如

var url = "https://myanimelist.net/character/214";

var data = ExtractCharacterNameAndImage(url);
var json = JsonConvert.SerializeObject(data, Formatting.Indented);

Console.WriteLine(json);

哪个输出

{
  "Name": "Youji Kudou",
  "ImageUrl": "https://cdn.myanimelist.net/images/characters/14/54554.jpg",
  "Url": "https://myanimelist.net/character/214"
}

如果您希望Name在括号中包含日语,请将GetDirectInnerText()替换为InnerText,结果是:

{
  "Name": "Youji Kudou (工藤耀爾)",
  "ImageUrl": "https://cdn.myanimelist.net/images/characters/14/54554.jpg",
  "Url": "https://myanimelist.net/character/214"
}

或者,如果您愿意,可以从文档标题中提取字符名称:

var title = string.Concat(htmlDoc.DocumentNode.SelectNodes("/html/head/title").Select(n => n.InnerText.Trim()));
var index = title.IndexOf("- MyAnimeList.net");
if (index >= 0)
    title = title.Substring(0, index).Trim();

如何确定正确的XPath字符串?

首先,我使用Firefox 66,opened the debugger并将https://myanimelist.net/character/214加载到了可见调试工具的窗口中。

接下来,按照 How to find xpath of an element in firefox inspector 的说明,我选择了工藤耀尔(emji)节点,并复制了它的XPath,结果证明是:

/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[2]/div[4]

然后我尝试使用SelectNodes() ...选择此节点,并得到空结果。但为什么?为了确定这一点,我创建了一个调试例程,该例程将路径分成多个较长的部分,并确定发生故障的位置:

static void TestSelect(HtmlDocument htmlDoc, string xpath)
{
    Console.WriteLine("\nInput path: " + xpath);
    var splitPath = xpath.Split('/');
    for (int i = 2; i <= splitPath.Length; i++)
    {
        if (splitPath[i-1] == "")
            continue;
        var thisPath = string.Join("/", splitPath, 0, i);
        Console.Write("Testing \"{0}\": ", thisPath);
        var result = htmlDoc.DocumentNode.SelectNodes(thisPath);
        Console.WriteLine("result count = {0}", result == null ? "null" : result.Count.ToString());
    }
}

这将输出以下内容:

Input path: /html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[2]/div[4]
Testing "/html": result count = 1
Testing "/html/body": result count = 1
Testing "/html/body/div[1]": result count = 1
Testing "/html/body/div[1]/div[3]": result count = 1
Testing "/html/body/div[1]/div[3]/div[3]": result count = 1
Testing "/html/body/div[1]/div[3]/div[3]/div[2]": result count = 1
Testing "/html/body/div[1]/div[3]/div[3]/div[2]/table": result count = 1
Testing "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody": result count = null
Testing "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr": result count = null
Testing "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[2]": result count = null
Testing "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[2]/div[4]": result count = null

如您所见,选择<tbody>路径元素时出了点问题。通过选择InnerHtml对返回的/html/body/div[1]/div[3]/div[3]/div[2]/table进行的手动检查显示,由于某种原因,服务器在将HTML返回到<tbody>对象时不包括HtmlWeb标记-可能由于Firefox与HtmlWeb提供的请求标头有所不同。省略tbody路径元素后,我便可以使用以下命令成功查询字符名称:

/html/body/div[1]/div[3]/div[3]/div[2]/table/tr/td[2]/div[4]

类似的过程为图像提供了以下工作路径:

/html/body/div[1]/div[3]/div[3]/div[2]/table/tr/td[1]/div[1]/a/img

由于两个查询正在同一<table>中查找内容,因此在我的最终代码中,我仅在一个单独的步骤中选择了该表一次,并删除了与{{1}的特定嵌套有关的一些硬编码}元素。

演示小提琴here

答案 1 :(得分:0)

好的,为了完善它,我在dbc的帮助下对代码进行了四舍五入,并且几乎完全完成了该项目。就算以后的某个人可能有相同的问题,也可以走。这将从定义的数字中输出所有字符名称,链接和图像,并将其写入JSON文件,并且可以适用于其他网站。

using System;
using System.Linq;
using Newtonsoft.Json;
using HtmlAgilityPack;
using System.IO;

namespace SearchingHTML
{
    public class CharacterData
    {
        public string Name { get; set; }
        public string ImageUrl { get; set; }
        public string Url { get; set; }
    }
    public class Program
    {
        public static CharacterData ExtractCharacterNameAndImage(string url)
        {
            var tableXpath = "/html/body//table";
            var nameXpath = "tr/td[2]/div[4]";
            var imageXpath = "tr/td[1]/div[1]/a/img";

            var htmlDoc = new HtmlWeb().Load(url);
            var table = htmlDoc.DocumentNode.SelectNodes(tableXpath).First();
            var name = table.SelectNodes(nameXpath).Select(n => n.GetDirectInnerText().Trim()).SingleOrDefault();
            var imageUrl = table.SelectNodes(imageXpath).Select(n => n.GetAttributeValue("src", "")).SingleOrDefault();

            return new CharacterData { Name = name, ImageUrl = imageUrl, Url = url };
        }
        public static void Main()
        {
            int max = 10000;
            string fileName = @"C:\Users\path of your file.json";

            Console.WriteLine("Environment version: " + Environment.Version);
            Console.WriteLine("Json.NET version: " + typeof(JsonSerializer).Assembly.FullName);
            Console.WriteLine("HtmlAgilityPack version: " + typeof(HtmlDocument).Assembly.FullName);
            Console.WriteLine();

            for (int i = 6; i <= max; i++)
            {
                try
                {
                    var url = "https://myanimelist.net/character/" + i;
                    var htmlDoc = new HtmlWeb().Load(url);
                    var data = ExtractCharacterNameAndImage(url);
                    var json = JsonConvert.SerializeObject(data, Formatting.Indented);
                    Console.WriteLine(json);
                    TextWriter tsw = new StreamWriter(fileName, true);
                    tsw.WriteLine(json);
                    tsw.Close();
                } catch (Exception ex) { }
            }

        }
    }
}

/*******************************************************************************************************************************
 ****************************************************IF TESTING IS REQUIERED****************************************************
 *******************************************************************************************************************************
 * 
 * static void TestSelect(HtmlDocument htmlDoc, string xpath)

        Console.WriteLine("\nInput path: " + xpath);
        var splitPath = xpath.Split('/');
        for (int i = 2; i <= splitPath.Length; i++)
        {
            if (splitPath[i - 1] == "")
                continue;
            var thisPath = string.Join("/", splitPath, 0, i);
            Console.Write("Testing \"{0}\": ", thisPath);
            var result = htmlDoc.DocumentNode.SelectNodes(thisPath);
            Console.WriteLine("result count = {0}", result == null ? "null" : result.Count.ToString());
        }
    }

  *******************************************************************************************************************************
  *********************************************FOR TESTING ENTER THIS INTO MAIN CLASS********************************************
  *******************************************************************************************************************************
  * 
  *     var url2 = "https://myanimelist.net/character/256";
        var data2 = ExtractCharacterNameAndImage(url2);
        var json2 = JsonConvert.SerializeObject(data2, Formatting.Indented);

        Console.WriteLine(json2);



        var nameXpathFromFirefox = "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[2]/div[4]";
        var imageXpathFromFirefox = "/html/body/div[1]/div[3]/div[3]/div[2]/table/tbody/tr/td[1]/div[1]/a/img";
        TestSelect(htmlDoc, nameXpathFromFirefox);
        TestSelect(htmlDoc, imageXpathFromFirefox);
        var nameXpathFromFirefoxFixed = "/html/body/div[1]/div[3]/div[3]/div[2]/table/tr/td[2]/div[4]";
        var imageXpathFromFirefoxFixed = "/html/body/div[1]/div[3]/div[3]/div[2]/table/tr/td[1]/div[1]/a/img";
        TestSelect(htmlDoc, nameXpathFromFirefoxFixed);
        TestSelect(htmlDoc, imageXpathFromFirefoxFixed);

  *******************************************************************************************************************************
  *******************************************************************************************************************************
  *******************************************************************************************************************************
  */