Question

我想从HTML中提取URL以存储在Oracle数据库中的CLOB数据类型中。

HTML文件的一部分如下所示：

<a class="href_class" href="/download/file.zip"></a>

我只需要从HTML中获取此部分：/download/file.zip并将所有下载链接放在数据库中。在regexp中，我如何指出类名与href_class之类的特定值匹配？

我想知道使用regexp或其他方法解决此问题的最佳方法是什么？

Answer 1

由于html是结构化文档，您可以在oracle中将其作为XMLType加载，并应用适当的xpath表达式来获取所需信息：

declare 
    html CLOB := '<html><a class="href_class" href="/download/file.zip"></a><a class="href_class" href="/download/file2.zip"></a></html>';
    xml XMLType;
    idx NUMBER := 1; 
begin
    xml := XMLType(html);
    WHILE xml.existsNode('//a[@class=''href_class''][' || idx || ']/@href') = 1 LOOP
        dbms_output.put_line(xml.extract('//a[@class=''href_class''][' || idx || ']/@href').getStringVal());
        idx := idx + 1;
    END LOOP;
end;

https://docs.oracle.com/cd/B28359_01/appdev.111/b28419/t_xml.htm#BABHCHHJ

如何从HTML中提取链接？

1 个答案: