Question

我有一个包含20000个域名的数据库，包括顶级域名，二级域名和更低级域名。例如

.BIZ stackoverflow.com ru.wikipedia.com

我想执行快速查找以查看输入网址是否与这些20000中的任何一个匹配。我可以使用Dictionary键或HashSet.Contains，但它仅适用于完全匹配。由于数据库还包含TLD名称，我希望acmeycompany.biz也因.biz TLD而返回匹配。另一方面，fr.wikipedia.com不应该匹配，因为子域是不同的。

简单地循环遍历列表并进行基于字符串的比较也不是一种选择。如果我有1000个网址进行比较，那就太慢了。所以它必须是基于密钥的索引查找。

我正在考虑构建如下所示的树结构，然后进行基于密钥的查找，例如：

.COM .wikipedia .RU 。堆栈溢出 .BIZ

然后我可以将输入Url（sampledomain.com）拆分为部分并执行此类查找.com - ＆gt; .sampledomain

有人能指点我怎么做？或者还有什么其他选择？任何样品都表示赞赏。

谢谢！

这就是我开始的方式......这是vb.net代码，但你明白了。

 Public Class TreeNode

    Sub New()
        ChildNodes = New Dictionary(Of String, TreeNode)
    End Sub

    Public Property Key As String
    Public Property ChildNodes As Dictionary(Of String, TreeNode)

End Class

Private Tree As New Dictionary(Of String, TreeNode)

Sub BuildTree()

    For Each Link In Links

        If Uri.IsWellFormedUriString(Link, UriKind.Absolute) Then

            Dim Url As New Uri(Link)
            Dim Domain As String

            If Url.HostNameType = UriHostNameType.Dns Then

                Domain = Url.Host.ToLower.Replace("www.", "")

                Dim DomainParts() As String = Domain.Split(CChar("."))

                'lets start from TLD
                For Each Key In DomainParts.Reverse

                    'dont konw how to populate tree

                Next

            End If

        End If

    Next

End Sub

Function TreeLookup(Link As String) As Boolean

    Dim Url As New Uri(Link)
    Dim Domain As String
    Dim IsMatch As Boolean = False

    If Url.HostNameType = UriHostNameType.Dns Then

        Domain = Url.Host.ToLower.Replace("www.", "")

        Dim DomainParts() As String = Domain.Split(CChar("."))
        Dim DomainPartsCount As Integer = DomainParts.Length
        Dim Index As Integer = 0


        For Each Key In DomainParts

            Index += 1

            'use recursive loop to see if 

            'returns true if directory contains key and we have reached to the last part of the domain name
            If Index = DomainPartsCount Then

                IsMatch = True
                Exit For

            End If

        Next

    End If

    Return IsMatch

End Function

Answer 1

您可能想要创建一个哈希映射字典的字典。第一个字典可以包含与具有该TLD的所有二级域的字典配对的所有TLD条目。然后，每个二级域可以包含它包含的所有较低级域的哈希映射。每个条目还将有一个标志，用于指示该条目是否实际位于数据库中，还是仅存储较低级别条目的占位符。与您使用的短列表一样，.com实际上并不在列表中，但仍然是TLD中的条目，因此可以访问stackoverflow.com和wikipedia.com（它本身就是占位符ru.wikipedia.com）。然后，查找将从URL TLD开始，然后是第二个，如果需要深入，则最后是较低级别。

我希望我能正确理解你的困境并充分解释我的想法。

编辑：最低级别只需要一个hashmap。

您需要在树节点中添加一个指示符，以指示该节点是匹配的键还是仅仅是进入次级/次级域的垫脚。

要添加域名，您可以执行以下操作（如果您愿意，可以将其设置为递归，但只有三个级别并不是很重要）：

// TLD, SLD and LLD are the three levels of the current domain you are adding into the tree
if Tree does not contain the TLD
    Add TLD to the Tree with a new TreeNode

if SLD does not exist for the current domain
    Mark the Tree at TLD as a match
else   
    if Tree[TLD] does not contain the SLD
        Add SLD to the Tree[TLD] Node

    if LLD does not exist for the current domain
        Mark the Tree[TLD] at SLD as a match
    else   
        if Tree[TLD][SLD] does not contain the LLD
            Add LLD to the Tree[TLD][SLD] Node
            // Don't really need to mark the node
            // as every LLD would be a match
            // Probably would need to mark if made recursive

查找域名（再次，可以递归）：

// TLD, SLD and LLD are for the domain you looking for
if Tree does not contain TLD
    you are done, no match
else
    if Tree[TLD] is marked
        done, match
    else
        if Tree[TLD] does not contain SLD
            done, no match
        else
            if Tree[TLD][SLD] is marked
                done, match
            else
                if Tree[TLD][SLD] contains LLD
                    done, match
                    // would need to check if the node
                    // is marked if made recursive

Answer 2

在数据库中存储项目时，请使URL的每个部分都有自己的列。因此，TLD，域和子域都是它们自己的列。

create table MyList
(
    [TLD] nvarchar(10) not null,
    [Domain] nvarchar(50) not null,
    [Subdomain] nvarchar(50) not null,

    unique ([TLD], [Domain], [Subdomain]) 
    --this means that you can't add the same data twice
)

现在使用SQL获取所需的数据。

select *
from MyList
where [TDL] = '.com'

这是解决问题的最有效方法，因为数据在过滤之前永远不会离开您的数据库。

关于表格的原因，请阅读Database Normalization

如果您只将网址存储在一个列中，则必须进行一些数据转换。

字典/树快速键查找

2 个答案: