模糊字符串匹配SQL - 不同顺序的单词

时间:2018-01-22 11:27:56

标签: sql-server algorithm tsql pattern-matching fuzzy-search

如何匹配不精确且具有不同单词顺序的字符串。通常字符串具有相似的数字模式,但单词可能的顺序不同。

例如,我会考虑匹配良好的字符串:

目标字符串Apple 10mg/51L Tail

测试字符串Tail 10mg/51L Apple(只是随机翻译,拼写正确)

我还会考虑以下字符串之间的良好匹配:

测试字符串51L MissleadingLENWord ObfuscateTail 10mg Apple(如果我们用LIKE子句逐个检查每个单词,即#34; Tail&#,可以在测试字符串中找到目标字符串的所有单词34;目标字符串可以在单词" ObfuscateTail")中的测试字符串中找到。

我希望在函数中看到这个问题的解决方案返回百分比数字,这意味着字符串有多相似 - 零 - 字符串不同,100%两个字符串都相同。

我应该使用哪种algorytm?如果可以使用SQL Server实现它是最好的。

我可以在这里找到一些算法:Fuzzy matching using T-SQL。 领先答案中提到的 Levenshtein 距离算法是否适合于单词的混合顺序?

3 个答案:

答案 0 :(得分:4)

只要单词被分开(空白,/或任何其他分隔符),这可以通过字符串拆分器和命中计数来完成,但是你找不到" Tail& #34; in" ObfuscateTail"。您还需要一些 CamelCase解析 ......

一个相当简单的解决方法是使用所有碎片进行LIKE搜索,但这可能会带来很多 - 而且(肯定!)这不会很快......

尝试这样的事情:

DECLARE @mockupTable TABLE(ID INT IDENTITY, YourTarget VARCHAR(100));
INSERT INTO @mockupTable VALUES('51L MissleadingLENWord ObfuscateTail 10mg Apple')
                              ,('Some other 51L with differing words');

DECLARE @search VARCHAR(100)='Apple 10mg/51L Tail';

WITH Parted AS
(
    SELECT CAST('<x>' + REPLACE(REPLACE(@search,' ','/'),'/','</x><x>') + '</x>' AS XML) AS SearchFragmentsXML
)
,AllSearchWords AS
(
    SELECT frgmnt.value(N'.',N'nvarchar(max)') AS Frg 
    FROM Parted 
    CROSS APPLY SearchFragmentsXML.nodes(N'/x') AS A(frgmnt)
)
SELECT ID
      ,COUNT(*) AS CountHits
      ,(SELECT COUNT(*) FROM AllSearchWords) AS CountFragments
FROM @mockupTable AS t
INNER JOIN AllSearchWords AS Frgs ON t.YourTarget LIKE '%' + Frgs.Frg + '%'
GROUP BY ID;

结果

ID  CountHits   CountFragments
1   4           4
2   1           4

命中次数越接近&#34;对于片段的数量&#34;越好。

更新:功能(不推荐)

DROP FUNCTION dbo.YourSearch;
GO
CREATE FUNCTION dbo.YourSearch(@SearchIn VARCHAR(MAX), @SearchFor VARCHAR(100)='Apple 10mg/51L Tail')
RETURNS FLOAT
AS
BEGIN
DECLARE @rslt DECIMAL(10,4) =
(
    SELECT CAST(COUNT(*) AS FLOAT) / MAX(SearchFragmentsXML.value('count(/x[text()])','float'))
    FROM 
    (
        SELECT CAST('<x>' + REPLACE(REPLACE(@SearchFor,' ','/'),'/','</x><x>') + '</x>' AS XML) AS SearchFragmentsXML
    ) AS Parted
    CROSS APPLY SearchFragmentsXML.nodes(N'/x') AS A(frgmnt)
    WHERE @SearchIn LIKE '%' + frgmnt.value(N'text()[1]',N'nvarchar(max)') + '%'
);

RETURN @rslt;
END
GO

DECLARE @mockupTable TABLE(ID INT IDENTITY, YourTarget VARCHAR(100));
INSERT INTO @mockupTable VALUES('51L MissleadingLENWord ObfuscateTail 10mg Apple')
                              ,('Some other 51L with differing words');

SELECT t.*
      ,dbo.YourSearch(t.YourTarget,'Apple 10mg/51L Tail') AS HitCoeff
FROM @mockupTable AS t;

结果

ID  YourTarget                                          HitCoeff
1   51L MissleadingLENWord ObfuscateTail 10mg Apple     1
2   Some other 51L with differing words                 0,25

提示:如果您使用带有SessionID的物理表格,那么它会有很大帮助,您可以在其中填写搜索字符串的片段。然后将SessionID传递给函数并从那里抓取碎片。这至少会避免重复拆分,并且可以使用结果缓存。

答案 1 :(得分:1)

您正在寻找通常称为词组匹配的内容。

模糊的单词和单词变得凌乱。

所有方法都从拆分单词开始。

您可以使用Levenshtein distance距离,但可以使用单词中不包含字符的单词。你可以拿出这个词的哈希值。不完美但基于散列会更快。

这里常见的最佳做法是tf–idf。这是Lucene使用的。您可能认为它有点激烈但我在一个包含100万个文档的库中使用了最多100,000个单词并且它在不到1秒的时间内找到了排名匹配。再一次,你不会在这个词中变得模糊。

Cosine similarity是另一种选择。

模糊的话,你可以对Levenshtein反对每个单词并取最小的然后做一些总和。我不推荐这条路线。

答案 2 :(得分:0)

我还没有发现任何可以衡量字符串中单词混排的东西。为了换个字母,我最后使用了这个答案:https://stackoverflow.com/a/26389197/1903793

CREATE ASSEMBLY [FuzzyString]
FROM 0x4D5A90000300000004000000FFFF0000B800000000000000400000000000000000000000000000000000000000000000000000000000000000000000800000000E1FBA0E00B409CD21B8014CCD21546869732070726F6772616D2063616E6E6F742062652072756E20696E20444F53206D6F64652E0D0D0A2400000000000000504500004C010300BBB08A5A0000000000000000E00022200B013000000C000000060000000000007A2B0000002000000040000000000010002000000002000004000000000000000600000000000000008000000002000000000000030060850000100000100000000010000010000000000000100000000000000000000000282B00004F000000004000009003000000000000000000000000000000000000006000000C000000F02900001C0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000200000080000000000000000000000082000004800000000000000000000002E74657874000000800B000000200000000C000000020000000000000000000000000000200000602E72737263000000900300000040000000040000000E0000000000000000000000000000400000402E72656C6F6300000C00000000600000000200000012000000000000000000000000000040000042000000000000000000000000000000005C2B00000000000048000000020005006022000090070000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000013300800F901000001000011000F00281000000A130711072C0C0F007201000070281100000A0F01281000000A130811082C0C0F017201000070281100000A0F00281200000A6F1300000A0A0F01281200000A6F1300000A0B066F1400000A0C076F1400000A0D081758091758731500000A130416130508095816FE01130911092C1600230000000000005940281600000A130A38690100000816FE01130B110B2C1600230000000000000000281600000A130A38490100000916FE01130C110C2C1600230000000000000000281600000A130A382901000016130D2B121104110D16110D281700000A110D1758130D110D08FE0216FE01130E110E2DE016130F2B12110416110F110F281700000A110F1758130F110F09FE0216FE01131011102DE0171311388C000000001713122B710006111117596F1800000A07111217596F1800000AFE01131311132C051613052B031713051104111111121104111117591112281900000A17581104111111121759281900000A1758281A00000A11041111175911121759281900000A110558281A00000A281700000A00111217581312111209FE0216FE01131411142D8100111117581311111108FE0216FE01131511153A63FFFFFF23000000000000F03F11040809281900000A6C0809281B00000A6C5B592300000000000059405A18281C00000A13061106281600000A130A2B00110A2A2202281D00000A002A000042534A4201000100000000000C00000076342E302E33303331390000000005006C0000005C020000237E0000C80200001C03000023537472696E677300000000E40500000400000023555300E8050000100000002347554944000000F80500009801000023426C6F620000000000000002000001471502080900000000FA01330016000001000000150000000200000002000000020000001D0000000F000000010000000100000001000000020000000000F101010000000000060028019E02060095019E02060047006C020F00BE02000006006F00270206000B0127020600D700270206007C01270206004801270206006101270206008600270206005B007F02060039007F020600BA0027020600A100BD010600FC020C020A00F6004B020A002500CD020A00D701CD020600DA010C020600E1010C02000000000100000000000100010001001000E202000041000100010050200000000096001702810001005522000000008618660206000300000001002F00000002003902090066020100110066020600190066020A00290066021000310066021000390066021000410066021000490066021000510066021000590066021000610066021500690066021000710066021000790066021000890066020600990001023A009900660210009900B3013E00A10043023E00A100E60142000C0066024E0091000B0354000C0007035A00A100F20261000C0003036600A90013026C00A90017036C00A9001F00720081006602060020007B0071012E000B008A002E00130093002E001B00B2002E002300BB002E002B00CE002E003300CE002E003B00CE002E004300BB002E004B00D4002E005300CE002E005B00CE002E006300EC002E006B0016012E00730023011A0046000480000001000000000000000000000000001B020000040000000000000000000000780016000000000004000000000000000000000078000A00000000000000003C4D6F64756C653E0053797374656D2E44617461006D73636F726C696200526F756E640053716C446F75626C6500737472696E674F6E6500477569644174747269627574650044656275676761626C6541747472696275746500436F6D56697369626C6541747472696275746500417373656D626C795469746C6541747472696275746500417373656D626C7954726164656D61726B417474726962757465005461726765744672616D65776F726B41747472696275746500417373656D626C7946696C6556657273696F6E41747472696275746500417373656D626C79436F6E66696775726174696F6E4174747269627574650053716C46756E6374696F6E41747472696275746500417373656D626C794465736372697074696F6E41747472696275746500436F6D70696C6174696F6E52656C61786174696F6E7341747472696275746500417373656D626C7950726F6475637441747472696275746500417373656D626C79436F7079726967687441747472696275746500417373656D626C79436F6D70616E794174747269627574650052756E74696D65436F6D7061746962696C697479417474726962757465006765745F56616C75650053797374656D2E52756E74696D652E56657273696F6E696E670053716C537472696E67004D617468006765745F4C656E677468004C6576656E73687465696E2E646C6C006765745F49734E756C6C0053797374656D004D696E004861426F4C6576656E73687465696E0053797374656D2E5265666C656374696F6E00737472696E6754776F00546F5570706572004D6963726F736F66742E53716C5365727665722E536572766572002E63746F720053797374656D2E446961676E6F73746963730053797374656D2E52756E74696D652E496E7465726F7053657276696365730053797374656D2E52756E74696D652E436F6D70696C6572536572766963657300446562756767696E674D6F6465730053797374656D2E446174612E53716C54797065730053746F72656446756E6374696F6E73006765745F4368617273004F626A6563740047657400536574006F705F496D706C69636974004D61780000000100008C1A28518DAA994B969F8C9B2C0CD20400042001010803200001052001011111042001010E04200101021F07160E0E080814080200020000080D02020211490202080208020808020202032000020320000E03200008071408020002000005200201080805000111490D0620030108080804200103080520020808080500020808080500020D0D0808B77A5C561934E0890800021149114D114D0801000800000000001E01000100540216577261704E6F6E457863657074696F6E5468726F7773010801000701000000001201000D436C6173734C69627261727931000005010000000017010012436F7079726967687420C2A920203230313800002901002465356266373439622D363661392D343637332D396233332D39616639656462383961663100000C010007312E302E302E3000004D01001C2E4E45544672616D65776F726B2C56657273696F6E3D76342E362E310100540E144672616D65776F726B446973706C61794E616D65142E4E4554204672616D65776F726B20342E362E31240100020054020F497344657465726D696E69737469630154020949735072656369736500000000000000BBB08A5A00000000020000001C0100000C2A00000C0C0000525344536AF89DEC4586C4488693EFBD73C73D1E01000000433A5C315C53514C5C646C6C5C436C6173734C696272617279315C436C6173734C696272617279315C6F626A5C44656275675C4C6576656E73687465696E2E7064620000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000502B000000000000000000006A2B00000020000000000000000000000000000000000000000000005C2B0000000000000000000000005F436F72446C6C4D61696E006D73636F7265652E646C6C0000000000FF2500200010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100100000001800008000000000000000000000000000000100010000003000008000000000000000000000000000000100000000004800000058400000340300000000000000000000340334000000560053005F00560045005200530049004F004E005F0049004E0046004F0000000000BD04EFFE00000100000001000000000000000100000000003F000000000000000400000002000000000000000000000000000000440000000100560061007200460069006C00650049006E0066006F00000000002400040000005400720061006E0073006C006100740069006F006E00000000000000B00494020000010053007400720069006E006700460069006C00650049006E0066006F0000007002000001003000300030003000300034006200300000001A000100010043006F006D006D0065006E007400730000000000000022000100010043006F006D00700061006E0079004E0061006D006500000000000000000044000E000100460069006C0065004400650073006300720069007000740069006F006E000000000043006C006100730073004C0069006200720061007200790031000000300008000100460069006C006500560065007200730069006F006E000000000031002E0030002E0030002E003000000040001000010049006E007400650072006E0061006C004E0061006D00650000004C006500760065006E00730068007400650069006E002E0064006C006C0000004800120001004C006500670061006C0043006F007000790072006900670068007400000043006F0070007900720069006700680074002000A90020002000320030003100380000002A00010001004C006500670061006C00540072006100640065006D00610072006B00730000000000000000004800100001004F0072006900670069006E0061006C00460069006C0065006E0061006D00650000004C006500760065006E00730068007400650069006E002E0064006C006C0000003C000E000100500072006F0064007500630074004E0061006D0065000000000043006C006100730073004C0069006200720061007200790031000000340008000100500072006F006400750063007400560065007200730069006F006E00000031002E0030002E0030002E003000000038000800010041007300730065006D0062006C0079002000560065007200730069006F006E00000031002E0030002E0030002E00300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002000000C0000007C3B00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
WITH PERMISSION_SET = SAFE
GO

CREATE FUNCTION [dbo].[Levenshtein](@S1 [nvarchar](200), @S2 [nvarchar](200))
RETURNS [float] WITH EXECUTE AS CALLER
AS 
EXTERNAL NAME [FuzzyString].[StoredFunctions].[HaBoLevenshtein]
GO

示例用法:

select [dbo].[Levenshtein] ('Apple', 'Appleee')
相关问题