Sql Server 2005全文搜索中的噪音字

时间:2009-06-02 06:31:52

标签: sql-server full-text-search

我正在尝试对数据库中的一系列名称使用全文搜索。这是我第一次尝试使用全文搜索。目前我输入的搜索字符串在每个术语之间放置一个NEAR条件(即输入的“莱昂国王”的短语变成“NEAR Leon附近的国王”)。

不幸的是,我发现这种策略会导致错误的否定搜索结果,因为SQL Server在创建索引时会删除“of”这个词,因为它是一个干扰词。因此,“国王莱昂”将正确匹配,但“莱昂国王”将不会。

我的同事建议使用MSSQL \ FTData \ noiseENG.txt中定义的所有干扰词并将它们放在.Net代码中,以便在执行全文搜索之前删除干扰词。

这是最好的解决方案吗?是否有一些自动魔术设置我可以在SQL服务器中更改为我这样做?或者也许只是一个更好的解决方案,不会感觉像hacky?

2 个答案:

答案 0 :(得分:4)

全文将取决于您提供的搜索条件。您可以从文件中删除干扰词,但这样做确实有可能使索引大小膨胀。 Robert Cain在他的博客上有很多关于此的信息:

http://arcanecode.com/2008/05/29/creating-and-customizing-noise-words-in-sql-server-2005-full-text-search/

要节省一些时间,您可以查看此方法如何删除它们并复制代码和单词:

        public string PrepSearchString(string sOriginalQuery)
    {
        string strNoiseWords = @" 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0 | $ | ! | @ | # | $ | % | ^ | & | * | ( | ) | - | _ | + | = | [ | ] | { | } | about | after | all | also | an | and | another | any | are | as | at | be | because | been | before | being | between | both | but | by | came | can | come | could | did | do | does | each | else | for | from | get | got | has | had | he | have | her | here | him | himself | his | how | if | in | into | is | it | its | just | like | make | many | me | might | more | most | much | must | my | never | now | of | on | only | or | other | our | out | over | re | said | same | see | should | since | so | some | still | such | take | than | that | the | their | them | then | there | these | they | this | those | through | to | too | under | up | use | very | want | was | way | we | well | were | what | when | where | which | while | who | will | with | would | you | your | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z ";

        string[] arrNoiseWord = strNoiseWords.Split("|".ToCharArray());

        foreach (string noiseword in arrNoiseWord)
        {
            sOriginalQuery = sOriginalQuery.Replace(noiseword, " ");
        }
        sOriginalQuery = sOriginalQuery.Replace("  ", " ");
        return sOriginalQuery.Trim();
    }

但是,我可能会使用Regex.Replace,这应该比循环快得多。我只是没有一个快速的例子来发布。

答案 1 :(得分:0)

这是一个有效的功能。文件noiseENU.txt按原样从\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData复制。

    Public Function StripNoiseWords(ByVal s As String) As String
        Dim NoiseWords As String = ReadFile("/Standard/Core/Config/noiseENU.txt").Trim
        Dim NoiseWordsRegex As String = Regex.Replace(NoiseWords, "\s+", "|") ' about|after|all|also etc.
        NoiseWordsRegex = String.Format("\s?\b(?:{0})\b\s?", NoiseWordsRegex)
        Dim Result As String = Regex.Replace(s, NoiseWordsRegex, " ", RegexOptions.IgnoreCase) ' replace each noise word with a space
        Result = Regex.Replace(Result, "\s+", " ") ' eliminate any multiple spaces
        Return Result
    End Function