识别字符串中最常用的单词

时间:2019-01-30 15:30:28

标签: sql-server tsql

我今天面临的挑战是:如何在字符串字段中找到三个最常用的单词。 知道如何提取特定单词的出现(如下),但如何识别3个最常用的单词? 我对此表示感谢。

关于, 槟榔

declare @string varchar(max)
set @string = 'mouse cat dog elephant chicken cat dog elephant cat dog elephant cat dog cat elephant cat lion dog elephant cat dog elephant lion cat dog elephant cat dog elephant cat dog elephant cat dog cat dog cat dog chicken lion'

select (DATALENGTH(@string) - DATALENGTH(REPLACE(@string, 'cat', '')))/DATALENGTH('cat')

5 个答案:

答案 0 :(得分:1)

您可以使用字符串拆分功能执行此操作。根据您的SQL Server版本,您可以使用自己的函数(下面包括一个函数)或内置的string_split(自2016年起可用):

declare @string varchar(max);
set @string = 'mouse cat dog elephant chicken cat dog elephant cat dog elephant cat dog cat elephant cat lion dog elephant cat dog elephant lion cat dog elephant cat dog elephant cat dog elephant cat dog cat dog cat dog chicken lion';

-- via user defined TVF
select item as word
    ,count(1) as frequency
from dbo.fn_stringsplit4k(@string,' ',null) as s
group by item
order by frequency desc;

-- via built in STRING_SPLIT function
select s.value as word
    ,count(1) as frequency
from string_split(@string,' ') as s
group by s.value
order by frequency desc;

输出

+----------+-----------+
|   word   | frequency |
+----------+-----------+
| cat      |        13 |
| dog      |        12 |
| elephant |         9 |
| lion     |         3 |
| chicken  |         2 |
| mouse    |         1 |
+----------+-----------+

表值函数

CREATE function [dbo].[fn_StringSplit4k]
(
     @str nvarchar(4000) = ' '              -- String to split.
    ,@delimiter as nvarchar(1) = ','        -- Delimiting value to split on.
    ,@num as int = null                     -- Which value to return.
)
returns table
as
return
                    -- Start tally table with 10 rows.
    with n(n)   as (select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1)

                    -- Select the same number of rows as characters in @str as incremental row numbers.
                    -- Cross joins increase exponentially to a max possible 10,000 rows to cover largest @str length.
        ,t(t)   as (select top (select len(isnull(@str,'')) a) row_number() over (order by (select null)) from n n1,n n2,n n3,n n4)

                    -- Return the position of every value that follows the specified delimiter.
        ,s(s)   as (select 1 union all select t+1 from t where substring(isnull(@str,''),t,1) = @delimiter)

                    -- Return the start and length of every value, to use in the SUBSTRING function.
                    -- ISNULL/NULLIF combo handles the last value where there is no delimiter at the end of the string.
        ,l(s,l) as (select s,isnull(nullif(charindex(@delimiter,isnull(@str,''),s),0)-s,4000) from s)

    select rn
          ,item
    from(select row_number() over(order by s) as rn
                ,substring(@str,s,l) as item
        from l
        ) a
    where rn = @num
        or @num is null;

答案 1 :(得分:1)

易:)

DECLARE @String VARCHAR(255)
DECLARE @strngLen int
DECLARE @split TABLE(w_id INT IDENTITY(1,1),w_word VARCHAR(100))
set @string = 'mouse cat dog elephant chicken cat dog elephant cat dog elephant cat dog cat elephant cat lion dog elephant cat dog elephant lion cat dog elephant cat dog elephant cat dog elephant cat dog cat dog cat dog chicken lion'
SET @strngLen = CHARINDEX(' ', @String)

WHILE CHARINDEX(' ', @String) > 0
BEGIN
    SET @strngLen = CHARINDEX(' ', @String);

    INSERT INTO @split
    SELECT SUBSTRING(@String,1,@strngLen - 1);

    SET @String = SUBSTRING(@String, @strngLen + 1, LEN(@String));
END

INSERT INTO @split
SELECT @String

SELECT w_word, COUNT(1) FROM @split
GROUP BY w_word
ORDER BY COUNT(1) desc

答案 2 :(得分:1)

从SQL Server 2016开始,您可以将STRING_SPLIT用于此类工作:

SELECT
  value      AS word
  , COUNT(*) AS occurrence 
FROM STRING_SPLIT(@string, ' ')
GROUP BY value
ORDER BY occurrence DESC;

答案 3 :(得分:0)

已经有好的答案。因此,这更像是展示我们可以找到的多种方法。这是一种XQuery方法:

DECLARE @string VARCHAR(MAX)
SET @string = 'mouse cat dog elephant chicken cat dog elephant cat dog elephant cat dog cat elephant cat lion dog elephant cat dog elephant lion cat dog elephant cat dog elephant cat dog elephant cat dog cat dog cat dog chicken lion'


SELECT CAST('<x>' + REPLACE(@string,' ','</x><x>') + '</x>' AS XML)
      .query('
            for $word in distinct-values(/x)
            return <word value="{$word}" count="{count(/x[text()=$word])}"/>
           ');

结果

  <word value="mouse" count="1" />
  <word value="cat" count="13" />
  <word value="dog" count="12" />
  <word value="chicken" count="2" />
  <word value="lion" count="3" />
  <word value="elephant" count="9" />

当然,以表格形式获得此结果很容易...

一些增强-只是为了娱乐

这将仅返回三个最高计数:

SELECT CAST('<x>' + REPLACE(@string,' ','</x><x>') + '</x>' AS XML)
    .query('
            for $word in distinct-values(/x)
            let $wCount:=count(/x[text()=$word])
            order by $wCount descending
            return <word value="{$word}" count="{$wCount}"/>
        ')
    .query('for $i in(1,2,3) return /word[$i]')

结果

<word value="cat" count="13" />
<word value="dog" count="12" />
<word value="elephant" count="9" />

答案 4 :(得分:0)

这是分享我为这个确切问题找到的出色解决方案的绝佳机会。它使用Jeff Moden's出色的DelimitedSplit8k函数。

首先,函数:

CREATE FUNCTION [dbo].[DelimitedSplit8K]
--===== Define I/O parameters
        (@pString VARCHAR(8000), @pDelimiter CHAR(1))
--WARNING!!! DO NOT USE MAX DATA-TYPES HERE!  IT WILL KILL PERFORMANCE!
RETURNS TABLE WITH SCHEMABINDING AS
 RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 10,000...
     -- enough to cover VARCHAR(8000)
  WITH E1(N) AS (
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
                ),                          --10E+1 or 10 rows
       E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
       E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
 cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
                     -- for both a performance gain and prevention of accidental "overruns"
                 SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
                ),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
                 SELECT 1 UNION ALL
                 SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter
                ),
cteLen(N1,L1) AS(--==== Return start and length (for use in substring)
                 SELECT s.N1,
                        ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000)
                   FROM cteStart s
                )
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.

SELECT a.ItemNumber,
vn = ROW_NUMBER() OVER (PARTITION BY a.Item ORDER BY a.ItemNumber asc),
a.Item 
FROM 
( SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
        Item       = SUBSTRING(@pString, l.N1, l.L1)
   FROM cteLen l) a
GO

该函数使用“数字”或“计数”表非常快速地分割字符串,并作为基于集合的操作。

现在是使用方法:

DECLARE @string VARCHAR(8000)

SET @string = 'mouse cat dog elephant chicken cat dog elephant cat dog elephant cat dog cat elephant cat lion dog elephant cat dog elephant lion cat dog elephant cat dog elephant cat dog elephant cat dog cat dog cat dog chicken lion'

SELECT TOP 3
    Item
FROM dbo.DelimitedSplit8K(@string, ' ')
GROUP BY Item
ORDER BY COUNT(*) DESC

输出:

Item
----
cat
dog
elephant