Postgres全文搜索排名按位置

时间:2016-02-26 04:51:32

标签: postgresql full-text-search

我有一张电影表,我想搜索标题并返回最接近的匹配。

我认为全文搜索可能有用,但它似乎无法通过单词的位置排序,尽管postgres知道位置。这有可能在postgres?

这是我的问题:

SELECT collectibles.id, collectibles.title, ts_rank_cd(to_tsvector('english', collectibles.title), plainto_tsquery('old school')) AS score
FROM collectibles WHERE to_tsvector('english', collectibles.title) @@ plainto_tsquery('old school')
ORDER BY score DESC;

以下是一些结果:(这是我能看出来的最好的格式,对不起!)

id | title | score

 - 277568 | Wilson Meadows: Live At The 15th Old School & Blues Festival | 0.1
 - 3545 | 5 Film Collection: Will Ferrell: Campaign / Old School (Unrtated Version) / Blades Of Glory / Roxbury / Semi-Pro | 0.1
 - 10366 | Alice Cooper: Old School: 1964-1974 (DVD/CD Combo) | 0.1
 - 13004 | American Classics: Old School (3-Disc Set) | 0.1
 - 13005 | American Classics: Old School: Classic Chevrolets | 0.1
 - 13006 | American Classics: Old School: Classic Travel Trailers | 0.1
 - 13007 | American Classics: Old School: Kings Of Kustomizing | 0.1
 - 14592 | Anchorman: The Legend Of Ron Burgundy (Widescreen/ Extended Edition) / Old School (R-Rated Version) (Back-To-Back) | 0.1
 - 14593 | Anchorman: The Legend Of Ron Burgundy (Widescreen/ Extended Edition) / Old School (R-Rated Version) (Side-By-Side) | 0.1
 - 20242 | Audiovisualize: Mixed By Addictive TV: Snake Worship Island / Corp. Inc. / Old School Futures / These Melodies / Robot War / ... | 0.1
 - 192057 | Old School (DreamWorks/ Widescreen/ Unrated Version/ Special Edition) | 0.1
 - 192058 | Old School (DreamWorks/ Widescreen/ Unrated Version/ Special Edition) / Road Trip (R-Rated) (Back-To-Back) | 0.1
 - 192059 | Old School (DreamWorks/ Widescreen/ Unrated Version/ Special Edition) / Road Trip (R-Rated) (Side-By-Side) | 0.1
 - 192060 | Old School (DreamWorks/ Widescreen/ Unrated Version/ Special Edition) / Road Trip (Unrated) (Back-To-Back) | 0.1
 - 192061 | Old School (DreamWorks/ Widescreen/ Unrated Version/ Special Edition) / Road Trip (Unrated) (Side-By-Side) | 0.1
 - 192062 | Old School (Warner Brothers/ R-Rated Version) | 0.1
 - 192063 | Old School (Warner Brothers/ R-Rated Version/ Blu-ray) | 0.1
 - 192064 | Old School (Warner Brothers/ Unrated Version) | 0.1
 - 192065 | Old School (Warner Brothers/ Unrated Version/ Blu-ray) | 0.1
 - 192066 | Old School Comedy (4-Pack): Atoll K / Jack And The Beanstalk / The Flying Deuces / Africa Screams | 0.1
 - 192067 | Old School Hip Hop Dance #1: Beginner | 0.1
 - 192068 | Old School Hip Hop Greatest | 0.1
 - 192069 | Old School Hip Hop: Run DMC & Flava Flav (2-Disc) | 0.1
 - 192070 | Old School Hits Movie Marathon Collection (3-Disc) | 0.1
 - 192071 | Old School Returns | 0.1

所有这些的分数是0.1,但许多标题中单词的位置更接近字符串的前面。有没有办法将这些排名更高?不幸的是,字符串或id的长度并不是排名很好的限定符。

4 个答案:

答案 0 :(得分:1)

在这里,您需要对ts_rank(tsvector,tsquery,normalization factor)函数使用规范化。在下面的代码片段中,我使用了let str = "t{he${cat${sat${on${the${mat" let splitBy = "${" extension String { func split(splitBy: String)->[String] { if self.isEmpty { return [] } var arr:[String] = [] var tmp = self var tmp1 = "" var i = self.startIndex let e = self.endIndex let c = splitBy.characters.count while i < e { let tag = tmp.hasPrefix(splitBy) if !tag { tmp1.append(tmp.removeAtIndex(tmp.startIndex)) i = i.successor() } else { tmp.removeRange(Range(start: tmp.startIndex, end: tmp.startIndex.advancedBy(c))) i = i.advancedBy(c) arr.append(tmp1) tmp1 = "" } } arr.append(tmp1) return arr.filter{ !$0.isEmpty } } } let arr = str.split(splitBy) // ["t{he", "cat", "sat", "on", "the", "mat"] = normalization(将等级除以1 +文档长度的对数),但您可以将其调整为您真正需要的值。这是一个例子:

1

结果:

WITH s(id,tsv) AS ( VALUES
  (1,to_tsvector('english','Alice Cooper: Old School: 1964-1974 (DVD/CD Combo)')),
  (2,to_tsvector('english','American Classics: Old School: Kings Of Kustomizing')),
  (3,to_tsvector('english','Old School Hip Hop Greatest')),
  (4,to_tsvector('english','Old School Returns'))
)
SELECT id,ts_rank(tsv,tsq,1) AS rank
FROM s,to_tsquery('english','old & school') tsq
ORDER BY rank DESC;

答案 1 :(得分:1)

很老的问题,但是: 您可以使用 ts_rank_cd() 来考虑词素(关键字)之间的距离。 (我不知道这是如何完成的)

您还可以将第 4 位传递给归一化整数(它是位掩码)以将排名除以 the mean harmonic distance between extents(使用 ts_rank_cd)

我没有过多关注这个,但希望这是一个起点

答案 2 :(得分:0)

documentation says

  

此外,*可以附加到词位以指定前缀匹配

  

to_tsquery也可以接受单引号短语

你可以这样做:

SELECT to_tsquery('''old school'':*');
      to_tsquery      
----------------------
 'old':* & 'school':*
(1 row)

所以你的情况会是这样的:

SELECT 
  collectibles.id,
  collectibles.title,
  ts_rank_cd(
    to_tsvector('english', collectibles.title),
    to_tsquery('''old school'':*')
  ) AS score
FROM collectibles
WHERE to_tsvector('english', collectibles.title) @@ to_tsquery('''old school'':*')
ORDER BY score DESC;

答案 3 :(得分:0)

我设法通过拆分各个部分、获取第一个单词并将其设置为更高的优先级 (A) 来实现这一点:

setweight(to_tsvector('english', split_part(coalesce("title", ''), ' ', 1) ), 'A') ||
setweight(to_tsvector('english', coalesce("title", '')), 'B')