Question

我有一张电影表，我想搜索标题并返回最接近的匹配。

我认为全文搜索可能有用，但它似乎无法通过单词的位置排序，尽管postgres知道位置。这有可能在postgres？

这是我的问题：

SELECT collectibles.id, collectibles.title, ts_rank_cd(to_tsvector('english', collectibles.title), plainto_tsquery('old school')) AS score
FROM collectibles WHERE to_tsvector('english', collectibles.title) @@ plainto_tsquery('old school')
ORDER BY score DESC;

以下是一些结果:(这是我能看出来的最好的格式，对不起！）

id | title | score

 - 277568 | Wilson Meadows: Live At The 15th Old School & Blues Festival | 0.1
 - 3545 | 5 Film Collection: Will Ferrell: Campaign / Old School (Unrtated Version) / Blades Of Glory / Roxbury / Semi-Pro | 0.1
 - 10366 | Alice Cooper: Old School: 1964-1974 (DVD/CD Combo) | 0.1
 - 13004 | American Classics: Old School (3-Disc Set) | 0.1
 - 13005 | American Classics: Old School: Classic Chevrolets | 0.1
 - 13006 | American Classics: Old School: Classic Travel Trailers | 0.1
 - 13007 | American Classics: Old School: Kings Of Kustomizing | 0.1
 - 14592 | Anchorman: The Legend Of Ron Burgundy (Widescreen/ Extended Edition) / Old School (R-Rated Version) (Back-To-Back) | 0.1
 - 14593 | Anchorman: The Legend Of Ron Burgundy (Widescreen/ Extended Edition) / Old School (R-Rated Version) (Side-By-Side) | 0.1
 - 20242 | Audiovisualize: Mixed By Addictive TV: Snake Worship Island / Corp. Inc. / Old School Futures / These Melodies / Robot War / ... | 0.1
 - 192057 | Old School (DreamWorks/ Widescreen/ Unrated Version/ Special Edition) | 0.1
 - 192058 | Old School (DreamWorks/ Widescreen/ Unrated Version/ Special Edition) / Road Trip (R-Rated) (Back-To-Back) | 0.1
 - 192059 | Old School (DreamWorks/ Widescreen/ Unrated Version/ Special Edition) / Road Trip (R-Rated) (Side-By-Side) | 0.1
 - 192060 | Old School (DreamWorks/ Widescreen/ Unrated Version/ Special Edition) / Road Trip (Unrated) (Back-To-Back) | 0.1
 - 192061 | Old School (DreamWorks/ Widescreen/ Unrated Version/ Special Edition) / Road Trip (Unrated) (Side-By-Side) | 0.1
 - 192062 | Old School (Warner Brothers/ R-Rated Version) | 0.1
 - 192063 | Old School (Warner Brothers/ R-Rated Version/ Blu-ray) | 0.1
 - 192064 | Old School (Warner Brothers/ Unrated Version) | 0.1
 - 192065 | Old School (Warner Brothers/ Unrated Version/ Blu-ray) | 0.1
 - 192066 | Old School Comedy (4-Pack): Atoll K / Jack And The Beanstalk / The Flying Deuces / Africa Screams | 0.1
 - 192067 | Old School Hip Hop Dance #1: Beginner | 0.1
 - 192068 | Old School Hip Hop Greatest | 0.1
 - 192069 | Old School Hip Hop: Run DMC & Flava Flav (2-Disc) | 0.1
 - 192070 | Old School Hits Movie Marathon Collection (3-Disc) | 0.1
 - 192071 | Old School Returns | 0.1

所有这些的分数是0.1，但许多标题中单词的位置更接近字符串的前面。有没有办法将这些排名更高？不幸的是，字符串或id的长度并不是排名很好的限定符。

Answer 1

在这里，您需要对ts_rank(tsvector,tsquery,normalization factor)函数使用规范化。在下面的代码片段中，我使用了let str = "t{he${cat${sat${on${the${mat" let splitBy = "${" extension String { func split(splitBy: String)->[String] { if self.isEmpty { return [] } var arr:[String] = [] var tmp = self var tmp1 = "" var i = self.startIndex let e = self.endIndex let c = splitBy.characters.count while i < e { let tag = tmp.hasPrefix(splitBy) if !tag { tmp1.append(tmp.removeAtIndex(tmp.startIndex)) i = i.successor() } else { tmp.removeRange(Range(start: tmp.startIndex, end: tmp.startIndex.advancedBy(c))) i = i.advancedBy(c) arr.append(tmp1) tmp1 = "" } } arr.append(tmp1) return arr.filter{ !$0.isEmpty } } } let arr = str.split(splitBy) // ["t{he", "cat", "sat", "on", "the", "mat"] = normalization（将等级除以1 +文档长度的对数），但您可以将其调整为您真正需要的值。这是一个例子：

结果：

WITH s(id,tsv) AS ( VALUES
  (1,to_tsvector('english','Alice Cooper: Old School: 1964-1974 (DVD/CD Combo)')),
  (2,to_tsvector('english','American Classics: Old School: Kings Of Kustomizing')),
  (3,to_tsvector('english','Old School Hip Hop Greatest')),
  (4,to_tsvector('english','Old School Returns'))
)
SELECT id,ts_rank(tsv,tsq,1) AS rank
FROM s,to_tsquery('english','old & school') tsq
ORDER BY rank DESC;

Answer 2

很老的问题，但是：您可以使用 ts_rank_cd() 来考虑词素（关键字）之间的距离。（我不知道这是如何完成的）

您还可以将第 4 位传递给归一化整数（它是位掩码）以将排名除以 the mean harmonic distance between extents（使用 ts_rank_cd）

我没有过多关注这个，但希望这是一个起点

Postgres Documentation--Ranking

Answer 3

documentation says

此外，*可以附加到词位以指定前缀匹配

与

to_tsquery也可以接受单引号短语

你可以这样做：

SELECT to_tsquery('''old school'':*');
      to_tsquery      
----------------------
 'old':* & 'school':*
(1 row)

所以你的情况会是这样的：

SELECT 
  collectibles.id,
  collectibles.title,
  ts_rank_cd(
    to_tsvector('english', collectibles.title),
    to_tsquery('''old school'':*')
  ) AS score
FROM collectibles
WHERE to_tsvector('english', collectibles.title) @@ to_tsquery('''old school'':*')
ORDER BY score DESC;

Answer 4

我设法通过拆分各个部分、获取第一个单词并将其设置为更高的优先级 (A) 来实现这一点：

setweight(to_tsvector('english', split_part(coalesce("title", ''), ' ', 1) ), 'A') ||
setweight(to_tsvector('english', coalesce("title", '')), 'B')

Postgres全文搜索排名按位置

4 个答案: