Question

我已将表从myisam升级到innodb，但性能却不一样。当存在某种联系时，innodb返回一个0分数。 myisam表返回相同术语的匹配项（我保留了旧表的副本，因此我仍然可以运行相同的查询）。

SELECT MATCH (COLUMNS) AGAINST ('+"Term Ex"' IN BOOLEAN MODE) as score
FROM table_myisam
where id = 1;

返回：

+-------+
| score |
+-------+
|     1 |
+-------+

但是：

SELECT MATCH (COLUMNS) AGAINST ('+"Term Ex"' IN BOOLEAN MODE) as score
FROM table
where id = 1;

返回：

+-------+
| score |
+-------+
|     0 |
+-------+

我认为ex可能没有被索引，因为innodb_ft_min_token_size设置为3。我将其降低到1并优化了表格，但这没有影响。列的内容长为99个字符，因此我推测由于innodb_ft_max_token_size而没有为整个列建立索引。我也将其增加到150并再次运行优化，但结果仍然相同。

这些表之间的唯一区别是引擎和字符集。此表使用utf8，myisam表使用latin1。

有人看到过这些行为，或者对如何解决它有建议吗？

更新：我将ft_stopword_file=""添加到了my.cnf中，然后再次运行了OPTIMIZE TABLE table。这次我得到了

优化|注意表不支持优化，而是执行重新创建+分析

此更改后，查询有效。 Ex并不是一个停用词，但是不确定为什么会有所作为。

一个失败的新查询是：

SELECT MATCH (Columns) AGAINST ('+Term +Ex +in' IN BOOLEAN MODE) as score FROM Table where id = 1;

+-------+
| score |
+-------+
|     0 |
+-------+

in导致此操作失败，但这是我表中的下一个单词。

SELECT MATCH (Columns) AGAINST ('+Term +Ex' IN BOOLEAN MODE) as score FROM Table where id = 1;

+--------------------+
| score              |
+--------------------+
| 219.30206298828125 |
+--------------------+

我还尝试了CREATE TABLE my_stopwords(value VARCHAR(30)) ENGINE = INNODB;，然后用my.cnf更新了innodb_ft_server_stopword_table='db/my_stopwords'。我重新启动并运行：

show variables like 'innodb_ft_server_stopword_table';

带回去的

+---------------------------------+---------------------------+
| Variable_name                   | Value                     |
+---------------------------------+---------------------------+
| innodb_ft_server_stopword_table | 'db/my_stopwords'; |
+---------------------------------+---------------------------+

所以我认为in不会立即导致查询失败，但是会继续。我还再次尝试了OPTIMIZE TABLE table，甚至ALTER TABLE table DROP INDEX ...和ALTER TABLE table ADD FULLTEXT KEY ...都没有影响。

第二次更新 问题出在停用词上。

$userinput = preg_replace('/\b(a|about|an|are|as|at|be|by|com|de|en|for|from|how|i|in|is|it|la|of|on|or|that|the|this|to|was|what|when|where|who|will|with|und|the|www)\b/', '', $userinput);

解决了该问题，但对我来说似乎不是一个好的解决方案。我想要一个避免停用词在mysql中破坏它的解决方案。

关键字表数据：

CREATE TABLE `my_stopwords` (
  `value` varchar(30) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1

和

Name: my_stopwords
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 0
 Avg_row_length: 0
    Data_length: 16384
Max_data_length: 0
   Index_length: 0
      Data_free: 0
 Auto_increment: NULL
    Create_time: 2019-04-09 17:39:55
    Update_time: NULL
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options: 
        Comment:

Answer 1

MyISAM的FULLTEXT和InnoDB的之间有一些区别。我认为您被“短”字和/或停用词的处理所吸引。 MyISAM将显示行，但InnoDB将失败。

在使用FT（以及切换到InnoDB之后）时，我所做的就是过滤用户输入的内容，以免输入简短的单词。这需要额外的精力，但是却使我获得了所需的行数。我的情况略有不同，因为结果查询是这样的。请注意，我添加了+来要求单词，但不要求少于3个单词（我的ft_min_token_size是3）。这些搜索是针对build a table和build the table的：

WHERE match(description) AGAINST('+build* a +table*' IN BOOLEAN MODE)
WHERE match(description) AGAINST('+build* +the* +table*' IN BOOLEAN MODE)

（结尾的*可能是多余的；我尚未对此进行调查。）

另一种方法

由于FT在非短，不停词方面非常有效，因此请分两个阶段进行搜索，每个阶段都是可选的：要搜索“长词”，请执行

WHERE MATCH(d) AGAINST ('+long +word' IN BOOLEAN MODE)
  AND d REGEXP '[[:<:]]a[[:>:]]'

第一部分通过查找“ long”和“ word”（如 words ）迅速缩小可能的行数。第二部分确保字符串中也有一个 word a。 REGEXP的成本很高，但仅适用于通过第一个测试的那些行。

要恰好搜索“长词”：

WHERE MATCH(d) AGAINST ('+long +word' IN BOOLEAN MODE)

要搜索 just 单词“ a”：

WHERE d REGEXP '[[:<:]]a[[:>:]]'

注意：这种情况会很慢。

注意：我的示例允许单词在字符串中的任何顺序和任何位置。也就是说，该字符串在我的所有示例中都将匹配：“她渴望从他那里得到一个字。”

Answer 2

这是应该逐步重现您的问题的分步过程。（实际上，这是您应该如何编写问题的方式。）该环境是带有 Debian 9.8 和 Percona Server Ver 5.6.43-84.3 的全新安装的VM。

使用全文索引和一些虚拟数据创建 InnoDB 表：

create table test.ft_innodb (
    txt text,
    fulltext index (txt)
) engine=innodb charset=utf8 collate=utf8_unicode_ci;

insert into test.ft_innodb (txt) values
    ('Some dummy text'),
    ('Text with a long and short stop words in it ex');

执行一个测试查询，以验证它是否还无法根据我们的需要：

select txt
    , match(t.txt) against ('+some' in boolean mode) as score0
    , match(t.txt) against ('+with' in boolean mode) as score1
    , match(t.txt) against ('+in'   in boolean mode) as score2
    , match(t.txt) against ('+ex'   in boolean mode) as score3
from test.ft_innodb t;

结果（四舍五入）：

txt                                            | score0 | score1 | score2 | score3
-----------------------------------------------|--------|--------|--------|-------
Some dummy text                                | 0.0906 | 0      | 0      | 0
Text with a long and short stop words in it ex | 0      | 0      | 0      | 0

如您所见，它不适用于停用词（“ + with”）或简短词（“ + ex”）。

为自定义停用词创建一个空的 InnoDB 表：

create table test.my_stopwords (value varchar(30)) engine=innodb;

编辑/etc/mysql/my.cnf并在[mysqld]块中添加以下两行：

[mysqld]
# other settings
innodb_ft_server_stopword_table = "test/my_stopwords"
innodb_ft_min_token_size = 1

使用service mysql restart
再次从（2.）运行查询（结果应相同）
使用
重建全文索引
```
optimize table test.ft_innodb;
```
它实际上将重建包括所有索引的整个表。

再次从（2.）执行测试查询。现在的结果是：

txt                                            | score1 | score1 | score2 | score3
-----------------------------------------------|--------|--------|--------|-------
Some dummy text                                | 0.0906 | 0      | 0      | 0
Text with a long and short stop words in it ex | 0      | 0.0906 | 0.0906 | 0.0906

您看到它对我来说很好用。复制非常简单。（再次-这是您应该如何编写问题的方式。）

由于您的过程相当混乱而不是详尽，因此很难说出可能会给您带来什么问题。例如：

CREATE TABLE my_stopwords(value VARCHAR(30)) ENGINE = INNODB;

这不包含信息，您已在哪个数据库中定义了该表。请注意，我为所有表加上了相应的数据库前缀。现在考虑以下几点：我更改my.cnf并设置innodb_ft_server_stopword_table = "db/my_stopwords"。注意-我的服务器上没有这样的表（甚至没有模式db存在）。重新启动MySQL服务器。并使用

检查新设置

show variables like 'innodb_ft_server_stopword_table';

这将返回：

    Variable_name                   | Value
    --------------------------------|----------------
    innodb_ft_server_stopword_table | db/my_stopwords

在optimize table test.ft_innodb;之后，测试查询返回以下内容：

    txt                                            | score0 | score1 | score2 | score3
    -----------------------------------------------|--------|--------|--------|-------
    Some dummy text                                | 0.0906 | 0      | 0      | 0
    Text with a long and short stop words in it ex | 0      | 0      | 0      | 0.0906

你看到了吗？它不再与停用词一起使用。但是它可以与简短的不停词（例如“ + ex”）一起使用。因此，请确保您在innodb_ft_server_stopword_table中定义的表确实存在。

Answer 3

一种常见的搜索技术是在带有“已消毒”字符串的额外列中进行搜索。然后将FULLTEXT索引添加到该列而不是原始列。

在您的情况下，删除停用词是主要区别。但是也可能有一些标点符号可以（应该删除）。有时，带连字符的单词或单词或缩写，部件号或型号会引起麻烦。可以对其进行修改以更改标点符号或间距，以使其更适合FT要求和/或用户的输入风格。另一件事是向搜索字符串列添加单词，这些单词是该列中单词的常见拼写错误。

当然，这是您要完成的工作。但我认为它提供了可行的解决方案。

全文搜索Innodb失败，MyIsam返回结果

3 个答案: