Question

我将UTF-8编码数据放入配置为使用utf8字符集的数据库表中，但是当我进行全文搜索时，它在不间断空格之前与该字匹配。

例如，对于格式化问题，我们在乙型肝炎中有一个不间断的空间。在寻找肝炎时，这个字符串不匹配。

CREATE TABLE `search` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `title` text COLLATE FULLTEXT KEY `title` (`title`),
  PRIMARY KEY (`id`),
  FULLTEXT KEY `title` (`title`),
) ENGINE=MyISAM AUTO_INCREMENT=202337 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

此查询不返回任何内容：

SELECT 
  title, 
  MATCH(title) AGAINST ('hepatitis') AS `titleScore` 
FROM 
  `search` 
WHERE 
  MATCH(title) AGAINST ("hepatitis")
ORDER BY 
  `titleScore` DESC LIMIT 10;

但是此查询返回以下内容：

SELECT
  title
FROM
  search
WHERE
  title LIKE "%hepatitis%";

+-------------------------------------------------------------------------+
| title                                                                   |
+-------------------------------------------------------------------------+
| Comparison of drugs for chronic HBeAg-positive hepatitisÂ B             |
| Antivirals in chronic hepatitisÂ C                                      |
| Chronic hepatitisÂ C                                                    |
| Antivirals for hepatitisÂ C                                             |
| Antivirals for hepatitisÂ B                                             |
| Other antivirals for hepatitisÂ C                                       |
| Chronic hepatitisÂ B                                                    |
| HepatitisÂ A vaccine                                                    |
| HepatitisÂ B vaccine                                                    |
| HepatitisÂ B immunoglobulin                                             |
| HepatitisÂ C virus protease inhibitors, see  HCV-protease inhibitors    |
+-------------------------------------------------------------------------+

根据http://ftp.nchu.edu.tw/MySQL/tech-resources/articles/full-text-revealed.html#breaking全文中“我们没有放入手册的全文”，全文应该只将字母数字作为单词元素匹配，因此在不间断的空间中打破（尽管它没有明确说明不间断的空格字符本身。）

我确实在MySQL手册上找到了评论 - http://dev.mysql.com/doc/refman/5.5/en/fulltext-search.html

要使FULLTEXT MATCH与日文UTF-8文本一起使用，请注意日语文本中的单词由 ASCII 空格分隔字符，而不是日语UTF-8（或其他）间距字符。（什么时候使用phpMyAdmin来管理数据/编写SQL查询，必须切换远离日语IME插入空格char ...）

我使用以下规则创建了新的排序规则following the MySQL manual：

<charset name="utf8">
  ...
  <collation name="utf8_custom" id="1001">
    <rules>
      <reset>\u0020</reset> <!-- ascii space character -->
      <i>\u00A0</i>         <!-- non-breaking space -->
      <reset>A</reset>      <!-- test -->
      <i>B</i>
    </rules>
  </collation>
</charset>

我重新启动了服务器，然后确认了show collation like 'utf8_custom';

的排序规则

然后，我更改了表以使用新的排序规则，并使用修复表重建索引以便进行测量。

SELECT title FROM search WHERE "Hepatitis A vaccine";仍未返回结果

SELECT title FROM search WHERE "HepatitisÂ A vaccine";会返回结果 - 实际上是两个：

 +------------------------+
 | title                  |
 +------------------------+
 | HepatitisÂ A vaccine   |
 | HepatitisÂ B vaccine   |
 +------------------------+

这表明使B与A相同的校对中的规则得到尊重，但不打破空间的规则。

Â困扰我 - 我的表是utf8，我的客户端是utf8，源数据是utf8。我不确定我应该看到这个角色。

Answer 1

问题是将搜索数据写入数据库的步骤 - 我必须发出SET NAMES "utf8"（或等效的Zend / PDO）以确保将utf8字符串发送到utf8表是以utf8运输的。

在我的Zend application.ini中为charset = 'utf8'添加参数到我的数据库配置解决了这个问题。

MySQL全文搜索，整理和非破坏空间

1 个答案: