MySQL5.7から実装されるInnoDBの全文検索 in CJKモードの挙動について。
Natural Language Full-Text Searchsについて。
12.9.1 Natural Language Full-Text Searches
http://dev.mysql.com/doc/refman/5.7/en/fulltext-natural-language.html
By default or with the IN NATURAL LANGUAGE MODE modifier,
the MATCH() function performs a natural language search for a string against
a text collection. A collection is a set of one or more columns included in a FULLTEXT index.
The search string is given as the argument to AGAINST().
検証用DB、テーブル、データの作成。
5.7のngramはDefaultのpluginになっているのでそのままインデックスまで作成して確認。
mysql> CREATE DATABASE `ngram` /*!40100 DEFAULT CHARACTER SET utf8mb4 */; Query OK, 1 row affected (0.00 sec) mysql> CREATE TABLE N_DEMO -> ( -> FTS_N_ID BIGINT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, -> title VARCHAR(100), -> FULLTEXT INDEX ngram_idx(title) WITH PARSER ngram -> ) Engine=InnoDB CHARACTER SET utf8mb4; Query OK, 0 rows affected (0.10 sec) mysql> INSERT INTO N_DEMO (title) VALUES ('mysql'); Query OK, 1 row affected (0.00 sec) mysql> INSERT INTO N_DEMO (title) VALUES ('MYSQL'); Query OK, 1 row affected (0.00 sec) mysql> INSERT INTO N_DEMO (title) VALUES ('MySQL'); Query OK, 1 row affected (0.01 sec) mysql> INSERT INTO N_DEMO (title) VALUES ('マイエスキューエル'); Query OK, 1 row affected (0.01 sec) mysql> INSERT INTO N_DEMO (title) VALUES ('マイエスキューエル'); Query OK, 1 row affected (0.00 sec) mysql> INSERT INTO N_DEMO (title) VALUES ('まいえすきゅーえる'); Query OK, 1 row affected (0.00 sec) mysql> show variables like 'ngram_token_size'; +------------------+-------+ | Variable_name | Value | +------------------+-------+ | ngram_token_size | 2 | +------------------+-------+ 1 row in set (0.00 sec) mysql> show variables like '%ft_min%'; +--------------------------+-------+ | Variable_name | Value | +--------------------------+-------+ | ft_min_word_len | 4 | | innodb_ft_min_token_size | 3 | +--------------------------+-------+ 2 rows in set (0.01 sec) mysql> SET GLOBAL innodb_ft_aux_table="ngram/N_DEMO"; Query OK, 0 rows affected (0.00 sec) mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE; +--------+--------------+-------------+-----------+--------+----------+ | WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION | +--------+--------------+-------------+-----------+--------+----------+ | my | 2 | 4 | 3 | 2 | 0 | | my | 2 | 4 | 3 | 3 | 0 | | my | 2 | 4 | 3 | 4 | 0 | | ql | 2 | 4 | 3 | 2 | 3 | | ql | 2 | 4 | 3 | 3 | 3 | | ql | 2 | 4 | 3 | 4 | 3 | | sq | 2 | 4 | 3 | 2 | 2 | | sq | 2 | 4 | 3 | 3 | 2 | | sq | 2 | 4 | 3 | 4 | 2 | | ys | 2 | 4 | 3 | 2 | 1 | | ys | 2 | 4 | 3 | 3 | 1 | | ys | 2 | 4 | 3 | 4 | 1 | | いえ | 7 | 7 | 1 | 7 | 3 | | えす | 7 | 7 | 1 | 7 | 6 | | える | 7 | 7 | 1 | 7 | 21 | | きゅ | 7 | 7 | 1 | 7 | 12 | | すき | 7 | 7 | 1 | 7 | 9 | | まい | 7 | 7 | 1 | 7 | 0 | | ゅー | 7 | 7 | 1 | 7 | 15 | | イエ | 5 | 5 | 1 | 5 | 3 | | エス | 5 | 5 | 1 | 5 | 6 | | エル | 5 | 5 | 1 | 5 | 21 | | キュ | 5 | 5 | 1 | 5 | 12 | | スキ | 5 | 5 | 1 | 5 | 9 | | マイ | 5 | 5 | 1 | 5 | 0 | | ュー | 5 | 5 | 1 | 5 | 15 | | ーえ | 7 | 7 | 1 | 7 | 18 | | ーエ | 5 | 5 | 1 | 5 | 18 | | ュー | 6 | 6 | 1 | 6 | 15 | | ーエ | 6 | 6 | 1 | 6 | 18 | | イエ | 6 | 6 | 1 | 6 | 3 | | エス | 6 | 6 | 1 | 6 | 6 | | エル | 6 | 6 | 1 | 6 | 21 | | キュ | 6 | 6 | 1 | 6 | 12 | | スキ | 6 | 6 | 1 | 6 | 9 | | マイ | 6 | 6 | 1 | 6 | 0 | +--------+--------------+-------------+-----------+--------+----------+ 36 rows in set (0.00 sec) mysql>
実際に全文検索(ngram)の動作を確認してみる。
mysql> select * from N_DEMO; +----------+-----------------------------+ | FTS_N_ID | title | +----------+-----------------------------+ | 1 | mysql | | 2 | MYSQL | | 3 | MySQL | | 4 | マイエスキューエル | | 5 | マイエスキューエル | | 6 | まいえすきゅーえる | +----------+-----------------------------+ 6 rows in set (0.00 sec) mysql> SELECT * FROM N_DEMO WHERE MATCH (title) AGAINST ('sql' IN NATURAL LANGUAGE MODE); +----------+-------+ | FTS_N_ID | title | +----------+-------+ | 1 | mysql | | 2 | MYSQL | | 3 | MySQL | +----------+-------+ 3 rows in set (0.00 sec) mysql> SELECT * FROM N_DEMO WHERE MATCH (title) AGAINST ('my' IN NATURAL LANGUAGE MODE); +----------+-------+ | FTS_N_ID | title | +----------+-------+ | 1 | mysql | | 2 | MYSQL | | 3 | MySQL | +----------+-------+ 3 rows in set (0.00 sec) mysql> SELECT FTS_N_ID,MATCH (title) AGAINST ('sql' IN NATURAL LANGUAGE MODE) AS score FROM N_DEMO; +----------+--------------------+ | FTS_N_ID | score | +----------+--------------------+ | 1 | 0.1812381148338318 | | 2 | 0.1812381148338318 | | 3 | 0.1812381148338318 | | 4 | 0 | | 5 | 0 | | 6 | 0 | +----------+--------------------+ 6 rows in set (0.00 sec) mysql> SELECT * FROM N_DEMO WHERE MATCH (title) AGAINST ('マイ' IN NATURAL LANGUAGE MODE); +----------+-----------------------------+ | FTS_N_ID | title | +----------+-----------------------------+ | 5 | マイエスキューエル | +----------+-----------------------------+ 1 row in set (0.00 sec) mysql> SELECT * FROM N_DEMO WHERE MATCH (title) AGAINST ('マイ' IN NATURAL LANGUAGE MODE); +----------+-----------------------------+ | FTS_N_ID | title | +----------+-----------------------------+ | 4 | マイエスキューエル | +----------+-----------------------------+ 1 row in set (0.00 sec) mysql> SELECT * FROM N_DEMO WHERE MATCH (title) AGAINST ('まい' IN NATURAL LANGUAGE MODE); +----------+-----------------------------+ | FTS_N_ID | title | +----------+-----------------------------+ | 6 | まいえすきゅーえる | +----------+-----------------------------+ 1 row in set (0.00 sec) mysql>
ngram_token_sizeがDefaultで2になっている時の挙動
2文字で設定されているので、検索結果も検索対象と同じ2文字を含むデータが検出されいる。
http://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_ngram_token_size
mysql> INSERT INTO N_DEMO (title) VALUES ('まい'), ('えす'), ('sq'), ('sl'), ('ql'); Query OK, 5 rows affected (0.00 sec) Records: 5 Duplicates: 0 Warnings: 0 mysql> select * from N_DEMO; +----------+-----------------------------+ | FTS_N_ID | title | +----------+-----------------------------+ | 1 | mysql | | 2 | MYSQL | | 3 | MySQL | | 4 | マイエスキューエル | | 5 | マイエスキューエル | | 6 | まいえすきゅーえる | | 7 | まい | | 8 | えす | | 9 | sq | | 10 | sl | | 11 | ql | +----------+-----------------------------+ 11 rows in set (0.00 sec) mysql> SELECT * FROM N_DEMO WHERE MATCH (title) AGAINST ('sql' IN NATURAL LANGUAGE MODE); +----------+-------+ | FTS_N_ID | title | +----------+-------+ | 1 | mysql | | 2 | MYSQL | | 3 | MySQL | | 9 | sq | | 11 | ql | +----------+-------+ 5 rows in set (0.00 sec) mysql> SELECT FTS_N_ID,MATCH (title) AGAINST ('sql' IN NATURAL LANGUAGE MODE) AS score FROM N_DEMO; +----------+---------------------+ | FTS_N_ID | score | +----------+---------------------+ | 1 | 0.38602644205093384 | | 2 | 0.38602644205093384 | | 3 | 0.38602644205093384 | | 4 | 0 | | 5 | 0 | | 6 | 0 | | 7 | 0 | | 8 | 0 | | 9 | 0.19301322102546692 | | 10 | 0 | | 11 | 0.19301322102546692 | +----------+---------------------+ 11 rows in set (0.01 sec) mysql>
ngram_token_sizeを2から3へ変更して再度検索した結果
– この値の変更には、SQLインスタンスの再起動が必要です。
– インデックスの再作成も必要になります。
先程と同じ検索をしても、結果が異なる事が確認出来る。
root@localhost [ngram]> system cat /etc/my.cnf | grep ngram_token_size ngram_token_size=3 root@localhost [ngram]> show variables like 'ngram_token_size'; +------------------+-------+ | Variable_name | Value | +------------------+-------+ | ngram_token_size | 3 | +------------------+-------+ 1 row in set (0.00 sec) root@localhost [ngram]> SELECT * FROM N_DEMO WHERE MATCH (title) AGAINST ('sql' IN NATURAL LANGUAGE MODE); Empty set (0.00 sec) root@localhost [ngram]> alter table N_DEMO drop index ngram_idx; Query OK, 0 rows affected (0.14 sec) Records: 0 Duplicates: 0 Warnings: 0 root@localhost [ngram]> alter table N_DEMO ADD FULLTEXT INDEX ngram_idx (title) WITH PARSER ngram; Query OK, 0 rows affected (0.12 sec) Records: 0 Duplicates: 0 Warnings: 0 root@localhost [ngram]> SELECT * FROM N_DEMO WHERE MATCH (title) AGAINST ('sql' IN NATURAL LANGUAGE MODE); +----------+-------+ | FTS_N_ID | title | +----------+-------+ | 1 | mysql | | 2 | MYSQL | | 3 | MySQL | +----------+-------+ 3 rows in set (0.00 sec) root@localhost [ngram]> SELECT FTS_N_ID,MATCH (title) AGAINST ('sql' IN NATURAL LANGUAGE MODE) AS score FROM N_DEMO; +----------+--------------------+ | FTS_N_ID | score | +----------+--------------------+ | 1 | 0.3184022605419159 | | 2 | 0.3184022605419159 | | 3 | 0.3184022605419159 | | 4 | 0 | | 5 | 0 | | 6 | 0 | | 7 | 0 | | 8 | 0 | | 9 | 0 | | 10 | 0 | | 11 | 0 | +----------+--------------------+ 11 rows in set (0.00 sec) root@localhost [ngram]>