MySQL5.7から実装されるInnoDBの全文検索 in CJKモードの挙動について。
Natural Language Full-Text Searchsについて。

12.9.1 Natural Language Full-Text Searches
http://dev.mysql.com/doc/refman/5.7/en/fulltext-natural-language.html
By default or with the IN NATURAL LANGUAGE MODE modifier,
the MATCH() function performs a natural language search for a string against
a text collection. A collection is a set of one or more columns included in a FULLTEXT index.
The search string is given as the argument to AGAINST().

テストバージョン
version

検証用DB、テーブル、データの作成。
5.7のngramはDefaultのpluginになっているのでそのままインデックスまで作成して確認。

mysql>  CREATE DATABASE `ngram` /*!40100 DEFAULT CHARACTER SET utf8mb4 */;
Query OK, 1 row affected (0.00 sec)


mysql> CREATE TABLE N_DEMO
    -> (
    ->         FTS_N_ID BIGINT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
    ->         title VARCHAR(100),
    ->         FULLTEXT INDEX ngram_idx(title) WITH PARSER ngram
    -> ) Engine=InnoDB CHARACTER SET utf8mb4;
Query OK, 0 rows affected (0.10 sec)

mysql> INSERT INTO N_DEMO (title) VALUES ('mysql');
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO N_DEMO (title) VALUES ('MYSQL');
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO N_DEMO (title) VALUES ('MySQL');
Query OK, 1 row affected (0.01 sec)

mysql> INSERT INTO N_DEMO (title) VALUES ('マイエスキューエル');
Query OK, 1 row affected (0.01 sec)

mysql> INSERT INTO N_DEMO (title) VALUES ('マイエスキューエル');
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO N_DEMO (title) VALUES ('まいえすきゅーえる');
Query OK, 1 row affected (0.00 sec)

mysql> show variables like 'ngram_token_size';
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| ngram_token_size | 2     |
+------------------+-------+
1 row in set (0.00 sec)

mysql> show variables like '%ft_min%';
+--------------------------+-------+
| Variable_name            | Value |
+--------------------------+-------+
| ft_min_word_len          | 4     |
| innodb_ft_min_token_size | 3     |
+--------------------------+-------+
2 rows in set (0.01 sec)

mysql> SET GLOBAL innodb_ft_aux_table="ngram/N_DEMO";
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE;
+--------+--------------+-------------+-----------+--------+----------+
| WORD   | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+--------+--------------+-------------+-----------+--------+----------+
| my     |            2 |           4 |         3 |      2 |        0 |
| my     |            2 |           4 |         3 |      3 |        0 |
| my     |            2 |           4 |         3 |      4 |        0 |
| ql     |            2 |           4 |         3 |      2 |        3 |
| ql     |            2 |           4 |         3 |      3 |        3 |
| ql     |            2 |           4 |         3 |      4 |        3 |
| sq     |            2 |           4 |         3 |      2 |        2 |
| sq     |            2 |           4 |         3 |      3 |        2 |
| sq     |            2 |           4 |         3 |      4 |        2 |
| ys     |            2 |           4 |         3 |      2 |        1 |
| ys     |            2 |           4 |         3 |      3 |        1 |
| ys     |            2 |           4 |         3 |      4 |        1 |
| いえ   |            7 |           7 |         1 |      7 |        3 |
| えす   |            7 |           7 |         1 |      7 |        6 |
| える   |            7 |           7 |         1 |      7 |       21 |
| きゅ   |            7 |           7 |         1 |      7 |       12 |
| すき   |            7 |           7 |         1 |      7 |        9 |
| まい   |            7 |           7 |         1 |      7 |        0 |
| ゅー   |            7 |           7 |         1 |      7 |       15 |
| イエ   |            5 |           5 |         1 |      5 |        3 |
| エス   |            5 |           5 |         1 |      5 |        6 |
| エル   |            5 |           5 |         1 |      5 |       21 |
| キュ   |            5 |           5 |         1 |      5 |       12 |
| スキ   |            5 |           5 |         1 |      5 |        9 |
| マイ   |            5 |           5 |         1 |      5 |        0 |
| ュー   |            5 |           5 |         1 |      5 |       15 |
| ーえ   |            7 |           7 |         1 |      7 |       18 |
| ーエ   |            5 |           5 |         1 |      5 |       18 |
| ュー     |            6 |           6 |         1 |      6 |       15 |
| ーエ     |            6 |           6 |         1 |      6 |       18 |
| イエ     |            6 |           6 |         1 |      6 |        3 |
| エス     |            6 |           6 |         1 |      6 |        6 |
| エル     |            6 |           6 |         1 |      6 |       21 |
| キュ     |            6 |           6 |         1 |      6 |       12 |
| スキ     |            6 |           6 |         1 |      6 |        9 |
| マイ     |            6 |           6 |         1 |      6 |        0 |
+--------+--------------+-------------+-----------+--------+----------+
36 rows in set (0.00 sec)

mysql> 

実際に全文検索(ngram)の動作を確認してみる。

mysql> select * from N_DEMO;
+----------+-----------------------------+
| FTS_N_ID | title                       |
+----------+-----------------------------+
|        1 | mysql                       |
|        2 | MYSQL                       |
|        3 | MySQL                       |
|        4 | マイエスキューエル          |
|        5 | マイエスキューエル                   |
|        6 | まいえすきゅーえる          |
+----------+-----------------------------+
6 rows in set (0.00 sec)

mysql> SELECT * FROM N_DEMO WHERE MATCH (title) AGAINST ('sql' IN NATURAL LANGUAGE MODE);
+----------+-------+
| FTS_N_ID | title |
+----------+-------+
|        1 | mysql |
|        2 | MYSQL |
|        3 | MySQL |
+----------+-------+
3 rows in set (0.00 sec)

mysql> SELECT * FROM N_DEMO WHERE MATCH (title) AGAINST ('my' IN NATURAL LANGUAGE MODE);
+----------+-------+
| FTS_N_ID | title |
+----------+-------+
|        1 | mysql |
|        2 | MYSQL |
|        3 | MySQL |
+----------+-------+
3 rows in set (0.00 sec)

mysql> SELECT FTS_N_ID,MATCH (title) AGAINST ('sql' IN NATURAL LANGUAGE MODE) AS score FROM N_DEMO;
+----------+--------------------+
| FTS_N_ID | score              |
+----------+--------------------+
|        1 | 0.1812381148338318 |
|        2 | 0.1812381148338318 |
|        3 | 0.1812381148338318 |
|        4 |                  0 |
|        5 |                  0 |
|        6 |                  0 |
+----------+--------------------+
6 rows in set (0.00 sec)

mysql> SELECT * FROM N_DEMO WHERE MATCH (title) AGAINST ('マイ' IN NATURAL LANGUAGE MODE);
+----------+-----------------------------+
| FTS_N_ID | title                       |
+----------+-----------------------------+
|        5 | マイエスキューエル                   |
+----------+-----------------------------+
1 row in set (0.00 sec)

mysql> SELECT * FROM N_DEMO WHERE MATCH (title) AGAINST ('マイ' IN NATURAL LANGUAGE MODE);
+----------+-----------------------------+
| FTS_N_ID | title                       |
+----------+-----------------------------+
|        4 | マイエスキューエル          |
+----------+-----------------------------+
1 row in set (0.00 sec)

mysql> SELECT * FROM N_DEMO WHERE MATCH (title) AGAINST ('まい' IN NATURAL LANGUAGE MODE);
+----------+-----------------------------+
| FTS_N_ID | title                       |
+----------+-----------------------------+
|        6 | まいえすきゅーえる          |
+----------+-----------------------------+
1 row in set (0.00 sec)

mysql> 

NGRAM

ngram_token_sizeがDefaultで2になっている時の挙動
2文字で設定されているので、検索結果も検索対象と同じ2文字を含むデータが検出されいる。
http://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_ngram_token_size

ngram-token-def

mysql> INSERT INTO N_DEMO (title) VALUES ('まい'), ('えす'), ('sq'), ('sl'), ('ql');
Query OK, 5 rows affected (0.00 sec)
Records: 5  Duplicates: 0  Warnings: 0

mysql> select * from N_DEMO;
+----------+-----------------------------+
| FTS_N_ID | title                       |
+----------+-----------------------------+
|        1 | mysql                       |
|        2 | MYSQL                       |
|        3 | MySQL                       |
|        4 | マイエスキューエル          |
|        5 | マイエスキューエル                   |
|        6 | まいえすきゅーえる          |
|        7 | まい                        |
|        8 | えす                        |
|        9 | sq                          |
|       10 | sl                          |
|       11 | ql                          |
+----------+-----------------------------+
11 rows in set (0.00 sec)

mysql> SELECT * FROM N_DEMO WHERE MATCH (title) AGAINST ('sql' IN NATURAL LANGUAGE MODE);
+----------+-------+
| FTS_N_ID | title |
+----------+-------+
|        1 | mysql |
|        2 | MYSQL |
|        3 | MySQL |
|        9 | sq    |
|       11 | ql    |
+----------+-------+
5 rows in set (0.00 sec)

mysql> SELECT FTS_N_ID,MATCH (title) AGAINST ('sql' IN NATURAL LANGUAGE MODE) AS score FROM N_DEMO;
+----------+---------------------+
| FTS_N_ID | score               |
+----------+---------------------+
|        1 | 0.38602644205093384 |
|        2 | 0.38602644205093384 |
|        3 | 0.38602644205093384 |
|        4 |                   0 |
|        5 |                   0 |
|        6 |                   0 |
|        7 |                   0 |
|        8 |                   0 |
|        9 | 0.19301322102546692 |
|       10 |                   0 |
|       11 | 0.19301322102546692 |
+----------+---------------------+
11 rows in set (0.01 sec)

mysql> 

ngram_token_sizeを2から3へ変更して再度検索した結果
– この値の変更には、SQLインスタンスの再起動が必要です。
– インデックスの再作成も必要になります。
先程と同じ検索をしても、結果が異なる事が確認出来る。

root@localhost [ngram]> system cat /etc/my.cnf | grep ngram_token_size
ngram_token_size=3
root@localhost [ngram]> show variables like 'ngram_token_size';
+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| ngram_token_size | 3     |
+------------------+-------+
1 row in set (0.00 sec)

root@localhost [ngram]> SELECT * FROM N_DEMO WHERE MATCH (title) AGAINST ('sql' IN NATURAL LANGUAGE MODE);
Empty set (0.00 sec)

root@localhost [ngram]> alter table N_DEMO drop index ngram_idx;
Query OK, 0 rows affected (0.14 sec)
Records: 0  Duplicates: 0  Warnings: 0

root@localhost [ngram]> alter table N_DEMO ADD FULLTEXT INDEX ngram_idx (title) WITH PARSER ngram;
Query OK, 0 rows affected (0.12 sec)
Records: 0  Duplicates: 0  Warnings: 0

root@localhost [ngram]> SELECT * FROM N_DEMO WHERE MATCH (title) AGAINST ('sql' IN NATURAL LANGUAGE MODE);
+----------+-------+
| FTS_N_ID | title |
+----------+-------+
|        1 | mysql |
|        2 | MYSQL |
|        3 | MySQL |
+----------+-------+
3 rows in set (0.00 sec)

root@localhost [ngram]> SELECT FTS_N_ID,MATCH (title) AGAINST ('sql' IN NATURAL LANGUAGE MODE) AS score FROM N_DEMO;
+----------+--------------------+
| FTS_N_ID | score              |
+----------+--------------------+
|        1 | 0.3184022605419159 |
|        2 | 0.3184022605419159 |
|        3 | 0.3184022605419159 |
|        4 |                  0 |
|        5 |                  0 |
|        6 |                  0 |
|        7 |                  0 |
|        8 |                  0 |
|        9 |                  0 |
|       10 |                  0 |
|       11 |                  0 |
+----------+--------------------+
11 rows in set (0.00 sec)

root@localhost [ngram]> 

alter_table

Comments are closed.

Post Navigation