Fulltext Search
In order to use fulltext search on one or more columns, a fulltext index with an analyzer has to be defined while creating the column: either with CREATE TABLE
or ALTER TABLE ADD COLUMN
. For more information see Fulltext Indices.
Table of Contents
MATCH
Predicate
Synopsis
MATCH (
{ column_or_idx_ident | ( column_or_idx_ident [boost] [, ...] ) }
, query_term
) [ using match_type [ with ( match_parameter [= value] [, ... ] ) ] ]
The MATCH predicate performs a fulltext search on one or more indexed columns or indices and supports different matching techniques. It can also be used to perform geographical searches on geo_shape indices.
The actual applicability of the MATCH predicate depends on the index’s type. In fact, the availability of certain match_types
and match_parameters
depend on the index. This section however, only covers the usage of the MATCH predicate on fulltext
indices on string columns. To use MATCH on geo_shape indices, see Geo Search.
In order to use fulltext searches on a column, a fulltext index with an analyzer must be created for this column. See Fulltext Indices for details. There are different types of Fulltext Indices with different goals, however it’s not possible to query multiple index columns with different index types within the same MATCH predicate.
To get the relevance of a matching row, a specific system column _score can be selected. It contains a numeric score relative to the other rows: The higher, the more relevant the row:
cr> select name, _score from locations
... where match(name_description_ft, 'time') order by _score desc;
+-----------+-----------+
| name | _score |
+-----------+-----------+
| Bartledan | 1.6672127 |
| Altair | 1.3862944 |
+-----------+-----------+
SELECT 2 rows in set (... sec)
The MATCH predicate in its simplest form performs a fulltext search against a single column. It takes the query_term
and, if no analyzer
was provided, analyzes the term with the analyzer configured on column_or_idx_ident
. The resulting tokens are then matched against the index at column_or_idx_ident
and if one of them matches, MATCH returns TRUE
.
The MATCH predicate can be also used to perform a fulltext search on multiple columns with a single query_term
and to add weight to specific columns it’s possible to add a boost
argument to each column_or_idx_ident
. Matches on columns with a higher boost result in a higher _score value for that document.
The match_type
argument determines how a single query_term
is applied and and how the resulting _score is computed. For more information see Match Types.
Results are ordered by _score by default, but can be overridden by adding an ORDER BY
clause.
Arguments
column_or_idx_ident: | A reference to a column or an index. If the column has an implicit index (e.g. created with something like If the column has an explicit index (e.g. created with something like By default every column is indexed but only the raw data is stored, so matching against a |
---|---|
boost: | A column ident can have a boost attached. That is a weight factor that increases the relevance of a column in respect to the other columns. The default boost is 1. |
query_term: | This string is analyzed (using the explicitly given |
match_type: | Optional. Defaults to |
Note
The MATCH
predicate can only be used in the WHERE Clause and on user-created tables. Using the MATCH
predicate on system tables is not supported.
One MATCH
predicate cannot combine columns of both relations of a join.
Additionally, MATCH
predicates cannot be used on columns of both relations of a join if they cannot be logically applied to each of them separately. For example:
This is allowed:
FROM t1, t2 WHERE match(t1.txt, 'foo') AND match(t2.txt, 'bar');``
But this is not:
FROM t1, t2 WHERE match(t1.txt, 'foo') OR match(t2.txt, 'bar');
Match Types
The match type determines how the query_term
is applied and the _score is created, thereby influencing which documents are considered more relevant. The default match_type
for fulltext indices is best_fields
.
best_fields: | Use the _score of the column that matched best. For example if a column contains all the tokens of the This type is the default, if omitted. |
---|---|
most_fields: | This match type takes the _score of every matching column and averages their scores. |
cross_fields: | This match type analyzes the |
phrase: | This match type differs from |
phrase_prefix: | This match type is roughly the same than |
Options
The match options further distinguish the way the matching process using a certain match type works. Not all options are applicable to all match types. See the options below for details.
analyzer: | The analyzer used to convert the |
---|---|
boost: | This numeric value is multiplied with the resulting _score of this If this |
cutoff_frequency: | The token frequency is the number of occurrences of a token in a column. This option specifies a minimum token frequency that excludes matching tokens with a higher frequency from the overall _score. Their _score is only included if another token with a lower frequency also matches. This can be used to suppress results where only high frequency terms like |
fuzziness: | Can be used to perform fuzzy full text search. On numeric columns use a numeric, on timestamp columns a long indicating milliseconds, on strings use a number indicating the maximum allowed Levenshtein Edit Distance. Use |
fuzzy_rewrite: | The same than |
max_expansions: | When using |
minimum_should_match: | The number of tokens from the |
operator: | Can be |
prefix_length: | When used with |
rewrite: | When using |
slop: | When matching for phrases this option controls how exact the phrase match should be (proximity search). If set to |
tie_breaker: | When using Defaults to Not applicable to match type |
zero_terms_query: | If no tokens are generated analyzing the |
Usage
A fulltext search is done using the MATCH Predicate predicate:
cr> select name from locations where match(name_description_ft, 'time') order by _score desc;
+-----------+
| name |
+-----------+
| Bartledan |
| Altair |
+-----------+
SELECT 2 rows in set (... sec)
It returns TRUE
for rows which match the search string. To get more detailed information about the quality of a match, the relevance of the row, the _score can be selected:
cr> select name, _score
... from locations where match(name_description_ft, 'time') order by _score desc;
+-----------+-----------+
| name | _score |
+-----------+-----------+
| Bartledan | 1.6672127 |
| Altair | 1.3862944 |
+-----------+-----------+
SELECT 2 rows in set (... sec)
Note
The _score
is not an absolute value. It just sets a row in relation to the other ones.
Searching On Multiple Columns
There are two possibilities if a search should span the contents of multiple columns:
- use a composite index column on your table. See Defining a Composite Index.
- use the MATCH Predicate predicate on multiple columns.
When querying multiple columns, there are many ways how the relevance a.k.a. _score can be computed. These different techniques are called Match Types.
To increase the relevance of rows where one column matches extremely well, use best_fields
(the default).
If rows with good matches spread over all included columns should be more relevant, use most_fields
. If searching multiple columns as if they were one, use cross_fields
.
For searching of matching phrases (tokens are in the exact same order) use phrase
or phrase_prefix
:
cr> select name, _score from locations
... where match(
... (name_description_ft, race['name'] 1.5, kind 0.75),
... 'end of the galaxy'
... ) order by _score desc;
+-------------------+------------+
| name | _score |
+-------------------+------------+
| NULL | 3.435174 |
| Altair | 1.3862944 |
| Aldebaran | 1.2243153 |
| Outer Eastern Rim | 0.8561404 |
| North West Ripple | 0.83460975 |
+-------------------+------------+
SELECT 5 rows in set (... sec)
cr> select name, description, _score from locations
... where match(
... (name_description_ft), 'end of the galaxy'
... ) using phrase with (analyzer='english', slop=4);
+------+-------------------------+-----------+
| name | description | _score |
+------+-------------------------+-----------+
| NULL | The end of the Galaxy.% | 3.4351742 |
+------+-------------------------+-----------+
SELECT 1 row in set (... sec)
A vast amount of options exist to fine-tune your fulltext search. A detailed reference can be found here MATCH Predicate.
Negative Search
A negative fulltext search can be done using a NOT
clause:
cr> select name, _score from locations
... where not match(name_description_ft, 'time')
... order by _score, name asc;
+------------------------------------+--------+
| name | _score |
+------------------------------------+--------+
| | 1.0 |
| Aldebaran | 1.0 |
| Algol | 1.0 |
| Allosimanius Syneca | 1.0 |
| Alpha Centauri | 1.0 |
| Argabuthon | 1.0 |
| Arkintoofle Minor | 1.0 |
| Galactic Sector QQ7 Active J Gamma | 1.0 |
| North West Ripple | 1.0 |
| Outer Eastern Rim | 1.0 |
| NULL | 1.0 |
+------------------------------------+--------+
SELECT 11 rows in set (... sec)
Filter By _score
It is possible to filter results by the _score column but as its value is a computed value relative to the highest score of all results and consequently never absolute or comparable across searches the usefulness outside of sorting is very limited.
Although possible, filtering by the greater-than-or-equals(‘>=’) operator on the _score column would not make much sense and can lead to unpredictable result sets.
Anyway let’s do it here for demonstration purpose:
cr> select name, _score
... from locations where match(name_description_ft, 'time')
... and _score >= 0.8 order by _score;
+-----------+-----------+
| name | _score |
+-----------+-----------+
| Altair | 2.3862944 |
| Bartledan | 2.6672127 |
+-----------+-----------+
SELECT 2 rows in set (... sec)
As you might have noticed, the _score value has changed for the same query text and document because it’s a ratio relative to all results, and by filtering on _score, ‘all results’ has already changed.
Caution
As noted above _score is a relative number and not comparable across searches. Filtering is therefore greatly discouraged.