Inverted Index

SinceVersion 2.0.0

From version 2.0.0, Doris implemented inverted index to support fulltext search on text field, normal eq and range filter on text, numeric, datetime field. This doc introduce inverted index usage, including create, drop and query.

Glossary

  • inverted index is a index techlogy used in information retirval commonly. It split text into word terms and construct a term to doc index. This index is called inverted index and can be used to find the docs where a specific term appears.

Basic Principles

Doris use CLucene as its underlying lib for inverted index. CLucene is a high performance and robust implementation of the famous Lucene inverted index library. Doris optimize CLucene to be more simple, fast and suitable for a database.

In the inverted index of Doris, a row in a table corresponds to a doc in CLucene, a column corresponds to a field in doc. So using inverted index, doris can get the rows that meet the filter of SQL WHERE clause, and then get the rows quickly without reading other unrelated rows.

Doris use a seperate file to store inverted index. It’s related to segment file in logic, but iosolated with each other. The advantange is that, create and drop inverted index does not need to rewrite tablet and segment file, which is very heavy work.

Features

The features for inverted index is as follows:

  • add fulltext search on text(string, varchar, char) field
    • MATCH_ALL matches all keywords, MATCH_ANY matches any keywords
    • support fulltext on array of text field
    • support english and chinese word parser
  • accelerate normal equal, range query, replacing bitmap index in the future
    • suport =, !=, >, >=, <, <= on text, numeric, datetime types
    • suport =, !=, >, >=, <, <= on array of text, numeric, datetime types
  • complete suport for logic combination
    • add index filter push down for OR, NOT
    • support combination of AND, OR, NOT
  • flexiable and fast index management
    • support inverted index definition on table creation
    • support add inverted index on existed table, without rewrite data
    • support delete inverted index on existed table, without rewrite data

Syntax

  • The inverted index definition syntax on table creation is as follows
    • USING INVERTED is mandatory, it specify index type to be inverted index
    • PROPERTIES is optional, it allows user to specify additional properties for index, “parser” is for type of word tokenizor/parser
      • missing stands for no parser, the whole field is considered to be a term
      • “english” stands for english parser
      • “chinese” stands for chinese parser
    • COMMENT is optional
  1. CREATE TABLE table_name
  2. (
  3. columns_difinition,
  4. INDEX idx_name1(column_name1) USING INVERTED [PROPERTIES("parser" = "english|chinese")] [COMMENT 'your comment']
  5. INDEX idx_name2(column_name2) USING INVERTED [PROPERTIES("parser" = "english|chinese")] [COMMENT 'your comment']
  6. )
  7. table_properties;
  • add an inverted index to existed table
  1. -- syntax 1
  2. CREATE INDEX idx_name ON table_name(column_name) USING INVERTED [PROPERTIES("parser" = "english|chinese")] [COMMENT 'your comment'];
  3. -- syntax 2
  4. ALTER TABLE table_name ADD INDEX idx_name(column_name) USING INVERTED [PROPERTIES("parser" = "english|chinese")] [COMMENT 'your comment'];
  • drop an inverted index
  1. -- syntax 1
  2. DROP INDEX idx_name ON table_name;
  3. -- syntax 2
  4. ALTER TABLE table_name DROP INDEX idx_name;
  • speed up query using inverted index
  1. -- 1. fulltext search using MATCH_ANY OR MATCH_ALL
  2. SELECT * FROM table_name WHERE column_name MATCH_ANY | MATCH_ALL 'keyword1 ...';
  3. -- 1.1 find rows that logmsg contains keyword1
  4. SELECT * FROM table_name WHERE logmsg MATCH_ANY 'keyword1';
  5. -- 1.2 find rows that logmsg contains keyword1 or keyword2 or more keywords
  6. SELECT * FROM table_name WHERE logmsg MATCH_ANY 'keyword2 keyword2';
  7. -- 1.3 find rows that logmsg contains both keyword1 and keyword2 and more keywords
  8. SELECT * FROM table_name WHERE logmsg MATCH_ALL 'keyword2 keyword2';
  9. -- 2. normal equal, range query
  10. SELECT * FROM table_name WHERE id = 123;
  11. SELECT * FROM table_name WHERE ts > '2023-01-01 00:00:00';
  12. SELECT * FROM table_name WHERE op_type IN ('add', 'delete');

Example

This example will demostrate inverted index creation, fulltext query, normal query using a hackernews dataset with 1 million rows. The performanc comparation between using and without inverted index will also be showed.

Create table

  1. CREATE DATABASE test_inverted_index;
  2. USE test_inverted_index;
  3. -- define inverted index idx_comment for comment column on table creation
  4. -- USING INVERTED specify using inverted index
  5. -- PROPERTIES("parser" = "english") specify english word parser
  6. CREATE TABLE hackernews_1m
  7. (
  8. `id` BIGINT,
  9. `deleted` TINYINT,
  10. `type` String,
  11. `author` String,
  12. `timestamp` DateTimeV2,
  13. `comment` String,
  14. `dead` TINYINT,
  15. `parent` BIGINT,
  16. `poll` BIGINT,
  17. `children` Array<BIGINT>,
  18. `url` String,
  19. `score` INT,
  20. `title` String,
  21. `parts` Array<INT>,
  22. `descendants` INT,
  23. INDEX idx_comment (`comment`) USING INVERTED PROPERTIES("parser" = "english") COMMENT 'inverted index for comment'
  24. )
  25. DUPLICATE KEY(`id`)
  26. DISTRIBUTED BY HASH(`id`) BUCKETS 10
  27. PROPERTIES ("replication_num" = "1");

Load data

  • load data by stream load
  1. wget https://doris-build-1308700295.cos.ap-beijing.myqcloud.com/regression/index/hacknernews_1m.csv.gz
  2. curl --location-trusted -u root: -H "compress_type:gz" -T hacknernews_1m.csv.gz http://127.0.0.1:8030/api/test_inverted_index/hackernews_1m/_stream_load
  3. {
  4. "TxnId": 2,
  5. "Label": "a8a3e802-2329-49e8-912b-04c800a461a6",
  6. "TwoPhaseCommit": "false",
  7. "Status": "Success",
  8. "Message": "OK",
  9. "NumberTotalRows": 1000000,
  10. "NumberLoadedRows": 1000000,
  11. "NumberFilteredRows": 0,
  12. "NumberUnselectedRows": 0,
  13. "LoadBytes": 130618406,
  14. "LoadTimeMs": 8988,
  15. "BeginTxnTimeMs": 23,
  16. "StreamLoadPutTimeMs": 113,
  17. "ReadDataTimeMs": 4788,
  18. "WriteDataTimeMs": 8811,
  19. "CommitAndPublishTimeMs": 38
  20. }
  • check loaded data by SQL count()
  1. mysql> SELECT count() FROM hackernews_1m;
  2. +---------+
  3. | count() |
  4. +---------+
  5. | 1000000 |
  6. +---------+
  7. 1 row in set (0.02 sec)

Query

Fulltext search query

  • count the rows that comment contains ‘OLAP’ using LIKE, cost 0.18s
  1. mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%';
  2. +---------+
  3. | count() |
  4. +---------+
  5. | 34 |
  6. +---------+
  7. 1 row in set (0.18 sec)
  • count the rows that comment contains ‘OLAP’ using MATCH_ANY fulltext search based on inverted index , cost 0.02s and 9x speedup, the speedup will be even larger on larger dataset
    • the difference of count is due to feature of fulltext. Word parser will not only split text to words, but also do some normalization such transform to lower case letters. So the result of MATCH_ANY will be a little more.
  1. mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP';
  2. +---------+
  3. | count() |
  4. +---------+
  5. | 35 |
  6. +---------+
  7. 1 row in set (0.02 sec)
  • Semilarly, count on ‘OLTP’ shows 0.07s vs 0.01s. Due to the cache in Doris, both LIKE and MATCH_ANY is faster, but there is still 7x speedup.
  1. mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLTP%';
  2. +---------+
  3. | count() |
  4. +---------+
  5. | 48 |
  6. +---------+
  7. 1 row in set (0.07 sec)
  8. mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLTP';
  9. +---------+
  10. | count() |
  11. +---------+
  12. | 51 |
  13. +---------+
  14. 1 row in set (0.01 sec)
  • search for both ‘OLAP’ and ‘OLTP’, 0.13s vs 0.01s,13x speedup
    • using MATCH_ALL if you need the keywords all appears
  1. mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' AND comment LIKE '%OLTP%';
  2. +---------+
  3. | count() |
  4. +---------+
  5. | 14 |
  6. +---------+
  7. 1 row in set (0.13 sec)
  8. mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ALL 'OLAP OLTP';
  9. +---------+
  10. | count() |
  11. +---------+
  12. | 15 |
  13. +---------+
  14. 1 row in set (0.01 sec)
  • search for at least one of ‘OLAP’ or ‘OLTP’, 0.12s vs 0.01s,12x speedup
    • using MATCH_ALL if you only need at least one of the keywords appears
  1. mysql> SELECT count() FROM hackernews_1m WHERE comment LIKE '%OLAP%' OR comment LIKE '%OLTP%';
  2. +---------+
  3. | count() |
  4. +---------+
  5. | 68 |
  6. +---------+
  7. 1 row in set (0.12 sec)
  8. mysql> SELECT count() FROM hackernews_1m WHERE comment MATCH_ANY 'OLAP OLTP';
  9. +---------+
  10. | count() |
  11. +---------+
  12. | 71 |
  13. +---------+
  14. 1 row in set (0.01 sec)

normal equal, range query

  • range query on DateTime column
  1. mysql> SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00';
  2. +---------+
  3. | count() |
  4. +---------+
  5. | 999081 |
  6. +---------+
  7. 1 row in set (0.03 sec)
  • add inverted index for timestamp column
  1. -- for timestamp column, there is no need for word parser, so just USING INVERTED without PROPERTIES
  2. -- this is the first syntax for CREATE INDEX
  3. mysql> CREATE INDEX idx_timestamp ON hackernews_1m(timestamp) USING INVERTED;
  4. Query OK, 0 rows affected (0.03 sec)
  • progress of building index can be view by SQL. It just costs 1s (compare FinishTime and CreateTime) to build index for timestamp column with 1 million rows.
  1. mysql> SHOW ALTER TABLE COLUMN;
  2. +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+
  3. | JobId | TableName | CreateTime | FinishTime | IndexName | IndexId | OriginIndexId | SchemaVersion | TransactionId | State | Msg | Progress | Timeout |
  4. +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+
  5. | 10030 | hackernews_1m | 2023-02-10 19:44:12.929 | 2023-02-10 19:44:13.938 | hackernews_1m | 10031 | 10008 | 1:1994690496 | 3 | FINISHED | | NULL | 2592000 |
  6. +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+
  7. 1 row in set (0.00 sec)
  • after the index is build, Doris will automaticaly use index for range query, but the performance is almost the same since it’s already fast on the small dataset
  1. mysql> SELECT count() FROM hackernews_1m WHERE timestamp > '2007-08-23 04:17:00';
  2. +---------+
  3. | count() |
  4. +---------+
  5. | 999081 |
  6. +---------+
  7. 1 row in set (0.01 sec)
  • similary test for parent column with numeric type, using equal query
  1. mysql> SELECT count() FROM hackernews_1m WHERE parent = 11189;
  2. +---------+
  3. | count() |
  4. +---------+
  5. | 2 |
  6. +---------+
  7. 1 row in set (0.01 sec)
  8. -- do not use word parser for numeric type USING INVERTED
  9. -- use the second syntax ALTER TABLE
  10. mysql> ALTER TABLE hackernews_1m ADD INDEX idx_parent(parent) USING INVERTED;
  11. Query OK, 0 rows affected (0.01 sec)
  12. mysql> SHOW ALTER TABLE COLUMN;
  13. +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+
  14. | JobId | TableName | CreateTime | FinishTime | IndexName | IndexId | OriginIndexId | SchemaVersion | TransactionId | State | Msg | Progress | Timeout |
  15. +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+
  16. | 10030 | hackernews_1m | 2023-02-10 19:44:12.929 | 2023-02-10 19:44:13.938 | hackernews_1m | 10031 | 10008 | 1:1994690496 | 3 | FINISHED | | NULL | 2592000 |
  17. | 10053 | hackernews_1m | 2023-02-10 19:49:32.893 | 2023-02-10 19:49:33.982 | hackernews_1m | 10054 | 10008 | 1:378856428 | 4 | FINISHED | | NULL | 2592000 |
  18. +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+
  19. mysql> SELECT count() FROM hackernews_1m WHERE parent = 11189;
  20. +---------+
  21. | count() |
  22. +---------+
  23. | 2 |
  24. +---------+
  25. 1 row in set (0.01 sec)
  • for text column author, inverted index can also be used to speedup equal query
  1. mysql> SELECT count() FROM hackernews_1m WHERE author = 'faster';
  2. +---------+
  3. | count() |
  4. +---------+
  5. | 20 |
  6. +---------+
  7. 1 row in set (0.03 sec)
  8. -- do not use any word parser for author to treat it as a whole
  9. mysql> ALTER TABLE hackernews_1m ADD INDEX idx_author(author) USING INVERTED;
  10. Query OK, 0 rows affected (0.01 sec)
  11. -- costs 1.5s to build index for author column with 1 million rows.
  12. mysql> SHOW ALTER TABLE COLUMN;
  13. +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+
  14. | JobId | TableName | CreateTime | FinishTime | IndexName | IndexId | OriginIndexId | SchemaVersion | TransactionId | State | Msg | Progress | Timeout |
  15. +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+
  16. | 10030 | hackernews_1m | 2023-02-10 19:44:12.929 | 2023-02-10 19:44:13.938 | hackernews_1m | 10031 | 10008 | 1:1994690496 | 3 | FINISHED | | NULL | 2592000 |
  17. | 10053 | hackernews_1m | 2023-02-10 19:49:32.893 | 2023-02-10 19:49:33.982 | hackernews_1m | 10054 | 10008 | 1:378856428 | 4 | FINISHED | | NULL | 2592000 |
  18. | 10076 | hackernews_1m | 2023-02-10 19:54:20.046 | 2023-02-10 19:54:21.521 | hackernews_1m | 10077 | 10008 | 1:1335127701 | 5 | FINISHED | | NULL | 2592000 |
  19. +-------+---------------+-------------------------+-------------------------+---------------+---------+---------------+---------------+---------------+----------+------+----------+---------+
  20. -- equal qury on text field autor get 3x speedup
  21. mysql> SELECT count() FROM hackernews_1m WHERE author = 'faster';
  22. +---------+
  23. | count() |
  24. +---------+
  25. | 20 |
  26. +---------+
  27. 1 row in set (0.01 sec)