Skip to main content
Skip to main content
Edit this page

Full-text search using text indexes

Experimental feature. Learn more.
Not supported in ClickHouse Cloud

Text indexes in ClickHouse (also known as "inverted indexes") provide fast full-text capabilities on string data. The index maps each token in the column to the rows which contain the token. The tokens are generated by a process called tokenization. For example, ClickHouse tokenizes the English sentence "All cat like mice." by default as ["All", "cat", "like", "mice"] (note that the trailing dot is ignored). More advanced tokenizers are available, for example for log data.

Creating a Text Index

To create a text index, first enable the corresponding experimental setting:

SET allow_experimental_full_text_index = true;

A text index can be defined on a String and FixedString column using the following syntax:

CREATE TABLE tab
(
    `key` UInt64,
    `str` String,
    INDEX text_idx(str) TYPE text(
                                -- Mandatory parameters:
                                tokenizer = 'default|ngram|split|no_op'
                                -- Optional parameters:
                                [, ngram_size = N]
                                [, separators = []]
                                [, dictionary_block_size = D]
                                [, max_cardinality_for_embedded_postings = M]
                                [, bloom_filter_false_positive_rate = R]
                            ) [GRANULARITY 64]
)
ENGINE = MergeTree
ORDER BY key

The tokenizer argument specifies the tokenizer:

  • default splits strings along non-alphanumeric ASCII characters.
  • ngram split strings into equally large n-grams.
  • split split strings along certain user-defined separator strings.
  • no_op performs no tokenization, i.e. every row value is a token.

To test how the tokenizers split the input string, you can use ClickHouse's tokens function:

As an example,

SELECT tokens('abc def', 'ngram', 3) AS tokens;

returns

+-tokens--------------------------+
| ['abc','bc ','c d',' de','def'] |
+---------------------------------+

If you chose the ngram tokenizer, you can set the ngram length using the (optional) parameter ngram_size. If ngram_size is not specified, the default ngram size is 3. The smallest and largest possible ngram size are 2 and 8.

If you chose the split tokenizer, you can set the separators using the (optional) parameter separators. The parameter expects a list of strings, for example, separators = [', ', '; ', '\n', '\\']. Note that each string can consist of multiple characters (', ' in the example). If parameter split is not specified, a single whitespace [' '] is used by default.

Note

The split tokenizer applies the split separators left-to-right. This can create ambiguities. For example, the separator strings ['%21', '%'] will cause %21abc to be tokenized as ['abc'], whereas switching both separators strings ['%', '%21'] will output ['21abc']. In the most cases, you want that matching prefers longer separators first. This can generally be done by passing the separator strings in order of descending length. If the separator strings happen to form a prefix code, they can be passed in arbitrary order.

Text indexes in ClickHouse are implemented as secondary indexes. However, unlike other skipping indexes, text indexes have a default index GRANULARITY of 64. This value has been chosen empirically and it provides a good trade-off between speed and index size for most use cases. Advanced users can specify a different index granularity (we do not recommend this).

Advanced parameters

The default values of the following advanced parameters will work well in virtually all situations. We do not recommend changing them.

Optional parameter dictionary_block_size (default: 128) specifies the size of dictionary blocks in rows.

Optional parameter max_cardinality_for_embedded_postings (default: 16) specifies the cardinality threshold below which posting lists should be embedded into dictionary blocks.

Optional parameter bloom_filter_false_positive_rate (default: 0.1) specifies the false-positive rate of the dictionary bloom filter.

Text indexes can be added to or removed from a column after the table has been created:

ALTER TABLE tab DROP INDEX text_idx;
ALTER TABLE tab ADD INDEX text_idx(s) TYPE text(tokenizer = 'default');

Using a Text Index

Using a text index in SELECT queries is straightforward as common string search functions will leverage the index automatically.

Supported functions

The text index can be used if text functions are used in the WHERE clause of a SELECT query:

SELECT [...]
FROM [...]
WHERE string_search_function(column_with_text_index)

= and !=

= (equals) and != (notEquals ) match the entire given search term.

Example:

SELECT * from tab WHERE str = 'Hello';

The text index supports = and !=, yet equality and inequality search only make sense with the no_op tokenizer (which causes the index to store entire row values).

IN and NOT IN

IN (in) and NOT IN (notIn) are similar to functions equals and notEquals but they match all (IN) or none (NOT IN) of the search terms.

Example:

SELECT * from tab WHERE str IN ('Hello', 'World');

The same restrictions as for = and != apply, i.e. IN and NOT IN only make sense in conjunction with the no_op tokenizer.

LIKE, NOT LIKE and match

Note

These functions currently use the text index for filtering only if the index tokenizer is either default or ngram.

In order to use LIKE like, NOT LIKE (notLike), and the match function with text indexes, ClickHouse must be able to extract complete tokens from the search term.

Example:

SELECT count() FROM tab WHERE comment LIKE 'support%';

support in the example could match support, supports, supporting etc. This kind of query is a substring query and it cannot be sped up by a text index.

To leverage a text index for LIKE queries, the LIKE pattern must be rewritten in the following way:

SELECT count() FROM tab WHERE comment LIKE ' support %'; -- or `% support %`

The spaces left and right of support make sure that the term can be extracted as a token.

startsWith and endsWith

Similar to LIKE, functions startsWith and endsWith can only use a text index, if complete tokens can be extracted from the search term.

Example:

SELECT count() FROM tab WHERE startsWith(comment, 'clickhouse support');

In the example, only clickhouse is considered a token. support is no token because it can match support, supports, supporting etc.

To find all rows that start with clickhouse supports, please end the search pattern with a trailing space:

startsWith(comment, 'clickhouse supports ')`

Similarly, endsWith should be used with a leading space:

SELECT count() FROM tab WHERE endsWith(comment, ' olap engine');

hasToken and hasTokenOrNull

Functions hasToken and hasTokenOrNull match against a single given token.

Unlike the previously mentioned functions, they do not tokenize the search term (they assume the input is a single token).

Example:

SELECT count() FROM tab WHERE hasToken(comment, 'clickhouse');

Functions hasToken and hasTokenOrNull are the most performant functions to use with the text index.

searchAny and searchAll

Functions searchAny and searchAll match against one or all of the given tokens.

Like hasToken, no tokenization of the search terms takes place.

Example:

SELECT count() FROM tab WHERE searchAny(comment, ['clickhouse', 'olap']);

SELECT count() FROM tab WHERE searchAll(comment, ['clickhouse', 'olap']);

Implementation

Index layout

Each text index consists of two (abstract) data structures:

  • a dictionary which maps each token to a postings list, and
  • a set of postings lists, each representing a set of row numbers.

Since a text index is a skip index, these data structures exist logically per index granule.

During index creation, three files are created (per part):

Dictionary blocks file (.dct)

The tokens in an index granule are sorted and stored in dictionary blocks of 128 tokens each (the block size is configurable by parameter dictionary_block_size). A dictionary blocks file (.dct) consists all the dictionary blocks of all index granules in a part.

Index granules file (.idx)

The index granules file contains for each dictionary block the block's first token, its relative offset in the dictionary blocks file, and a bloom filter for all tokens in the block. This sparse index structure is similar to ClickHouse's sparse primary key index). The bloom filter allows to skip dictionary blocks early if the searched token is not contained in a dictionary block.

Postings lists file (.pst)

The posting lists for all tokens are laid out sequentially in the postings list file. To save space while still allowing fast intersection and union operations, the posting lists are stored as roaring bitmaps. If the cardinality of a posting list is less than 16 (configurable by parameter max_cardinality_for_embedded_postings), it is embedded into the dictionary.

Direct read

Certain types of text queries can be sped up significantly by an optimization called "direct read". More specifically, if the SELECT query does not project from the text column, the optimization can be applied.

Example:

SELECT column_a, column_b, ... -- not: column_with_text_index
FROM [...]
WHERE string_search_function(column_with_text_index)

Supported functions

Example: Hackernews dataset

Let's look at the performance improvements of text indexes on a large dataset with lots of text. We will use 28.7M rows of comments on the popular Hacker News website. Here is the table without an text index:

CREATE TABLE hackernews (
    id UInt64,
    deleted UInt8,
    type String,
    author String,
    timestamp DateTime,
    comment String,
    dead UInt8,
    parent UInt64,
    poll UInt64,
    children Array(UInt32),
    url String,
    score UInt32,
    title String,
    parts Array(UInt32),
    descendants UInt32
)
ENGINE = MergeTree
ORDER BY (type, author);

The 28.7M rows are in a Parquet file in S3 - let's insert them into the hackernews table:

INSERT INTO hackernews
    SELECT * FROM s3Cluster(
        'default',
        'https://datasets-documentation.s3.eu-west-3.amazonaws.com/hackernews/hacknernews.parquet',
        'Parquet',
        '
    id UInt64,
    deleted UInt8,
    type String,
    by String,
    time DateTime,
    text String,
    dead UInt8,
    parent UInt64,
    poll UInt64,
    kids Array(UInt32),
    url String,
    score UInt32,
    title String,
    parts Array(UInt32),
    descendants UInt32');

Consider the following simple search for the term ClickHouse (and its varied upper and lower cases) in the comment column:

SELECT count()
FROM hackernews
WHERE hasToken(lower(comment), 'clickhouse');

Notice it takes 3 seconds to execute the query:

┌─count()─┐
│    1145 │
└─────────┘

1 row in set. Elapsed: 3.001 sec. Processed 28.74 million rows, 9.75 GB (9.58 million rows/s., 3.25 GB/s.)

We will use ALTER TABLE and add an text index on the lowercase of the comment column, then materialize it (which can take a while - wait for it to materialize):

ALTER TABLE hackernews
     ADD INDEX comment_lowercase(lower(comment)) TYPE text;

ALTER TABLE hackernews MATERIALIZE INDEX comment_lowercase;

We run the same query...

SELECT count()
FROM hackernews
WHERE hasToken(lower(comment), 'clickhouse')

...and notice the query executes 4x faster:

┌─count()─┐
│    1145 │
└─────────┘

1 row in set. Elapsed: 0.747 sec. Processed 4.49 million rows, 1.77 GB (6.01 million rows/s., 2.37 GB/s.)

We can also search for one or all of multiple terms, i.e., disjunctions or conjunctions:

-- multiple OR'ed terms
SELECT count(*)
FROM hackernews
WHERE hasToken(lower(comment), 'avx') OR hasToken(lower(comment), 'sve');

-- multiple AND'ed terms
SELECT count(*)
FROM hackernews
WHERE hasToken(lower(comment), 'avx') AND hasToken(lower(comment), 'sve');