Skip to main content

Collations

Collations provide rules for how text should be sorted or compared in the execution engine. Collations are useful for localization, as the rules for how text should be ordered are different for different languages or for different countries. These orderings are often incompatible with one another. For example, in English the letter y comes between x and z. However, in Lithuanian the letter y comes between the i and j. For that reason, different collations are supported. The user must choose which collation they want to use when performing sorting and comparison operations.

By default, the BINARY collation is used. That means that strings are ordered and compared based only on their binary contents. This makes sense for standard ASCII characters (i.e., the letters A-Z and numbers 0-9), but generally does not make much sense for special unicode characters. It is, however, by far the fastest method of performing ordering and comparisons. Hence it is recommended to stick with the BINARY collation unless required otherwise.

Using Collations

SereneDB ships with three built-in, region-independent collations: NOCASE, NOACCENT and NFC. The NOCASE collation compares characters as equal regardless of their casing. The NOACCENT collation compares characters as equal regardless of their accents. The NFC collation performs NFC-normalized comparisons, see Unicode normalization for more information. In addition to these three built-ins, SereneDB also includes the region- and language-specific (ICU) collations described in ICU Collations below.

The available collations can be listed with PRAGMA collations. The query below filters to a stable subset to show both the built-ins and a few ICU locales:

Query
SELECT collnameFROM pragma_collations()WHERE collname IN ('nocase', 'noaccent', 'nfc', 'de', 'fr', 'ja')ORDER BY collname;
Result
 collname---------- de fr ja nfc noaccent nocase
Query
SELECT 'hello' = 'hElLO';
Result
 ?column?---------- f
Query
SELECT 'hello' COLLATE NOCASE = 'hElLO';
Result
 ?column?---------- t
Query
SELECT 'hello' = 'hëllo';
Result
 ?column?---------- f
Query
SELECT 'hello' COLLATE NOACCENT = 'hëllo';
Result
 ?column?---------- t

Collations can be combined by chaining them using the dot operator. Note, however, that not all collations can be combined together. In general, the NOCASE collation can be combined with any other collator, but most other collations cannot be combined.

Query
SELECT 'hello' COLLATE NOCASE = 'hElLÖ';
Result
 ?column?---------- f
Query
SELECT 'hello' COLLATE NOACCENT = 'hElLÖ';
Result
 ?column?---------- f
Query
SELECT 'hello' COLLATE NOCASE.NOACCENT = 'hElLÖ';
Result
 ?column?---------- t

Default Collations

The collations we have seen so far have all been specified per expression. It is also possible to specify a default collator, either on the global database level or on a base table column. The PRAGMA default_collation can be used to specify the global default collator. This is the collator that will be used if no other one is specified.

Query
SET default_collation = NOCASE;
SELECT 'hello' = 'HeLlo';
Result
 ?column?---------- t

Collations can also be specified per-column when creating a table. When that column is then used in a comparison, the per-column collation is used to perform that comparison.

Query
CREATE TABLE names (name VARCHAR COLLATE NOACCENT);
INSERT INTO names VALUES ('hännes');
Query
SELECT nameFROM namesWHERE name = 'hannes';
Result
 name-------- hännes

Be careful here, however, as different collations cannot be combined. This can be problematic when you want to compare columns that have a different collation specified.

Query
SELECT nameFROM namesWHERE name = 'hannes' COLLATE NOCASE;
Result
db error: ERROR: Cannot combine types with different collation!
Query
CREATE TABLE other_names (name VARCHAR COLLATE NOCASE);
INSERT INTO other_names VALUES ('HÄNNES');
Query
SELECT names.name AS name, other_names.name AS other_nameFROM names, other_namesWHERE names.name = other_names.name;
Result
db error: ERROR: Cannot combine types with different collation!

We need to manually overwrite the collation:

Query
SELECT names.name AS name, other_names.name AS other_nameFROM names, other_namesWHERE names.name COLLATE NOACCENT.NOCASE = other_names.name COLLATE NOACCENT.NOCASE;
Result
 name   | other_name--------+------------ hännes | HÄNNES

ICU Collations

The collations we have seen so far are not region-dependent and do not follow any specific regional rules. SereneDB also includes region- and language-specific collations powered by ICU. These follow the ordering and comparison rules of a specific language or region.

ICU collations are named by their locale, for example de (German), fr (French) and ja (Japanese). Region-qualified locales are also available, such as de_at (German as used in Austria). The full set of available locales can be inspected with PRAGMA collations as shown above.

For example, the German collation orders the umlaut ä next to a rather than after z:

Query
CREATE TABLE strings (s VARCHAR COLLATE DE);
INSERT INTO strings VALUES ('z'), ('ä'), ('a');
SELECT sFROM stringsORDER BY s;
Result
 s--- a ä z

Like the built-in collations, ICU collations can be applied per expression with the COLLATE operator:

Query
SELECT sFROM (VALUES ('z'), ('ä'), ('a')) AS t(s)ORDER BY s COLLATE DE;
Result
 s--- a ä z

They can equally be used per column when creating a table (s VARCHAR COLLATE DE) and as the global default_collation.