Edit this page

Collations

Collations provide rules for how text should be sorted or compared in the execution engine. Collations are useful for localization, as the rules for how text should be ordered are different for different languages or for different countries. These orderings are often incompatible with one another. For example, in English the letter y comes between x and z. However, in Lithuanian the letter y comes between the i and j. For that reason, different collations are supported. The user must choose which collation they want to use when performing sorting and comparison operations.

By default, the BINARY collation is used. That means that strings are ordered and compared based only on their binary contents. This makes sense for standard ASCII characters (i.e., the letters A-Z and numbers 0-9), but generally does not make much sense for special unicode characters. It is, however, by far the fastest method of performing ordering and comparisons. Hence it is recommended to stick with the BINARY collation unless required otherwise.

Using Collations

SereneDB ships with three built-in, region-independent collations: NOCASE, NOACCENT and NFC. The NOCASE collation compares characters as equal regardless of their casing. The NOACCENT collation compares characters as equal regardless of their accents. The NFC collation performs NFC-normalized comparisons, see Unicode normalization for more information. In addition to these three built-ins, SereneDB also includes the region- and language-specific (ICU) collations described in ICU Collations below.

The available collations can be listed with PRAGMA collations. The query below filters to a stable subset to show both the built-ins and a few ICU locales:

Query

SELECT collnameFROM pragma_collations()WHERE collname IN ('nocase', 'noaccent', 'nfc', 'de', 'fr', 'ja')ORDER BY collname;

Result

 collname---------- de fr ja nfc noaccent nocase

Query

SELECT 'hello' = 'hElLO';

Result

 ?column?---------- f

Query

SELECT 'hello' COLLATE NOCASE = 'hElLO';

Result

 ?column?---------- t

Query

SELECT 'hello' = 'hëllo';

Result

 ?column?---------- f

Query

SELECT 'hello' COLLATE NOACCENT = 'hëllo';

Result

 ?column?---------- t

Collations can be combined by chaining them using the dot operator. Note, however, that not all collations can be combined together. In general, the NOCASE collation can be combined with any other collator, but most other collations cannot be combined.

Query

SELECT 'hello' COLLATE NOCASE = 'hElLÖ';

Result

 ?column?---------- f

Query

SELECT 'hello' COLLATE NOACCENT = 'hElLÖ';

Result

 ?column?---------- f

Query

SELECT 'hello' COLLATE NOCASE.NOACCENT = 'hElLÖ';

Result

 ?column?---------- t

Default Collations

The collations we have seen so far have all been specified per expression. It is also possible to specify a default collator, either on the global database level or on a base table column. The PRAGMA default_collation can be used to specify the global default collator. This is the collator that will be used if no other one is specified.

Query

SET default_collation = NOCASE;
SELECT 'hello' = 'HeLlo';

Result

 ?column?---------- t

Collations can also be specified per-column when creating a table. When that column is then used in a comparison, the per-column collation is used to perform that comparison.

Query

CREATE TABLE names (name VARCHAR COLLATE NOACCENT);
INSERT INTO names VALUES ('hännes');

Query

SELECT nameFROM namesWHERE name = 'hannes';

Result

 name-------- hännes

Be careful here, however, as different collations cannot be combined. This can be problematic when you want to compare columns that have a different collation specified.

Query

SELECT nameFROM namesWHERE name = 'hannes' COLLATE NOCASE;

Result

db error: ERROR: Cannot combine types with different collation!

Query

CREATE TABLE other_names (name VARCHAR COLLATE NOCASE);
INSERT INTO other_names VALUES ('HÄNNES');

Query

SELECT names.name AS name, other_names.name AS other_nameFROM names, other_namesWHERE names.name = other_names.name;

Result

db error: ERROR: Cannot combine types with different collation!

We need to manually overwrite the collation:

Query

SELECT names.name AS name, other_names.name AS other_nameFROM names, other_namesWHERE names.name COLLATE NOACCENT.NOCASE = other_names.name COLLATE NOACCENT.NOCASE;

Result

 name   | other_name--------+------------ hännes | HÄNNES

ICU Collations

The collations we have seen so far are not region-dependent and do not follow any specific regional rules. SereneDB also includes region- and language-specific collations powered by ICU. These follow the ordering and comparison rules of a specific language or region.

ICU collations are named by their locale, for example de (German), fr (French) and ja (Japanese). Region-qualified locales are also available, such as de_at (German as used in Austria). The full set of available locales can be inspected with PRAGMA collations as shown above.

For example, the German collation orders the umlaut ä next to a rather than after z:

Query

CREATE TABLE strings (s VARCHAR COLLATE DE);
INSERT INTO strings VALUES ('z'), ('ä'), ('a');
SELECT sFROM stringsORDER BY s;

Result

 s--- a ä z

Like the built-in collations, ICU collations can be applied per expression with the COLLATE operator:

Query

SELECT sFROM (VALUES ('z'), ('ä'), ('a')) AS t(s)ORDER BY s COLLATE DE;

Result

 s--- a ä z

They can equally be used per column when creating a table (s VARCHAR COLLATE DE) and as the global default_collation.

Using Collations​

Default Collations​

ICU Collations​

Using Collations

Default Collations

ICU Collations