Collations
Collations provide rules for how text should be sorted or compared in the execution engine. Collations are useful for localization, as the rules for how text should be ordered are different for different languages or for different countries. These orderings are often incompatible with one another. For example, in English the letter y comes between x and z. However, in Lithuanian the letter y comes between the i and j. For that reason, different collations are supported. The user must choose which collation they want to use when performing sorting and comparison operations.
By default, the BINARY collation is used. That means that strings are ordered and compared based only on their binary contents. This makes sense for standard ASCII characters (i.e., the letters A-Z and numbers 0-9), but generally does not make much sense for special unicode characters. It is, however, by far the fastest method of performing ordering and comparisons. Hence it is recommended to stick with the BINARY collation unless required otherwise.
Using Collations
SereneDB ships with three built-in, region-independent collations: NOCASE, NOACCENT and NFC. The NOCASE collation compares characters as equal regardless of their casing. The NOACCENT collation compares characters as equal regardless of their accents. The NFC collation performs NFC-normalized comparisons, see Unicode normalization for more information. In addition to these three built-ins, SereneDB also includes the region- and language-specific (ICU) collations described in ICU Collations below.
The available collations can be listed with PRAGMA collations. The query below filters to a stable subset to show both the built-ins and a few ICU locales:
SELECT collnameFROM pragma_collations()WHERE collname IN ('nocase', 'noaccent', 'nfc', 'de', 'fr', 'ja')ORDER BY collname; collname---------- de fr ja nfc noaccent nocaseSELECT 'hello' = 'hElLO'; ?column?---------- fSELECT 'hello' COLLATE NOCASE = 'hElLO'; ?column?---------- tSELECT 'hello' = 'hëllo'; ?column?---------- fSELECT 'hello' COLLATE NOACCENT = 'hëllo'; ?column?---------- tCollations can be combined by chaining them using the dot operator. Note, however, that not all collations can be combined together. In general, the NOCASE collation can be combined with any other collator, but most other collations cannot be combined.
SELECT 'hello' COLLATE NOCASE = 'hElLÖ'; ?column?---------- fSELECT 'hello' COLLATE NOACCENT = 'hElLÖ'; ?column?---------- fSELECT 'hello' COLLATE NOCASE.NOACCENT = 'hElLÖ'; ?column?---------- tDefault Collations
The collations we have seen so far have all been specified per expression. It is also possible to specify a default collator, either on the global database level or on a base table column. The PRAGMA default_collation can be used to specify the global default collator. This is the collator that will be used if no other one is specified.
SET default_collation = NOCASE;
SELECT 'hello' = 'HeLlo'; ?column?---------- tCollations can also be specified per-column when creating a table. When that column is then used in a comparison, the per-column collation is used to perform that comparison.
CREATE TABLE names (name VARCHAR COLLATE NOACCENT);
INSERT INTO names VALUES ('hännes');SELECT nameFROM namesWHERE name = 'hannes'; name-------- hännesBe careful here, however, as different collations cannot be combined. This can be problematic when you want to compare columns that have a different collation specified.
SELECT nameFROM namesWHERE name = 'hannes' COLLATE NOCASE;db error: ERROR: Cannot combine types with different collation!CREATE TABLE other_names (name VARCHAR COLLATE NOCASE);
INSERT INTO other_names VALUES ('HÄNNES');SELECT names.name AS name, other_names.name AS other_nameFROM names, other_namesWHERE names.name = other_names.name;db error: ERROR: Cannot combine types with different collation!We need to manually overwrite the collation:
SELECT names.name AS name, other_names.name AS other_nameFROM names, other_namesWHERE names.name COLLATE NOACCENT.NOCASE = other_names.name COLLATE NOACCENT.NOCASE; name | other_name--------+------------ hännes | HÄNNESICU Collations
The collations we have seen so far are not region-dependent and do not follow any specific regional rules. SereneDB also includes region- and language-specific collations powered by ICU. These follow the ordering and comparison rules of a specific language or region.
ICU collations are named by their locale, for example de (German), fr (French) and ja (Japanese). Region-qualified locales are also available, such as de_at (German as used in Austria). The full set of available locales can be inspected with PRAGMA collations as shown above.
For example, the German collation orders the umlaut ä next to a rather than after z:
CREATE TABLE strings (s VARCHAR COLLATE DE);
INSERT INTO strings VALUES ('z'), ('ä'), ('a');
SELECT sFROM stringsORDER BY s; s--- a ä zLike the built-in collations, ICU collations can be applied per expression with the COLLATE operator:
SELECT sFROM (VALUES ('z'), ('ä'), ('a')) AS t(s)ORDER BY s COLLATE DE; s--- a ä zThey can equally be used per column when creating a table (s VARCHAR COLLATE DE) and as the global default_collation.