Skip to main content

minhash

Generates MinHash signatures for approximate deduplication and similarity detection. Wraps a nested analyzer whose options are prefixed with ANALYZER_.

Syntax

Options

OptionTypeDefaultDescription
ANALYZER_TEMPLATEstringrequiredTemplate for the nested analyzer
ANALYZER_*Options for the nested analyzer, prefixed with ANALYZER_
NUMHASHESinteger1Number of hash functions

Examples

MinHash with delimiter analyzer

CREATE TEXT SEARCH DICTIONARY minhash_delim (
TEMPLATE = 'minhash',
ANALYZER_TEMPLATE = 'delimiter',
ANALYZER_DELIMITER = ',',
NUMHASHES = 2
);

MinHash with text analyzer

CREATE TEXT SEARCH DICTIONARY minhash_text (
TEMPLATE = 'minhash',
ANALYZER_TEMPLATE = 'text',
ANALYZER_LOCALE = 'en_US.UTF-8',
ANALYZER_CASE = 'lower',
ANALYZER_STEMMING = true,
NUMHASHES = 4
);

See also