minhash
Generates MinHash signatures for approximate deduplication and similarity detection. Wraps a nested analyzer whose options are prefixed with ANALYZER_.
Syntax
Options
| Option | Type | Default | Description |
|---|---|---|---|
ANALYZER_TEMPLATE | string | required | Template for the nested analyzer |
ANALYZER_* | — | — | Options for the nested analyzer, prefixed with ANALYZER_ |
NUMHASHES | integer | 1 | Number of hash functions |
Examples
MinHash with delimiter analyzer
CREATE TEXT SEARCH DICTIONARY minhash_delim (
TEMPLATE = 'minhash',
ANALYZER_TEMPLATE = 'delimiter',
ANALYZER_DELIMITER = ',',
NUMHASHES = 2
);
MinHash with text analyzer
CREATE TEXT SEARCH DICTIONARY minhash_text (
TEMPLATE = 'minhash',
ANALYZER_TEMPLATE = 'text',
ANALYZER_LOCALE = 'en_US.UTF-8',
ANALYZER_CASE = 'lower',
ANALYZER_STEMMING = true,
NUMHASHES = 4
);
See also
- CREATE TEXT SEARCH DICTIONARY
- pipeline — another composition template