Database Administrators Asked on November 21, 2021
I try to figure out how to improve Postgres 10.6 pg_trgm queries with German umlauts (äöü). In german ‘ö’ can be written as ‘oe’. But beware: not every ‘oe’ can be written as ‘ö’.
CREATE TABLE public.names
(name text COLLATE pg_catalog."default");
CREATE INDEX names_idx
ON public.names USING gin (name COLLATE pg_catalog."default" gin_trgm_ops);
SHOW LC_COLLATE; -- de_DE.UTF-8
When I use the similarity()
function to query the similarity for ‘Schoenstraße’.
SELECT
name,
similarity (name, 'Schoenstraße') AS similarity,
show_trgm (name)
FROM
names
WHERE
name % 'Schoenstraße'
ORDER BY
similarity DESC;
I get the following result:
Name similarity show_trgm
Schyrenstraße 0.588235 {0x9a07c3,0xde3801,"" s"","" sc"",chy,ens,hyr,nst,ren,sch,str,tra,0x76a40e,yre}
Schönstraße 0.5625 {0x9a07c3,0xde3801,0xf00320,0x095f29,"" s"","" sc"",0x6deea5,nst,sch,str,tra,0x76a40e}
db<>fiddle here
Is there anything I can do to improve that or do I need to replace all umlauts in the DB?
For starters: A German poet is still a Poet. 'Poet' <> 'Pöt'; 'oe' <> 'ö'. The convention to replace 'ö' with 'oe' is largely out of use nowadays. See:
I have had related problems many times. The additional module unaccent
is instrumental to cover spelling variants and typos on both sides. Install it once per database:
CREATE EXTENSION unaccent;
See:
The tricky part is to get index support with it. Details may be hard for beginners.
Then create the wrapper function as instructed here (same link again):
Create a new index with it:
CREATE INDEX names_trgm_idx ON names USING gin (f_unaccent(name) gin_trgm_ops);
Query demo:
SELECT name
, similarity(f_unaccent(name), f_unaccent('Schoenstraße')) AS sim_unaccent
, similarity(name, 'Schoenstraße') AS sim
FROM names
WHERE f_unaccent(name) % f_unaccent('Schoenstraße')
ORDER BY f_unaccent(name) <-> f_unaccent('Schoenstraße')
, name <-> 'Schoenstraße'; -- best match first
Sorting by name <-> 'Schoenstraße'
additionally serves as tiebreaker to get best matches first (after filtering cheaply with the index).
Consider a GiST index instead of GIN for "nearest neighbor"-type queries with LIMIT
. See:
Consider full text search to just match complete words (after stemming), or with prefix matching / phrase search ... See:
The unaccent
module provides a dictionary for integration in a TEXT SEARCH CONFIGURATION
:
CREATE TEXT SEARCH CONFIGURATION de (COPY = german);
ALTER TEXT SEARCH CONFIGURATION de
ALTER MAPPING FOR hword, hword_part, word WITH unaccent, german_stem;
Index based on it:
CREATE INDEX names_fts_idx ON names USING GIN (to_tsvector('de', name));
Alternatively, since you are dealing with names, you might base it on the simple
text search configuration instead of german
. No stop words, no stemming. Names don't necessarily follow language-specific rules.
Query demo:
SELECT name
, ts_lexize('unaccent', name)
, to_tsvector('de', name)
, to_tsquery('de', 'Schoenstraße')
, to_tsvector('de', name) @@ to_tsquery('de', 'Schoenstraße') AS match
, similarity(name, 'Schoenstraße')
FROM names
WHERE to_tsvector('de', name) @@ to_tsquery('de', 'Schoenstraße')
ORDER BY name <-> 'Schoenstraße'; -- exact matches first
Again, sorting by name <-> 'Schoenstraße'
serves as tiebreaker, maybe additionally to other sort criteria.
Note, however, that looking for 'Schoenstraße' with FTS does not find 'Schönstraße', as unaccent()
does not change 'oe', and maps 'ö' to 'o'. ('oe' <> 'ö').
Related:
db<>fiddle here -- looking for 'Schoenstraße'
db<>fiddle here -- looking for 'Schönstraße'
Answered by Erwin Brandstetter on November 21, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP