Trigram similarity (pg_trgm) with German umlauts

Question

I try to figure out how to improve Postgres 10.6 pg_trgm queries with German umlauts (äöü). In german 'ö' can be written as 'oe'. But beware: not every 'oe' can be written as 'ö'.
CREATE TABLE public.names
  (name text COLLATE pg_catalog."default");

CREATE INDEX names_idx
    ON public.names USING gin (name COLLATE pg_catalog."default" gin_trgm_ops);

SHOW LC_COLLATE; -- de_DE.UTF-8

When I use the similarity()
function to query the similarity for 'Schoenstraße'.
SELECT
    name,
    similarity (name, 'Schoenstraße') AS similarity,
    show_trgm (name)
FROM
    names
WHERE
    name % 'Schoenstraße'
ORDER BY
    similarity DESC;

I get the following result:
Name            similarity  show_trgm

Schyrenstraße   0.588235    {0x9a07c3,0xde3801,""  s"","" sc"",chy,ens,hyr,nst,ren,sch,str,tra,0x76a40e,yre}
Schönstraße     0.5625      {0x9a07c3,0xde3801,0xf00320,0x095f29,""  s"","" sc"",0x6deea5,nst,sch,str,tra,0x76a40e}

db<>fiddle here
Is there anything I can do to improve that or do I need to replace all umlauts in the DB?

Erwin Brandstetter · Answer

For starters: A German poet is still a Poet. 'Poet' <> 'Pöt'; 'oe' <> 'ö'. The convention to replace 'ö' with 'oe' is largely out of use nowadays. See: https://de.wikipedia.org/wiki/Umlaut#Darstellung_von_Umlauten I have had related problems many times. The additional module unaccent is instrumental to cover spelling variants and typos on both sides. Install it once per database: CREATE EXTENSION unaccent; See: Does PostgreSQL support “accent insensitive” collations? The tricky part is to get index support with it. Details may be hard for beginners. Trigram similarity Then create the wrapper function as instructed here (same link again): Does PostgreSQL support “accent insensitive” collations? Create a new index with it: CREATE INDEX names_trgm_idx ON names USING gin (f_unaccent(name) gin_trgm_ops); Query demo: SELECT name , similarity(f_unaccent(name), f_unaccent('Schoenstraße')) AS sim_unaccent , similarity(name, 'Schoenstraße') AS sim FROM names WHERE f_unaccent(name) % f_unaccent('Schoenstraße') ORDER BY f_unaccent(name) <-> f_unaccent('Schoenstraße') , name <-> 'Schoenstraße'; -- best match first Sorting by name <-> 'Schoenstraße' additionally serves as tiebreaker to get best matches first (after filtering cheaply with the index). Consider a GiST index instead of GIN for "nearest neighbor"-type queries with LIMIT. See: Best index for similarity function Full text search (FTS or just TS) Consider full text search to just match complete words (after stemming), or with prefix matching / phrase search ... See: Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL The unaccent module provides a dictionary for integration in a TEXT SEARCH CONFIGURATION: CREATE TEXT SEARCH CONFIGURATION de (COPY = german); ALTER TEXT SEARCH CONFIGURATION de ALTER MAPPING FOR hword, hword_part, word WITH unaccent, german_stem; Index based on it: CREATE INDEX names_fts_idx ON names USING GIN (to_tsvector('de', name)); Alternatively, since you are dealing with names, you might base it on the simple text search configuration instead of german. No stop words, no stemming. Names don't necessarily follow language-specific rules. Query demo: SELECT name , ts_lexize('unaccent', name) , to_tsvector('de', name) , to_tsquery('de', 'Schoenstraße') , to_tsvector('de', name) @@ to_tsquery('de', 'Schoenstraße') AS match , similarity(name, 'Schoenstraße') FROM names WHERE to_tsvector('de', name) @@ to_tsquery('de', 'Schoenstraße') ORDER BY name <-> 'Schoenstraße'; -- exact matches first Again, sorting by name <-> 'Schoenstraße' serves as tiebreaker, maybe additionally to other sort criteria. Note, however, that looking for 'Schoenstraße' with FTS does not find 'Schönstraße', as unaccent() does not change 'oe', and maps 'ö' to 'o'. ('oe' <> 'ö'). Related: Creating a case-insensitive and accent/diacritics insensitive search on a field Finding the most commonly used non-stop words in a column db<>fiddle here -- looking for 'Schoenstraße' db<>fiddle here -- looking for 'Schönstraße'

Trigram similarity (pg_trgm) with German umlauts

One Answer

Trigram similarity

Full text search (FTS or just TS)

Add your own answers!

Ask a Question