So, you may or may not be new to this term. Fuzzy search. But, let me explain it to you so we can clear any doubts.
What’s a fuzzy search?
A fuzzy search is a type of search where the items are returned even when they are not an exact match. A fuzzy search is performed by a
fuzzy algorithm that evaluates the likeliness between the search query and the values, even when the search query is misspelled or the order of the words changed.
For example, let’s assume that we’re using Google to find information about
washin machines. As you can see, the word
washing is misspelled. But even when Google knows that it’ll return results that are somewhat similar to what you typed. Sometimes it even tries to correct you!
Well, that’s a
fuzzy search. Returning elements that are just
similar one to another.
How’s searching occasionally implemented?
Let’s take a typical requirement: allow a user to search other users by its name. To accomplish that, you may end up building a query like this one:
Assuming you have a user named
Jonh Doe in your database, looking for
Jonh Doe and so on will return the value you’re looking for. But, what if the user types
Well, for sure your app won’t crash with that input. But I’m pretty sure it won’t return anything! Why? Because the order of the words is different than what’s stored on the database, and
ilike won’t do that rearrange of the words to match. One way to overcome this is to query also for the words in a different order, but that’s not elegant.
Apart from that, what happens if now the user (yes, always the user, ugh) types
john doe? There are some ways to handle this with
ilikes but, why doing that amount of work when you can let PostgreSQL do that for you?
pg_trgm. A PostgreSQL extension based on something called a
trigram that will be very useful for this! But, first of all, what’s a
The basic definition of trigram is: “A group of 3 consecutive written units, such as words, letters, etc”. The
pg_trgm extension allows us to divide a word in such.
To see how a trigram looks like, let’s enable the PostgreSQL extension first!
After that, run:
As you see, it’ll divide the word into groups of 3 characters. Why there are spaces in the first and last trigrams? Well, that’s because PostgreSQL assumes that every word to be converted into a trigram has 2 spaces before and one space after. Why? This allows PostgreSQL to create trigrams for words with less than 3 characters (
we, for example). Also, this makes the similarity to take into account the first and last characters, which otherwise would have less weight when the trigrams are generated.
If we select the trigrams for 2 words, we should be able to find how similar those strings are. How? By counting the number of trigrams they share!
How we find those similar trigrams between the 2 sets? Correct, an intersection!
As you can see, there are 7 trigrams common between the 2 sets. Now, how do we know how similar they are? We can do that by calculating a similarity coefficient between the 2 words. How?
Why we use the first word? Well, because we’re comparing the first against the second word. We don’t care about the excedent trigrams from the second word because we know they won’t correlate to the first one.
Now, let’s apply the formula. We have
7.00 common trigrams, and we have
8.00 total trigrams in the first word. Let’s use PostgreSQL to determine the coefficient:
Here, we use decimals because otherwise, PostgreSQL will ignore the decimal part of the operation, thus returning us 0.
Now, we know that our words have a similarity of
0.875. That means they are very similar! But, having to do all of this manually may take time… there should be a better way.
Well, there’s a better way! Alongside the trigram extension, we enabled a set of functions that came in the same packages. For example, the whole process we’ve done for calculating the similarity is abstracted away into a single function:
word_similarity. By using the function, we can get all of this process into a single line:
And even better, there’s a set of useful operators that are built into the same package we’re using! For example, we may let PostgreSQL decide if two words are similar or not by using the
That operator returns if 2 words are similar or not by using a
word_similarity_threshold, which we can query too:
show_limit() function is deprecated, but still works.
That limit is, the minimum coefficient a pair of words can have before marking them as different. You can modify that threshold using the
set_limit() function and provide new value, but for most use cases, the default one should work!
Using trigrams, we can also query for the “distance” between 2 words, which is calculated by taking the
similarity of the words from
<-> operator. The lower the number the more similar (or less distance) they have.
We can use this operator to build something like Google’s
did you mean functionality, which is probably done with n-grams.
First, let’s create a table and populate it with data. For this example, I’ll be using the departments of my country, El Salvador.
And now, some data:
Now, what happens if the user is looking for
San Salvador but it misspells it and writes
sn salvador? Our system should recommend the user to write it properly! How? Let’s see:
And 🎉. By returning the first result, we can tell the user that we know what he was trying to write, and still, return results related to its search, as Google does!
Well, our initial problem was the user, right? Jk, our initial problem was to handle misspellings when users type a name, but they still want a result.
It’s now trivial to do this with our new knowledge, by querying either the
word similarity or the
distance between the words.
Ignore Mark Twain…
I hope you liked the article, and if you are curious and want to delve more, you can read the whole documentation for trigrams and all the operators you can. In my next blog post, I’ll share how to index this kind of data so the queries are faster!
About the author
Graduated from Computer Science Technician from Universidad Pedagógica de El Salvador. Kevin has over 2 years of experience as a Software Developer. He’s currently a Backend Developer at Applaudo Studios.