By Applaudo Studios

2020-05-14

Tags

Development

Fuzzy Searching with PostgreSQL

It’s a fact that everyone makes typos or use alternate spelling on a frequent basis. On this blog, Kevin will guide you on a solution to this problem with PostgreSQL.

So, you may or may not be new to this term. Fuzzy search. But, let me explain it to you so we can clear any doubts.

What’s a fuzzy search?

A fuzzy search is a type of search where the items are returned even when they are not an exact match. A fuzzy search is performed by a fuzzy algorithm that evaluates the likeliness between the search query and the values, even when the search query is misspelled or the order of the words changed.

For example, let’s assume that we’re using Google to find information about washin machines. As you can see, the word washing is misspelled. But even when Google knows that it’ll return results that are somewhat similar to what you typed. Sometimes it even tries to correct you!

Well, that’s a fuzzy search. Returning elements that are just similar one to another.

How’s searching occasionally implemented?

Let’s take a typical requirement: allow a user to search other users by its name. To accomplish that, you may end up building a query like this one:

Assuming you have a user named Jonh Doe in your database, looking for JonhJonHjOnhjonh doeJonh Doe and so on will return the value you’re looking for. But, what if the user types Doe Jonh?

Well, for sure your app won’t crash with that input. But I’m pretty sure it won’t return anything! Why? Because the order of the words is different than what’s stored on the database, and ilike won’t do that rearrange of the words to match. One way to overcome this is to query also for the words in a different order, but that’s not elegant.

Apart from that, what happens if now the user (yes, always the user, ugh) types john doe? There are some ways to handle this with ilikes but, why doing that amount of work when you can let PostgreSQL do that for you?

Let’s introduce pg_trgm. A PostgreSQL extension based on something called a trigram that will be very useful for this! But, first of all, what’s a trigram?

PostgreSQL trigrams. What are they?

The basic definition of trigram is: “A group of 3 consecutive written units, such as words, letters, etc”. The pg_trgm extension allows us to divide a word in such.

How does a trigram look like?

To see how a trigram looks like, let’s enable the PostgreSQL extension first!

After that, run:

As you see, it’ll divide the word into groups of 3 characters. Why there are spaces in the first and last trigrams? Well, that’s because PostgreSQL assumes that every word to be converted into a trigram has 2 spaces before and one space after. Why? This allows PostgreSQL to create trigrams for words with less than 3 characters (we, for example). Also, this makes the similarity to take into account the first and last characters, which otherwise would have less weight when the trigrams are generated.

How to use them to find similar words?

If we select the trigrams for 2 words, we should be able to find how similar those strings are. How? By counting the number of trigrams they share!

How we find those similar trigrams between the 2 sets? Correct, an intersection!

As you can see, there are 7 trigrams common between the 2 sets. Now, how do we know how similar they are? We can do that by calculating a similarity coefficient between the 2 words. How?

Why we use the first word? Well, because we’re comparing the first against the second word. We don’t care about the excedent trigrams from the second word because we know they won’t correlate to the first one.

Now, let’s apply the formula. We have 7.00 common trigrams, and we have 8.00 total trigrams in the first word. Let’s use PostgreSQL to determine the coefficient:

Here, we use decimals because otherwise, PostgreSQL will ignore the decimal part of the operation, thus returning us 0.

Now, we know that our words have a similarity of 0.875. That means they are very similar! But, having to do all of this manually may take time… there should be a better way.

Well, there’s a better way! Alongside the trigram extension, we enabled a set of functions that came in the same packages. For example, the whole process we’ve done for calculating the similarity is abstracted away into a single function: word_similarity. By using the function, we can get all of this process into a single line:

And even better, there’s a set of useful operators that are built into the same package we’re using! For example, we may let PostgreSQL decide if two words are similar or not by using the <% operator:

That operator returns if 2 words are similar or not by using a word_similarity_threshold, which we can query too:

Note that show_limit() function is deprecated, but still works.

That limit is, the minimum coefficient a pair of words can have before marking them as different. You can modify that threshold using the set_limit() function and provide new value, but for most use cases, the default one should work!

Using trigrams, we can also query for the “distance” between 2 words, which is calculated by taking the similarity of the words from 1.

With the <-> operator. The lower the number the more similar (or less distance) they have.

We can use this operator to build something like Google’s did you mean functionality, which is probably done with n-grams.

Building a Did You Mean engine

First, let’s create a table and populate it with data. For this example, I’ll be using the departments of my country, El Salvador.

And now, some data:

Now, what happens if the user is looking for San Salvador but it misspells it and writes sn salvador? Our system should recommend the user to write it properly! How? Let’s see:

And ?. By returning the first result, we can tell the user that we know what he was trying to write, and still, return results related to its search, as Google does!

Solving our initial problem

Well, our initial problem was the user, right? Jk, our initial problem was to handle misspellings when users type a name, but they still want a result.
It’s now trivial to do this with our new knowledge, by querying either the word similarity or the distance between the words.

Ignore Mark Twain…

I hope you liked the article, and if you are curious and want to delve more, you can read the whole documentation for trigrams and all the operators you can. In my next blog post, I’ll share how to index this kind of data so the queries are faster!

About the author

Kevin Alemán

Graduated from Computer Science Technician from Universidad Pedag├│gica de El Salvador. Kevin has over 2 years of experience as a Software Developer. He’s currently a Backend Developer at Applaudo Studios.


Share
Current Time

sal

Calle La Reforma, C.C Plaza San Benito, local 1-3 San Salvador, SV

Current Time

tx

C701 Brazos Street. Suite #1600 Austin, TX 78701, US

Current Time

nc

128 S Tryon St, 21st Floor Charlotte, NC 28202

Current Time

chi

Cerro Colorado 5030, of. 309 Las Condes, Santiago de Chile, CL