What it does

Fuzzyset.js creates a data structure to efficiently compute similarity scores in order to find likely misspellings in user input.

For example, if someone types "mossisippi" in a U.S. state input box, we would use this library to infer that they probably meant "Mississippi". Here's a simple interactive example:

Uh oh, there was an error. Do you have JavaScript enabled?

How it works

The basic idea is to compare a search against potential matches in the dictionary and order the matches by a similarity score.

The library's functionality can be broken down into three parts:

Calculating Similarity Scores — how do we figure out how "similar" two strings are?
Storing the Dictionary — how do we use the similarity score to store the dictionary efficiently?
Looking up Matches — how do we look up potential matches in that dictionary?

Calculating Similarity Scores

First, to find misspellings we want to have a way to compare strings. How do we know how similar two strings are? In this case, we can compute a "similarity score" from 0 (not similar at all) to 1 (completely similar, the same string).

The formula for cosine similarity gives us a score from 0 to 1, but to use it we have to turn strings in numeric vectors. How do we do that?

To turn a string into numeric vector, we use the _gramCounter function which counts the character substrings ("grams") in a given string:

Uh oh, there was an error. Do you have JavaScript enabled?

Now that we have a way to turn strings into vectors, we can use cosine similarity to get a score between 0 and 1.

The formula for cosine similarity is:

$$\frac{A \cdot B}{\left\lVert A \right\rVert \left\lVert B \right\rVert}$$

where $A$ and $B$ are vectors (as shown above),

$A \cdot B$ is the dot product of those vectors,

and $\left\lVert A \right\rVert$ is the vector magnitude of $A$.

To calculate the dot product, $A \cdot B$, we just multiply the parts of each vector together and add them up:

Uh oh, there was an error. Do you have JavaScript enabled?

And that is how Fuzzyset.js calculates a similarity score for two strings. Perfect matches have a score of 1, and completely dissimilar matches have a score of 0. Try it out!

Storing the Dictionary

We want to use the formula described above to match a dictionary of strings. To store strings efficiently so they can be later compared using cosine similarity, we use the following strategy:

Annotated cosine similarity equation showing the storage of the vector parts in fuzzyset.matchDict and the vector magnitude in fuzzyset.items. ($A \cdot B$ and $\left\lVert A \right\rVert$ will be calculated on-the-fly when looking up matches.)

Here's what that looks like in practice:

How Fuzzyset stores strings

Uh oh, there was an error. Do you have JavaScript enabled?

Looking Up Matches

The algorithm for looking up matches is as follows:

Starting with gramSizeUpper, look up potential matches with at least one gram in common (a dictionary lookup in fuzzyset.matchDict), calculate the cosine similarity between the lookup string and the match, and sort the results by that score. Return all results with scores above the minMatchScore.
If useLevenshtein is true, compute the Levenshtein distance for the top 50 matching strings and return those results sorted by Levenshtein distance with scores above the minMatchScore. (Levenshtein distance is useful for finding misspellings where letters are out of order.)
If no results were found for steps 2 or 3, try again with the next largest gram size. (Lower gram sizes will generally create more matches of lower similarity.) Keep trying with lower gram sizes until results are found or gramSizeLower is reached.

Here's how this looks in practice:

Uh oh, there was an error. Do you have JavaScript enabled?

Try playing around with the different parameters to see how it affects the quality of matches. For example, the useLevenshtein parameter does not provide good matches if a user hasn't typed out a whole word yet.

And that's basically the whole library! We store strings so they can be looked up quickly and compared using cosine similarity.

Failed to load latest commit information.
lib	better string normalization	8 months ago
.npmignore	update for node/npm	6 years ago
README.rst	Update README.rst	10 months ago
index.js	update for node/npm	6 years ago
package.json	better string normalization	8 months ago
How it works.html	Added "how it works"	June 2018

A Human-Readable Interactive Representation of a Code Library

Glench/fuzzyset.js

How it works.html

What it does

How it works

Calculating Similarity Scores

Storing the Dictionary

How Fuzzyset stores strings

Looking Up Matches

Future Directions

A Human-Readable Interactive Representation of a Code Library

Glench/fuzzyset.js

Clone with HTTPS

Launching GitHub Desktop...

Launching GitHub Desktop...

Launching Xcode...

Launching Visual Studio...

How it works.html

What it does

How it works

Calculating Similarity Scores

Storing the Dictionary

How Fuzzyset stores strings

Looking Up Matches

Future Directions