fuzzyset.js - A fuzzy string set for javascript
JavaScript
Switch branches/tags
Nothing to show
Clone or download

How it works.html

What it does

Fuzzyset.js creates a data structure to efficiently compute similarity scores in order to find likely misspellings in user input.

For example, if someone types "mossisippi" in a U.S. state input box, we would use this library to infer that they probably meant "Mississippi". Here's a simple interactive example:

Dictionary

Try replacing with your own text, separated by newlines

Code

fs = FuzzySet(['Alabama',
'Alaska',
'Arizona',
'Arkansas',
'California',
'Colorado',
'Connecticut',
'Delaware',
'Florida',
'Georgia',
'Hawaii',
'Idaho',
'Illinois',
'Indiana',
'Iowa',
'Kansas',
'Kentucky',
'Louisiana',
'Maine',
'Maryland',
'Massachusetts',
'Michigan',
'Minnesota',
'Mississippi',
'Missouri',
'Montana',
'Nebraska',
'Nevada',
'New Hampshire',
'New Jersey',
'New Mexico',
'New York',
'North Carolina',
'North Dakota',
'Ohio',
'Oklahoma',
'Oregon',
'Pennsylvania',
'Rhode Island',
'South Carolina',
'South Dakota',
'Tennessee',
'Texas',
'Utah',
'Vermont',
'Virginia',
'Washington',
'West Virginia',
'Wisconsin',
'Wyoming'], false)

Look up

Code

fs.get('mossisippi')

Result

Dictionary TermSimilarity Score
Mississippi0.572

Code

[[0.5715476066494082, 'Mississippi']]

How it works

The basic idea is to compare a search against potential matches in the dictionary and order the matches by a similarity score.

The library's functionality can be broken down into three parts:

  1. Calculating Similarity Scores β€” how do we figure out how "similar" two strings are?
  2. Storing the Dictionary β€” how do we use the similarity score to store the dictionary efficiently?
  3. Looking up Matches β€” how do we look up potential matches in that dictionary?

Calculating Similarity Scores

First, to find misspellings we want to have a way to compare strings. How do we know how similar two strings are? In this case, we can compute a "similarity score" from 0 (not similar at all) to 1 (completely similar, the same string).

The formula for cosine similarity gives us a score from 0 to 1, but to use it we have to turn strings in numeric vectors. How do we do that?

To turn a string into numeric vector, we use the _gramCounter function which counts the character substrings ("grams") in a given string:

Turning a string into a vector

Example

value: <- try changing this!
gramSize:
function call:_gramCounter('Mississippi', 3)
vector output:
gramcount
-mi1
mis1
iss2The gram 'iss' occurs 2 times in 'Mississippi'
ssi2
sis1
sip1
ipp1
ppi1
pi-1
(We use -dashes- to mark the beginning and end of words because these are more likely to be correct in misspellings. Also, we normalize the string by converting to lowercase and removing non-alphanumeric characters.)

Code

(try editing! maybe add some console.log() statements)

Now that we have a way to turn strings into vectors, we can use cosine similarity to get a score between 0 and 1.

The formula for cosine similarity is:

Aβ‹…Bβ€–Aβ€–β€–Bβ€–

where A and B are vectors (as shown above),

Aβ‹…B is the dot product of those vectors,

and β€–Aβ€– is the vector magnitude of A.

To calculate the dot product, Aβ‹…B, we just multiply the parts of each vector together and add them up:

Calculating the dot product of two vectors (Aβ‹…B)

Aβ‹…B
gramcount gramcount
-mi0Γ—-mi1=0
-mo1Γ—-mo0=0
mis0Γ—mis1=0
mos1Γ—mos0=0
iss0Γ—iss2=0
oss1Γ—oss0=0
ssi1Γ—ssi2=2
sis1Γ—sis1=1
sip1Γ—sip1=1
isi1Γ—isi0=0
ipp1Γ—ipp1=1
ppi1Γ—ppi1=1
pi-1Γ—pi-1=1
sum:7

To calculate the magnitudes ( β€–Aβ€– and β€–Bβ€–) we take the root of the sum of squares of each vector:

Calculating the vector magnitudes (β€–Aβ€– and β€–Bβ€–)

A
gramcountcountΒ²
-mo11
mos11
oss11
ssi11
sis11
isi11
sip11
ipp11
ppi11
pi-11
sum:10
β€–Aβ€–=√sum = 3.162
B
gramcountcountΒ²
-mi11
mis11
iss24
ssi24
sis11
sip11
ipp11
ppi11
pi-11
sum:15
β€–Bβ€–=√sum = 3.873

Putting all that together we get:

Calculating the cosine similarity score

Aβ‹…Bβ€–Aβ€–β€–Bβ€–=73.162Γ—3.873=0.572

And that is how Fuzzyset.js calculates a similarity score for two strings. Perfect matches have a score of 1, and completely dissimilar matches have a score of 0. Try it out!

Storing the Dictionary

We want to use the formula described above to match a dictionary of strings. To store strings efficiently so they can be later compared using cosine similarity, we use the following strategy:

Annotated cosine similarity equation showing the storage of the vector parts in fuzzyset.matchDict and the vector magnitude in fuzzyset.items.(Aβ‹…B and β€–Aβ€– will be calculated on-the-fly when looking up matches.)

Here's what that looks like in practice:

How Fuzzyset stores strings

Example

Examples:
Or try your own below
List of strings: (separated by newlines)
Function call:fuzzyset = FuzzySet(['Mississippi'], false, 3,3)

Code Try editing!

Result

fuzzyset.matchDict

keyvalue
gramarray of:
(index to fuzzyset.items, count of B)
-mi
[(
0
,
1
)]

⬆   ⬆ Hover over each number for a detailed description

mis
[(
0
,
1
)]
iss
[(
0
,
2
)]
ssi
[(
0
,
2
)]
sis
[(
0
,
1
)]
sip
[(
0
,
1
)]
ipp
[(
0
,
1
)]
ppi
[(
0
,
1
)]
pi-
[(
0
,
1
)]

See the "Mississippi / Missouri" example for more detail when words have shared grams.

fuzzyset.items

keyvalue
gram sizearray of:
(β€–Bβ€–, index to fuzzyset.exactSet)
3
[(3.873, mississippi)]

This table is keyed by gram size so we can make a tradeoff between precision and lookup time. More on that in the next section.

fuzzyset.exactSet

keyvalue
normalized stringoriginal string
mississippiMississippi

Looking Up Matches

The algorithm for looking up matches is as follows:

  1. Starting with gramSizeUpper, look up potential matches with at least one gram in common (a dictionary lookup in fuzzyset.matchDict), calculate the cosine similarity between the lookup string and the match, and sort the results by that score. Return all results with scores above the minMatchScore.
  2. If useLevenshtein is true, compute the Levenshtein distance for the top 50 matching strings and return those results sorted by Levenshtein distance with scores above the minMatchScore. (Levenshtein distance is useful for finding misspellings where letters are out of order.)
  3. If no results were found for steps 2 or 3, try again with the next largest gram size. (Lower gram sizes will generally create more matches of lower similarity.) Keep trying with lower gram sizes until results are found or gramSizeLower is reached.

Here's how this looks in practice:

How Fuzzyset looks up strings

Dictionary

Lookup

Examples: north, nroth kadota, nroth kadota with levenshtein score

Result

MatchScore
Mississippi0.572

Code  fuzzyset.get() is the top-level function

Calling fuzzyset.get("mossisippi", null, 0.33)

Lookup (A)Match (B)Shared Grams

(found in fuzzyset.matchDict)

Aβ‹…B

β€–Aβ€–β€–Bβ€–

(from fuzzyset.items)

Aβ‹…Bβ€–Aβ€–β€–Bβ€– Score
3-gram matches
mossisippimississippissi, sis, sip, ipp, ppi, pi-73.1623.8730.572
mossisippimontana-mo13.1622.6460.120
mossisippilouisianaisi13.16230.105

Try playing around with the different parameters to see how it affects the quality of matches. For example, the useLevenshtein parameter does not provide good matches if a user hasn't typed out a whole word yet.

And that's basically the whole library! We store strings so they can be looked up quickly and compared using cosine similarity.

Future Directions

Programming is hard β€” it's hard for beginners to create and understand programs which makes programming as an expressive medium inaccessible. It's hard for experts β€” even I (the library's maintainer!) had difficulty understanding Fuzzyset, fixing bugs, and adding features. Programming is not just hard, but in my opinion it's needlessly hard, creating suffering where none should exist.

I believe much of the reason programming is hard is because so far it hasn't been crafted with an eye toward human understanding, the main bottleneck for both beginners and experts. What would it look like to create programming artifacts and systems that are optimized to be understood?

My strategy is to start small. Can we create even one example of a well-explained system? How about a couple? What can we start to learn from those examples, and how can we bring those insights into the programming environment itself, so authors are assisted as they create their programs?

I don't know the answers to those questions, but I'm excited to find out.


Thanks to Geoffrey Litt for helpful comments on drafts of this document.