What it does
Fuzzyset.js creates a data structure to efficiently compute similarity scores in order to find likely misspellings in user input.
For example, if someone types "mossisippi" in a U.S. state input box, we would use this library to infer that they probably meant "Mississippi". Here's a simple interactive example:
Dictionary
Try replacing with your own text, separated by newlinesCode
fs = FuzzySet(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming'], false)
Result
Dictionary Term | Similarity Score |
---|---|
Mississippi | 0.572 |
Code
[[0.5715476066494082, 'Mississippi']]
How it works
The basic idea is to compare a search against potential matches in the dictionary and order the matches by a similarity score.
The library's functionality can be broken down into three parts:
- Calculating Similarity Scores β how do we figure out how "similar" two strings are?
- Storing the Dictionary β how do we use the similarity score to store the dictionary efficiently?
- Looking up Matches β how do we look up potential matches in that dictionary?
Calculating Similarity Scores
First, to find misspellings we want to have a way to compare strings. How do we know how similar two strings are? In this case, we can compute a "similarity score" from 0 (not similar at all) to 1 (completely similar, the same string).
The formula for cosine similarity gives us a score from 0 to 1, but to use it we have to turn strings in numeric vectors. How do we do that?
To turn a string into numeric vector, we use the _gramCounter
function which counts the character substrings ("grams") in a given string:
Turning a string into a vector
Example
value: | <- try changing this! | |||||||||||||||||||||
gramSize: | ||||||||||||||||||||||
function call: | _gramCounter('Mississippi', 3) | |||||||||||||||||||||
vector output: |
| |||||||||||||||||||||
(We use -dashes- to mark the beginning and end of words because these are more likely to be correct in misspellings. Also, we normalize the string by converting to lowercase and removing non-alphanumeric characters.) |
Code
(try editing! maybe add some console.log() statements)_nonWordRe = /[^a-zA-Z0-9\u00C0-\u00FF\u0621-\u064A\u0660-\u0669, ]+/g;
_iterateGrams = function(value, gramSize) {
gramSize = gramSize || 2;
var simplified = '-' + value.toLowerCase().replace(_nonWordRe, '') + '-',
lenDiff = gramSize - simplified.length,
results = [];
if (lenDiff > 0) {
for (var i = 0; i < lenDiff; ++i) {
value += '-';
}
}
for (var i = 0; i < simplified.length - gramSize + 1; ++i) {
results.push(simplified.slice(i, i + gramSize));
}
return results;
};
β
_gramCounter = function(value, gramSize) {
// return an object where key=gram, value=number of occurrences
gramSize = gramSize || 2;
var result = {},
grams = _iterateGrams(value, gramSize),
i = 0;
for (i; i < grams.length; ++i) {
if (grams[i] in result) {
result[grams[i]] += 1;
} else {
result[grams[i]] = 1;
}
}
return result;
};
Now that we have a way to turn strings into vectors, we can use cosine similarity to get a score between 0 and 1.
The formula for cosine similarity is:
Aβ BβAββBβ
where A and B are vectors (as shown above),
Aβ B is the dot product of those vectors,
and βAβ is the vector magnitude of A.
To calculate the dot product, Aβ B, we just multiply the parts of each vector together and add them up:
Calculating the dot product of two vectors (Aβ B)
A | β | B | ||||
---|---|---|---|---|---|---|
gram | count | gram | count | |||
-mi | 0 | Γ | -mi | 1 | = | 0 |
-mo | 1 | Γ | -mo | 0 | = | 0 |
mis | 0 | Γ | mis | 1 | = | 0 |
mos | 1 | Γ | mos | 0 | = | 0 |
iss | 0 | Γ | iss | 2 | = | 0 |
oss | 1 | Γ | oss | 0 | = | 0 |
ssi | 1 | Γ | ssi | 2 | = | 2 |
sis | 1 | Γ | sis | 1 | = | 1 |
sip | 1 | Γ | sip | 1 | = | 1 |
isi | 1 | Γ | isi | 0 | = | 0 |
ipp | 1 | Γ | ipp | 1 | = | 1 |
ppi | 1 | Γ | ppi | 1 | = | 1 |
pi- | 1 | Γ | pi- | 1 | = | 1 |
sum: | 7 |
To calculate the magnitudes ( βAβ and βBβ) we take the root of the sum of squares of each vector:
Calculating the vector magnitudes (βAβ and βBβ)
A | ||
---|---|---|
gram | count | countΒ² |
-mo | 1 | 1 |
mos | 1 | 1 |
oss | 1 | 1 |
ssi | 1 | 1 |
sis | 1 | 1 |
isi | 1 | 1 |
sip | 1 | 1 |
ipp | 1 | 1 |
ppi | 1 | 1 |
pi- | 1 | 1 |
sum: | 10 | |
βAβ=βsum = | 3.162 |
B | ||
---|---|---|
gram | count | countΒ² |
-mi | 1 | 1 |
mis | 1 | 1 |
iss | 2 | 4 |
ssi | 2 | 4 |
sis | 1 | 1 |
sip | 1 | 1 |
ipp | 1 | 1 |
ppi | 1 | 1 |
pi- | 1 | 1 |
sum: | 15 | |
βBβ=βsum = | 3.873 |
Putting all that together we get:
Calculating the cosine similarity score
And that is how Fuzzyset.js calculates a similarity score for two strings. Perfect matches have a score of 1, and completely dissimilar matches have a score of 0. Try it out!
Storing the Dictionary
We want to use the formula described above to match a dictionary of strings. To store strings efficiently so they can be later compared using cosine similarity, we use the following strategy:
(Aβ
B and βAβ will be calculated on-the-fly when looking up matches.)
Here's what that looks like in practice:
How Fuzzyset stores strings
Example
Examples: | Or try your own below |
List of strings: (separated by newlines) | |
Function call: | fuzzyset = FuzzySet(['Mississippi'], false, 3,3) |
Code Try editing!
fuzzyset.add = function(value) {
var normalizedValue = this._normalizeStr(value);
if (normalizedValue in this.exactSet) {
return false;
}
β
var i = this.gramSizeLower;
for (i; i < this.gramSizeUpper + 1; ++i) {
this._add(value, i);
}
};
β
fuzzyset._add = function(value, gramSize) {
var normalizedValue = this._normalizeStr(value),
items = this.items[gramSize] || [],
index = items.length;
β
items.push(0);
var gramCounts = _gramCounter(normalizedValue, gramSize),
sumOfSquareGramCounts = 0,
gram, gramCount;
for (gram in gramCounts) {
gramCount = gramCounts[gram];
sumOfSquareGramCounts += Math.pow(gramCount, 2);
if (gram in this.matchDict) {
this.matchDict[gram].push([index, gramCount]);
} else {
this.matchDict[gram] = [[index, gramCount]];
}
}
var vectorNormal = Math.sqrt(sumOfSquareGramCounts);
items[index] = [vectorNormal, normalizedValue];
this.items[gramSize] = items;
this.exactSet[normalizedValue] = value;
};
β
fuzzyset._normalizeStr = function(str) {
if (Object.prototype.toString.call(str) !== '[object String]') {
throw 'Must use a string as argument to FuzzySet functions';
}
return str.toLowerCase();
};
Result
fuzzyset.matchDict
key | value |
---|---|
gram | array of: |
(index to fuzzyset.items, count of B) | |
-mi | [( 0 , 1 )]β¬ β¬ Hover over each number for a detailed description |
mis | [( 0 , 1 )] |
iss | [( 0 , 2 )] |
ssi | [( 0 , 2 )] |
sis | [( 0 , 1 )] |
sip | [( 0 , 1 )] |
ipp | [( 0 , 1 )] |
ppi | [( 0 , 1 )] |
pi- | [( 0 , 1 )] |
See the "Mississippi / Missouri" example for more detail when words have shared grams.
fuzzyset.items
key | value |
---|---|
gram size | array of: |
(βBβ, index to fuzzyset.exactSet) | |
3 | [(3.873, mississippi)] |
This table is keyed by gram size so we can make a tradeoff between precision and lookup time. More on that in the next section.
fuzzyset.exactSet
key | value |
---|---|
normalized string | original string |
mississippi | Mississippi |
Looking Up Matches
The algorithm for looking up matches is as follows:
- Starting with
gramSizeUpper
, look up potential matches with at least one gram in common (a dictionary lookup infuzzyset.matchDict
), calculate the cosine similarity between the lookup string and the match, and sort the results by that score. Return all results with scores above theminMatchScore
. - If
useLevenshtein
istrue
, compute the Levenshtein distance for the top 50 matching strings and return those results sorted by Levenshtein distance with scores above theminMatchScore
. (Levenshtein distance is useful for finding misspellings where letters are out of order.) - If no results were found for steps 2 or 3, try again with the next largest gram size. (Lower gram sizes will generally create more matches of lower similarity.) Keep trying with lower gram sizes until results are found or
gramSizeLower
is reached.
Here's how this looks in practice:
How Fuzzyset looks up strings
Dictionary
Lookup
Examples: north, nroth kadota, nroth kadota with levenshtein score
Result
Match | Score |
---|---|
Mississippi | 0.572 |
Code fuzzyset.get() is the top-level function
// helper functions
var levenshtein = function(str1, str2) {
var current = [], prev, value;
β
for (var i = 0; i <= str2.length; i++)
for (var j = 0; j <= str1.length; j++) {
if (i && j)
if (str1.charAt(j - 1) === str2.charAt(i - 1))
value = prev;
else
value = Math.min(current[j], current[j - 1], prev) + 1;
else
value = i + j;
β
prev = current[j];
current[j] = value;
}
β
return current.pop();
};
β
// return an edit distance from 0 to 1
var _distance = function(str1, str2) {
if (str1 === null && str2 === null) throw 'Trying to compare two null values';
if (str1 === null || str2 === null) return 0;
str1 = String(str1); str2 = String(str2);
β
var distance = levenshtein(str1, str2);
if (str1.length > str2.length) {
return 1 - distance / str1.length;
} else {
return 1 - distance / str2.length;
}
};
β
// main lookup function
fuzzyset.get = function(value, defaultValue, minMatchScore) {
// check for value in set, returning defaultValue or null if none found
if (minMatchScore === undefined) {
minMatchScore = .33
}
var result = this._get(value, minMatchScore);
if (!result && typeof defaultValue !== 'undefined') {
return defaultValue;
}
return result;
};
β
fuzzyset._get = function(value, minMatchScore) {
var results = [];
// start with high gram size and if there are no results, go to lower gram sizes
for (var gramSize = this.gramSizeUpper; gramSize >= this.gramSizeLower; --gramSize) {
results = this.__get(value, gramSize, minMatchScore);
if (results && results.length > 0) {
return results;
}
}
return null;
};
β
fuzzyset.__get = function(value, gramSize, minMatchScore) {
var normalizedValue = this._normalizeStr(value),
matches = {},
gramCounts = _gramCounter(normalizedValue, gramSize),
items = this.items[gramSize],
sumOfSquareGramCounts = 0,
gram,
gramCount,
i,
index,
otherGramCount;
β
for (gram in gramCounts) {
gramCount = gramCounts[gram];
sumOfSquareGramCounts += Math.pow(gramCount, 2);
if (gram in this.matchDict) {
for (i = 0; i < this.matchDict[gram].length; ++i) {
index = this.matchDict[gram][i][0];
otherGramCount = this.matchDict[gram][i][1];
if (index in matches) {
matches[index] += gramCount * otherGramCount;
} else {
matches[index] = gramCount * otherGramCount;
}
}
}
}
β
function isEmptyObject(obj) {
for(var prop in obj) {
if(obj.hasOwnProperty(prop))
return false;
}
return true;
}
β
if (isEmptyObject(matches)) {
return null;
}
β
var vectorNormal = Math.sqrt(sumOfSquareGramCounts),
results = [],
matchScore;
// build a results list of [score, str]
for (var matchIndex in matches) {
matchScore = matches[matchIndex];
results.push([matchScore / (vectorNormal * items[matchIndex][0]), items[matchIndex][1]]);
}
var sortDescending = function(a, b) {
if (a[0] < b[0]) {
return 1;
} else if (a[0] > b[0]) {
return -1;
} else {
return 0;
}
};
results.sort(sortDescending);
if (this.useLevenshtein) {
var newResults = [],
endIndex = Math.min(50, results.length);
// truncate somewhat arbitrarily to 50
for (var i = 0; i < endIndex; ++i) {
newResults.push([_distance(results[i][1], normalizedValue), results[i][1]]);
}
results = newResults;
results.sort(sortDescending);
}
var newResults = [];
results.forEach(function(scoreWordPair) {
if (scoreWordPair[0] >= minMatchScore) {
newResults.push([scoreWordPair[0], this.exactSet[scoreWordPair[1]]]);
}
}.bind(this))
return newResults;
};
Calling fuzzyset.get("mossisippi", null, 0.33)
Lookup (A) | Match (B) | Shared Grams (found in fuzzyset.matchDict) | Aβ B | βAβ | βBβ (from fuzzyset.items) | Aβ BβAββBβ Score |
---|---|---|---|---|---|---|
3-gram matches | ||||||
mossisippi | mississippi | ssi, sis, sip, ipp, ppi, pi- | 7 | 3.162 | 3.873 | 0.572 |
mossisippi | montana | -mo | 1 | 3.162 | 2.646 | 0.120 |
mossisippi | louisiana | isi | 1 | 3.162 | 3 | 0.105 |
Try playing around with the different parameters to see how it affects the quality of matches. For example, the useLevenshtein
parameter does not provide good matches if a user hasn't typed out a whole word yet.
And that's basically the whole library! We store strings so they can be looked up quickly and compared using cosine similarity.