I found a Levenshtein Distance function, implemented as SQL Stored Function for MySQL and MariaDB, written by Arjen Lentz. That post also contains a short but interesting discussion about the advantages and disadvantages of Stored Functions over UDFs.
For those who don’t know, the Levenshtein Distance (or Edit Distance) reads two strings and returns the minimum number of atomic edits that must be done to the first string to make it equal to the second string (or vice versa, as you like). An atomic edit can be an addition, a deletion or a substitution of a single character.
String 1: abcde_1
String 2: bcdef_2
The required atomic edits are: delete a, add f, change 1 to 2.
I tried to improve the function. Well, my changes are trivial (because Arjen Lentz is a genius, as you can see by taking a look to
OQGRAPH code, and I’m not), but I wish to share them, as Lentz asked in his post. I also wrote a
levenshtein_ratio Function, and you can find it below.
Here is my “fork”:
DROP FUNCTION IF EXISTS `levenshtein`; CREATE FUNCTION `levenshtein`(`s1` VARCHAR(255) CHARACTER SET utf8, `s2` VARCHAR(255) CHARACTER SET utf8) RETURNS TINYINT UNSIGNED NO SQL DETERMINISTIC BEGIN DECLARE s1_len, s2_len, i, j, c, c_temp TINYINT UNSIGNED; -- max strlen=255 for this function DECLARE cv0, cv1 VARBINARY(256); -- if any param is NULL return NULL -- (consistent with builtin functions) IF (s1 + s2) IS NULL THEN RETURN NULL; END IF; SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; -- if any string is empty, -- distance is the length of the other one IF (s1 = s2) THEN RETURN 0; ELSEIF (s1_len = 0) THEN RETURN s2_len; ELSEIF (s2_len = 0) THEN RETURN s1_len; END IF; WHILE (j <= s2_len) DO SET cv1 = CONCAT(cv1, CHAR(j)), j = j + 1; END WHILE; WHILE (i <= s1_len) DO SET c = i, cv0 = CHAR(i), j = 1; WHILE (j c_temp) THEN SET c = c_temp; END IF; SET c_temp = ORD(SUBSTRING(cv1, j+1, 1)) + 1; IF (c > c_temp) THEN SET c = c_temp; END IF; SET cv0 = CONCAT(cv0, CHAR(c)), j = j + 1; END WHILE; SET cv1 = cv0, i = i + 1; END WHILE; RETURN c; END;
What are the differences? Well, excluding the trivial ones:
- If one param is
NULL, my Function returns
NULL. This is the behavior of My and Maria’s builtin functions, and is logically correct: if you don’t know one of the values, you don’t know the Levenshtein Distance.
- Since params have a limit of 255 characters, I turned all numeric variable types and Function return type to
TINYINT UNSIGNED. From my tests, this seems to me correct; please, let me know if I’m wrong.
- I dropped 2 vars:
cost. They were there just to add readability, but I think that the code is still readable as before.
Usually you don’t care about the Levenshtein Distance as an absolute value. If the first string is ‘ace’ and the second string is ‘see’, it’s very unlikely that one of them is misspelled. But if the first string is ‘no woman no cry’ and the second string is ‘no womn no cr’, the second string is probably misspelled. But in both cases, the Levenshtein Distance is 2. What can help us is the ratio between the length of the longest string and the Levenshtein Distance. In the first case the ratio is 67, while in the second case it’s 13. The first value means that 2/3 of the characters are different, the second value tells us that the difference is little.
That’s why I wrote the
levenshtein_ratio() Stored Function:
DROP FUNCTION IF EXISTS `levenshtein_ratio`; CREATE FUNCTION `levenshtein_ratio`(`s1` VARCHAR(255) CHARACTER SET utf8, `s2` VARCHAR(255) CHARACTER SET utf8) RETURNS TINYINT UNSIGNED DETERMINISTIC NO SQL COMMENT 'Levenshtein ratio between strings' BEGIN DECLARE s1_len TINYINT UNSIGNED DEFAULT CHAR_LENGTH(s1); DECLARE s2_len TINYINT UNSIGNED DEFAULT CHAR_LENGTH(s2); RETURN ((levenshtein(s1, s2) / IF(s1_len > s2_len, s1_len, s2_len)) * 100); END;
Like Arjen Lentz, if you optimize one of my Functions, I hope you’ll lets me know.