### !!!WARNING!!!

**Once again, WordPress managed to silently change my code and make it unusable. And this time, it does it too well: I can’t fix it. This (code killing) is the only WP feature which works good, but I must admit that it is definitely perfect.**

**For this reason, please ignore the code examples below, and download this archive, which contains my Stored Functions, a Test Case for my Functions, and Arjen Lentz’s Stored Function.**

**—**

I found a Levenshtein Distance function, implemented as SQL Stored Function for MySQL and MariaDB, written by Arjen Lentz. That post also contains a short but interesting discussion about the advantages and disadvantages of Stored Functions over UDFs.

For those who don’t know, the **Levenshtein Distance** (or **Edit Distance**) reads two strings and returns the minimum number of *atomic edits* that must be done to the first string to make it equal to the second string (or vice versa, as you like). An atomic edit can be an *addition*, a *deletion* or a *substitution* of a single character.

Example:

`String 1: abcde_1`

String 2: bcdef_2

Dinstance: 3.

The required atomic edits are: delete a, add f, change 1 to 2.

I tried to improve the function. Well, my changes are trivial (because Arjen Lentz is a genius, as you can see by taking a look to `OQGRAPH`

code, and I’m not), but I wish to share them, as Lentz asked in his post. I also wrote a `levenshtein_ratio`

Function, and you can find it below.

### My Levenshtein

Here is my “fork”:

```
DROP FUNCTION IF EXISTS `levenshtein`;
CREATE FUNCTION `levenshtein`(`s1` VARCHAR(255) CHARACTER SET utf8, `s2` VARCHAR(255) CHARACTER SET utf8)
RETURNS TINYINT UNSIGNED
NO SQL
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp TINYINT UNSIGNED;
-- max strlen=255 for this function
DECLARE cv0, cv1 VARBINARY(256);
-- if any param is NULL return NULL
-- (consistent with builtin functions)
IF (s1 + s2) IS NULL THEN
RETURN NULL;
END IF;
SET s1_len = CHAR_LENGTH(s1),
s2_len = CHAR_LENGTH(s2),
cv1 = 0x00,
j = 1,
i = 1,
c = 0;
-- if any string is empty,
-- distance is the length of the other one
IF (s1 = s2) THEN
RETURN 0;
ELSEIF (s1_len = 0) THEN
RETURN s2_len;
ELSEIF (s2_len = 0) THEN
RETURN s1_len;
END IF;
WHILE (j <= s2_len) DO
SET cv1 = CONCAT(cv1, CHAR(j)),
j = j + 1;
END WHILE;
WHILE (i <= s1_len) DO
SET c = i,
cv0 = CHAR(i),
j = 1;
WHILE (j <= c_temp) THEN
SET c = c_temp;
END IF;
SET c_temp = ORD(SUBSTRING(cv1, j+1, 1)) + 1;
IF (c > c_temp) THEN
SET c = c_temp;
END IF;
SET cv0 = CONCAT(cv0, CHAR(c)),
j = j + 1;
END WHILE;
SET cv1 = cv0,
i = i + 1;
END WHILE;
RETURN c;
END;
```

What are the differences? Well, excluding the trivial ones:

- If one param is
`NULL`

, my Function returns`NULL`

. This is the behavior of My and Maria’s builtin functions, and is logically correct: if you don’t know one of the values, you don’t know the Levenshtein Distance. - Since params have a limit of 255 characters, I turned all numeric variable types and Function return type to
`TINYINT UNSIGNED`

. From my tests, this seems to me correct;**please, let me know if I’m wrong**. - I dropped 2 vars:
`s1_chars`

and`cost`

. They were there just to add readability, but I think that the code is still readable as before.

### Levenshtein Ratio

Usually you don’t care about the Levenshtein Distance as an absolute value. If the first string is ‘ace’ and the second string is ‘see’, it’s very unlikely that one of them is misspelled. But if the first string is ‘no woman no cry’ and the second string is ‘no womn no cr’, the second string is probably misspelled. But in both cases, the Levenshtein Distance is 2. What can help us is the ratio between the length of the longest string and the Levenshtein Distance. In the first case the ratio is 67, while in the second case it’s 13. The first value means that 2/3 of the characters are different, the second value tells us that the difference is little.

That’s why I wrote the `levenshtein_ratio()`

Stored Function:

```
DROP FUNCTION IF EXISTS `levenshtein_ratio`;
CREATE FUNCTION `levenshtein_ratio`(`s1` VARCHAR(255) CHARACTER SET utf8, `s2` VARCHAR(255) CHARACTER SET utf8)
RETURNS TINYINT UNSIGNED
DETERMINISTIC
NO SQL
COMMENT 'Levenshtein ratio between strings'
BEGIN
DECLARE s1_len TINYINT UNSIGNED DEFAULT CHAR_LENGTH(s1);
DECLARE s2_len TINYINT UNSIGNED DEFAULT CHAR_LENGTH(s2);
RETURN ((levenshtein(s1, s2) / IF(s1_len > s2_len, s1_len, s2_len)) * 100);
END;
```

Like Arjen Lentz, if you optimize one of my Functions, I hope you’ll lets me know.

*Enjoy!*

I had to add DELIMITER $$ before the querys.

Really fantastic! So glad i found this. Only slight issue may be with new versions of mysql but this:

IF (s1 + s2) IS NULL THEN

Generates mysql warnings because of the + on the strings. This was my slight change:

IF s1 IS NULL or s2 IS NULL THEN

You are right. Thank you for your contribute! I will update the code soon.

Just to be sure, this is O(nm)?

I think it is. But benchmark it before using it in the “real world”.

This function is running slow. What’s other solution?

I’m afraid that no one can do much better than this using pure SQL. But you can calculate the Levenshtein dinstance on the cliant side. Or create a User Defined Function in C language and install it in SQL using CREATE FUNCTION. See the manual for info.

On which version of MySQL you got this working ? When i copy-paste the code above and try to run it, it says:

Error

SQL query:

CREATE FUNCTION `levenshtein` (

`s1` VARCHAR( 255 ) CHARACTER SET utf8,

`s2` VARCHAR( 255 ) CHARACTER SET utf8

) RETURNS TINYINT UNSIGNEDNO SQL DETERMINISTIC BEGIN DECLARE s1_len,

s2_len,

i,

j,

c,

c_temp TINYINT UNSIGNED;

MySQL said: Documentation

#1064 – You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ” at line 6

it points to row that begins RETURNS TINYINT…

I have a working alternative @http://openquery.com/blog/levenshtein-mysql-stored-function which works, but i’m interested at using your fork. Any help is much preciated.

Hi Jan!

Thank you for your report. Sorry for the inconvenience. The problem is that WordPress made my code useless by dropping some characters (which were properly escaped in HTML). See the “Warning” I added at the beginning of this post.

If you are interested in trying my functions, you can download a zip archive from here: http://federico-razzoli.com/stuff/string.zip

Any feedback is welcome!

Federico

good implementation. i will try to compare with other SP implementation of levenshtein

I’ve only made small changes to Arjen Lentz’s Levenshtein. But I’m curios to see your tests results.

Hi java,

It was reported to me (see Jan’s comment) that the code on this page was not working. The reason is that WordPress deleted some chars (which were properly escaped) without an apparent reason.

If you are still interested, you can download the code from here: http://federico-razzoli.com/stuff/string.zip

Federico