levenshtein and levenshtein_ratio Functions for MySQL

!!!WARNING!!!

Once again, WordPress managed to silently change my code and make it unusable. And this time, it does it too well: I can’t fix it. This (code killing) is the only WP feature which works good, but I must admit that it is definitely perfect.

For this reason, please ignore the code examples below, and download this archive, which contains my Stored Functions, a Test Case for my Functions, and Arjen Lentz’s Stored Function.

I found a Levenshtein Distance function, implemented as SQL Stored Function for MySQL and MariaDB, written by Arjen Lentz. That post also contains a short but interesting discussion about the advantages and disadvantages of Stored Functions over UDFs.

For those who don’t know, the Levenshtein Distance (or Edit Distance) reads two strings and returns the minimum number of atomic edits that must be done to the first string to make it equal to the second string (or vice versa, as you like). An atomic edit can be an addition, a deletion or a substitution of a single character.

Example:

String 1: abcde_1
String 2: bcdef_2
Dinstance: 3.
The required atomic edits are: delete a, add f, change 1 to 2.

I tried to improve the function. Well, my changes are trivial (because Arjen Lentz is a genius, as you can see by taking a look to OQGRAPH code, and I’m not), but I wish to share them, as Lentz asked in his post. I also wrote a levenshtein_ratio Function, and you can find it below.

My Levenshtein

Here is my “fork”:

DROP FUNCTION IF EXISTS `levenshtein`;
CREATE FUNCTION `levenshtein`(`s1` VARCHAR(255) CHARACTER SET utf8, `s2` VARCHAR(255) CHARACTER SET utf8)
    RETURNS TINYINT UNSIGNED
    NO SQL
    DETERMINISTIC
BEGIN
    DECLARE s1_len, s2_len, i, j, c, c_temp TINYINT UNSIGNED;
    -- max strlen=255 for this function
    DECLARE cv0, cv1 VARBINARY(256);
    
    -- if any param is NULL return NULL
    -- (consistent with builtin functions)
    IF (s1 + s2) IS NULL THEN
        RETURN NULL;
    END IF;
    
    SET s1_len = CHAR_LENGTH(s1),
        s2_len = CHAR_LENGTH(s2),
        cv1 = 0x00,
        j = 1,
        i = 1,
        c = 0;
    
    -- if any string is empty,
    -- distance is the length of the other one
    IF (s1 = s2) THEN
        RETURN 0;
    ELSEIF (s1_len = 0) THEN
        RETURN s2_len;
    ELSEIF (s2_len = 0) THEN
        RETURN s1_len;
    END IF;
    
    WHILE (j <= s2_len) DO
        SET cv1 = CONCAT(cv1, CHAR(j)),
        j = j + 1;
    END WHILE;
    
    WHILE (i <= s1_len) DO
        SET c = i,
            cv0 = CHAR(i),
            j = 1;
        
        WHILE (j <= c_temp) THEN
                SET c = c_temp;
            END IF;
            
            SET c_temp = ORD(SUBSTRING(cv1, j+1, 1)) + 1;
            IF (c > c_temp) THEN
                SET c = c_temp;
            END IF;
            
            SET cv0 = CONCAT(cv0, CHAR(c)),
                j = j + 1;
        END WHILE;
        
        SET cv1 = cv0,
            i = i + 1;
    END WHILE;
    
    RETURN c;
END;

What are the differences? Well, excluding the trivial ones:

  • If one param is NULL, my Function returns NULL. This is the behavior of My and Maria’s builtin functions, and is logically correct: if you don’t know one of the values, you don’t know the Levenshtein Distance.
  • Since params have a limit of 255 characters, I turned all numeric variable types and Function return type to TINYINT UNSIGNED. From my tests, this seems to me correct; please, let me know if I’m wrong.
  • I dropped 2 vars: s1_chars and cost. They were there just to add readability, but I think that the code is still readable as before.

Levenshtein Ratio

Usually you don’t care about the Levenshtein Distance as an absolute value. If the first string is ‘ace’ and the second string is ‘see’, it’s very unlikely that one of them is misspelled. But if the first string is ‘no woman no cry’ and the second string is ‘no womn no cr’, the second string is probably misspelled. But in both cases, the Levenshtein Distance is 2. What can help us is the ratio between the length of the longest string and the Levenshtein Distance. In the first case the ratio is 67, while in the second case it’s 13. The first value means that 2/3 of the characters are different, the second value tells us that the difference is little.

That’s why I wrote the levenshtein_ratio() Stored Function:

DROP FUNCTION IF EXISTS `levenshtein_ratio`;
CREATE FUNCTION `levenshtein_ratio`(`s1` VARCHAR(255) CHARACTER SET utf8, `s2` VARCHAR(255) CHARACTER SET utf8)
    RETURNS TINYINT UNSIGNED
    DETERMINISTIC
    NO SQL
    COMMENT 'Levenshtein ratio between strings'
BEGIN
    DECLARE s1_len TINYINT UNSIGNED DEFAULT CHAR_LENGTH(s1);
    DECLARE s2_len TINYINT UNSIGNED DEFAULT CHAR_LENGTH(s2);
    RETURN ((levenshtein(s1, s2) / IF(s1_len > s2_len, s1_len, s2_len)) * 100);
END;

Like Arjen Lentz, if you optimize one of my Functions, I hope you’ll lets me know.

Enjoy!

About these ads

7 thoughts on “levenshtein and levenshtein_ratio Functions for MySQL

    • I’m afraid that no one can do much better than this using pure SQL. But you can calculate the Levenshtein dinstance on the cliant side. Or create a User Defined Function in C language and install it in SQL using CREATE FUNCTION. See the manual for info.

  1. On which version of MySQL you got this working ? When i copy-paste the code above and try to run it, it says:

    Error

    SQL query:

    CREATE FUNCTION `levenshtein` (
    `s1` VARCHAR( 255 ) CHARACTER SET utf8,
    `s2` VARCHAR( 255 ) CHARACTER SET utf8
    ) RETURNS TINYINT UNSIGNEDNO SQL DETERMINISTIC BEGIN DECLARE s1_len,
    s2_len,
    i,
    j,
    c,
    c_temp TINYINT UNSIGNED;

    MySQL said: Documentation
    #1064 – You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ” at line 6

    it points to row that begins RETURNS TINYINT…
    I have a working alternative @http://openquery.com/blog/levenshtein-mysql-stored-function which works, but i’m interested at using your fork. Any help is much preciated.

    • Hi Jan!
      Thank you for your report. Sorry for the inconvenience. The problem is that WordPress made my code useless by dropping some characters (which were properly escaped in HTML). See the “Warning” I added at the beginning of this post.
      If you are interested in trying my functions, you can download a zip archive from here: http://federico-razzoli.com/stuff/string.zip
      Any feedback is welcome!
      Federico

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s