levenshtein and levenshtein_ratio Functions for MySQL

!!!WARNING!!!

Once again, WordPress managed to silently change my code and make it unusable. And this time, it does it too well: I can’t fix it. This (code killing) is the only WP feature which works good, but I must admit that it is definitely perfect.

For this reason, please ignore the code examples below, and download this archive, which contains my Stored Functions, a Test Case for my Functions, and Arjen Lentz’s Stored Function.

I found a Levenshtein Distance function, implemented as SQL Stored Function for MySQL and MariaDB, written by Arjen Lentz. That post also contains a short but interesting discussion about the advantages and disadvantages of Stored Functions over UDFs.

For those who don’t know, the Levenshtein Distance (or Edit Distance) reads two strings and returns the minimum number of atomic edits that must be done to the first string to make it equal to the second string (or vice versa, as you like). An atomic edit can be an addition, a deletion or a substitution of a single character.

Example:

String 1: abcde_1
String 2: bcdef_2
Dinstance: 3.
The required atomic edits are: delete a, add f, change 1 to 2.

I tried to improve the function. Well, my changes are trivial (because Arjen Lentz is a genius, as you can see by taking a look to OQGRAPH code, and I’m not), but I wish to share them, as Lentz asked in his post. I also wrote a levenshtein_ratio Function, and you can find it below.

My Levenshtein

Here is my “fork”:

DROP FUNCTION IF EXISTS `levenshtein`;
CREATE FUNCTION `levenshtein`(`s1` VARCHAR(255) CHARACTER SET utf8, `s2` VARCHAR(255) CHARACTER SET utf8)
    RETURNS TINYINT UNSIGNED
    NO SQL
    DETERMINISTIC
BEGIN
    DECLARE s1_len, s2_len, i, j, c, c_temp TINYINT UNSIGNED;
    -- max strlen=255 for this function
    DECLARE cv0, cv1 VARBINARY(256);
    
    -- if any param is NULL return NULL
    -- (consistent with builtin functions)
    IF (s1 + s2) IS NULL THEN
        RETURN NULL;
    END IF;
    
    SET s1_len = CHAR_LENGTH(s1),
        s2_len = CHAR_LENGTH(s2),
        cv1 = 0x00,
        j = 1,
        i = 1,
        c = 0;
    
    -- if any string is empty,
    -- distance is the length of the other one
    IF (s1 = s2) THEN
        RETURN 0;
    ELSEIF (s1_len = 0) THEN
        RETURN s2_len;
    ELSEIF (s2_len = 0) THEN
        RETURN s1_len;
    END IF;
    
    WHILE (j <= s2_len) DO
        SET cv1 = CONCAT(cv1, CHAR(j)),
        j = j + 1;
    END WHILE;
    
    WHILE (i <= s1_len) DO
        SET c = i,
            cv0 = CHAR(i),
            j = 1;
        
        WHILE (j <= c_temp) THEN
                SET c = c_temp;
            END IF;
            
            SET c_temp = ORD(SUBSTRING(cv1, j+1, 1)) + 1;
            IF (c > c_temp) THEN
                SET c = c_temp;
            END IF;
            
            SET cv0 = CONCAT(cv0, CHAR(c)),
                j = j + 1;
        END WHILE;
        
        SET cv1 = cv0,
            i = i + 1;
    END WHILE;
    
    RETURN c;
END;

What are the differences? Well, excluding the trivial ones:

  • If one param is NULL, my Function returns NULL. This is the behavior of My and Maria’s builtin functions, and is logically correct: if you don’t know one of the values, you don’t know the Levenshtein Distance.
  • Since params have a limit of 255 characters, I turned all numeric variable types and Function return type to TINYINT UNSIGNED. From my tests, this seems to me correct; please, let me know if I’m wrong.
  • I dropped 2 vars: s1_chars and cost. They were there just to add readability, but I think that the code is still readable as before.

Levenshtein Ratio

Usually you don’t care about the Levenshtein Distance as an absolute value. If the first string is ‘ace’ and the second string is ‘see’, it’s very unlikely that one of them is misspelled. But if the first string is ‘no woman no cry’ and the second string is ‘no womn no cr’, the second string is probably misspelled. But in both cases, the Levenshtein Distance is 2. What can help us is the ratio between the length of the longest string and the Levenshtein Distance. In the first case the ratio is 67, while in the second case it’s 13. The first value means that 2/3 of the characters are different, the second value tells us that the difference is little.

That’s why I wrote the levenshtein_ratio() Stored Function:

DROP FUNCTION IF EXISTS `levenshtein_ratio`;
CREATE FUNCTION `levenshtein_ratio`(`s1` VARCHAR(255) CHARACTER SET utf8, `s2` VARCHAR(255) CHARACTER SET utf8)
    RETURNS TINYINT UNSIGNED
    DETERMINISTIC
    NO SQL
    COMMENT 'Levenshtein ratio between strings'
BEGIN
    DECLARE s1_len TINYINT UNSIGNED DEFAULT CHAR_LENGTH(s1);
    DECLARE s2_len TINYINT UNSIGNED DEFAULT CHAR_LENGTH(s2);
    RETURN ((levenshtein(s1, s2) / IF(s1_len > s2_len, s1_len, s2_len)) * 100);
END;

Like Arjen Lentz, if you optimize one of my Functions, I hope you’ll lets me know.

Enjoy!

Advertisements

MariaDB/MySQL: Performances of COUNT()

Versione italiana

How fast is COUNT() execution? Well, it depends from the Storage Engine.

Try to create an Aria or MyISAM table, INSERT some data, and execute an EXPLAIN similar to the following:

MariaDB [(none)]> EXPLAIN SELECT COUNT(*) FROM test.t1;
+------+-------------+-------+------+---------------+------+--------+------+------+------------------------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra                        |
+------+-------------+-------+------+---------------+------+--------+------+------+------------------------------+
|    1 | SIMPLE      | NULL  | NULL | NULL          | NULL | NULL    | NULL | NULL | Select tables optimized away |
+------+-------------+-------+------+---------------+------+--------+------+------+------------------------------+
1 row in set (0.10 sec)

Optimized away in this case means that there is no need to access the table data, because the number of rows is stored in table’s metadata. So, the results are always retrieved instantly.

For COUNT(col_name), the table is not optimized away – unless the column is NOT NULL and you are not using DISTINCT, but that syntax is equivalent to COUNT(*).

InnoDB cannot optimize away the tables from these queries, because it has a more complex way to handle data. But of course, it can use an index to get the number of values in a column.

The Storage Engines which optimize away the COUNT() operations are Aria, MyISAM, MEMORY, and the ugly thing called MRG_MyISAM. The Storage Engine which have to access the tables are InnoDB, XtraDB, Archive and CSV. (I only made a quick test on MariaDB 5.5 default engines)

Is this a problem for you? Before switching to Aria/MyISAM, consider an alternative: keeping the row count in a very small MEMORY table without changing the original table’s engine.

Enjoy!