levenshtein and levenshtein_ratio Functions for MySQL

I found a Levenshtein Distance function, implemented as SQL Stored Function for MySQL and MariaDB, written by Arjen Lentz. That post also contains a short but interesting discussion about the advantages and disadvantages of Stored Functions over UDFs.

For those who don’t know, the Levenshtein Distance (or Edit Distance) reads two strings and returns the minimum number of atomic edits that must be done to the first string to make it equal to the second string (or vice versa, as you like). An atomic edit can be an addition, a deletion or a substitution of a single character.

Example:

String 1: abcde_1
String 2: bcdef_2
Dinstance: 3.
The required atomic edits are: delete a, add f, change 1 to 2.

I tried to improve the function. Well, my changes are trivial (because Arjen Lentz is a genius, as you can see by taking a look to OQGRAPH code, and I’m not), but I wish to share them, as Lentz asked in his post. I also wrote a levenshtein_ratio Function, and you can find it below.

My Levenshtein

Here is my “fork”:

DROP FUNCTION IF EXISTS `levenshtein`;
CREATE FUNCTION `levenshtein`(`s1` VARCHAR(255) CHARACTER SET utf8, `s2` VARCHAR(255) CHARACTER SET utf8)
    RETURNS TINYINT UNSIGNED
    NO SQL
    DETERMINISTIC
BEGIN
    DECLARE s1_len, s2_len, i, j, c, c_temp TINYINT UNSIGNED;
    -- max strlen=255 for this function
    DECLARE cv0, cv1 VARBINARY(256);
    
    -- if any param is NULL return NULL
    -- (consistent with builtin functions)
    IF (s1 + s2) IS NULL THEN
        RETURN NULL;
    END IF;
    
    SET s1_len = CHAR_LENGTH(s1),
        s2_len = CHAR_LENGTH(s2),
        cv1 = 0x00,
        j = 1,
        i = 1,
        c = 0;
    
    -- if any string is empty,
    -- distance is the length of the other one
    IF (s1 = s2) THEN
        RETURN 0;
    ELSEIF (s1_len = 0) THEN
        RETURN s2_len;
    ELSEIF (s2_len = 0) THEN
        RETURN s1_len;
    END IF;
    
    WHILE (j <= s2_len) DO
        SET cv1 = CONCAT(cv1, CHAR(j)),
        j = j + 1;
    END WHILE;
    
    WHILE (i <= s1_len) DO
        SET c = i,
            cv0 = CHAR(i),
            j = 1;
        
        WHILE (j  c_temp) THEN
                SET c = c_temp;
            END IF;
            
            SET c_temp = ORD(SUBSTRING(cv1, j+1, 1)) + 1;
            IF (c > c_temp) THEN
                SET c = c_temp;
            END IF;
            
            SET cv0 = CONCAT(cv0, CHAR(c)),
                j = j + 1;
        END WHILE;
        
        SET cv1 = cv0,
            i = i + 1;
    END WHILE;
    
    RETURN c;
END;

What are the differences? Well, excluding the trivial ones:

  • If one param is NULL, my Function returns NULL. This is the behavior of My and Maria’s builtin functions, and is logically correct: if you don’t know one of the values, you don’t know the Levenshtein Distance.
  • Since params have a limit of 255 characters, I turned all numeric variable types and Function return type to TINYINT UNSIGNED. From my tests, this seems to me correct; please, let me know if I’m wrong.
  • I dropped 2 vars: s1_chars and cost. They were there just to add readability, but I think that the code is still readable as before.

Levenshtein Ratio

Usually you don’t care about the Levenshtein Distance as an absolute value. If the first string is ‘ace’ and the second string is ‘see’, it’s very unlikely that one of them is misspelled. But if the first string is ‘no woman no cry’ and the second string is ‘no womn no cr’, the second string is probably misspelled. But in both cases, the Levenshtein Distance is 2. What can help us is the ratio between the length of the longest string and the Levenshtein Distance. In the first case the ratio is 67, while in the second case it’s 13. The first value means that 2/3 of the characters are different, the second value tells us that the difference is little.

That’s why I wrote the levenshtein_ratio() Stored Function:

DROP FUNCTION IF EXISTS `levenshtein_ratio`;
CREATE FUNCTION `levenshtein_ratio`(`s1` VARCHAR(255) CHARACTER SET utf8, `s2` VARCHAR(255) CHARACTER SET utf8)
    RETURNS TINYINT UNSIGNED
    DETERMINISTIC
    NO SQL
    COMMENT 'Levenshtein ratio between strings'
BEGIN
    DECLARE s1_len TINYINT UNSIGNED DEFAULT CHAR_LENGTH(s1);
    DECLARE s2_len TINYINT UNSIGNED DEFAULT CHAR_LENGTH(s2);
    RETURN ((levenshtein(s1, s2) / IF(s1_len > s2_len, s1_len, s2_len)) * 100);
END;

Like Arjen Lentz, if you optimize one of my Functions, I hope you’ll lets me know.

Enjoy!

MariaDB/MySQL: Performances of COUNT()

Versione italiana

How fast is COUNT() execution? Well, it depends from the Storage Engine.

Try to create an Aria or MyISAM table, INSERT some data, and execute an EXPLAIN similar to the following:

MariaDB [(none)]> EXPLAIN SELECT COUNT(*) FROM test.t1;
+------+-------------+-------+------+---------------+------+--------+------+------+------------------------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra                        |
+------+-------------+-------+------+---------------+------+--------+------+------+------------------------------+
|    1 | SIMPLE      | NULL  | NULL | NULL          | NULL | NULL    | NULL | NULL | Select tables optimized away |
+------+-------------+-------+------+---------------+------+--------+------+------+------------------------------+
1 row in set (0.10 sec)

Optimized away in this case means that there is no need to access the table data, because the number of rows is stored in table’s metadata. So, the results are always retrieved instantly.

For COUNT(col_name), the table is not optimized away – unless the column is NOT NULL and you are not using DISTINCT, but that syntax is equivalent to COUNT(*).

InnoDB cannot optimize away the tables from these queries, because it has a more complex way to handle data. But of course, it can use an index to get the number of values in a column.

The Storage Engines which optimize away the COUNT() operations are Aria, MyISAM, MEMORY, and the ugly thing called MRG_MyISAM. The Storage Engine which have to access the tables are InnoDB, XtraDB, Archive and CSV. (I only made a quick test on MariaDB 5.5 default engines)

Is this a problem for you? Before switching to Aria/MyISAM, consider an alternative: keeping the row count in a very small MEMORY table without changing the original table’s engine.

Enjoy!

MariaDB/MySQL: GET_LOCK, RELEASE_LOCK, etc

InnoDB and some 3rd parties Storage Engines support transactions. But there are many places where concurrency can cause conflicts:

  • MEMORY, Aria and MyISAM only support table-level locks, which prevent all write statements (and maybe even reads) on a whole table. When concurrency is too high, this is a problem.
  • You may want to lock a VIEW, which is very different from locking a table: a VIEW can be a subset of a table, or a JOIN between table subsets.
  • DDL statements are not affected by locks. This is even true on tables which support transactions.

An alternative is using some SQL functions which acquire, check and release global named “locks”. The reason why I quoted “locks” is that they don’t lock anything. All sessions are expected to acquire and release locks with a proper name. So, applications and routines must be written by someone who knows the locking logic used for that database.

In other words: there is no way to really lock some rows, or a table, or a view using this set of functions; but if applications and Stored Routines acquire a logical lock having the same name before accessing the same table, no conflict is possible.

Just some notes, before discussing the set of functions that handle logical locks:

  • Lock names are case-insensitive. So 'A' is equal to 'a'.
  • Sometimes you don’t want to check the result of GET_LOCK() or RELEASE_LOCK(). In that case, you can use them with DO (see examples below).

GET_LOCK(name, timeout)
Acquires a lock with the given name. If the lock is already taken by someone else, waits until the other session releases the lock. The wait will not be eternal: with the timeout parameter, you specify how many seconds you want to wait, as a maximum. If you formerly acquired a lock with another name, this function automatically releases it. Currently, you can’t hold more than one lock.
When the current connection is closed (because you close it, because it timeouts, or because the administrator KILLs it), the lock will be free.
GET_LOCK() should return 1. If it timeouts, returns 0. If another error occurs, returns NULL.

RELEASE_LOCK(name)
Releases a lock with the given name. This function should be called when you don’t need a lock anymore but you are not about to close the connection.
Despite the fact that you can’t currently hold more than one lock, you must provide a lock name. If you want to acquire another lock, you don’t need to call RELEASE_LOCK().
RELEASE_LOCK() should return 1. If it returns 0, that lock has been acquired by someone else, and can’t be release by you. If the specified lock does not currently exist, returns NULL. In both cases, your code is likely to be buggy.

IS_FREE_LOCK(name)
Returns 1 is specified lock is free and can be acquired, 0 if it’s in use, NULL if an error occurs. Normally you don’t need to call this function because, as stated above, GET_LOCK() waits until specified lock is free.

IS_USED_LOCK(name)
This function is useful if you need to know which thread is holding a certain lock. It returns the connection id of that thread, or NULL if the specified lock is free.
If you need to know your connection id, in SQL you can use CONNECTION_ID(). In other languages, check your API documentation to find the API function which efficiently does this.

Example

Let’s try two example. The following examples are tested on MariaDB 5.5, but should work with the same versions of Percona Server and Oracle MySQL.

Ex. 1: Using GET_LOCK and RELEASE_LOCK

This example shows how you can simulate a lock on 1 row using its id. The following stored function copies a row from table a to table b, and then increments a value in the original row. But if we don’t handle concurrency, this could happen:

  • thread 1 copies a row into b
  • thread 2 copies the same row into b, generating a duplicate
  • thread 1 increments the row in a
  • thread 2 doesn’t copy the new value of that row, and increments it again.

To avoid this conflict, we’ll use GET_LOCK(). At the end of the Routine, we’ll use RELEASE_LOCK().

CREATE PROCEDURE `my_db`.`copy_row`(IN `row_id` SMALLINT UNSIGNED)
    MODIFIES SQL DATA
BEGIN
    -- name of the lock we'll acquire: it's a row-level lock
    DECLARE `lock_name` TEXT DEFAULT CONCAT('row_id_', `row_id`);
    
    -- try to acquire lock, or return an error
    IF (GET_LOCK(`lock_name`, 10)  1) THEN
        SIGNAL SQLSTATE VALUE '45000' SET
            MESSAGE_TEXT  = '[my_db.copy_row] Cant acquire lock';
    END IF;
    
    -- copy the row
    INSERT INTO `b` (qty)
        SELECT `qty`
            FROM `a`
            WHERE `id` = `row_id`;
    -- increment
    UPDATE `a`
        SET `qty` = `qty` + 1
        WHERE `id` = `row_id`;
    
    -- free lock
    DO RELEASE_LOCK(`lock_name`);
END;

Some notes about what we do here.
First, we create a lock_name dynamically, because we want to simulate a row-level lock. If two threads try to copy the same row (same id), they will try to acquire the same lock, so one of them will be queued. But if the id’s are different, the name of the locks will be different, and the threads can run this procedure at the same time.
GET_LOCK() is invoked within an IF condition, because we want to check the result. If it’s not 1, the lock is not acquired, so we throw an error.
At the end, we call RELEASE_LOCK() to avoid blocking the row when it’s no more necessary. We do it with a DO statement, because we are not checking the result of RELEASE_LOCK().

Ex. 2: Using IS_USED_LOCK

In the following example, we want to do something. Well, we won’t do anything, in fact we’ll just sleep for 5 seconds… but hey, when one sleeps, someone must be awake, in case the enimy arrives!

CREATE PROCEDURE `my_db`.`do_something`()
    MODIFIES SQL DATA
`do_some`:
BEGIN
    -- result of IS_USED_LOCK()
    DECLARE `locker` TINYINT UNSIGNED DEFAULT IS_USED_LOCK('my_lock');
    
    IF `locker` IS NOT NULL THEN
        SELECT CONCAT('Sorry, my_lock is used by thread ', `locker`);
        LEAVE `do_some`;
    END IF;
    
    DO GET_LOCK('my_lock', 1);
    
    SELECT 'I\'m doing something for 5 seconds...';
    DO SLEEP(5);
    
    DO RELEASE_LOCK('my_lock');
END;

This time, we store the result of IS_USED_LOCK(), so we know the id of the thread which taken the lock. If it’s NOT NULL, we report it to the user and exit the Procedure. Otherwise, we go ahead and invoke GET_LOCK().

It’s easy, isn’t it?

STK/Unit 1.0 Release Candidate 3 released

2013-04-05

STK/Unit 1.0 Release Candidate 3 is out!

STK/Unit 1.0 Release Candidate 3 should be a stable version. STK/Unit main author uses the software since 2010, but heavily recoded when he decided to make it public. All old tests he wrote still work.

Since the first public release, rc1, we only received come private feedback. That has been valuable, because helped us to identify bugs and flaws. We really wish to receive some public feedback before declaring STK/Unit production-ready.

This RC was not planned. However, we were able to expand the list of supported MySQL and MariaDB versions with a very low delta from the previous release. This should help us to get some user feedbacks, and we believe that we are not introducing any bugs.

Now, supported platforms are:

  • MariaDB 10.0, 5.5
  • MariaDB 5.1, 5.2, 5.3 (with some documented issues: a few DDL-specific assert functions could generate obscure exceptions)
  • MySQL 5.6, 5.5
  • MySQL 5.1 (with same problems MariaDB 5.1 has)

Here is a list of changes in 1.0 Release Candidate 3:

  • Fix bug #1162515 (test_stk_unit aborts on MySQL 5.5)
  • Modifying SQL files for installation on Oracle MySQL is no longer necessary
  • Commented BTs in test_stk_unit_assertions contained errors
  • Minor code cleaning

Documentation and Downloads for STK/Unit and others STK tools to come, are avaible from here:
http://stk.wikidot.com/

The public Mailing List can be found here:
https://launchpad.net/~stk-discuss

The STK team encourage you to try STK/Unit in your databases, report any bugs you may find, ask for help in the list when needed, and let us know any comments. Your feedback is valuable for us!

The STK Team

Quoting MySQL & MariaDB identifiers

Versione italiana

MySQL/MariaDB identifiers are names for databases, tables, columns, etc. They can be quoted with `backticks` (AKA backquotes), and in that case they can contain characters which are normally illegal for identifiers (even spaces or the backtick itself), or they can be reserved words. Both quoting and not quoting cause some problems.

If you don’t quote names you will need to avoid illegal chars and reserved words – which is a good practice, anyway. But when you upgrade MariaDB, the new version could add some reserved words.

If you quote names, you should be sure to do it everywhere. It is a good practice, but if a developer doesn’t use backticks (or forgets to do it), he may see strange errors.

If you use keywords as identifiers and don’t quote them, you will probably receive an error 1064 (SQLSTATE 42000):

mysql> SELECT * FROM insert;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'insert' at line 1

The bad identifier is usually the first word in the reported string (see example above), but can also be the last word before the beginning of the reported code part:

mysql> SELECT high_priority FROM t;
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual thatcorresponds to your MySQL server version for the right syntax to use near 'FROM t' at line 1

In theory, a query involving unquoted identifiers which are reserved words, could even do something different from what you think and report no errors.

Another caveat: even when you quote all identifiers, avoid using the same name for a column and for a local variable in a Stored Routine. While MariaDB understands clauses like: WHERE `id` = `id`, this can cause a lot of confusion.

SQL tools

Here are 2 Stored Routines that can probably be useful. For both the routines, I’ll include a Base Test for STK/Unit. The Stored Routines will go into a DB called lib, so the BTs will go into a Test Case called test_lib.

The first Function quotes an identifier for inclusion in a SQL string (which can be PREPAREd and EXECUTEd). It’s similar to the QUOTE() built-in function, that works for strings.

CREATE FUNCTION `lib`.`quote_name`(`id` TEXT)
    RETURNS TEXT
    DETERMINISTIC
    NO SQL
    LANGUAGE SQL
    COMMENT 'Return a quoted identifier (if NULL, id is empty)'
BEGIN
    IF `id` IS NULL THEN
        RETURN '``';
    ELSE
        RETURN CONCAT('`', REPLACE(`id`, '`', '``'), '`');
    END IF;
END;
 
CREATE PROCEDURE `test_lib`.`test_quote_name`()
    LANGUAGE SQL
    COMMENT 'Test quote_name()'
BEGIN
    CALL `stk_unit`.`assert_equals`(`lib`.`quote_name`('x'), '`x`', 'Incorrect quoting');
    CALL `stk_unit`.`assert_equals`(`lib`.`quote_name`('x`y'), '`x``y`', 'Incorrect escape');
 
    CALL `stk_unit`.`assert_equals`(`lib`.`quote_name`(''), '``', 'Empty name expected');
    CALL `stk_unit`.`assert_equals`(`lib`.`quote_name`(NULL), '``', 'For NULL value, empty name should be returned');
END;

If you need to verify a single name on the fly, the easiest way is probably:
SELECT 0 AS [name];
The following Procedure, however, is meant to be used inside Stored Programs. It can be used to know if a string is a valid name. It executes the above query (as a Prepared Statement, because it must be composed dynamically) and checks wether an error is generated.

CREATE PROCEDURE `lib`.`is_valid_name`(IN `id` TEXT, OUT `ret` BOOL)
    NO SQL
    LANGUAGE SQL
    COMMENT 'set `ret` to TRUE if id is valid name, else FALSE'
BEGIN
    -- error in query
    DECLARE EXIT HANDLER
        FOR 1064
    BEGIN
        SET `ret` = FALSE;
    END;
 
    SET @sql_query = CONCAT('DO (SELECT 0 AS ', `id`, ');');
    PREPARE stmt FROM @sql_query;
    EXECUTE stmt;
    DEALLOCATE PREPARE stmt;
 
    SET `ret` = TRUE;
END;
 
CREATE PROCEDURE `test_lib`.`test_is_valid_name`()
	LANGUAGE SQL
	COMMENT 'Test quote_name()'
BEGIN
    CALL `lib`.`is_valid_name`('valid_name', @r);
    CALL `stk_unit`.`assert_true`(@r, 'Specified name is valid');
 
    CALL`lib`.`is_valid_name`('SELECT', @r);
    CALL `stk_unit`.`assert_false`(@r, 'SELECT is a keyword');
    CALL `lib`.`is_valid_name`('a a', @r);
    CALL `stk_unit`.`assert_false`(@r, 'Valid names cant contain spaces');
    CALL `lib`.`is_valid_name`('', @r);
    CALL `stk_unit`.`assert_false`(@r, 'Empty name is not valid');
END;

STK/Unit 1.0 Release Candidate 1 released

Annuncio italiano

STK/Unit 1.0 Release Candidate 1 is out!

STK stands for SQL ToolKit. It’s a family of proects for MariaDB, MySQL and Percona Server. STK/Unit is the first STK project that has been publicly release; more tools will come in the next future. The long-term purpose of STK is making SQL programming much easier and reliable on MariaDB and her sisters.

STK/Unit is a Unit Test framework for MariaDB, entirely written in SQL and inspired by SimpleTest and JUnit. Test Cases and Test Suites written by the user can set a test environment and check that all operations work as expected. The results can be retrieved as a human-readable string, in HTML format, or examined in the tables they are stored in. Both developers and database administrators can benefit from such tests.

Errors in applications can be originated by errors in databases. STK/Unit is designed to mainly test active structures: Stored Routines, Triggers, integrity constraints and Views. But also Tables must use the correct datatypes, column sizes and character sets to be able to contain data from the Real World. And DBMS updates, new plugins or even configuration changes can break the complex, delicate logics of a relational database. But a good set of tests can show any problem as soon as it raises!

STK/Unit is still under development and is expanding the list of supported platform; currently, the folloing are supported:
* MariaDB 5.5 and 10.0 – work good
* MariaDB 5.3, 5.2, 5.1 – with documented minor problems
* MySQL 5.1 – using MyISAM place of Aria, with minor problems (undocumented)

Documentation and Downloads for STK/Unit and others STK tools to come, are avaible from here:
http://stk.wikidot.com/

The public Mailing List can be found here:
https://launchpad.net/~stk-discuss

The STK team encourage you to try STK/Unit in your databases, report any bugs you may find, ask for help in the list when needed, and let us know any comments. Your feedback is valuable for us!

The STK Team

XML and HTML entities

Versione italiana

As everyone should know, some characters should be replaced with entities, if you want to use them in a text, within an XML or (X)HTML document. For example, since tags are usually wrapped by < and > characters , you can’t write and expect that it is simply considered as a text and printed in the browser’s window.

Usually, entities are used to escape meaningful chars. XML has five predefined entities, which are also avaible in HTML:

& &amp;
< &lt;
> &gt;
” &quot;
‘ &apos;

To be sure that the result is what you want, & characters must be replaced before others.

These are the only character that can confuse the client and invalidate a document. Within a parameter, you can even leave < and > unchanged. In other contexts, ” and ‘ can appear.

You can avoid to replace all other “weird” characters, if you use the UTF-8 character set.

In XML documents, you can also avoid using the predefined entities, as long as all matching characters only appear in a CDATA section. The syntax for CDATA sections is:

<![CDATA[ blah blah blah ]]>

CDATA sections can not contain the ]]> sequence of characters; there is no workaround for this limitation.

Enjoy!