The Levenshtein distance is a standard tool in approximate string matching. Despite its popularity, DB2 does not provide a build-in function for computing Levenshtein distances. What seems to be most similar in spirit is DB2’s DIFFERENCE function, which measures the similarity between two strings using the Soundex algorithm.
However, it’s not difficult to create a user-defined function (UDF) that computes Levenshtein distances. One simply can integrate the Java implementation provided by the Apache Commons Lang library into DB2. The method we need is getLevenshteinDistance, located in the class org.apache.commons.lang.StringUtils. Here is a short guide:
- Download the binary package of Apache Commons Lang and extract it. Copy the contained JAR file (commons-lang-2.4.jar) somewhere onto your DB2 server.
- Use DB2’s command-line tool to connect to your database:
db2 CONNECT TO yourdb
- Deploy the JAR file into DB2:
db2 "CALL sqlj.install_jar('file:/path/to/the/jarfile/commons-lang-2.4.jar', 'commons-lang-2_4')"
- Create a corresponding user-defined function:
db2 "CREATE FUNCTION yourschema.levenshtein(s VARCHAR(255), t VARCHAR(255)) RETURNS INTEGER EXTERNAL NAME 'commons-lang-2_4:org.apache.commons.lang.StringUtils.getLevenshteinDistance' LANGUAGE JAVA PARAMETER STYLE JAVA NO SQL NO EXTERNAL ACTION ALLOW PARALLEL"
- Test it:
db2 "VALUES yourschema.levenshtein('test', 'taste')"
This should return the value 2.
db2 DISCONNECT yourdb