Implementing a Soundex function

<< Creating UDFs in Delphi | Database technology articles | Structure of a header page >>

Implementing a Soundex function

By John Midwinter

No, this is not grunting and grinding sound effects for all you Multimedia-heads - though it's a thought for later maybe? This is about making our database lightening fast at finding names that sound the same as the name we have in mind.

How about this scenario:

I have a 300 Mb database holding 300,000 contacts, I can identify that there are 3,439 records for surnames that sound like "SmiTh" in 0.77 seconds and list the 82 different names that sound alike ranging from "SMEETH" to "Synott" in 1.51 seconds; and note that this is not case sensitive. Compare this with a time on the same database of 2.95 secs to find an exact match for "SMITH" on an indexed field of Upper_Surname even though there were fewer matching records (3,161). i.e. The sounding like method is twice as fast as an exact match! That surely cannot be right? But it is! These are real figures on a real database.

So, now I've got your interest huh?

This is all achieved by utilising the Soundex method of indexing, encapsulating it in a DLL, using that as an InterBase® UDF and modifying our InterBase® table to use this key. That may sound hefty, but really is so simple and straightforward you should be able to modify your application to perform this miracle of lookup within just "a few short moments".

back to top of page

How it works

The Soundex method of indexing was developed for use with the U.S. Federal census records since 1880 as a card index system for grouping similar sounding names. The implementation given here goes a step further to make it really compact and faster.

The Soundex method works by taking each word to be indexed (surnames in this example) and reducing it to a short code based on the letters in the word according to a simple set of RULES:

  • Take the first letter.
  • Then, for each of the subsequent letters, score a code for the group it belongs to. There are 6 groups:
      Group	      Letters
      1               BFPV
      2               CGJKQSXZ
      3               DT
      4               L
      5               MN
      6               R
  • Ignore ALL other characters (vowels, h,y,w, spaces and punctuation)
  • If the scoring code is the same as the last one, ignore it, otherwise add it to the code
  • Do this until you get up to 4 characters. If you do not get to 4 characters, then add 0's up to 4 characters.

Thus :

 SMITH becomes S5300
 Smitt becomes S5330
 sMythe becomes S5300

and hey! they all sound the same!

That's how it works. It is that simple. Though it sometimes it gets quite strange matches, it generally gets realistic results.

Thus all we need to do to use this is to have a new column in our database into which we generate the Soundex code for the name and keep it up to date. Then, to find matches we just look for those names that have the same Soundex code. As this new column will be indexed, this search will be fast.

back to top of page

How to implement it

We need to do the following :

  1. Produce a function in a DLL that we can call from InterBase® (and also call from our search routines).
  2. Declare the DLL and its function to InterBase® as a user-defined function (UDF).
  3. Add a column (via a domain) to our master table.
  4. Initialise the data in the master table to create the Soundex index codes.
  5. Put in some code to keep our database up to date.

back to top of page

The function

The function is just a few lines of Delphi®. The FULL source code for the DLL and its function SoundBytes (so called as it returns bytes not characters) is attached. I hope it makes easy reading. Notice that rather than use the Soundex 4 character result I have modified this to generate just 2 bytes (as a SMALLINT) to give more compactness. This halves the size of the index used to look up our values thus providing faster performance. I did this on realising that, though the natural Soundex Code has 4 characters, it only has 26 * 6 * 6 * 6 (=5,616) possible values which can be held by a SMALLINT. I experimented with improving the "accuracy" of the match by increasing the size of the Soundex code to 5 characters (which could still be held within 2 bytes (33,696) as a word). This resulted in fewer matches being returned but, looking at the results, it seemed better to have a wider net for general use. However, it could be worth bearing in mind if your application has a lot of similar sounding words and you need to have finer selection.

For those not familiar with DLL writing, the "tricks" to pick up in the code are:

  • The heading as a library rather than a project.
  • Only USE SysUtils - we do not need anything else. By default, Delphi® will put other stuff in there - knock it out, you do not need it and removing it will reduce the size of the DLL.
  • The declaration of the function with the MAGIC word cdecl at the end. That tells Delphi® which way round to handle its parameters and how to compile it for Windows use (by pretending it's a "C" thingy - it's more complex than that - but this is the declaration we need).
  • The exports clause at the end.
  • The use of pointer type variables for the parameter and return (do not use strings).

Notes in the code describe why some things have been done the way they have. The aim has been for minimal processing so that the function (which can be used in general queries) will operate as quickly as possible. Get it tighter and I'd be delighted to hear - it's always a challenge.

Compiling produces a 27k DLL which is a tiny thing for Windows but works.

back to top of page

Making it available to InterBase®

For guaranteed results, put the SoundBytes.DLL into the BIN directory of InterBase® (normally Program Files\Borland\Interbase\bin) The necessity is that it is visible to InterBase®. The bin directory ensures this but you are not restricted to this location if you need to put it elsewhere.

Then tell InterBase® about it. Using WISQL, connect to your database, then give the command:

 Declare external function SoundBytes Char(40) \What we will call it in InterBase®
 Returns Integer by VALUE \What it returns. 
 Entry_Point "SoundBts" \The name of the function in the DLL
 Module_Name "Soundbytes.DLL" \The name of the DLL

Then commit. That's it.

You now have your own user-defined function that you can call like any other function (e.g. Upper()) in InterBase®.

A word of caution when using UDF's. Be aware that when a database has a UDF declared, then the DLL it calls MUST be visible to the SERVER when it attempts to connect to the database. If the DLL is not visible the database will not connect. This can be painful when setting up on another machine if you forget to bring the DLL with you! Do not underestimate this one. It can be a real Gotcha, you cannot even get into your database in order to remove the reference to it! (You're right, that's personal experience talking!).

back to top of page

Making InterBase® use it

First we need somewhere to hold this index so give your table an extra column.

In my case the table was called Main and had the field [Surname] that I wanted to index. So from WISQL, I first created a domain (OK, you don't need to do this before adding a column, but it's the right way to do it!):

 Create Domain SoundBytesDomain 
 SmallInt 
 Default 0 
 Not Null

Then use this domain to add a column to the table:

 Alter Table Main
 Add SoundBytes SoundBytesDomain

Now generate data into it:

 Update Main
 Set SoundBytes = GetSoundBytes(Surname)

That took about 10 mins on my 300,000 record database.

Now, put an index on this column.

NB. DO NOT put the index on the column before doing the data update. If you do the data update on a table of this size it would take approximately 1 hour, 17 mins and 44.4 secs (guess how I know this!).

 Create Index SoundByteIndex on Main (SoundBytes)

At this point the database is ready to try out!

If you have not committed as you go along now would be a good time to commit and save all this hard (?) work.

Try:

 Select Distinct Surname from Main //use distinct to get the different names
 where SoundBytes = GetSoundBytes("sMitH") //I use mixed case just to show it does not matter!

What you get depends on what you have got on your database. I got 82 records from my 300,000 record database in 1.54 seconds and was mightily impressed.

Now all we need to do is to make our database keep this new field up to date. We want the Server to do this and not bother us with it. This is an obvious task for a trigger.

We need this field to be updated whenever the Surname is changed. So trap on UPDATE:

 Create Trigger SoundBytes_update for Main
 BEFORE UPDATE
 as begin
   New. SoundBytes = GetSoundBytes(New.Surname);
 end;

Note the use of the New.Fieldname syntax. We want the new index value to be set to the new value of the Surname. We could put in code to check to see if the Surname has changed, but the operation is so fast that I let it reset the index regardless of change to the Surname. You might already have a trigger on the UPDATE event (it is probably the most common trigger used) so you could just add code to that trigger.

Depending on how your database operates, you might also want to trap on INSERT for cases where the whole record gets banged in one shot. Note that you can only trap for one event for each trigger, so an INSERT trigger would have to be separate.

That's it. All done. You now have a SoundBytes index maintained in your table. All you need to do is use it.

back to top of page

Using it

Though it's so impressive to use this new function from WISQL that you could probably play with it all day, we need to get down to using it from the real world of our applications. This is easy, you probably do not need me to tell you how to do this, but this is how I did it from Delphi®.

Most commonly this index would be used to produce a list of records that have surnames sounding like our sample. Taking this as my example, the simple pseudo query is:

 select surname from mytable where 
 soundbytes = the soundbyte of my sample.

Three approaches came to mind:

1. Pass the query as text to InterBase® and get the server to do the conversion of the sample and return the result

 'Select ID,Surname,FirstName  from Main where 
 SoundBytes = GetSoundBytes( "'+ SearchPattern + '")'

Now this should have worked exactly the same as if the command had been given from WISQL with the search pattern filled in (as it did above). But I found that I got GPs on this. I guessed that InterBase® (remote) when addressed from a client was having problems with the path of the DLL? Anyway, I gave up on this as there was an easier solution.

2. To overcome the path thing, use a stored procedure that would first call the UDF to get the search pattern as a SMALLINT and then use this in a SELECT command. This worked but I did not like calling a stored procedure to call a UDF to return a SELECT command. It did not "feel right".

3. Use a normal SELECT query, but first get the SoundByte of the search pattern and pass that.

 'Select Surname from Main where 
 SoundBytes = '+ SearchIndex + ')'

This worked best of all (like it worked!) and had the neatness that it could be held by a TQuery and be pre-prepared, thus just needing to change the parameter each time the query was called.

To get the SoundByte index, I called the same DLL from my application. To do this the function must be declared in the interface section of the unit from which you want to call the DLL:

 function SountBts(Name : PChar) : SmallInt ; cdecl; external 'SoundBytes.DLL';

Note the use of cdecl again and the declaration of the DLL file in which it is to be found. In this example Delphi® will need to find the DLL within its default paths. So:

  • a copy should be it in the same directory as your executable
  • or it should be in a common directory that Delphi® and InterBase® can see
  • or you could put the Interbase/Bin directory in your default path
  • or you could give the full path in the declaration - but this is messy - and we don't like hard coded paths in OUR applications do we?

Then it can be used it just like any other function. Thus:

 SoundIndex := SoundBts(SearchPattern);

works to put the Number into SoundIndex (which is a SMALLINT) for the SearchPattern variable which is a PChar.

In my test application, I had a TQuery, attached to a database component which was set to my InterBase® database. I set the query strings:

 Select Surname,Contact_ID
 from Main
 where Soundbytes = :SXCode

By defining my strings thus, I automatically got a parameter of SXCode.

Using the Params editor I set this to a type of SMALLINT.

In the FormCreate event I prepared the query:

 SoundBytesQuery.Prepare 

so that it would go faster when called.

Then to activate the query, I just needed to put in the SoundBytes value into the Parameter and activate the query:

 With SoundByteQuery do
   begin
     Active := False;
     ParamByName('SXCode').AsSmallInt := SoundBts(PChar(SearchString.Text));
     Active := True;
   end;

Note that this calls the external function and that the string passed to it is first TypeCast as a PChar. That's it. All that was left was to play and be impressed!

back to top of page

Conclusion

Well that was all pretty easy wasn't it? Now you can have that snappy response to what to the observer may seem a daunting task and bask in the glory of InterBase® at it's best:

  • Perplexed User: "Let me get this straight, you are searching 300,000 surnames of my database and comparing each with your pattern to see if it sounds the same and then you show all of those?"
  • Happy Developer: "Yup." (well it's near enough, why break the spell?)
  • Imressed User: "In that time?"
  • Happier Developer: "Nothing up my sleeve, I promise."
  • Very impressed User: "It's amazing!"
  • Happiest Developer: "Just something I knocked up! By the way, here is my latest invoice."

Now don't you agree that is how to make a database sound better!

back to top of page

Source code of SoundBytes.DPR to create SoundBytes.DLL

 {This DLL is for use as a UDF for InterBase® but may be called by any 
 application as a normal DLL it is based on the Soundex method of compressing
 names released to the public domain by Natural Systems for free 
 distribution This version compresses the sound of the name into a 2 byte
 word rather than the 4 characters of the original version in order for 
 improved performance. 

 JDM 19 Nov 97}

 library SoundBytes;

 uses
   SysUtils; 

 function SoundBts(Name : PChar) : SmallInt; cdecl;
 Const
 {This table gives the Soundex SCORE for all characters Upper and Lower Case
 hence no need to convert. This is faster than doing an UpCase on the whole input string
 The 5 NON Chars in middle are just given 0} 

 SoundExTable : Array[65..122] Of Byte
 //A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ / ] ^ _ '
 =(0,1,2,3,0,1,2,0,0,2,2,4,5,5,0,1,2,6,2,3,0,1,0,2,0,2,0,0,0,0,0,0,
 //a b c d e f g h i j k l m n o p q r s t u v w x y z
   0,1,2,3,0,1,2,0,0,2,2,4,5,5,0,1,2,6,2,3,0,1,0,2,0,2);

 Var
   i, l, s, SO, x : Byte;
   Multiple : Word;

 begin
   If Name >  ''                           //do nothing if nothing is passedthen begin
     Result := Ord(UpCase(Name[0]));       //initialise to first character
     SO := 0;                              //initialise last char as 0
     Multiple := 26;                       //initialise to 26 char of alphabet
     l := Pred(StrLen(Name));              //get into var to save repeating function
     For i := 1 to l do                    //for each char of input str
     begin
       s := Ord(name[i]);                  //*
       If (s > 64) and (s < 123)           //see notes * below
       then begin
         x := SoundExTable[s];             //get SoundEx value
         If (x > 0)                        //it is a scoring char
         AND (x <> SO)                     //is different from previous char
         then begin
           Result := Result + (x * Multiple); //avoid use of POW as it needs maths unit
           If (Multiple = 26 * 6 * 6)      //we have done enough (NB compiles to a const
   	   then break;  			 //We have done, so leave loop
           Multiple := Multiple * 6;
           SO := x;                        //save for next round
         end;                              // of if a scoring char
       end;                                //of if in range of SoundEx table
     end;                                  //of for loop
   end else result := 0;
 end;                                      //of function SoundBts

 exports                                    
   SoundBts;
 begin
 end.

 {*Notes  I originally used a Try/except block for getting the index code 
 but found that although SoundexTable[45] will error as out of range, 
 SoundExTable[Ord(Name[i])] where OrdName[i] was 45 returned 0, or what 
 ever happened to be at that ILLEGAL address. It did NOT error and fall 
 out of the block! So, I put in checks that letter was in range - I could 
 have kept my try/except by using using SoundExTable[s], but went for this
 way instead}

This paper was written by John Midwinter and is copyright IBPhoenix Inc.

See also:
User-defined function UDF
Stored Procedure
Trigger

back to top of page
<< Creating UDFs in Delphi | Database technology articles | Structure of a header page >>