Understanding Full Text Searching Compared with LIKE

One of the most common questions that comes up on the newsgroups is around searching for terms containing punctuation. The questions are normally something like

"I'm trying to use Full Text to search my product catalog, when someone enters a product code the results I get back are often wrong. Why is that?"

This is normally folllwed by a response

"What format are your product codes?"

With an answer of,

"Something like A-BXDRF/1234"

And that is when the problem is highlighted. the "A-BXDRF/1234" in full text world is more like "A BXDRF 1234" let me explain 

Full text works with terms (words), it has an index of these terms that have been extracted from the data in the column indexed. Full text uses a word breaker to achieve this, these have rules built in to them that define how a block of text is broken down into terms. The word breaker is language specific as they are different rules for breaking up a block of text for different languages. Some languages don't have word breakers and use a neutral word breaker whilst others inherit word breakers from a master language i.e. French Canadian inherits from French. Word breaking applies to both indexing of documents as well your search text.

At a simple level words are broken up at neutral characters like spaces and punctuation. This should explain the breaking up of the product code into the individual terms. In the example the killer aspect is that one of the terms is a single letter. By default full text treats individual letters as noise words. This means they are not stored in the index. To resolve this you can remove all the single letters (and anything else that you might want to search on) from the noise word file and then repopulate your index.

Whilst this will result in a better set of results it may still not be 100% accurate and thats because you will be searching for a broken down set of terms. and not a precise term.

They other word of warning is that in removing words from the noise list will result in your full text index being much larger.

Many people will also reccommend the use of LIKE to solve this problem. Whilst this is a solution, be aware that SQL will have to scan every record to perform the search rather than use an index, if you text column is large i.e. a product description of 4000 characters then this will be a slow process. There is performance trade off between having a large full text index or using LIKE, generally full text will perform better.

If you must have 100% accurate searches then I would suggest holding your product code in a seperate field and performing a search on that field, in that situation a LIKE search may perform adequately.

Which ever option you choose make sure that you thoroughly test your solution and also test for concurrency.



-
Published 02 March 2007 10:15 by simonsabin

Comments

No Comments