SQL Server 2008 – iFTS Introduction

This is the first of a series of posts on the new Integrated Full Text feature in SQL Server 2008. This is one of my favourite features which I’ve had to keep stum about  for quite a while.

Firstly an introduction to full text.

Full text is the feature in SQL Server that allows you to search text content (varchar, char,blobs etc) in a way that performs well ad also do stuff that you can’t do simply in the relational world with SQL. The simplest example is a description of a product. If you wanted to search for ice cream containing cherries your query might look like this.

select ProductId, ProductDescription

from dbo.product      p

join dbo.productType  pt on pt.ProductTypeId = p.ProductTypeId

where pt.productType = 'Ice Cream'

and p.description  like '%cherries%'

This is using the like operator to find products that contain the word cherries in the product description.  If you obtained the query plan for this, even though you had an index on productDescription and you only had 1 product which contained cherries, it is more than likely you would end up with a table scan of the products table. If you had an index on the productTypeId column you may end up with a range scan of that index and a bookmark lookup to get the product description.

The reason for this is that a normal index on a text based column only helps predicate based searches i.e. where the column starts with a certain set of characters, i.e.

and p.description  like 'cherries%’

Note: There is no % at the beginning only at the end.

You will also find that you are unable to index a column that is more than 900 bytes. So if your column is a unicode column you are limited to 450 characters (unicode requires 2 bytes per character)

So how do you get optimal querying. This is where full text comes in. Full Text indexes each word in the column not the whole column. This means you can convert the query above into

select ProductId, ProductDescription

from dbo.product      p

join dbo.productType  pt on pt.ProductTypeId = p.ProductTypeId

where pt.ProductType = 'Ice Cream'

and contains (p.ProductDescription,'cherries')

This enables the query engine to obtain the set of products that have cherries and merge the results with those that are ice cream to produce the results. What this means in terms of query performance I will discuss later.

So how does Full Text work.

It consists of a number of components, these are documented well in BOL and in a set of white papers. They are,

1.       iFilter – This takes the value in the column and returns a lump of text containing a set of words, this might sound odd but when you realise that there are iFilters for pdfs, jpegs, CAD drawings you can see that irrespective of the content in the file they are able to return the text from that content to the Full Text engine.

These iFilters are used by all Microsoft Search applications including sharepoint, index server and desktop search.

2.       Word breaker – This takes the text returned from the iFIlter and breaks the paragraphs down into individual words based on the rules for the relevant language, i.e. spaces, commas, etc. Some languages are more complex than others because they contain compound words.

3.       Indexer – The output from the word breaker is then stored in the full text index to provide optimal retrieval, its not just the words put also the position that is stored, which enables you to search where two words are near each other.

4.       Parser – Does a number of things. It removes words that are to be ignored i.e. the, at, in and a. It applies the thesaurus if one has been configured, applies word breaking and converts the result into a set of criteria to be used by the full text service to query the index.

5.       OLEDB interface – In SQL 2005 and before Full Text is a windows service that runs outside of the sqlserver.exe process. This means the sql server engine needs a way of querying the data from the service. This is done using OLEDB, in an execution plan you will see a “Remote Scan” operator. This is the engine talking to the full text service using OLEDB.

Prior to SQL Server 2008 most of the items above where black boxes. If something didn’t work as expected you had no idea what was wrong. Further more the performance of having to use OLEDB to go to an external service was not good.

In the rest of the series I will be going over what the issues are with SQL Server 2005 and how the improvements in SQL Server 2008 truely make the Full Text index first class.

The following are the other posts in the series

If you want to try iFTS you can download the SQL Server 2008 from here http://www.microsoft.com/sql/2008/prodinfo/download.mspx



-
Published 19 February 2008 15:06 by simonsabin

Comments

20 February 2008 10:10 by Fredrik Svard

# re: SQL Server 2008 – iFTS Introdction

Hi,

Do you know if Sql 2008 Full text support "contains (p.ProductDescription,'erries')" findingfor example  'cherries'.

So when you search for parts of a word on not only from the beginning?

Thanks

Fredrik

20 February 2008 11:13 by simonsabin

# re: SQL Server 2008 – iFTS Introdction

Suffix searching i.e. "%erries" is still not supported. only prefix searching is "cherr%".

You can get round this by having two columns one that is the revers of the other and then do a search for

contains(reversecolumn,'"seirre*"')

If you want to search for parts of the word you are stuck. I have a post coming soon on some other features that you may be able to leverage though

25 February 2008 11:07 by Alun

# re: SQL Server 2008 – iFTS Introduction

An excellent introduction and a good lead into the 2008 features.

28 February 2008 10:43 by Stan

# re: SQL Server 2008 – iFTS Introduction

I agree with Alun, great introduction to the new features. I'm looking forward to get started with SQL Server 2008!

# SQL Server 2008 ??? iFTS Introduction - SimonS Blog on SQL Server Stuff « Catnippet

Pingback from  SQL Server 2008 ??? iFTS Introduction - SimonS Blog on SQL Server Stuff « Catnippet

# Best Practices for Integrated Full Text Search (iFTS) in SQL 2008

This blog is a result of lessons learned while working with the new Integrated Full Text Search in SQL

# Best Practices for Integrated Full Text Search (iFTS) in SQL 2008 | MS Tech News

Pingback from  Best Practices for Integrated Full Text Search (iFTS) in SQL 2008 | MS Tech News

10 November 2008 08:21 by Marco Scheel aka GeekDotNet

# Microsoft SQL Server 2008 – Interated Fulltext Service ist cool

In vielen unserer Projekte ist ein SQL Server Teil der Lösung. In einigen Projekten muss über die Daten in den Tabellen gesucht werden. Für kleinere Projekte mit kleinen Datenbeständen kommt man meist mit dem T-SQL Bordmittel LIKE aus. Ein Projekt hat

08 June 2009 01:32 by SQL Server Blog by Jason Massie

# SQL Server Blog by Jason Massie

Pingback from  SQL Server Blog by Jason Massie