SQL Server 2008 – iFTS Introduction
This is the
first of a series of posts on the new Integrated Full Text feature in SQL Server
2008. This is one of my favourite features which I’ve had to keep stum
about for quite a while.
Firstly an
introduction to full text.
Full text is
the feature in SQL Server that allows you to search text content (varchar,
char,blobs etc) in a way that performs well ad also do stuff that you can’t do
simply in the relational world with SQL. The simplest example is a description
of a product. If you wanted to search for ice cream containing cherries your
query might look like this.
select
ProductId,
ProductDescription
from dbo.product
p
join dbo.productType
pt on pt.ProductTypeId =
p.ProductTypeId
where pt.productType = 'Ice Cream'
and p.description
like '%cherries%'
This is using
the like operator to find products that contain the word cherries in the product
description. If you obtained the
query plan for this, even though you had an index on productDescription and you
only had 1 product which contained cherries, it is more than likely you would
end up with a table scan of the products table. If you had an index on the
productTypeId column you may end up with a range scan of that index and a
bookmark lookup to get the product description.
The reason
for this is that a normal index on a text based column only helps predicate
based searches i.e. where the column starts with a certain set of characters,
i.e.
and p.description
like 'cherries%’
Note: There
is no % at the beginning only at the end.
You will also
find that you are unable to index a column that is more than 900 bytes. So if
your column is a unicode column you are limited to 450 characters (unicode
requires 2 bytes per character)
So how do you
get optimal querying. This is where full text comes in. Full Text indexes each
word in the column not the whole column. This means you can convert the query
above into
select
ProductId,
ProductDescription
from dbo.product
p
join dbo.productType
pt on pt.ProductTypeId =
p.ProductTypeId
where pt.ProductType = 'Ice Cream'
and contains (p.ProductDescription,'cherries')
This enables
the query engine to obtain the set of products that have cherries and merge the
results with those that are ice cream to produce the results. What this means in
terms of query performance I will discuss later.
So how does
Full Text work.
It consists
of a number of components, these are documented well in BOL and in a set of
white papers. They are,
1.
iFilter – This takes the value in the
column and returns a lump of text containing a set of words, this might sound
odd but when you realise that there are iFilters for pdfs, jpegs, CAD drawings
you can see that irrespective of the content in the file they are able to return
the text from that content to the Full Text engine.
These iFilters are used by all Microsoft Search applications
including sharepoint, index server and desktop search.
2.
Word breaker – This takes the text
returned from the iFIlter and breaks the paragraphs down into individual words
based on the rules for the relevant language, i.e. spaces, commas, etc. Some
languages are more complex than others because they contain compound
words.
3.
Indexer – The output from the word
breaker is then stored in the full text index to provide optimal retrieval, its
not just the words put also the position that is stored, which enables you to
search where two words are near each other.
4.
Parser – Does a number of things. It
removes words that are to be ignored i.e. the, at, in and a. It applies the
thesaurus if one has been configured, applies word breaking and converts the
result into a set of criteria to be used by the full text service to query the
index.
5.
OLEDB interface – In SQL 2005 and before
Full Text is a windows service that runs outside of the sqlserver.exe process.
This means the sql server engine needs a way of querying the data from the
service. This is done using OLEDB, in an execution plan you will see a “Remote
Scan” operator. This is the engine talking to the full text service using
OLEDB.
Prior to SQL
Server 2008 most of the items above where black boxes. If something didn’t work
as expected you had no idea what was wrong. Further more the performance of
having to use OLEDB to go to an external service was not good.
In the rest
of the series I will be going over what the issues are with SQL Server 2005 and
how the improvements in SQL Server 2008 truely make the Full Text index first
class.
The following are the other posts in the series
If you want to try iFTS you can download the SQL Server 2008 from here http://www.microsoft.com/sql/2008/prodinfo/download.mspx
-