SQL Server 2008 – iFTS Transparency – Viewing the words in the index

This is the next in the series of posts about the new iFTS (Integrated Full Text Search) in SQL Server 2008.

One of the huge frustrations with using Full Text prior to SQL Server 2008 was that you had no idea what content the iFilter has extracted from the document you have indexed. This leads to questions like “Why hasn’t this document been returned when this one has?”. We had a situation where a process was run that looked for discrimination, this included looking for the word black. At one time one companies  content kept getting flagged as including discrimination but on looking at the content we couldn’t figure out why.  After doing some digging we realised that the content was HTML, the viewer showed rendered HTML and not the raw HTML, and the HTML contained the font elements specifying the colour Black. We were had a Full Text index that wasn’t type and so the generic text iFilter was used and so Black was included in the index. We changed to use a type column and specified “HTML” which resulted in the HTML iFilter being used and so the markup elements were not returned and stored in the index.

If we had access to the words that had been indexed, it would have been clear what had happened. Well in SQL Server 2008 there are Two dynamic management views that return the keywords of the index, and the keywords of each document in the index. These dmvs are, sys.dm_fts_index_keywords and sys.dm_fts_index_keywords_by_document respectively.

select *
  from sys.dm_fts_index_keywords (db_id(),object_id('Table_1'))

 select *
  from sys.dm_fts_index_keywords_By_Document (db_id(),object_id('Table_1'))

Note: The second dmv whilst is by document does not take a document id as a parameter. This means that this returns all the keywords for all the documents in an index. That will be a lot of rows. As an example if you store the definition from all_sql_modules in master in a table you will have ~ 1780 rows. When index using the default stop lists that will result in ~153000 rows being returned from sys.dm_fts_index_keywords_by_document.

Because of this it is not a dmv to be called from your application. If you want to use the output from his dmv I suggest you have a batch process that persists the output to a table which can be indexed and so usable in an application. This would be an ideal candidate for using the MERGE statement.

Further uses for this information?

If you are interested in looking at other query models for text content then this information provides you a fantastic starting point. Effectively you have text mining functionality in the database. The ranking algorithms are fixed in SQL Server (something that is being looked at for future releases) if you don’t like their ranking algorithm you could build your own.

Unfortunately the position information of keywords in a document is not returned by these dmvs so you can’t build an algorithm that involves position. That also means you can’t use this information to rebuild snippets based on search criteria in a way that live.com or google.com do.

What it is great at doing is giving you analysis features or your data. If you have a set of search terms you can easily use the contents of this index to find how many matches each term makes. This is useful if you want to display counts of items on your web site.

Another option that I thought of in response to a post on my introduction post (SQL Server 2008 – iFTS Introdction) Is the use for partial matches.

Currently it is only possible to do prefix wild card searching (Wildcard-search-in-SQL-Server-Full-Text) i.e. "Mirco*" You can't look for "*soft", the only way of achieving he latter is by indexing a reverse version of the content, which is fine for text content (give the extra space) but not for other content type i.e. word or pdf. Having the contents index available would allow you to do build a query based on the terms that match your criteria.The terms in index will be a fraction of the total index size and so querying using LIKE would not be too bad. i.e. select term from MyIndexTerms where term LIKE '%<partialword>%'. You can then use these terms to build a CONTAINS statement, i..e looking for "compute" CONTAINS (*, 'computedcolname OR computername OR computetime OR ....'). Essentially you are treating the index as a dictionary.

You could also do matching based on sounds like (not SOUNDEX), but thats beyond the scope of this post.

The following are the other posts in the series

If you want to try iFTS you can download the SQL Server 2008 from here http://www.microsoft.com/sql/2008/prodinfo/download.mspx



-
Published Wednesday, February 20, 2008 11:32 AM by simonsabin

Comments

Tuesday, October 7, 2008 8:50 AM by SQL Server, BizTalk Server, le 64 bits et au-delà !...

# SQL Server 2008 - Integrated Full Text Search (iFTS) ...

Voici une suite de 6 articles tr&#232;s int&#233;ressants sur les nouveaut&#233;s de la recherche en

Wednesday, February 10, 2010 8:58 AM by Simon Sabin UK SQL Consultant's Blog

# Microsoft ditches Unix in the search market

Wednesday, February 10, 2010 9:43 AM by SimonS Blog on SQL Server Stuff

# Microsoft ditches Unix in the search market

Search is one of my interests, where ever you go, whatever business you are in search is the like a holy