February 2008 - Posts

Can anyone that saw the launch event in LA explain what on earth that rant/speech was at the start before Steve took over.

We sat here in the UK just wondering if the AV guy had got the wrong feed.

MS marketing have some odd thoughts sometimes.



-
Posted by simonsabin | 2 comment(s)

There have been some posts on cleaning text with TSQL, Remove all non-numeric characters from a string and extract-only-numbers-from-a-string

Here is my take on it. The nice thing is that you just specify the valid characters in the like pattern to select those that you want.return.

declare @string varchar(200)

set @string = 'this $%^^&is%^& s2342om23&&({}e c76l232e+_+a#n/ c][#o''d#e'

 

select cast(cast((select substring(@string,n,1)

from num

where n <= len(@string)

and substring(@string,n,1) like '[a-z ]' for xml path('')) as xml)as varchar(max))

 



-
Posted by simonsabin | 4 comment(s)
Filed under:

In the user group meeting today I mentioned an AddIn for Management Studio that did reflector on the assemblies in the database. I was either confused as I don't seem to be able to find the management studio add in.

What I did find was a plugin for Reflector that can read assemblies from a SQL Server database. This can be found here http://www.denisbauer.com/NETTools/SQL2005Browser.aspx

This is probably the preferred option anyway as add-ins for management studio aren't supported.



-
Posted by simonsabin | with no comments
Filed under: ,

Thanks for all those that turned up tonight it was a good turnout, if only for the beer and pizza (which was very good for those who didn't come).

We will be following up on the questions that everyone raised over the next few days. If you thought of anything else feel free to contact me.

As I mentioned during the meeting We are looking for people to help out at user group meetings and to speak. Many of you I spoke to said you had something that you would like to present. Please drop me a line and we can arrange something for the forthcomin meetings. The next one will be in TVP and the one after in London again. Your session only has to be 5 or 10 minutes (although I bet once you get started you won't be able to stop) so don't even worry about a slide deck just bring some demos.

Once again thanks for turning up and I hope to see most of you at sqlbits next weekend.



-
Posted by simonsabin | with no comments
Filed under: ,
The next post in the series on iFTS (Integrated Full Text Search) covers one of the major improvements in SQL Server 2008 for Full Text and is related to the Integrated part or iFTS.

SQL Server 2005 introduced a huge performance improvement by moving the location of the docid map.

So what is a docid, you may ask?

Well we all know that to get best performance with indexes our keys should be as small as possible. When you create a Full Text index you specify a unique index to be used as the key for the row. However this might be anything, a uniqueidentifier, a large varchar, anything, so to ensure optimal performance uses a surrogate key to identify a row in a table. This is the docid (Document Id).

Note: In full text terms a document is a row in a table.

There is obviously a need to translate between the docid and the key of the table. Prior to SQL 2005 that was done in the full text engine, this meant two things. Large keys had to be passed through OLEDB to SQL Server and also unnecessary mapping may have undertaken in the Full Text service.

In SQL 2005 this mapping was moved into the database which meant the transfer over OLEDB was quicker and the engine could do more optimisations in terms of doing the mapping.

So what’s new in SQL 2008?

Firstly the index is stored in the engine and so there is no OLEDB stuff going on, but more importantly if you have an integer key for your Full Text index, Full Text will use that key as the docid and so you won’t have to map between two values.

This has huge benefit when you have full text and normal relational filters because the optimiser can push the relational filter into the querying of the full text index and thus read less data and so perform better.

The following query plans highlight the docid mapping. The first is using a uniqueidentifier as a key,

And this shows the query plan with an integer key,

The subtree cost of the first is 0.01 (which isn't high but I only have a 174 rows in this table. The subtree cost of the second is 0.000459, thats a 20 fold decrease. Just for using an integer key. All is not lost, if you have a non-integer key you still benefit from the performance changes and the ability to use all the relational join operators in the engine, its just you will have a few extra steps to perform and so performance will not be as good as with an integer key.

What about upgrade?

If you are upgrading from SQL 2005 you have a number of options. The first is to maintain the index as is (but held within SQL Server). This has the benefit of reduced downtime, but you still have a docidmap.

The second option is to rebuild the index. This will rebuild the index using the new structure AND remove the docidmap. So will give you the best performance but requires a full rebuild of the index (which should be quicker in 2008 as well).

The third option takes the index offline and can’t be used. This allows you to delay the rebuild until a time that is convenient for you and your server. (If you upgrade 10 databases it is probably not wie to have them all rebuilding their full text indexes at the same time).

What about new databases ?

One of the other aspects is that because the full text is in the database you no longer have to enable the database for full text. You simply need to create your full text catalog and away you go.

Be conscious of the docid issue when creating new tables. You will get best performance if the unique index you use for the full text index has an integer key.

The following are the other posts in the series

If you want to try iFTS you can download the SQL Server 2008 from here http://www.microsoft.com/sql/2008/prodinfo/download.mspx



-

The next post in the series on iFTS (Integrated Full Text Search) covers a thesaurus files. A feature introduced in SQL 2005.

The loading of thesaurus files is a bit of an uncontrolled beast in SQL Server 2005 (there weren’t supported or documented prior to SQL 2005). The only way you could force a thesaurus files to be reloaded and used was to restart the Full Text service. (I believe you may have had to restart the SQL Service as well but that isn’t documented). This is obviously a slight pain.

In SQL Server 2008 thesaurus files are still files on the file system residing in the FTDATA folder, but the difference is that you can reload them when you want using the sys.sp_fulltext_load_thesaurus_file procedure and specifying which language to load i.e.

EXEC sys.sp_fulltext_load_thesaurus_file 1033;

The books online has(will have by RTM) much better FTS documentation that in SQL Server 2005 around thesaurus files.

One area to be wary of a term can only be included in one pat of a replacement or sub of an expansion. If you have a term repeated it appears that expansions take precedence over replacements and the order in the file the takes precedence. If you find that words are not being expanded or replaced as expected check whether you have included the term more than once. There are some trace flags that support can use to help you if you run into difficulties.

If you are running a search function on a website, thesaurus files are a great way of allowing or mis spelling of terms. i.e. Mircosoft, using expansion elements will allow the user to spell something wrong and the person writing the content to have got it wrong. This would be the case if users could post their own content. If you know your content is rock solid you can just use replacement to auto correct spelling mistakes into the correct spelling, i.e.

        <replacement>
            <pat>mircosoft</pat>
            <pat>microsft</pat>
            <pat>micrsoft</pat>
            <sub>microsoft</sub>
        </replacement>

If you are not capturing what people are searching on and the results they get back then you should be, the information whilst often overwhelming can be like gold dust. If you’re running a music website how many people know how to spell Anastacia. If you reviewed misspelt searches you could return results based on a corrected spelling or like Live does append the results to those of the badly spelt word as often a misspelt word can still be a valid word even in the context of the search. In the case of Anastacia if I entered Anastasia this is still valid as there are film scores for this however based on popularity you could assume people wanted Anastacia and so included those results by use of a thesaurus.

The following are the other posts in the series

If you want to try iFTS you can download the SQL Server 2008 from here http://www.microsoft.com/sql/2008/prodinfo/download.mspx



-

This is the next in the series of posts about the new iFTS (Integrated Full Text Search) in SQL Server 2008.

One of the huge frustrations with using Full Text prior to SQL Server 2008 was that you had no idea what content the iFilter has extracted from the document you have indexed. This leads to questions like “Why hasn’t this document been returned when this one has?”. We had a situation where a process was run that looked for discrimination, this included looking for the word black. At one time one companies  content kept getting flagged as including discrimination but on looking at the content we couldn’t figure out why.  After doing some digging we realised that the content was HTML, the viewer showed rendered HTML and not the raw HTML, and the HTML contained the font elements specifying the colour Black. We were had a Full Text index that wasn’t type and so the generic text iFilter was used and so Black was included in the index. We changed to use a type column and specified “HTML” which resulted in the HTML iFilter being used and so the markup elements were not returned and stored in the index.

If we had access to the words that had been indexed, it would have been clear what had happened. Well in SQL Server 2008 there are Two dynamic management views that return the keywords of the index, and the keywords of each document in the index. These dmvs are, sys.dm_fts_index_keywords and sys.dm_fts_index_keywords_by_document respectively.

select *
  from sys.dm_fts_index_keywords (db_id(),object_id('Table_1'))

 select *
  from sys.dm_fts_index_keywords_By_Document (db_id(),object_id('Table_1'))

Note: The second dmv whilst is by document does not take a document id as a parameter. This means that this returns all the keywords for all the documents in an index. That will be a lot of rows. As an example if you store the definition from all_sql_modules in master in a table you will have ~ 1780 rows. When index using the default stop lists that will result in ~153000 rows being returned from sys.dm_fts_index_keywords_by_document.

Because of this it is not a dmv to be called from your application. If you want to use the output from his dmv I suggest you have a batch process that persists the output to a table which can be indexed and so usable in an application. This would be an ideal candidate for using the MERGE statement.

Further uses for this information?

If you are interested in looking at other query models for text content then this information provides you a fantastic starting point. Effectively you have text mining functionality in the database. The ranking algorithms are fixed in SQL Server (something that is being looked at for future releases) if you don’t like their ranking algorithm you could build your own.

Unfortunately the position information of keywords in a document is not returned by these dmvs so you can’t build an algorithm that involves position. That also means you can’t use this information to rebuild snippets based on search criteria in a way that live.com or google.com do.

What it is great at doing is giving you analysis features or your data. If you have a set of search terms you can easily use the contents of this index to find how many matches each term makes. This is useful if you want to display counts of items on your web site.

Another option that I thought of in response to a post on my introduction post (SQL Server 2008 – iFTS Introdction) Is the use for partial matches.

Currently it is only possible to do prefix wild card searching (Wildcard-search-in-SQL-Server-Full-Text) i.e. "Mirco*" You can't look for "*soft", the only way of achieving he latter is by indexing a reverse version of the content, which is fine for text content (give the extra space) but not for other content type i.e. word or pdf. Having the contents index available would allow you to do build a query based on the terms that match your criteria.The terms in index will be a fraction of the total index size and so querying using LIKE would not be too bad. i.e. select term from MyIndexTerms where term LIKE '%<partialword>%'. You can then use these terms to build a CONTAINS statement, i..e looking for "compute" CONTAINS (*, 'computedcolname OR computername OR computetime OR ....'). Essentially you are treating the index as a dictionary.

You could also do matching based on sounds like (not SOUNDEX), but thats beyond the scope of this post.

The following are the other posts in the series

If you want to try iFTS you can download the SQL Server 2008 from here http://www.microsoft.com/sql/2008/prodinfo/download.mspx



-
SQL Server 2008 – iFTS Transparency – dm_fts_parser

In the next in this series of posts on Integrated Full Text Search (iFTS) in SQL Server 2008, we look at the new dmv dm_fts_parser.

Wow thats a cool function name what does it do Simon?

Well in my first post I talked about the processes involved in the full text process which until now have been black boxes. This function makes some of these more transparent from a querying perspective.

dm_fts_parser takes a full text query and breaks it up using the word breaker rules, applies stop lists (more on them later), and any configured thesaurus. This is essential in the first step of diagnosing when users are complaining because there queries aren’t working. Often this is due to, a word not breaking as expected, use of noise words that exist in the stop list or thesaurus replacing  or substituting words.

You call the function using the same query string as you would use normally with a CONTAINS statement, along with a language, a stop list and where the search should be accent sensitive.

SELECT *
FROM
sys.dm_fts_parser ('FORMSOF( THESAURUS, "Internet Explorer")', 2057, 0, 0)

This  returns the following,

You can see that in my thesaurus I have added substitution elements for Internet Explorer or firefox and netscape.

The following query ,

SELECT *
FROM
sys.dm_fts_parser ('multi-million', 2057, 0, 0)

Returns the following showing how the word breaking as broken the word up but also maintained the combined word.

Finally

SELECT *
FROM
sys.dm_fts_parser ('SQL OR Server OR 2008 OR is OR the OR best', 2057, 0, 0)

Returns the following which nicely indicates which words are noise words but also that numbers are searched as numbers and text. Note the nn prefix.

And finally finally, the query about c++, c# etc.

SELECT *
FROM sys.dm_fts_parser ('C or c or C++ or c++ or C# or c#', 2057, 0, 0)

Returns the following, which shows what you need to put in to get an exact search on c++, or c#. Capitalise the C. What’s also interesting is that C, C++ both relate to C as well but C# doesn’t, which means it C is removed from the noise word then C++ would return any document containing the word C.

The following are the other posts in the series

If you want to try iFTS you can download the SQL Server 2008 from here http://www.microsoft.com/sql/2008/prodinfo/download.mspx



-

In the previous post I talked about Full Text in general. I will now talk about whats new in SQL Server 2008.

I would categorise the new features into Performance, Manageability and Transparency. So what’s new in SQL Server 2008, well we have,

1.       Full Text Indexes are stored in the database

2.       Full Text Query engine is part of the main SQL Server Query Engine

3.       Words in a Full Text Index are visible.

4.       Words in a document are

5.       Noise words are no longer in a text file but stored in the database and are called Stop Lists

6.       Contents of a Stop List can be amended on the fly (pre indexed content will not be amended) without restarting SQL Server

7.       Thesaurus Files can be amended on the fly without restarting SQL Server

8.       Indexes are maintained in SQL Server memory rather than the OS.

9.       A new function exists to show how a query gets translated by word breaking, stop lists and thesaurus’s

10.   A new external process is used to index content. (This is for security and stability reasons as iFilters can be written by anybody)

11.   Support for filestream data, which means you don’t have to blobs in your database and means managing file growth is easier.

If you are wondering iFTS means Integrated Full Text Search which is what it is.

The following are the other posts in the series

If you want to try iFTS you can download the SQL Server 2008 from here http://www.microsoft.com/sql/2008/prodinfo/download.mspx

 



-
Posted by simonsabin | 6 comment(s)

The next CTP of SQL Serve 2008 is now available for download.

Whats different about this CTP from the others, apart from the new features which there are a lot, is there is a bug bash contest that is opening.

Simply put the people that raise the most nonduplicate reproducible bugs on CTP6 in the next month will win XBoxes. The rules are available here

https://connect.microsoft.com/SQLServer/content/content.aspx?ContentID=7734

I was suprised at how many XBoxes they are giving away, which shows the commitment to getting bugs ironed out now and not later in the day.

CTP6 is nearest to RTM with very few additional features being added from here on in. So if you fancy an XBox get installing and see if you can find a bug.

 



-
More Posts Next page »