File:10base2 t-piece.png 

My career started at a company where we hardly had email, the network was a 10base2 affair with cables running all around the office. You used floppy disks and the thought of a GB of data was absurd. You had to look after every byte and only keep what you really needed.

Whilst the cost of the spinning disks gradually falls the cost and size of flash storage continues to plummet.

The new Crucial SSD is £380 for 1TB

I can now keep 128GB of data on a SD card the size of my finger. It only costs $50 a month to store 1TB of data in Azure ($61.29 on Amazon S3 Europe)

This brings long with it a whole host of problems.

 

Its too cheap

Whilst before you had to manage your mailbox, your desktop, your database because you ran out of space.

Now its cheaper to keep data than it is to get rid of it,

Electronic wastesSo what?

Keeping data around leads to many problems, and considering the cost of the media is only a small part of the cost of maintaining the data

1. You have to manage it, i.e. Backups

2. You have to secure it.

3. The more you have of it the more risk there is of loosing some of it.

4. When in a database having more data affects your query performance.

5. Can your employees find the data they need, is it a game of find the needle in the hay stack,

6. Most data protection acts state you should only keep data for as long as you need it.

7. You are also liable to give customers copies of the data you have on them. Do you know all the data you have on someone. Even those notes the sales person wrote about someone in a onenote notebook.

Its just like the blob

The reason that this is a problem for many organisations is that its never something that is on the list of priority things when a company starts or a project starts. Its never a problem as there is always enough space when things start. However as time goes by the problem gets bigger and bigger until it becomes a problem.

When it becomes a problem its then such a big thing to deal with, to do the work, change attitudes, implement policy, that no one really has the appetite.

What are you doing about it?

Do you think about what you are storing and decide if you really need to store it?

Do you have a data retention policy?

Do you have a data deletion policy?

What about a data archive policy?

Are you actively reviewing what data you are holding?

Do your IT guys really push back when a team says we need 1TB of storage for project X?

Do your developers have data retention in their definition of done?

 

Deal with it sooner rather than later or it will be just too big to digest.

Posted by simonsabin | with no comments

Data is a huge part of what we do and I need someone that has a passion for data to lead our SQL team.

If you’ve got experience with SQL and want to lead a team working in an agile environment with aggressive CI processes.

Do you have a passion about data and want to use technology to solve problems then you are just the person I am looking for

The role is based in London working for on of the top tech companies in Europe.

Contact me though my blog or linkedin (http://uk.linkedin.com/in/simonsabin), if you are interested.

Posted by simonsabin | with no comments

Then contact me we need your services for project.

Use the contact form on this blog or contact me through linkedin http://www.linkedin.com/profile/view?id=2895653

Posted by simonsabin | with no comments
Filed under:

When you go shopping for a new gadget do you just go to your nearest electronics store (currys, best buy, etc) and read the marketing ticket next to the product?

No you probably don’t. You more than likely ask your friends if they’ve got one and what they think. Or you go and look for reviews that compare the different options.

The same applies to recruitment.

Employers don’t just rely on your cv (your marketing ticket), who knows what rubbish you’ve put in it. Employers want to validate that information.

Linkedin I find key to that validation process.

It takes 2 minutes to complete your profile and keep it up to date. Whilst it is similar to your CV it has a few benefits.

1. Its a very small world. I can see if you are connected to anyone I know or have worked with someone I know.

2. It validates what you have in your CV. Because its public you aren’t likely to exaggerate as much on Linkedin than you may do in your CV

3. You can get endorsements and recommendations from previous colleagues.

You wouldn’t included snippets of recommendations in your CV you can have them in your linkedin profile. Whats more, with the blandness of most references these provide some great information to base a decision on.

 

So make sure that you complete your profile.

Posted by simonsabin | 4 comment(s)

When I started looking at SQL conferences I had heard about this conference held in Denmark in the middle of nowhere with everyone staying in huts and was always interested as to how it worked, confining all the delegates in one place with no opportunity for escape. Later on when I became an MVP I started hearing stories about a conference in Denmark, the same one. It was the extreme of the conference work hard play hard ethic. The play hard being the legend. The huts are holiday homes and the conference venue is on the site of Legoland. On one of the nights they have a bbq and on the last night they have a dinner followed by the pool party. This isn’t any old pool party but here they take over a water park with a bar. You can imagine the scene, a conference of Oracle and SQL guys (a few woman), drinking and then going down water slides in groups of 4 on huge rubber rings.

Almost two years ago I was invited to speak at said event and unsurprisingly decided it was a must. The event was Miracle Open World 2011.

Now whilst the event and the beach party was something to behold, you try holding a beer whilst shooting down a vertical slide, my overriding memory was a talk I attended.

“You’ve got to go and see this guy. Last year he dissected the internals of a PDF document.”

This was a comment by on of the attendees about a session on dissecting mdf files.

Being a fairly hard core SQL guy I though this was going to be a poor show, mdf files are very complex. But I was intrigued as I was looking at how to explain to developers how SQL worked, and thats what the session promised.

Well I wasn’t disappointed, the session was by Mark S. Rasmussen, on his OrcaMDF project. Essentially he has written code that allows you to point it at an MDF file and it will reconstruct rows and columns from the data. Essentially he’s written the storage engine parser, that parses pages into the relevant information.

Even at that stage Mark had made huge progress and could different types of pages, have linked lists of pages and the allocation maps, decode page values into the column values. Not simple stuff by any stretch.

Mark is a seriously clever and nice guy. If you ever see him at a conference say hi and have a chat. He’s also got a fascinating past, more examples of your stupidly clever he is.

Since that talk Mark has been talking around the world (no idea how he finds the time, or the money to do it) and I’m really please today as I found out that Mark has been awarded an MVP for SQL Server. Fantastic and well deserved.

 

You can followed Mark on twitter http://twitter.com/improvedk or read his blog http://improve.dk

He has made the OrcaMDF project available on Github https://github.com/improvedk/OrcaMDF

This is where it started http://improve.dk/archive/2011/04/17/miracle-open-world-2011-follow-up.aspx

Posted by simonsabin | 1 comment(s)
Filed under: , ,

How many times have you copied data into a table and got one of the following less than helpful errors.

 

String or binary data would be truncated.

 

Arithmetic overflow error converting numeric to data type numeric.

 

Arithmetic overflow error for data type tinyint, value = 256.

 

Violation of UNIQUE KEY constraint '??????'. Cannot insert duplicate key in object '??????'. The duplicate key value is (3).

 

Cannot insert the value NULL into column 'parentId', table 'tempdb.dbo.child'; column does not allow nulls. INSERT fails.

 

The INSERT statement conflicted with the FOREIGN KEY constraint "?????". The conflict occurred in database "????", table "?????", column '?????'.

So they improved some of these messages to include the offending value. However that doesn’t help much when you’ve got a multi row insert. The error only occurs for the first offending value. For instance in the unique violation above you can resolve the problem by excluding the rows with a value of 3 but you might have others. In the example below we have others and so you have to put some generic code in to verify the data before you insert it.

What this means is that if you have a source of data to insert into a table you have to go through the following process.

1. Perform insert

2. Get failure message

3. Identify offending rows, correct or exclude them

4. Proceed to step 1 until you get no errors.

5. Once insert works, implement a process to identify bad rows prior to the insert to be corrected or post the insert to be verified and reprocessed.

 

I’ve worked through a very simple example below involving two tables “parent” and “child”. Each step resolves an error with the insert due to a data type domain constraint or explicit constraint. You can see that its a painful process.

The end result is a hugely complex query, with many sub queries and use of try_convert. Note that this code prior to SQL 2012 would have been amazingly complex to verify the data type values. The uniqueness check is the most complex due to having to check the combination of values from both the source and target tables.

The plan goes from a simple insert

image

to a much much more complex plan involving 2 additional scans of the source and 1 of the parent. Whilst those could be indexed to improve the plan it highlights additional work and indexes that aren’t required in the simple insert.

image

What’s immensely frustrating is that all these verification predicates are being done by the engine anyway, that’s how it generates the errors.

KISS

image

This is far from keeping it simple. On a recent project we’ve had to implement months of work to put this level of validation in. Whats worse is that you have to put all this code in up front. If you don’t then your application will jet fail and fall apart when you get a problem. What worse is the resolving of errors is hugely complex and time consuming.

I’d just accepted this as something you just have to live with. That was until now.

I was recently notified of a discussion on the TSQL forum and associated connect items.

Forum thread - http://social.technet.microsoft.com/Forums/en-US/transactsql/thread/3e17f8dc-9685-412b-8e76-94ad41536d5d 

Connect item https://connect.microsoft.com/SQLServer/feedback/details/774754/new-virtual-table-errors-it-would-analogous-to-the-deleted-and-inserted-tables#details

The idea here is that the errors get dumped into a table just like the output clause does. You can then examine the contents of the errors table and do something with it. This is very similar to the process seen in most ETL frameworks where rows that fail a lookup or a data type validation are put in an errors table  for later processing.

What a fantastic idea

 

Implementation

I would like the statement to not fail by default (maybe an option clause to configure this behaviour).

The biggest question is what error information is included. One column could have multiple validation errors, not null, check, foreign key etc. Ideally I would like to know all validation errors for each column but I can’t see that as workable, unless the result is a pivotted table i.e. one row per column and validation rule, but that would require a unique set of columns on the source to be defined somehow and would make it complex.

You would also have the situation that and error that fails one check is likely to fail lots more and so knowing that a column value has at least one error is good enough. That would need to have an error code column in the errors table per column.

The simplest would be to have a single error per row, thats what you get with the SSIS implementation and would probbaly be simplest to implement, as soon as a row fails its is marked as bad and not passed to the other parts of the plan but directed to the errors output.

I would settle for the 1 error per row. At least you know the row has failed, have caught the row, can reprocess the data easily.

Side note: What I found amusing during the process of building the script blow was that I was getting errors excluded but couldn’t find which source rows where causing the problem due to the vagueness of the failings. Exactly the problem that would be solved if we had an errors table.

 

VOTE

image

I think this idea is fantastic and urge you to vote on it if you agree.

https://connect.microsoft.com/SQLServer/feedback/details/774754/new-virtual-table-errors-it-would-analogous-to-the-deleted-and-inserted-tables#details

 

 

 

Sample script showing the ugly process of validating source data.

use tempdb

go

drop table child

drop table parent

drop table SourceTable

go

create table parent (parentId int not null primary key)

create table child (childid int identity(1,1), parentId int not null,strCol varchar(10),unqCol int unique, intCol int, numCol numeric(2,1),tinyintCol tinyint,  )

alter table child add constraint fk_child foreign key (parentId) references parent(parentId)

 

go

 

insert into parent (parentId)

values(1)

go

declare @bigint as bigint = pOWER(cast(2 as bigint),32)

select *

 into SourceTable

 from (

values

 (1,replicate('A',11),1,100,10.22,256)

,(1,replicate('A',10),1,100,10.22,256)

,(1,replicate('A',10),2,100,10.22,256)

,(1,replicate('A',10),3,100,1.2,256)

,(1,replicate('A',10),3,100,1.2,256)

,(1,replicate('A',10),3,@bigint,1.2,255)

,(1,replicate('A',10),3,@bigint,1.2,255)

,(1,replicate('A',10),3,@bigint,1.2,255)

,(1,replicate('A',10),4,100,1.2,255)

,(1,replicate('A',10),4,100,1.2,255)

,(null,replicate('A',10),9,100,1.2,255)

,(2,replicate('A',10),10,@bigint,1.2,255)

,(2,replicate('A',10),5,100,1.2,255)

,(2,replicate('A',10),6,100,1.2,255)

,(1,replicate('A',10),11,@bigint,1.2,255)

,(1,replicate('A',10),7,100,1.2,255)

,(1,replicate('A',10),8,100,1.2,255)

)  v(parentId, strcol,unqcol, intCol, numCol,tinyintCol)

 

go

insert into child(parentId, strcol,unqcol, intCol, numCol,tinyintCol)

select parentId, strcol,unqcol, intCol, numCol,tinyintCol

from SourceTable

go

print ' '

print 'Exclude varchar values that will be truncated'

insert into child(parentId, strcol,unqcol, intCol, numCol,tinyintCol)

select parentId, strcol,unqcol, intCol, numCol,tinyintCol

from SourceTable

where len(strcol) <= 10

 

print ' '

print ' Exclude numeric values that are too big'

insert into child(parentId, strcol,unqcol, intCol, numCol,tinyintCol)

select parentId, strcol,unqcol, intCol, numCol,tinyintCol

from SourceTable

where len(strcol) <= 10

and (numCol is null or try_convert(numeric(2,1),numCol) is not null)

 

print ' '

print 'Exclude values too big for tinyint'

insert into child(parentId, strcol,unqcol, intCol, numCol,tinyintCol)

select parentId, strcol,unqcol, intCol, numCol,tinyintCol

from SourceTable

where len(strcol) <= 10

and (numCol is null or try_convert(numeric(2,1),numCol) is not null)

and (tinyintCol is null or try_convert(tinyint,tinyintCol) is not null)

 

print ' '

print 'Exclude values too big for int'

insert into child(parentId, strcol,unqcol, intCol, numCol,tinyintCol)

select parentId, strcol,unqcol, intCol, numCol,tinyintCol

from SourceTable

where len(strcol) <= 10

and (numCol is null or try_convert(numeric(2,1),numCol) is not null)

and (tinyintCol is null or try_convert(tinyint,tinyintCol) is not null)

and (intCol is null or try_convert(int,intCol ) is not null)

 

print ' '

print 'Exclude specific value that breaks unique constraint'

insert into child(parentId, strcol,unqcol, intCol, numCol,tinyintCol)

select parentId, strcol,unqcol, intCol, numCol,tinyintCol

from SourceTable

where len(strcol) <= 10

and (numCol is null or try_convert(numeric(2,1),numCol) is not null)

and (tinyintCol is null or try_convert(tinyint,tinyintCol) is not null)

and (intCol is null or try_convert(int,intCol ) is not null)

and unqcol <> 3

 

print ' '

print 'Exclude values that break unique constraint'

insert into child(parentId, strcol,unqcol, intCol, numCol,tinyintCol)

select parentId, strcol,unqcol, intCol, numCol,tinyintCol

from SourceTable

where len(strcol) <= 10

and (numCol is null or try_convert(numeric(2,1),numCol) is not null)

and (tinyintCol is null or try_convert(tinyint,tinyintCol) is not null)

and (intCol is null or try_convert(int,intCol ) is not null)

and not exists (select 1 from child  c where c.unqCol = SourceTable.unqcol)

and (select COUNT(1) from sourcetable s where s.unqCol = SourceTable.unqcol) <=1

go

print ' '

print 'Exclude values that break foreign key constraint'

insert into child(parentId, strcol,unqcol, intCol, numCol,tinyintCol)

select parentId, strcol,unqcol, intCol, numCol,tinyintCol

from SourceTable

where len(strcol) <= 10

and (numCol is null or try_convert(numeric(2,1),numCol) is not null)

and (tinyintCol is null or try_convert(tinyint,tinyintCol) is not null)

and (intCol is null or try_convert(int,intCol ) is not null)

and not exists (select 1 from child  c where c.unqCol = SourceTable.unqcol)

and (select COUNT(1) from sourcetable s where s.unqCol = SourceTable.unqcol) <=1

and (parentId is null or

  exists (select 1 from parent p where p.parentId = SourceTable.parentId))

go

print ' '

print 'Exclude values that breaks not null constraint'

insert into child(parentId, strcol,unqcol, intCol, numCol,tinyintCol)

select parentId, strcol,unqcol, intCol, numCol,tinyintCol

from SourceTable

where len(strcol) <= 10

and (numCol is null or try_convert(numeric(2,1),numCol) is not null)

and (tinyintCol is null or try_convert(tinyint,tinyintCol) is not null)

and (intCol is null or try_convert(int,intCol ) is not null)

and not exists (select 1 from child  c where c.unqCol = SourceTable.unqcol)

and (select COUNT(1) from sourcetable s where s.unqCol = SourceTable.unqcol) <=1

and (parentId is not  null and

  exists (select 1 from parent p where p.parentId = SourceTable.parentId))

Posted by simonsabin | 3 comment(s)

Bob Duffy is doing an afternoon seminar on Azure.

It will provide an introduction to the Azure platform, and in particular SQL Azure, show tools and methodologies to migrate on premise databases into the cloud, using a sample application and database and finally it will detail some of the Azure specific features that enable massive scale OLTP solutions such as federations.

http://www.prodata.ie/Events/2012/SQL-Azure_and_the_Cloud.aspx

Posted by simonsabin | 1 comment(s)
Filed under: ,

Its taken 30 days and I have managed, to my surprise, to grow a ginger Mo in support of Movember.

http://uk.movember.com/mospace/6154809

I didn’t quite reach my supposed lookaliki but I don’t think it was a bad effort. Next time I’ll leave my hair for a few months before.

imageimage

If you fancy donating then you can do so here https://www.movember.com/uk/donate/payment/member_id/6154809/ I only need £3 to reach £400 which would be great. The team have just passed £2000 which is awesome.

Posted by simonsabin | 1 comment(s)
Filed under:

If you are in Dublin then we have a meeting on tonight at the Microsoft office.

The subject of the talk is

“Almost all queries have to do two things, get data and join it together. In this session we will look at the aspects of these that most people think they know but in reality don’t. “

If you think you know SQL then you should come along and we'll see if you are right

http://www.mtug.ie/UserGroups/SQLServer/tabid/82/ctl/Details/Mid/413/ItemID/110/Default.aspx?ContainerSrc=%5BG%5DContainers%2F_default%2FNo+Container

Hope to see you there.

PS the Mo is growing nicely, if you fancy supporting Movember then you can do here https://www.movember.com/uk/donate/payment/member_id/6154809/

WP_000912

Posted by simonsabin | with no comments

I'll be over in Dublin doing a usergroup meeting on Thursday evening at the Microsoft office.

The subject of the talk is

“Almost all queries have to do two things, get data and join it together. In this session we will look at the aspects of these that most people think they know but in reality don’t. “

If you think you know SQL then you should come along and we'll see if you are right

http://www.mtug.ie/UserGroups/SQLServer/tabid/82/ctl/Details/Mid/413/ItemID/110/Default.aspx?ContainerSrc=%5BG%5DContainers%2F_default%2FNo+Container

Hope to see you there.

Posted by simonsabin | 1 comment(s)
More Posts Next page »