Row compression – internal structure
The CTP6 release of SQL Server 2008 includes row and page compression. It’s a feature that will only be in the Developer and Enterprise edition of SQL Server, so I wasn’t sure whether I should (or could!) add it in to SQL Internals Viewer. I had a look into it and thought it would be worth putting it in as it’s one of those things where it’ll be very useful to understand how it works.
So far I’ve only got row compression covered in SQL Internals Viewer, but I thought I’d better get down what I’ve found out so far.
There’s more on the new SQL Server 2008 compression on the SQL Server Storage Engine Blog here.
Row compression has a completely different row format. The size of a field is determined by the minimum amount of space needed to store it. The Books Online topic ‘Row Compression Implementation’ has a good run down of how space is saved with different data types.
It’s very easy to add row compression to a table. The syntax is:
ALTER TABLE table REBUILD WITH (DATA_COMPRESSION = ROW)
The standard row structure is covered in Kalen Delaney’s Inside Microsoft SQL Server 2005: The Storage Engine, and there’s also an overview here (Storage Engine Blog).
Compressed row structure
(There may be mistakes in this as it’s currently undocumented, please let me know if anything needs correcting)
Status Bits A
This looks the same as a normal row, although there may be differences.
Number of columns 1 or 2 byte integer
This is the first instance where space can be saved. If the number of columns can fit into one byte (0-254) one byte will be used, if not two bytes are used. If the first (high-order) bit is 1 on the first byte this indicates a second byte is used.
Next (Number of columns/2) + (Number of columns%2) bytes
I’m guessing CD stands for Column Description or Compression Description. It’s an array of 4-bit (nibble) integers, stored 2 per byte. Every column in the row has a CD Array entry that determines if it is null, empty, stored ‘short’ (and if so the size) or stored ‘long’.
Short and long are the equivalents to the difference between fixed and variable length storage in the standard row format. Short CD Array entries represent fixed length storage (defined by the CD Array entry), but they use the optimal amount or storage. Long fields are similar to variable length fields, they have an entry in a row offset array, and these too use optimized storage.
Possible values for the CD array are:
0 – Null
1 – Empty
2 – 1 byte short
3 – 2 bytes short
4 – 3 bytes short
5 – 4 bytes short
6 – 5 bytes short
7 – 6 bytes short
8 – 7 bytes short
9 – 8 bytes short
10 - Long
If a field is a BIT data type the value of the CD Array is used as the value.
Row compression essentially turns every compressible field into a variable length field. It seems that the distinction between long and short columns is used so the extra overhead (column offset array entry) is only used when necessary. Below 9 bytes the CD array can be used to store the length. Above 8 bytes and an extra two bytes are used for the offset array entry.
Short Column Data
Next ∑ (short bytes in CD Array)
Number of variable length columns
Next 2 bytes
Column offset array
Next (2 * Number of variable length columns) bytes
Each 2-byte integer defines the end offset of the variable length field
Long Column Data
The long/variable length fields with the offsets defined in the offset array.
Here’s an example in Internals Viewer:
This only covers data records. I've still got to look into indexes and after that page compression (which when used also uses row compression). I'll also try to blog on how the data is actually stored and how to decode it.
Hopefully everything will be covered in version 1.0 of SQL Internals Viewer.