Page compression - internals and examples

First of all, apologies for the delay in writing this. I meant to blog about page compression as soon as I added it in to Internals Viewer but I didn’t get round to it.

Going over it again for this blog post and testing things out with RC0 I found a few bugs, so along with this comes a new release, version 1.02. It can be downloaded here.

Syntax:  ALTER TABLE table REBUILD WITH (DATA_COMPRESSION = PAGE)

Page compression uses the following techniques to save space:

  • Row Compression (covered here)
  • Prefix compression
  • Dictionary compression

Applying Page compression to a table or index doesn’t necessarily mean that all of the techniques are used. Row compression will always be used if page compression is used, but prefix and dictionary compression are only used if SQL Server determines that space can be saved by using them.

For this post I’ve created an example script that can be used with SQL Server 2008 (RC0) and SQL Internals Viewer. The script will create and populate two tables, one called FirstNames_PrefixOnly and one called FirstNames_PrefixAndDictionary. It can be downloaded here.

Both page compression techniques work by cutting down on repetition in the page. Prefix compression identifies common prefixes on a per-column basis. SQL Server identifies a common prefix and then it assumes that all values start with that prefix. Records only need to store the differences from the prefix.

Here’s an example. All of the following names start with Antoni.  The prefix is stored as Antoni so only the suffixes need to be stored, saving space.

Antonia
Antoni
e
Antoni
etta
Antoni
na
Antonio

Dictionary compression is applied after prefix compression and identifies common patterns in the data. If a particular pattern occurs more that once space can be saved by adding the pattern to the dictionary and in its place adding a reference back to the dictionary entry. Prefixes are specific to each column, but the dictionary can be used across columns.

Following on from the example of first names beginning with Antoni, there may be repetition in the table:

Antonietta
Antonietta
Antonietta
Antonietta
Antonina
Antonina
Antonina
Antonina

Prefix compression (in red) has already identified common prefixes. Dictionary compression indentifies the common pattern (etta and na), adds them to the dictionary and replaces them with a reference to the entry.

So this...

Data:       Antonietta
            Antonietta
            Antonina
            Antonina

Becomes this...

Prefix:      Antoni
Dictionary:  [0]etta, [1]na
Data:        [0]
             [0]
             [1]
             [1]

There’s more information on this on the Storage Engine Blog and also in Books Online.

Compression Info structure

If a table or index uses Page compression and it uses Prefix or Dictionary compression it will have something called a Compression Info (CI) structure just after the page header starting at byte 96.

The CI structure contains:

  • A header describing the CI structure and what it contains
  • The Anchor Record (for Prefix compression)
  • The Dictionary (if used)

In SQL Internals Viewer if a page has a CI structure an additional set of options will appear underneath the Offset table. Selecting one of the three items will colourise and decode it in the Page Viewer.


 
Header

The CI header is made of the following parts:

  • Status Bits – 1 byte
    • If bit 1 is 1 the CI has an anchor record
    • If bit 2 is 1 the CI has a dictionary
  • Page Mod Count – 2 byte short (I’m not sure what this does)
  • Length – Length of the anchor record (Only present if there is a dictionary) – 2 byte short
  • Size – Size of the CI structure – 2 byte short

Prefix Compression - Anchor Record

The anchor record is a record that uses the new row compression record format. The anchor record defines the prefix, if it exists, for each column.

For more on the new row compression record format see the previous post about row compression.

The screenshot above shows the anchor record for the first page of the example FirstNames_PrefixOnly table.

0x416E6E6162656C6C61 decodes to Annabella – this is the prefix for column 1.

 
 
In the example table, NameId 461 is Annemarie. The first name field is made up of two things. The first is the data offset. This is 1 or 2 bytes (depending on if the first bit is set) that determines at what offset the subsequent data starts. 03 decodes to 3 and 65 6D 61 72 69 65 is emarie. This means the prefix is used for the first three bytes, and then the rest of the data is taken from the field.

Annabella (Prefix from the anchor record)
123
emarie (Data from the row)
Annemarie

The record for Annabella (NameId 441) doesn’t use any space and the prefix provides all of the data:


 
When the prefix isn’t used at all the data offset is set to 0, which means the entire anchor prefix is ignored:

 
Dictionary Compression

The Dictionary is stored in the Compression Info structure.


 
The dictionary structure is as follows:

  • Entry count – 2 bytes
    • This defines the number of entries in the dictionary
  • Dictionary entry offset array – 2 * Entry count bytes
    • They array defines the end offset of each dictionary entry in a very similar way to the column offset array in a standard record with variable length fields.
  • Dictionary entries
    • Defined by the offset array

The example script creates a table called FirstNames_PrefixAndDictionary. The table has repetition in it which makes it suitable for dictionary compression.

How does a field refer back to the dictionary? The row structure discussed in the previous row compression post has a CD Array that contains information about each column in a record stored as 4-bit integers, two to a byte.  Row compression covers values 0-10 for the CD array, above 10 and is used to mark that the value in the field is a symbol, a reference to the dictionary item. The size of the symbol is determined by the CD array value minus 11.

Here’s an example:

This record has two columns, so the CD Array is one byte (0xC3) that is split out into 3 (2 bytes short) and 12 (1 byte symbol). The 1 byte symbol value is 4 so we need to replace this with the data in dictionary entry 4.

Because this column has a prefix the first byte is the offset of the data, and the subsequent bytes need to be appended at the offset. This column has a prefix of 41 64 65 6C 61 69 64 61  (Adelaida).

Dictionary 4 has the value 04 65 which means 0x65 (e) is appended to the prefix at byte 4, this is 41 64 65 6C 65. This decodes to Adele. It's as simple as that!

Published 19 June 2008 08:57 PM by Danny

Comments

# SQL Server, BizTalk Server, le 64 bits et au-delà !... said on 16 December, 2008 05:13 PM

Ce petit outil très sympathique vient de sortir en version 2 beta 1 qui intègre notamment un

# Page Compression Internals « Blog do Luan.Moreno [SQL Soul] said on 12 August, 2011 03:56 AM

Pingback from  Page Compression Internals « Blog do Luan.Moreno [SQL Soul]