Monday, 12 May 2008

Collation and CHECKSUM do not play together nicely.

If you work with data warehouses, like yours truly, most likely you’ve had to write code that checks your dimension tables for changes. Our usual approach is to pump the most recent copy of the table into a staging table, and then compare it to the latest version of the data.

Roughly, the comparison would be something like:

SELECT
P.Column1
, P.Coulmn2
FROM
ProductionTable P
INNER JOIN StagingTable S ON P.PKID = S.PKID
WHERE

Followed by a piece of code where the individual columns are compared. There’s two methods that we use: A column by column comparison (WHERE P.Column1 != S.Column1 OR P.Column2 != S.Column2), or a CHECKSUM method, where a CHECKSUM is done over all the columns of the staging table, and compared against the CHECKSUM over all the columns of the production table (WHERE CHECKSUM(S.Column1, S.coulmn2) != CHECKSUM(P.Column1, P.Column2).


Today, when testing some new code, I found that even though the values in the two tables were identical (except for one column I excluded), and yet, SQL Server reported the checksums to be different. A bit more investigation revealed that the reason behind this was because of collation differences. For those of you who never heard of it (I didn’t until about a year ago, despite having worked with databases for 5+ years in numerous international organizations), here’s an excerpt from BOL (“Selecting collations”):


Collations specify the rules for how strings of character data are sorted and compared, based on the norms of particular languages and locales. For example, in an ORDER BY clause, an English speaker would expect the character string 'Chiapas' to come before 'Colima' in ascending order. But a Spanish speaker in Mexico might expect words beginning with 'Ch' to appear at the end of a list of words starting with 'C'. Collations dictate these kinds of sorting and comparison rules. The Latin_1 General collation will sort 'Chiapas' before 'Colima' in an ORDER BY ASC clause, while the Traditional_Spanish collation will sort 'Chiapas' after 'Colima'.


Now the source system we pull our data from has a default collation of US-English, and our warehouse system uses Norwegian/Danish. Essentially, this means whenever we try and run comparison checks or updates between staging and production, we have to add a COLLATE DATABASE_DEFAULT clause in order to “convert” one set of values to another. That this also was an issue with CHECKSUM never occurred to me.

Here’s some sample code to prove my point. We create a test table with two different collations, and insert identical values into them. First, we’ll run a SELECT to see the CHECKSUMs are different, a second select which does a collation conversion on the US-column, and finally a statement that boxes the “other” collation column to the proper one for the database, which will result in an identical CHECKSUM. The code was run on a database with a US collation, by the way:

CREATE TABLE dbo.CHECKSUM_TEST
(
US nchar(10) NULL,
DK nchar(10) COLLATE Danish_Norwegian_BIN NULL
) ON [PRIMARY]

INSERT INTO CHECKSUM_TEST
(US, DK)
VALUES ('foo', 'foo')

INSERT INTO CHECKSUM_TEST
(US, DK)
VALUES
('bar', 'bar')

SELECT US, DK, CHECKSUM(US), CHECKSUM(DK)
FROM CHECKSUM_TEST

SELECT US, DK, CHECKSUM(US), CHECKSUM(DK COLLATE DATABASE_DEFAULT)
FROM CHECKSUM_TEST

DROP TABLE CHECKSUM_TEST

I’ve deliberately used Unicode rather than non-Unicode for the columns, to prove that doesn’t really help to deal with this particular issue. Below are the results from the three statements:

US DK
---------- ---------- ----------- -----------
foo foo 59294611 24735
bar bar -2088615046 25698

(2 row(s) affected)

US DK
---------- ---------- ----------- -----------
foo foo 59294611 24735
bar bar -2088615046 25698

(2 row(s) affected)

US DK
---------- ---------- ----------- -----------
foo foo 59294611 59294611
bar bar -2088615046 -2088615046


The lesson learnt here is to be careful when using CHECKSUM in combination with mixed collations, and when you do, it’s important to use the COLLATE statement on the proper column ;)

1 comment:

Customer Loyalty Program said...

Data warehousing is used to house the necessary data for specific analysis. it requires requires different capacities that are more static and less easily manipulated than those used for data warehousing.