Showing posts with label SQL Server 2005. Show all posts
Showing posts with label SQL Server 2005. Show all posts

Monday, 12 May 2008

Collation and CHECKSUM do not play together nicely.

If you work with data warehouses, like yours truly, most likely you’ve had to write code that checks your dimension tables for changes. Our usual approach is to pump the most recent copy of the table into a staging table, and then compare it to the latest version of the data.

Roughly, the comparison would be something like:

SELECT
P.Column1
, P.Coulmn2
FROM
ProductionTable P
INNER JOIN StagingTable S ON P.PKID = S.PKID
WHERE

Followed by a piece of code where the individual columns are compared. There’s two methods that we use: A column by column comparison (WHERE P.Column1 != S.Column1 OR P.Column2 != S.Column2), or a CHECKSUM method, where a CHECKSUM is done over all the columns of the staging table, and compared against the CHECKSUM over all the columns of the production table (WHERE CHECKSUM(S.Column1, S.coulmn2) != CHECKSUM(P.Column1, P.Column2).


Today, when testing some new code, I found that even though the values in the two tables were identical (except for one column I excluded), and yet, SQL Server reported the checksums to be different. A bit more investigation revealed that the reason behind this was because of collation differences. For those of you who never heard of it (I didn’t until about a year ago, despite having worked with databases for 5+ years in numerous international organizations), here’s an excerpt from BOL (“Selecting collations”):


Collations specify the rules for how strings of character data are sorted and compared, based on the norms of particular languages and locales. For example, in an ORDER BY clause, an English speaker would expect the character string 'Chiapas' to come before 'Colima' in ascending order. But a Spanish speaker in Mexico might expect words beginning with 'Ch' to appear at the end of a list of words starting with 'C'. Collations dictate these kinds of sorting and comparison rules. The Latin_1 General collation will sort 'Chiapas' before 'Colima' in an ORDER BY ASC clause, while the Traditional_Spanish collation will sort 'Chiapas' after 'Colima'.


Now the source system we pull our data from has a default collation of US-English, and our warehouse system uses Norwegian/Danish. Essentially, this means whenever we try and run comparison checks or updates between staging and production, we have to add a COLLATE DATABASE_DEFAULT clause in order to “convert” one set of values to another. That this also was an issue with CHECKSUM never occurred to me.

Here’s some sample code to prove my point. We create a test table with two different collations, and insert identical values into them. First, we’ll run a SELECT to see the CHECKSUMs are different, a second select which does a collation conversion on the US-column, and finally a statement that boxes the “other” collation column to the proper one for the database, which will result in an identical CHECKSUM. The code was run on a database with a US collation, by the way:

CREATE TABLE dbo.CHECKSUM_TEST
(
US nchar(10) NULL,
DK nchar(10) COLLATE Danish_Norwegian_BIN NULL
) ON [PRIMARY]

INSERT INTO CHECKSUM_TEST
(US, DK)
VALUES ('foo', 'foo')

INSERT INTO CHECKSUM_TEST
(US, DK)
VALUES
('bar', 'bar')

SELECT US, DK, CHECKSUM(US), CHECKSUM(DK)
FROM CHECKSUM_TEST

SELECT US, DK, CHECKSUM(US), CHECKSUM(DK COLLATE DATABASE_DEFAULT)
FROM CHECKSUM_TEST

DROP TABLE CHECKSUM_TEST

I’ve deliberately used Unicode rather than non-Unicode for the columns, to prove that doesn’t really help to deal with this particular issue. Below are the results from the three statements:

US DK
---------- ---------- ----------- -----------
foo foo 59294611 24735
bar bar -2088615046 25698

(2 row(s) affected)

US DK
---------- ---------- ----------- -----------
foo foo 59294611 24735
bar bar -2088615046 25698

(2 row(s) affected)

US DK
---------- ---------- ----------- -----------
foo foo 59294611 59294611
bar bar -2088615046 -2088615046


The lesson learnt here is to be careful when using CHECKSUM in combination with mixed collations, and when you do, it’s important to use the COLLATE statement on the proper column ;)


Read more!

Friday, 4 January 2008

SQL Server 2005 varchar(max) in C#

An interesting question at work today: How to properly define a parameter that's varchar(max) in Sql Server 2005 from within C#?

Sure, it should be possible to use the signature without adding the integer value. But what if a custom DataAdapter is being used that does require a value if you supply a string? We could use 4000 as a value, but that kind of defeats the purpose of using the new datatype. 8000 then? Same difference; you still possibly end up with truncated strings. Should we make it 2bn? That's 2GB of memory. Not really an otpion in my book.

After some googling, we found the answer: the size of the parameter should be set to -1. There's an example here, where the -1 value is not explained in detail, but just shown in the code. We tested, and it works like a charm.


Read more!

Tuesday, 28 August 2007

Hierarchical data with Common Table Expressions

Today, I finally managed to crack a problem that's been bothering me for some time. Some weeks ago I started working on the database schema that the site will be using, and the Entropia auction system gave me some trouble.

You see, the auction in the game is built up hierarchically, and I eventually want to show individual items in the database under the auction section they belong to (as with a lot of items, it's unclear exacty where to find them).

That in itself was easy enough, but not if you want to store the data in third normal form, but at the same time want to use some form of a cookie-trail, so the user will know exactly where to go. (With cookie-trail I mean something along the lines of "Items \ Tools \ Misc Tools"). I chose to store my data in a table with an AuctionCategoryID, AuctionCategoryName, and an AuctionCategoryParentID, the latter of which represents a SELF-JOIN.





(The figure above doesn't actually represent my own table. I chose to give the very bottom nodes of the tree a ParentID of -1, and did not define -1 as an AuctionCategoryID. Hence, creating a Foreign Key is just not possible. The figure is just for clarification).

I first tried meddling with code a bit myself, but found that my code just didn't do the trick. Sure, I could hard-code a number of self-joins, but what if suddenly more categories were added? I'd have to alter my function. That was definitely not an option.

I tried a few more things, but found my code was not robust enough, and that there were sorting troubles (among other things).

I Googled a bit, and initially ended up with a solution like this (Joe Celko describes a similar method in his "SQL for Smarties"). Hardly elegant, and hardly dynamic (although it would be possible, of course, to ensure there would be a difference of a few numbers between each of the individual "nodes". However, that still potentially means I could run out of space, and would then have to manually alter stuff again. Not an option if you ask me).

Enter the recursive Common Table Expression. BOL doesn't seem very helpful, but hopefully the code below will make things easier to see.

The first query within the Common Table Expression defines the root level. As you see, the -1 value I mentioned above is hard-coded. The UNION ALL is the key to ensuring the self-join will work for an indefinte number of child nodes. The second query just fills in those child nodes.

Note that it is possible to potentially create an infinite loop, though CTEs also offer a way to cap this.

The code:

WITH AuctionPath -- The name of the CTE
(
[Name]
, [AuctionCategoryID]
, [Path]
, [Level]
)
AS
(
SELECT
AuctionCategoryName -- Name
, AuctionCategoryID -- AuctionCategoryID
, CONVERT(varchar(255), AuctionCategoryName) -- Path
, 1 -- Level
FROM
AuctionCategories
WHERE
AuctionCategoryParentID = -1 -- Only the parent nodes here
UNION ALL
SELECT
AuctionCategoryName -- Name
, AuctionCategories.AuctionCategoryID -- AuctionCategoryID
, CONVERT(varchar(255), Path + ' \ ' + AuctionCategoryName) -- Path
, [Level] + 1 -- Level
FROM
AuctionCategories
INNER JOIN AuctionPath ON AuctionCategories.AuctionCategoryParentID = AuctionPath.[AuctionCategoryID]
)
SELECT
[Name]
, [AuctionCategoryID]
, [Path]
, [Level]
FROM
AuctionPath
ORDER BY AuctionCategoryID
OPTION (MAXRECURSION 5); -- The maximum amount of recursions

Some things of note. There are certain conditions to recursive CTEs. From BOL:

  • The number of columns in the anchor and recursive members must be the same.
  • The data type of a column in the recursive member must be the same as the
    data type of the corresponding column in the anchor member.

In the quoted text, anchor is what I have refered to as root node, and recursive member is a child node.

I hope this helps a little if you're facing a situation where you need to do hierarchies. Once you manage to get the trick, there's nothing to it, really :)

Many thanks to both Kalman Toth and russellb over at the sqlmag forums.


Read more!

Wednesday, 15 August 2007

SSIS - DataReader issue

Just a quick note, unrelated to anything to do with the actual website.

Today, at work, I was working on a task that involves moving data from a production database to a Historic database. The production environment has been misused at times in the past, and as a result, there's numerous backups of tables residing in the database.

Due to regulations, just dropping the tables or wiping data is not an option. Hence, we chose to move it to the historic database. Now to accomplish this, I selected SSIS as our prefered tool. Stupidly enough, I chose to use ADO.NET as a data connection. The fun started when I was creating the actual "Data Flow" objects. I must have created a dozen of them when suddenly I was confronted with an error that SSIS was unable to translate from unicode to non-unicode data.

Huh?


I checked my CREATE TABLE scripts, and sure enough, the data types in both source and destination were set to char. I tried altering the Source's External columns, which seemed to work from the colum's properties, but actually didn't if you would check the datatype coming into the Source. I tried fiddling with the Source's Output column, but got an error message that I was not allowed to. I tried explicitly casting the column as char in the Source's SELECT statement. No luck.

A colleague came over and suggested using a Dereived Column object. Again, that didn't work. I ended up Googling, and came to an MSDN link. There, it stated somewhere halfway the page:

"One issue when using ADO.NET is that there is no ADO.NET data type corresponding to VARCHAR (non-Unicode character) in relational databases. Therefore, ADO.NET data sources output all character data as Unicode. This can cause problems with simple components that use ADO.NET sources and destinations that expect single-byte character types. You can work around this by using a Data Conversion transformation."

I tested this, and using a Data Conversion did indeed work. But what a lot of extra work to do something relatively simple. I decided to just switch all connections to OLEDB.

Conclusion: When dealing with non-unicode char columns, OLEDB might be the better choice.


Read more!

Monday, 30 July 2007

Welcome!

Hi there,

We'd like to take the time to welcome you to the official developer blog of http://www.entropiaonline.com/. This blog will be mostly used to describe some pieces of code that we found to be tremendously helpful while developing the site, discussion of programming techniques, and as a way for us to get feedback from our users as to what they would like us to work on next.

As some initial information, the techniques we will be using include ASP.NET (2.0), SQL Server 2005, and C#. Once we feel the content of the site is to our liking, we will most likely also look into AJAX to enhance the user experience.

Thanks for stopping by!


Read more!