Showing posts with label SQL Server. Show all posts
Showing posts with label SQL Server. Show all posts

Monday, 12 May 2008

Collation and CHECKSUM do not play together nicely.

If you work with data warehouses, like yours truly, most likely you’ve had to write code that checks your dimension tables for changes. Our usual approach is to pump the most recent copy of the table into a staging table, and then compare it to the latest version of the data.

Roughly, the comparison would be something like:

SELECT
P.Column1
, P.Coulmn2
FROM
ProductionTable P
INNER JOIN StagingTable S ON P.PKID = S.PKID
WHERE

Followed by a piece of code where the individual columns are compared. There’s two methods that we use: A column by column comparison (WHERE P.Column1 != S.Column1 OR P.Column2 != S.Column2), or a CHECKSUM method, where a CHECKSUM is done over all the columns of the staging table, and compared against the CHECKSUM over all the columns of the production table (WHERE CHECKSUM(S.Column1, S.coulmn2) != CHECKSUM(P.Column1, P.Column2).


Today, when testing some new code, I found that even though the values in the two tables were identical (except for one column I excluded), and yet, SQL Server reported the checksums to be different. A bit more investigation revealed that the reason behind this was because of collation differences. For those of you who never heard of it (I didn’t until about a year ago, despite having worked with databases for 5+ years in numerous international organizations), here’s an excerpt from BOL (“Selecting collations”):


Collations specify the rules for how strings of character data are sorted and compared, based on the norms of particular languages and locales. For example, in an ORDER BY clause, an English speaker would expect the character string 'Chiapas' to come before 'Colima' in ascending order. But a Spanish speaker in Mexico might expect words beginning with 'Ch' to appear at the end of a list of words starting with 'C'. Collations dictate these kinds of sorting and comparison rules. The Latin_1 General collation will sort 'Chiapas' before 'Colima' in an ORDER BY ASC clause, while the Traditional_Spanish collation will sort 'Chiapas' after 'Colima'.


Now the source system we pull our data from has a default collation of US-English, and our warehouse system uses Norwegian/Danish. Essentially, this means whenever we try and run comparison checks or updates between staging and production, we have to add a COLLATE DATABASE_DEFAULT clause in order to “convert” one set of values to another. That this also was an issue with CHECKSUM never occurred to me.

Here’s some sample code to prove my point. We create a test table with two different collations, and insert identical values into them. First, we’ll run a SELECT to see the CHECKSUMs are different, a second select which does a collation conversion on the US-column, and finally a statement that boxes the “other” collation column to the proper one for the database, which will result in an identical CHECKSUM. The code was run on a database with a US collation, by the way:

CREATE TABLE dbo.CHECKSUM_TEST
(
US nchar(10) NULL,
DK nchar(10) COLLATE Danish_Norwegian_BIN NULL
) ON [PRIMARY]

INSERT INTO CHECKSUM_TEST
(US, DK)
VALUES ('foo', 'foo')

INSERT INTO CHECKSUM_TEST
(US, DK)
VALUES
('bar', 'bar')

SELECT US, DK, CHECKSUM(US), CHECKSUM(DK)
FROM CHECKSUM_TEST

SELECT US, DK, CHECKSUM(US), CHECKSUM(DK COLLATE DATABASE_DEFAULT)
FROM CHECKSUM_TEST

DROP TABLE CHECKSUM_TEST

I’ve deliberately used Unicode rather than non-Unicode for the columns, to prove that doesn’t really help to deal with this particular issue. Below are the results from the three statements:

US DK
---------- ---------- ----------- -----------
foo foo 59294611 24735
bar bar -2088615046 25698

(2 row(s) affected)

US DK
---------- ---------- ----------- -----------
foo foo 59294611 24735
bar bar -2088615046 25698

(2 row(s) affected)

US DK
---------- ---------- ----------- -----------
foo foo 59294611 59294611
bar bar -2088615046 -2088615046


The lesson learnt here is to be careful when using CHECKSUM in combination with mixed collations, and when you do, it’s important to use the COLLATE statement on the proper column ;)


Read more!

Wednesday, 29 August 2007

A free script to create CRUD procedures.

Recently, both at my job, as well as while working on the site, I found myself having to write countless CRUD stored procedures in order to facilitate retrieving and storing data in my tables. The majority of these reflected an exact replication of the database table model (i.e. updating all columns), and I found writing those stored procedures to be a real burden. It was boring, repetitive and error prone work (especially the UPDATE procedures, where one has to replicate columns three times (once as input variables (which includes data types), once as columnnames, and once again as variables (but this time without the columnnames).

Being the lazy guy I am, I decided I would automate the matter, and write a script to automatically create the procedures for me. Before I started, I decided I would make a few starting assumptions:

1) I wanted the SELECT procedure to be able to accept either an ID-number or a Name (assuming that like me, you called the column something involving '%Name') for an individual row, or neither, in which case the whole table should be returned.
2) I wanted to be able to have the DELETE procedures to be optional. I achieved this by using a bit-flag and IF statements to see if it ought to be created or not.
3) I wanted the user to be able to specify certain columns to exclude. In the database I'm working on for the site, there are numerous audit columns to keep track of when data is entered or modified, and by who. Naturally, this is not required in all of the procedures, so I wanted a mechanism for the end-user to include or exclude these columns.


The result of my work is the script below. I've briefly tested the generated procedures against my own database, and there, they work like a charm. Perhaps others who run more specialised or customised databases will be able to come up with unexpected errors etc. If so, please do let me know.

Of course, the script will not cover 100% of anybody's specific wishes. In fact, I'm quite sure that some of the procedures will need to be customised to suit your needs, but you will find this script will enable you to start with at least some foundation in place.

I'm releasing this script under the Creative Commons Attribution Non-Commercial No Derivatives license, which means you are free to download and share the script, provided you mention and link back to me. However, you are not allowed to change or commercially use the script. I might change this to a less restrictive license in the future, however for now, I want to be able to evaluate comments etc first. This is the first time I ever released anything under a license, so any comments regarding that are more than welcome.

A list of things I have figured I could enhance the script with:

1) Add error handling to the stored procedures.
2) Add support for schemas in SQL Server 2005.
3) Add paging support to the select procedures, by adding two additional parameters.

Also, I might later on decide to automatically generate .NET custom classes based on the exact database tables.

Click here to download the script.

One last comment, should you decide to alter the generated stored procedures to suit your specific needs, do remember to rename the new procedure. The script will automatically drop its standard procedures, in order to always reflect the latest table schemas.


Creative Commons License


This work is licensed under a
Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License.


Read more!

Sunday, 26 August 2007

Roman numerals

If you play Entropia Universe (formerly Project Entropia), you're probably aware that mining depositsizes have two distinctive identifiers. The first being the name of the claimsize (i.e. tiny, very poor, poor, all the way to Immense). However, those terms do not clearly indicate which of the two is larger.

In the case of very poor vs. poor, it's obvious, but is Significant bigger or smaller than Substantial? In order to avoid that kind of confusion, MindArk also decided to give claims a Roman numeral indication.

I order to have the database reflect these numerals, I decided to create a function that will translate our number system into Roman numerals. It works surprisingly easy, actually. The function takes an input parameter, which is the number to be translated.


It then determines the regular numbers, the tens, the hundreds, and the thousands, and translates those into their Roman counterparts using a case statement. When it encounters a 4 or a 9 (which, obviously, are the special numbers because they are not a simple repetitive sequence of characters), it will translate those as well.

CREATE FUNCTION [dbo].[fn_ToRoman]
(
@InputNumber INT
)
RETURNS VARCHAR(16)
AS
BEGIN
DECLARE
@Singles INT
, @Tens INT
, @Hundreds INT
, @Thousands INT

SELECT
@Singles = @InputNumber%10
, @Tens = @InputNumber%100/10
, @Hundreds = @InputNumber%1000/100
, @Thousands = @InputNumber%10000/1000

RETURN
(
SELECT
REPLICATE('M', @Thousands) +
CASE
WHEN @Hundreds THEN REPLICATE('C', @Hundreds) WHEN @Hundreds = 4 THEN 'CD' WHEN @Hundreds = 9 THEN 'CM' WHEN @Hundreds >= 5 THEN 'D' +
REPLICATE('C', @Hundreds - 5)
ELSE ''
END +
CASE
WHEN
@Tens THEN REPLICATE('X', @Tens) WHEN @Tens = 4 THEN 'XL' WHEN @Tens = 9 THEN 'XC' WHEN @Tens >= 5 THEN 'L' + REPLICATE('X', @Tens - 5)
ELSE ''
END +
CASE
WHEN
@Singles THEN REPLICATE('I',@Singles) WHEN @Singles = 4 THEN 'IV' WHEN @Singles = 9 THEN 'IX' WHEN @Singles >=5 THEN 'V' + REPLICATE('I',@Singles-5)
ELSE ''
END
)
END


Read more!

Sunday, 19 August 2007

Yes! Code regions in SQL :)

Currently I'm working on a long long long SQL script that automatically writes CRUD (Create, Read, Update, Delete) procedures based on a table's definition. If you are like me, you will find that you group certain code together.

In Visual Studio, you can conveniently wrap #region tags around the code, so you can just collapse the region when you know certain stuff works, and doesn't need to be edited anymore. Not so in SQL Query Analyzer (or SQL Server Management Studio for that matter).

SQL Visualizer does offer that option though. You start a region with a -- REGION comment, and end with -- END REGION.
Because it works with comment blocks, your code will work anywhere. Except within this tool, you will be able to now use regions:



I've used other SQL tools before, and quickly found myself abandoning them due to being, well frankly, crap. I'll have to work with this one a bit more before I can actually judge it properly, but regions in SQL code are definitely something I miss in the Microsoft environment.

Update 08/27/2007: I've decided to not use SQL Analyzer a lot, unless it's for developing large scripts like the CRUD script I mentioned before. Reason is that it's not possible to copy/paste output messages (which, if you're building dynamic SQL scripts, is kind of neccessary to test).


Read more!

Friday, 10 August 2007

Preventing accidental data deletion in SQL Server

After a few hours of working on the proper table definition for your base table, and filling it with data, you sit back, and decide to do some minor maintenance on a few other tables. You truncate a table, and seconds later realise to your horror that you just purged the table you spent quite some time on filling.

Sound familiar?

It recently happened to me, and I decided to put some mechanisms in place to prevent accidental deletion of data. Of course the best way to prevent accidental deletion is by making sure you are always 100% awake, and you simply do not make errors. Unfortunately that is unrealistic, so I have a few tips to help out.


As a very basic and rather obvious tip, I'll advise you to just make a database backup every time you spent a lot of time on filling in basetables. Now for some added technical tricks:

1) To prevent accidental deletion of data, I decided to use INSTEAD OF triggers. BOL again:


INSTEAD OF
Specifies that the DML trigger is executed instead of the triggering SQL statement, therefore, overriding the actions of the triggering statements. INSTEAD OF cannot be specified for DDL triggers.

At most, one INSTEAD OF trigger per INSERT, UPDATE, or DELETE statement can be defined on a table or view. However, you can define views on views where each view has its own INSTEAD OF trigger.

INSTEAD OF triggers are not allowed on updatable views that use WITH CHECK OPTION. SQL Server raises an error when an INSTEAD OF trigger is added to an updatable view WITH CHECK OPTION specified. The user must remove that option by using ALTER VIEW before defining the INSTEAD OF trigger.

{ [ DELETE ] [ , ] [ INSERT ] [ , ] [ UPDATE ] }
Specifies the data modification statements that activate the DML trigger when it is tried against this table or view. At least one option must be specified. Any combination of these options in any order is allowed in the trigger definition.

For INSTEAD OF triggers, the DELETE option is not allowed on tables that have a referential relationship specifying a cascade action ON DELETE. Similarly, the UPDATE option is not allowed on tables that have a referential relationship specifying a cascade action ON UPDATE.


The definition of my trigger is similar to this:


CREATE TRIGGER DeleteTable_1 ON Table_1
INSTEAD OF DELETE
AS
BEGIN
IF
@@rowcount > 0
BEGIN
RAISERROR
( 'Rows in Table_1 cannot be deleted!', 16, 2 )
ROLLBACK
END
END

GO


When I now try to delete rows (using DELETE FROM Table_1), I will see the following errors in my Query result:


Msg 50000, Level 16, State 2, Procedure DeleteTable_1, Line 7
Rows in Table_1 cannot be deleted!
Msg 3609, Level 16, State 1, Line 1
The transaction ended in the trigger. The batch has been aborted.



Unfortunately, that does NOT prevent the table from being Truncated.

2) To avoid accidently dropping a table, I create views over them and use the option WITH SCHEMABINDING. From Bol (my emphasis):


SCHEMABINDING
Binds the view to the schema of the underlying table or tables. When SCHEMABINDING is specified, the base table or tables cannot be modified in a way that would affect the view definition. The view definition itself must first be modified or dropped to remove dependencies on the table that is to be modified. When you use SCHEMABINDING, the select_statement must include the two-part names (schema.object) of tables, views, or user-defined functions that are referenced. All referenced objects must be in the same database.

Views or tables that participate in a view created with the SCHEMABINDING clause cannot be dropped unless that view is dropped or changed so that it no longer has schema binding. Otherwise, the Microsoft SQL Server 2005 Database Engine raises an error. Also, executing ALTER TABLE statements on tables that participate in views that have schema binding fail when these statements affect the view definition.

SCHEMABINDING cannot be specified if the view contains alias data type columns."


The first bit I emphasised clearly shows that using views in this way is a neat little feature to prevent accidental drops of a table.

The second quote is IMHO also important, as it shows that clearly the table should be "final". If you later on decide that the table in fact needs extra (or less) columns, the VIEW will actually make your life harder.

As an extra tip, when you create a view to use WITH SCHEMABINDING, always use the fully qualified name of the table (i.e. dbo.Salaries instead of just Salaries). Also, you cannot use *, but instead you will have to define all the columns in the query statement.

Nice, but still no protection against a TRUNCATE.

3) In fact, the only way to prevent TRUNCATE from executing on your table is to ensure there is a Foreign Key relationship to the table. If you are (like me) still working on the tables that will eventually link to the base table, you can always create a dummy table that references your base table:


CREATE TABLE [dbo].[Table_2]
(
[TableID] [int] NULL,
[Table_1ID] [int] NULL
)
ON [PRIMARY]

GO
ALTER TABLE [dbo].[Table_2]
WITH CHECK
ADD CONSTRAINT
[FK_Table_2_Table_1] FOREIGN KEY([Table_1ID])
REFERENCES [dbo].[Table_1] ([TableID])
GO
ALTER TABLE [dbo].[Table_2] CHECK CONSTRAINT [FK_Table_2_Table_1]



Now when you try to TRUNCATE Table_1, you will receive the following error:


Msg 4712, Level 16, State 1, Line 1
Cannot truncate table 'TABLE_1' because it is being referenced by a FOREIGN KEY constraint.



Well, that concludes this post. I hope you will find it helpful in preventing you from accidentally destroying some of your hard work. Of course all comments are welcome :)


Read more!