Monday, February 28, 2005 - Posts

Data warehousing, data quality, and data vs. information

An interesting day in all of the e-newsletters I subscribe to.

SQLServerCentral's "Database Daily" newsletter included a link to this article, which states that 50% of all data warehousing projects are doomed to failure! Why? "The main problems ... centre on a lack of attention to data quality issues."

This is interesting, in that one of the first things I learned when studying about data warehousing is that second letter in "ETL" -- TRANSFORM. Part of that transformation is data clensing! The warehouse itself should be the final arbiter of what data gets in, and what data gets out -- and "garbage in/garbage out" certainly applies. It's humorous to me that SQL databases have had features like CHECK constraints for so long, and yet they're so rarely used in most shops. How many constraints do you have on the tables in your current project? Are you really constraining to make sure no trash gets in, or are you letting the application handle some of the burden?

On to the SQL Server Worldwide Users Group newsletter, which included three interesting links dealing with data vs. information. What is data, and when does it become information? Some very good reading here:

Interesting that both newsletters posted such related topics. Why is it that we have all of these data warehouses being developed with no conception of what a data warehouse is? In my opinion, a data warehouse is nothing more -- or less -- than a tool by which data becomes information. Of what use is dirty, flawed information? And how will it help business owners make the kinds of sound decisions that will drive up your paycheck? Data quality is of utmost importance in any database project -- not just a data warehouse -- yet many database professionals I encounter never implement CHECK constraints, and many seem to fear even basic declarative relational integrity (foreign key constraints? natural primary keys?)

I'm saddened by what I see as a trend towards lower data quality, not the other way around. As more software engineers jump into database work without paying attention to the foundations, without thinking about data as more than just a collection of bytes, this problem will continue to intensify. So if you have some time today, read some of these articles and spend a bit of time thinking about data quality within your organization. There is no reason that we as database professionals shouldn't be doing all we can to combat bad data, at every possible level.