August 2005 - Posts

The wonderful processing options of Analysis Services 2005

Today I deeply tested processing options of Analysis Services 2005. A complete discussion of it is available on this paper by T. K. Anand, so I will not describe what is already well written but I'd like to share my today experience.

What is more interesting to me is the handling of changing dimensions. Imagine you have a Customer dimension with a changing attribute of type I (new values of attribute overwrite the previous in the same record); this attribute can have a key and a name, let's say they are CategoryCode and CategoryName. I'll use the format [CustomerName CategoryCode CategoryName] to make samples of customer records.

First scenario: the CategoryName for a particular CategoryCode changes (for example, [Mark PRG Programmer] becomes [Mark PRG Developer]), so all Customer records change according to this. In this case, the Process Update of Customer dimension will change the Category in the dimension and you don't need to update cube to see correct values (only the displayed name changed).

Second scenario: the CategoryCode of a Category changes for a particular Customer and related CategoryName changes accordingly to it (for example, [Mark PRG Programmer] becomes [Mark MUS Musician]). In this case, the Process Update of Customer dimension will change the Category for Mark dimension member and data in cubes are updated even if you don't process the cube! If you don't have aggregation in the cube, it is not a surprise, but if you have aggregations on Category attribute than It's a kind of magic (especially if you are used to Analysis Services 2000) because it works in both cases! What's happening? Well, if there is an aggregation on an attribute that is changed during a Process Update, that aggregation is temporarily removed from the cube, while it's still present in the cube schema. At this point cube queries give correct results and aggregation not affected by dimension process update are still in use. If you want to rebuild the lost aggregation you don't have to reprocess the whole cube/measure group/partition but you only need to do a Process Index on it. Only the necessary missing indexes are generated, already existing indexes are maintained. If you are not sure what kind of Process you need on a cube, you can use the Process Default, that use the less expensive process type to give you the right data at the minimum cost.

I stressed this scenario even with related dimensions and it works very well. I really like this approach, even if I'd like to know in advance what is the necessary process step for a "process default" (but may be I missed the right documentaton page... comments are welcome!).

Cascading many-to-many dimensions: it will work

I'm waiting for the next CTP (September 05?) because it should work with cascading many-to-many dimensions. In a previous post I talked about the lack of functionality when you have more many-to-many dimensions chained together (I named this scenario "cascading many-to-many dimension"); I received feedback on this issue and it was a supported scenario, so what I described 3 months ago was a bug. What is important now is to understand performance in real-worlds scenario, because this kind of thing will tell us what dimensional modeling decisions will be best practice in different scenario.

This is a key feature to adopt Analysis Services 2005 in certain scenarios where today a SQL query could be better than a MDX one. The big challenge is that often many-to-many dimensions are paired with large dimensions (at least one). This bug prevented me to test some real-world scenario... If someone already has some "ideas" of numbers (we can't perf-test a Beta, I know...) let me know!

Dimensional modeling with Analysis Services 2005

First of all, a praise to Ralph Kimball: in the last days I read for the second time his The Data Warehouse Toolkit and this second read has been very inspiring to me. This time I've had the perspective of Analysis Services 2005 (SSAS) as OLAP target for data marts, and I have to say that many of the features of SSAS are probably inspired by this same book; when I first read the book, I had too much AS2000 limitations in my head and almost immediately forgot some design pattern. I strongly suggest reading this book, because if you interpret it in the right way you can clearly figure out the right way to use the so many new dimensional features of AS2005.

One thing that surprised me is that I always thought that hierarchy-modeling and dimensional-modeling was nearly the same concept, but I clearly was wrong and this book never sustained such a thesis! I was wrong because AS2000 distorted my perception of right modeling!

This book is definitely a guide to attribute-based dimensional modeling (even if the author never uses this definition). And this is the key to really leverage the potential of SSAS.

Now, the real intention for this post was to make some consideration (and to pose some question to the blog readers) about the relation between some relational models and the corresponding right design in SSAS 2005.

 

JUNK DIMENSION

I currently use junk dimension in the relational model even with AS2000. A junk dimension is a single table with all the significative combination (until the Cartesian product of all members) of different and not correlated indicators of flag. With AS2000 I used to place each flag/indicator in a different dimension (because, you know, these were independent hierarchies).

With SSAS 2005 I have the doubt if it would be better to place a junk dimension table in a single OLAP dimension: each attribute is naturally independent from each other to the user. I can imagine that this also improves aggregation design schema. But what about the user presentation? How to name such a dimension? I instinctively refuse to define a "flags" dimension, that has a meaning only for a geek like us but probably it's not so meaningful for the user.

MINIDIMENSION

A minidimension table usually contains some attributes of a larger dimension tables just to improve accessibility to data in fact table. Rows are less than rows in large dimension table because only the existing attribute permutations in the real dimension defines the rows for the minidimension table. In other words, it's a way to avoid the large-dimension pitfall of AS2000.
Now, with SSAS 2005 I was tempted to put any attribute of a large dimension (imagine one table with 2 millions of customers) into the dimension itself; but the fact that this new release has not anymore the scalability problem that affected large dimensions in AS2000, it could be still useful to use minidimensions to improve query performance. I still have not faithful numbers to define a "best practice" way (or to define a tradeoff to help choosing between minidimensions and regular "single" dimensions). Is there someone with more experience in this area? And, again, how to name a minidimension in a meaningful way?

SLOWLY CHANGING DIMENSION

We all already know what a slowly changing dimension (SCD) is, right? If not, it's well explained in the books on line (BOL).
My concern here is in part related to the use of minidimensions. Ralph Kimball suggests to not use SCD with large dimension, placing attributes interested for SCD in one or more minidimensions. So, if it would be true that minidimensions are beneficial to SSAS 2005 performance, it should be very important to avoid SCD on large dimensions to improve performance.
In the next weeks I will have to test this scenario (SCD on large dimension) on the field and I'm worried to see performance problems (but happy to already know a possible solution). But it would be great to share some experience with some other tester....

VSTS and SSAS 2005: another example of bad design?

Recently I've seen the multidimensional database that Visual Studio Team System use to produce statistics and reports about projects managed. There are several different cubes, each one with only one measure group, and many cubes use several dimensions shared with other cubes. I can't understand if there is a reason for this: may be yes, or may be that someone used the "AS2000 approach" to the problem, howsoever without worrying about virtual cubes...

Previously I already had bad words for BizTalk 2004 and Analysis Services 2000 integration. I think that this could be another missed opportunity for Microsoft to leverage on his own technology to build a more powerful product and to show some best practice to follow. I hope to receive some comment from MS-guys...

UPDATE: Rob Caron announced new features for RTM that seems to addess all of my concern. Well Done!