July 2004 - Posts

Measure Groups, Fact Tables, and Benevolent Bugs


In Yukon, there's a new concept called Measure Groups. A Measure group consists of the group of measures that come from a single fact table (get it? measure group...group of measures?).

Actually, measure groups are pretty cool. They let you put easily measures with different grain into the same cube. Standard example is Actual Sales stored at the day level and Budget Sales stored at the month level.

There is, however, a slight, er, inconvenience in UI in Beta 2. You can only make one measure group from a single fact table. Once you “use up” a fact table for a measure group, it no longer appears in the list of available tables. The problem is that sometimes you need to make two measure groups from the same fact table. One reason is if you have, say, budget and actual values in different fact table rows and want them to be in the cube as different measures. A bigger reason is if you have a Distinct Count measure. A critical performance key for Distinct Count measures is that each one should be in a measure group all by itself. So if you have other measures in the same fact table as a distinct count item, you need two measure groups. Oops.

They're going to fix the UI before the product ships, but in the meantime, there's a cool way to get around the problem. You open the XML representation of the cube, copy the chunk for Measure Group A, and modify the copy to create Measure Group B. The UI's list of available tables is none the wiser.

Bottom line is that the UI bug helps you learn how to take advantage of the XML definition, which, in a backwards kind of way, is a good thing.


DTS Loops


Anybody who has ever written--or tried to write--a loop in DTS 2000 would prefer to do the beer-with-Drano-chaser routine. The whole process is an incredible spaghetti hack of lying to steps about whether they've already been executed.

Not surprising that Yukon DTS has an easy-to-use Loop task. Very surprising not to have to use it. To demonstrate the Loop task, I built a package that loops through a bunch of text files in a folder, unioning the results into a table--simple, but real-world, blah, blah. I was a bit annoyed to have to deal with the results of each file in a separate iteration, but not bad.

Then, in a completely unrelated package, I happened to create a data source--a “connection manager” in Yukon-speak--against text files, and I happened to use a wild-card string for the file name. And the data flow happened to automatically grab the data from all the matching files. No Loop task. No separate iterations. Totally unnerving. Totally cool. Totally wipes out my Loop task example package.


New family member


My daughter just gave birth to my second grandchild--first grandson. I put a picture of them in my Personal gallery. (I'm not proud about it or anything.)

Have a good weekend,
Reed


Using a DSV to make a sandbox


I'm a firm believer in working in a very small dataset while exploring, experimenting, and testing alternatives. Having a small sandbox for doing an experiment cuts the turnaround and simplifies controlling variables. After building a nice DTS package to create a sandbox database, I just realized how easy it is to do it directly in AS. For two reasons.

First, AS has a new layer called a Data Source View (DSV) that basically buffers the logical metadata. I just added WHERE (ProductKey % 50 = 1) to the DSV table definition for the Product dimension and instantly got a small subset. Of course, you could always do this by creating a view--but that would then require making changes within the source database and also switching the table source in the AS designer. By doing it in the DSV, it's isolated from the SQL database and doesn't require any changes to anything.

Second, AS has a new “Unknown Member” concept. In 2000, discarding dimension members would have either discarded all the orphaned fact rows, or just failed with “dimension member not found”, depending on how you built it. In 2005, you can lump all the fact rows with invalid dimension members into an Unknown bucket for each dimension. This means a) that the totals tie out to the original data source, and b) subsetting multiple dimensions still leaves useful combinations--often with the Unknown members of other dimensions. Even a relatively large fact table processes very quickly with small dimensions, so this doesn't mess up the “quick turnaround“ benefit of the sandbox, and if it did, you could just use the DSV to subset the fact table too.

It's also really easy to keep two copies of the DSV - one filtered, one not - so I can switch back to the full dataset version with no problem.


Spreadsheets and Databases


One of the first projects I did as a programmer at Hewlett-Packard back in the early 80s was to build a planning tool using this cool new “PC” (with a whopping 10 MB hard drive!). In selecting a development platform, I went through a bizarre oscillation between Lotus 1-2-3 and R:Base. I would work in the spreadsheet until I couldn't stand the limitations any more. Then I switched to the database until I couldn't stand it any more. And back, and forth. Finally, I ended up building a quirky hybrid solution, frustrated at how hard it was to exploit the benefits of the two architectures. I continued living in the twilight zone suspended between spreadsheets and databases--moving from 1-2-3 to Excel, and from R:Base to Access to SQL Server--and always trying to find better ways to integrate the calculations of a spreadsheet with the scalability of a database.

Enter OLAP. It's easy to think of Analysis Services as a database architecture, with pre-stored aggregated data. But it's at least as valid to think of a cube as a hyper-spreadsheet. When you write an MDX expression, you have direct addressable access to any cell defined in the logical space of the cube. This is exactly analogous to the way a spreadsheet formula gives you direct addressable access to any cell in the workbook, and it's conceptually backwards from the typical database concept of correlated subqueries.


Ascend training scheduling


I've started to get requests for scheding information for the Ascend training. I think the schedules are still pretty tentative. If you want information, or want to be sure you're informed of training plans, please email to the alias ascend@microsoft.com.

Mosha's Excellent Link for all things OLAP


If you're using--or thinking about using--Analysis Services, you need to use Mosha's web site.  It's mostly AS2000, but there are a couple of Yukon articles referenced as well.

http://www.mosha.com/msolap/index.htm


UDM, MDX, and SQL


One of the major new features in Yukon BI is the Unified Dimensional Model, or UDM, and one of the big benefits of the UDM is the seamless integration of relational and multi-dimensional (MOLAP) data. The idea is that you define the data logically and then decide separately how parts should be physically implement--using purely relational storage, by adding explicit MOLAP aggregations, or by some cool new combinations. It's a great concept.

Most people I've talked to who have heard about the UDM in Yukon have interpreted the idea of “seamless integration“ to mean that they could arbitrarily write either SQL or MDX queries against the UDM. I have to admit that I am one of those who made that assumption. The reality is that to use the UDX you always (for all practical purposes) use MDX, regardless of whether the data is stored relationally or multi-dimensionally. When I first found out that accessing the UDM requires MDX, I was a bit, shall we say, disappointed. I'm a big fan of MDX, and know that there are a lot of upsides to using MDX for reporting, but it has always been plagued by a few serious “downsides“ that made it unusable in some situations, so it was hard to envision building all reporting queries using MDX.

But most of the downsides of MDX have actually been limitations of Analysis Services 2000. For example, it's easy in SQL to create a query to group by, say, Invoice Number, even if there are 10 million invoices. (Not smart, necessarily, but easy.) In AS2K (at least 32 bit), it was simply impossible to create that kind of query. But the limitation was actually how big the dimension could be, not the MDX query per se.

Other limitations that were part of MDX have been removed in Yukon. For example, in SQL, if the filter is WHERE Type In (1,2,3), but there is no type 2, the result set simply does not include item 2. In AS2K MDX, if you request Types 1,2,3 on rows, and there is no type 2, the query fail. This is simply fixed in Yukon.

Some other time, I'll wax eloquent about the upsides of MDX as a reporting language.

Reed


Hitachi Consulting Yukon BI Blog


I just got back from a nice week cultivating cancer cells on a beach in Mexico when I found out that the Ascend team had asked us to start a blog with our experiences as we develop our course for Yukon BI developers. So here we are. Thanks to SQL Junkies for the space--getting the Hitachi parent company to let us add a blog to the official web site would take a bit longer than a week.

In this space, I'll keep you posted on the cool new stuff in Yukon BI.

Ciao,
Reed