Well, I think it’s time for a new post here in my blog... Yes, I’m still alive and still working in the MS BI space at Avanade... Many things changed in the last months so this blog was not on my top priority list, but maybe this will change again...
So I just want to publish some thoughts about the “Current Day” problem in SSAS... There are quite some articles about that and what I want to tell you is nothing more than bringing two posts together...
Mosha wrote quite something about this task... http://sqlblog.com/blogs/mosha/archive/2007/05/23/how-to-get-the-today-s-date-in-mdx.aspx. How to use VBA to find out the current date and how to map this with your dimension... At the end he mentions performance issues with this... And that might be right... Since this is a calculation which will be run in each MDX you send to the server there is nearly nothing it can cache... Mosha’s recommendation is to use some fixed member you change daily...
So what can you do? I came across this article... http://www.sqlbi.eu/Home/tabid/36/ctl/Details/mid/374/ItemID/8/Default.aspx. Here you find some code how to build up a simple application which changes your actual MDX script on your server...
I didn’t have time to test it but it should be pretty straight-forward to build up a tiny tool which updates a set with the actual date in your MDX script. So you have that fixed set and your performance should get better...
If someone likes to implement it and give some feedback, I’ll be more than happy to publish it...
SQL Server 2005 SP2 is finally available... http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=d07219b2-1e23-49c8-8f0c-63fa18f26d3a
I was very happy when I saw Microsoft's new BI Certifications known as MCITP (Microsoft Certified IT Professional) Business Intelligence some months ago. This certification was based on the “standard” MCTS (Microsoft Certified Technology Specialist) SQL Server which was a very generic exam covering all stuff a SQL Server offers.
Since I’m not really a good DBA I had a look at the requirements, had a look a the preparation guides and saw that this would mean quite some work for me since it covered many things I never was in touch with before (i.e. Service Broker, what are the correct parameters to run sqlcmd, ...).
Today a colleague told me that Microsoft changed these certifications yesterday. Before the new “version” of these exams you had to pass two additional exams to get an MCITP. They now changed one of the MCITP exams to the MCTS Business Intelligence (exam 70-445, no content announced, yet) which is a new requirement before starting the exam 70-446 (also no content announced, yet). The only information available is the trainings and books available (http://www.microsoft.com/learning/mcp/mcitp/bid/default.mspx). Basically it seams that 70-446 is just getting more into details than 70-445 since the books you should study are quite the same as for 70-445.
So the good news is that you have the chance to get a BI certification by passing only one exam (getting MCTS by exam 70-445) and you need only one additional exam to get an MCITP (by exam 70-446). Unfortunately both exams will be only available in early 2007 so there’s plenty of time reading through all the books…
Thomas
Today I came across some issues with “Parallel Execution” within a dataflow. Did you ever use one source system for more than one data source in a dataflow? If the queries you send of are not trivial it might affect your source system that you run these queries in parallel, which is the default for Integration Services. In the last days we struggled with a dataflow sending off 10 quite complex queries to one server using up all the memory at the source. So I was looking for a solution for this problem.
However there seams to be no solution. The only way you can handle this is by splitting up the dataflow into many and save the result i.e. in RAW files which you later use instead of querying the source.
What I would appreciate is a way to control how many queries can be send off to a connection manager at one time. I opened a suggestion for that on Microsoft Connect and everybody is invited to vote for it... http://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=178069
I also would like to point your attention to a suggestion Jamie posted. It’s the old issue that you can’t control how data is send to multiple destinations in a dataflow. So if you have two destinations (i.e. two tables connected by a foreign key constraint) you can’t enforce your data integrity by controlling that i.e. an order is saved before the details are added. Here’s the link to Jamie’s post and I assume that votes are appreciated as well... http://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=178058
Thomas
Did you ever get this message: “Parameters cannot be extracted from the SQL command. The provider might not help to parse parameter information from the command. ...” My Avanade Colleague Marcel Franke invested some time and found out that this might be a problem with line breaks in the SQL command… Especially when you use /* comments */ in the SQL… We’re not ready for an official bug report, yet, but I came across that problem more than once and now I have an idea where to look for…
Thomas
We investigated about a solution for that… Finally my Avanade colleague Tom Lichtenstein came up with an approach weeks ago which worked perfectly… Now I finally find some time to blog about that and: well, I’m not the first one…
Did you ever want to do a Cross-Join with the Merge Join transform? There isn’t a feature like that… What you have to do is add a “dummy” column (i.e. by a Derived Column transform) in both pipelines, sort by that column and join them. That’s expensive (because of the sort) and more a hack than anything else… Perhaps you can do the Cross-Join in your SQL query of the source instead… Perhaps you can at least add the dummy column in your source queries (that works if you have two different sources you want to join) and change the sort property in the output so that SSIS things the source is already sorted by that column (so you don’t need the Sort transform anymore).
One thing you should do for sure: vote for a change in the Merge Join transform so that we’ll see a Cross-Join in the future… https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=161154
Thomas
Since nobody else on my bloglist reported this (at least I didn't notice)... Here you find the KB article with the fixed issues: http://support.microsoft.com/kb/913090/en-us. And here is the download: http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=cb6c71ea-d649-47ff-9176-e7cac58fd4bc
Wow, I really expected to see that download later... But I got some hints that it will we available in April since every vendor has to ship the version TPC benchmarks are based on a defined time after publishing the results. I'm not aware of that timeframe, but from that timeframe you could expect to see SP1 before May...
Let's see how things developed...
Thomas
I just issued a suggestion for the next SSIS version… I decided to make it public here just to ensure that I get lots of votes for it… ;-) Here is the description and at the end of the post you’ll find the link for the Product Feedback Center where you can help to make it more important…
I know that this is quite an effort, but I guess it's worth that... Currently SSIS is very "generic" when it comes to common BI tasks. You can do virtually everything with SSIS what also means that you have do invent everything on your own on the other hand. I would be very happy if you can still work very "generic" but Microsoft provides a "standard" way to go ("best practice") that makes BI with SSIS/SQL Server much easier and faster.
One "standard" issue is the process of creating/updating dimension tables, including the assignment of surrogate keys and linking them with fact tables.
The Lookup transform currently fetches the reference data at the package start-up time. If you have i.e. a star schema to populate then you have to prepare the dimension tables first before you can work on the fact table because otherwise the lookup for surrogate keys will not work. This is even worse if you think of snowflake schemas. You have to work from the "outer leaves" table by table to get to the "base". There are workarounds for that (I blogged about that http://sqljunkies.com/WebLog/tpagel/archive/2005/08/19/16495.aspx) but they are very complex and a nightmare to maintain.
So this is my idea:
Make the Lookup transform "dynamic" by allowing updates to the underlying data. The lookup still would be populated at package start-up. But some things will be different....
-
You still define a connection manager
-
You define a table for the lookup data (better no query (only views), otherwise this would be more complicated...)
-
A mapping of the pipeline to that table (for lookups/inserts/updates)
-
Within the mapping you define
a) This is the lookup key for matching the records (may be more than one field, like today with the lookup
b) This is the data I'm looking for (the values returned from the lookup, in this case this would be the surrogate key in most cases, but it might not be limited to that). For existing data you get the data out of the cache. Otherwise the record is created. If you have a surrogate key in that table you can use an own mechanism (i.e. a script) to calculate it and populate it by mapping a pipeline field to the table's field storing the surrogate key (this would also mean that you have some kind of "feedback" that you need a new key for the next record, maybe a different output for created records or a "status" variable indicating that you have to calculate a new key) or you can use a "build-in" surrogate key which is calculated automatically (you should have an option to auto-increment the last issued surrogate key). Then the cache is automatically updated (see below).
c) This is a field I want to be saved in the table (for new records, this is not to be "cached" for lookups, it's only transferred to the database for new records, only if it's marked to be "returned", than it's cached, too! So the cache is permanently updated with new records!)
d) This is a field I want to be saved in the table (also for updates, it's still not cached but updated data will be pushed to the database (i.e. changes will be committed at the end of the package), only if it’s marked to be “returned”, than it’s cached, too (like the insert))
- Maybe c and d can be omitted if inserts/updates are always done for all mapped, this would make thinks easier for the users. Maybe you simply select if updates should be committed to the table or not…
I think this can be done by “extending” the current Lookup transform. Maybe it’s a good idea to create a new transform for it.
It might be a good idea to have that for SCD type 2, too. The SCD wizard is a great thing, however it adds complexity to the project and it lacks some functionality (surrogate key generation, updated cache, …). I would keep the wizard as it is today… I just won’t dare to ask to add SCD type 2 support to this new transform, too. It’s just giving “BI Best Practices” in an easy intuitive way to the users. This will not be a 100% solution for everyone but a 90% solution offering much benefit for a huge number of users. I didn’t think about that so much but it shouldn’t be too hard… It might be a combo box where you choose between “no updates”, “overwrite updates” and “track history”. Then you need something to “outdate” the “old” record. This might be something you define in the “role” of a field (if it’s integer, it might be a “actual” flag, if it’s datetime it might be “valid from” or “valid to” (which might come from the pipeline or might be auto assigned). This might be enough for the beginning and people could get packages much easier to maintain and the learning curve wouldn’t be as steep as today…
How should that be used?
Well, this might be something like a “Validate your pipeline against the dimension table”. So if you have a package for updating a dimension table you might have a “dynamic lookup” at the end of the data flow. Before that you prepare the dimension’s data. The “dynamic lookup” checks if the data is already present in the dimension table, assigns new surrogate keys (maybe also tracks the history if SCD type 2 is not impossible ;-) ) and updates the dimension table.
If you use it in the fact load process, you can have it to create inferred members (or late arriving dimension entries). You get that for free ;-) just while finding out the surrogate keys (foreign keys) to use in your fact table…
So if you like these ideas I would like to ask your for your vote at http://lab.msdn.microsoft.com/ProductFeedback/viewFeedback.aspx?feedbackId=FDBK43386.
Thomas
Perhaps not everyone knows that it’s very easy to call stored procedures from a Data Flow and return values into the pipeline.
Just take this procedure:
CREATE PROCEDURE dbo.sp_test
@param1 int,
@param2 int,
@param3 int output
AS
BEGIN
select @param3=@param1+@param2
END
GO
Well, quite trivial, but it should work as an example…
So just imagine that you have two integer fields in your pipeline and want to have these added. The result should be in the pipeline for later use (i.e. to write it into a table). So what do you have to do?
Just add a field to the pipeline (i.e. use a Derived Column transform for that assigning a default value to it).
Then you can add an OLE DB transform to call the stored procedure. The command will look like this:
exec dbo.sp_test ?,?,? output
Then you simply have to do the column mapping for the three parameters. And that’s it…
I didn’t do any performance testing on that, but it would be quite interesting how good this performs… I’m quite curious if calling a stored procedure (perhaps even some C#/VB.net code) is faster than calling a VB.net script…
Thomas
Quite a while ago Brian Welker talked about a delivery extension to Reporting Services to directly send reports to SharePoint. Well, it’s not there and it will still take some time (Office 12 will change the situation). However I was at a customer and he asked for exactly this functionality. So I thought about different options…
The easiest solution would be just to use web folders and data driven subscriptions (or just subscriptions). You just send the reports to a network share which is a SharePoint document library. Well, that’s really easy. But unfortunately you need the “Web Client” service for that and this is disabled at the customer’s site. No idea why but somebody told me that this is quite common…
So the other option would be to leverage web services. Too bad, Microsoft “forgot” to include a “document upload” web service. There’s an example for a “workaround” in MSDN (http://msdn.microsoft.com/library/default.asp?url=/library/en-us/odc_sp2003_ta/html/odc_writingcustomwebservicesforsppt.asp describes how to add a document upload web service to SharePoint) but since the customer has quite a lot of SharePoint servers, deployment would be a problem.
Since Reporting Services 2000 SP2 you have that nice WebParts you can use. But the customer didn’t like to do the administration for all the users accessing reports. He just wanted some static reports in SharePoint which are refreshed automatically.
So the standard subscriptions were an option, too. But if you deliver rendered reports the mail data