October 2004 - Posts

31,459 days later...

In the last 31,459 days, among other things we've seen two world wars, men walking on the moon, and the rise and fall of Pets.com...

But what we haven't seen is the Red Sox win a World Series....

Populating Links

Both of my regular readers (one of whom is my only fan) will notice that I'm populating a bunch of links into the margin of the blog.

I just want to state outright: I don't support any companies I'm linking to. I don't agree with the content I'm linking to. I don't like anything or anyone affiliated with what I'm linking to.

The reason I'm posting these links is not that I want you to see them. Rather, I have way too many things bookmarked in my browser and I want to clean them up. So I'm pasting the interesting-looking ones in here for posterity. That way, I don't lose the links, and maybe there's something good embedded in there.

Or maybe I'll really piss someone off. Bonus!!!

No, stored procedures are NOT bad

I recently found a rather old post from Frans Bouma's blog, "Stored procedures are bad, m'key?".

Since comments are closed (he posted it almost a year ago), I have to reply here.

I'll state my bottom line at the top: Stored procedures are not only not bad, they're necessary for maintaining loosely coupled, application agnostic, flexible databases. And they're also necessary for maintaining loosely coupled, database agnostic, flexible database-based applications.

Mr. Bouma's argument centers around three claims: Ad-hoc queries are easier to construct in application code, stored procedures don't provide greater security than the application already does, and in certain cases stored procedure execution plan caching can be detrimental to performance. And all three of these arguments are true! But none of them has any bearing whatsoever on the real benefits of stored procedures.

Let's have a quick refresher on Computer Science 101, as it applies to object-oriented design: Loose coupling and high cohesion. What does that mean in plain English? Each module, or object, should have one very specific job, which it can perform for a VARIETY of other modules or objects on request (high cohesion). Furthermore, no module or object (or as few as possible) should depend upon the inner workings of another module or object. This latter problem is known as tight coupling, and its consequences are dire. Change one piece of code, end up changing every other piece of code that uses it.

News flash for Mr. Bouma: This is the reason object-oriented programming was invented. We try to separate components from one another so that their functions can be re-used and re-applied to other modules, other objects, and if we've done our job really well, even other applications entirely.

So how does this apply to the relationship between a database and an application?

The database, undoubtedly, should be completely application unaware. The database is serving up data. It has no idea what application is requesting the data, or whether an application is requesting the data. It needs to have no such idea. All it needs to do is keep serving and all is happy. Furthermore, the database doesn't care if multiple applications request data, or if those applications request the same data. The database is set up to provide a single point of access for all data requests in its domain of knowledge (i.e. the data in the database being queried).

And now to the other side, the application itself. I am an advocate of the application being as database agnostic as possible. I don't believe this is entirely possible, but it is certainly a goal to which we can aspire. Applications should request data from the database using standardized, documented interfaces, after which that data should be composed as quickly as possible (at the lowest level) into native objects. This allows for changes in the data interface to have as little ripple effect as possible in the application.

Mr. Bouma makes the claim that, "changes to a relational model will have always an impact on the application that targets that model". Again, he is correct. Which is why applications should NOT target a model. Applications should have absolutely no knowledge of database schema, including table names, column names, data types, or any other information. This is the role of stored procedures. We can, using stored procedures, completely encapsulate this metadata and provide standardized interfaces into the data.

Need to change a datatype in the database? Perhaps you won't need to change the output values that the application receives from the stored procedures. Perhaps you won't need to change the parameter input values. No change is needed in application code. Need to change a column name? Same thing. Need to re-architect the entire schema? Again, just change your stored procedures. The application will keep running as if nothing changed.

Contrast this to Mr. Bouma's suggestion that we drop stored procedures altogether and instead build all queries within application code. Suddenly, any small change in the database needs to be completely regression tested throughout not only one app, but every application that uses the database. You've created an extremely tight, perhaps unbreakable coupling between the application and the database. Large-scale changes to the schema will most likely never be possible. Will there ever really be time to re-write all of that application code?

Now, back to the three central arguments: Yes, ad-hoc queries are easier to construct in application code; but it's also argued by many database experts that ad-hoc queries are a sign of either poor database design, poor application design, or both. Either way, dynamic SQL isn't too hard to work with, and I've seen plenty of extremely ad-hoc applications in which its use is hardly a stumbling block. Next, security. It's true that if a user isn't authenticated in the application, he or she won't be able to use that application to access the data. So in that case, the stored procedure does not provide greater access control. But the same is certainly not true for every other application that uses the same database. Unless, of course, code is duplicated across every application. The database should be the final word on its data. This includes data security and data integrity. Application code simply cannot do this if the database need ever be shared. Finally, Bouma's assertions about cache plans are simply not worth touching. The WITH RECOMPILE option has been around long enough that DBAs and developers know it's there and know how to use it.

Now that we're at the bottom of this post, I'll re-state my conclusion: Stored procedures are necessary for proper object-oriented database-based application development. Not using them will force code duplication, tight coupling between applications and databases, and greatly increased man hours for every change made to either application or database.

Pattern-based replacement UDF

New article posted here.

The topic du jour is a UDF for replacing substrings based on patterns (vs. T-SQL's built-in REPLACE() function, which uses static strings).

Caveats of the TEXT datatype

Someone named "Krygim" posted the following question today in the microsoft.public.sqlserver.programming newsgroup:

"Will defaulting a TEXT field to an empty string take up more space than defaulting it to Null when no data is entered into the field. [sic]"

Before reading any further, I ask that you think for a moment and consider what you think the answer should be...

If you answered "yes", you're correct. An empty TEXT column will take up more space than a NULL one -- quite a bit more space, as it turns out. Here's the test code I posted to prove it:

USE tempdb

CREATE TABLE atable(acol TEXT NULL)

INSERT atable(acol)
SELECT NULL
FROM 	pubs..authors a,
	pubs..authors b,
	pubs..authors c,
	pubs..authors d

EXEC sp_spaceused 'atable'

CREATE TABLE btable(acol TEXT NOT NULL)

INSERT btable(acol)
SELECT ''
FROM	pubs..authors a,
	pubs..authors b,
	pubs..authors c,
	pubs..authors d

EXEC sp_spaceused 'btable'

DROP TABLE atable
DROP TABLE btable

Running this on my end tells me that the NULLs took up only 3 MB, whereas the empty strings took up 33 MB -- 11 times more space used!

I did a quick scan of Inside Microsoft SQL Server 2000 and discovered that for TEXT and IMAGE datatypes:

"If the amount of data is less than 32 KB, the text pointer in the data row points to an 84-byte text root structure."

This, in addition to the 16-byte pointer to the off-row location of the data, explains the huge difference in size between these two tables.

At this point, you're probably either wondering how to fix this or you've drifted away and are half-reading, half-thinking about what's for dinner. If you're in the latter category and not also in the former category, you either know the answer already or are not enough of a certified geek to be reading my blog. And if you're in the former category and not in the latter category, your priorities are certainly skewed, as we all know that food is more important than any stupid DBMS space wasted due to empty strings problem.

But for those still following me, the answer is one of my favorite about-to-be-deprecated features of SQL Server, the mighty text-in-row option... Re-run the second part of the previous example, with a new line inserted:

USE tempdb

CREATE TABLE btable(acol TEXT NOT NULL)

EXEC sp_tableoption 'btable', 'text in row', '24'
-- HINT: This is the new line

INSERT btable(acol)
SELECT ''
FROM	pubs..authors a,
	pubs..authors b,
	pubs..authors c,
	pubs..authors d

EXEC sp_spaceused 'btable'

DROP TABLE btable

There it is. Back down to 3 MB, even with the non-empty strings in the TEXT column. By storing the small data in-row, we've eliminated the 16-byte off-row pointer and the 84-byte root structure, and other assorted bytes that are used by the TEXT datatype that I don't know about (the math didn't quite add up when I tried to calculate where all of the space went).

So what does this tell us? I'm thinking that as a best practice, perhaps the text-in-row option should be used for every table with LOB columns, and that it should be set to around 100 bytes. This will still keep row sizes down when larger data is inserted (as it will go off-row), but it will also keep overall IO way down if the amount of rows with LOB data larger than 100 bytes is fairly sparse.

Comments?

Identifying unused stored procedures

I just noticed that my article for the November issue of SQL Server Professional is online. The topic is using server-side traces to identify unused stored procedures...

You can read the article here.

Any comments are appreciated!

Great article on ADO.NET performance

Can be found here.

In related news, another article on the same site, posted today, says that SQL Server 2005 has been delayed "from the first half of 2005 until later in the summer of that year."

Awwww yeah.

Red Sox, American League champions!

Synchronous testing

As a prerequisite for this blog, since I'm going to get into a lot of performance and scalability questions, I need some way for readers to run scripts synchronously.

When I'm testing a script for production, I usually like to start with five or so concurrent execution loops and then pump it up in a stress environment, alone, to see how far I can take it. Sometimes I'll watch various performance counters during the test, but usually I'm either feeling lazy or don't think I'll gain much insight from the counters. Sometimes it is handy to watch Profiler's SP:StmtCompleted event to get a slightly more granular picture of the performance degredation, but if I'm testing a SELECT query I'll usually just watch how many times I'm able to execute it in a given amount of time. For an INSERT/UPDATE process, I may put some logging triggers in place (of course, I need to account for them in any performance figures I produce.

Usually I'll execute my test code in a simple loop, like:

DELCARE @EndDate DATETIME
SET @EndDate = DATEADD(mi, 10, GETDATE())

WHILE @EndDate > GETDATE()
    EXEC dbo.DoStuff

DoStuff is assumed to be some stored procedure that's being tested, or perhaps a stored procedure containing a bunch of other stored procedure calls and maybe some randomization or other logic to help make the tests slightly more realistic.

I like to run tests for 10 or 20 minutes at a time. And I like to have them automatically stop in case I get called away in the middle... One time I was running an INSERT load test on a shared development machine and forgot to turn it off before going home. The next morning was not fun.

Of course, none of this has any bearing on how to get these things to run synchronously! And that's really what I'd like to know from whoever's reading this post. My usual method is to copy my loop into the clipboard, then open a new instance of Query Analyzer and start hitting CTRL-N, CTRL-V, F5 as many times as I need to. Other methods I know about are the script that Ken Henderson included in The Guru's Guide To SQL Server Architecture and Internals and the Database Hammer utility included in the Microsoft SQL Server 2000 Resource Kit. I'm not a huge fan of Henderson's script -- it's just not graphical enough for my tastes -- and I haven't tried the Database Hammer yet.

So what's the best way to load test synchronous SQL? Do people prefer using just QA, or is the Database Hammer an amazing tool? I'm considering writing my own synchronous SQL execution tool for a book I'd like to write, and if such a tool would also be useful to readers of this blog it would certainly make sense for me to write it before I get to the book stage. I know I could use a better tool -- hitting CTRL-N so many times is giving me a minor case of carpal tunnel syndrome.

Any comments on exisiting tools and techniques or wishes for these kind of tools would be appreciated...

Hola

Thanks to Donny Mack of SQLJunkies.com, I finally have some blog space. So now all I need are some topics to write about (but don't worry, I have plenty of those)...

I'm going to use this space primarily to share techniques for working with SQL Server. I'll also delve into data access with .NET/C#, with an eye towards high performance and availability. And when I'm not discussing technical matters, you may see posts about music, beer, or music and beer.

A bit about me: I am a database-oriented software engineer in Boston, MA. I'm currently working for a firm called Get Connected, where I am tasked with architecting data and data access systems for the company's wireless and broadband sales systems. I'm active on a lot of technical forums (mostly newsgroups), and was recently awarded MVP for SQL Server by Microsoft.

Please excuse the dust over the next few weeks... I'm hoping to do a lot of tweaking of the look and feel of this blog, and you'll find that it may look odd from time to time ... and thank you for reading!