posted on Thursday, March 10, 2005 8:25 AM by amachanic

Tokenize UDF

Yes, another string splitting UDF from a guy who's obvioiusly become obsessed with TSQL string splitting. This time we delve into a mysterious world that I call, "Tokenization."

So what is Tokenization? It's a word I made up for this problem.

But what is it, really? It's splitting up a string based on a delimiter -- in this case, a comma -- but being wary of substring delimiters. In this case, that's a pair of apostrophes, because that's what TSQL uses for strings.

I think this is best illustrated with an example string:

DECLARE @Tokens VARCHAR(50)

SET @Tokens = 'a, ''b'', ''''c'', ''d'', ''e'''', f, ''1,2,3,4'''

The basic split string function that you can find will produce the following output:

SELECT * 
FROM dbo.SplitString(@Tokens, ',')

OutParam
-------------
a
'b'
''c'
'd'
'e''
f
'1
2
3
4'

Well, that's wrong. Because what I want to do is maintain the substrings (or, "tokens," as I like to call them -- thus, Tokenization!)

The output I desire is:

Token
--------
a
'b'
''c', 'd', 'e''
f
'1,2,3,4'

Notice that substrings -- delimited with apostrophes -- should be maintained.

And here's how I've solved this problem...

CREATE FUNCTION dbo.Tokenize
(
	@Input NVARCHAR(2000)
)
RETURNS @Tokens TABLE 
	(
		TokenNum INT IDENTITY(1,1),
		Token NVARCHAR(2000)
	)
AS
BEGIN
	DECLARE @i INT SET @i = 0
	DECLARE @StartChar INT SET @StartChar = 1
	DECLARE @Quote INT SET @Quote = 0	

	DECLARE @Chars TABLE 
	(
		CharNum INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
		TheChar CHAR(1), 
		TheCount INT,
		StartChar INT
	)

	SET @Input = ' , ' + @Input + ' , '
	
	INSERT @Chars (TheChar)
	SELECT SUBSTRING(@Input, n.Number, 1)
	FROM Numbers n
	WHERE n.Number > 0 
		AND n.Number <= LEN(@Input)
	ORDER BY n.Number
	
	UPDATE Chars SET 
		@i = Chars.TheCount = 
			CASE 
				WHEN Chars1.TheChar = ',' 
					AND @Quote % 2 = 0 THEN 0 
				ELSE @i + 1 
			END,
		@Quote = 
			CASE  
				WHEN Chars1.TheChar = '''' THEN @Quote + 1 
				WHEN @i = 0 THEN 0 
				ELSE @Quote 
			END,
		@StartChar = Chars.StartChar =
			CASE
				WHEN @i = 1 THEN Chars1.CharNum - 1
				WHEN @i = 0 THEN @StartChar + 1
				ELSE @StartChar
			END
	FROM @Chars Chars
	JOIN @Chars Chars1 ON Chars1.CharNum = Chars.CharNum + 1

	INSERT @Tokens(Token)
	SELECT
		RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum - StartChar + 1)))
	FROM (
		SELECT StartChar, CharNum
		FROM @Chars
		WHERE TheCount = 0

		UNION ALL

		SELECT 
			MAX
			(
				CASE TheCount 
					WHEN 0 THEN CharNum 
					ELSE 0 
				END
			) + 1, 
			MAX(CharNum)
		FROM @Chars
	) x
	WHERE RTRIM(LTRIM(SUBSTRING(@Input, StartChar, CharNum - StartChar + 1))) NOT IN ('', ',')
	ORDER BY x.StartChar
	RETURN
END

A word of warning: This UDF uses the undocumented -- and unsupported -- "aggregate update" functionality. I've tested thoroughly in this case and believe it works perfectly (and it sure is handy!), but I would advise you to not use it in your own projects without extensive testing! MS doesn't support this one, so handle with care.

And by the way, you need a numbers table to use this thing. Of course.

As for using this thing, it's pretty easy:

DECLARE @Tokens VARCHAR(50)

SET @Tokens = 'a, ''b'', ''''c'', ''d'', ''e'''', f, ''1,2,3,4'''

SELECT Token
FROM dbo.Tokenize(@Tokens)


Token
--------
a
'b'
''c', 'd', 'e''
f
'1,2,3,4'

... and it even appears to work properly!

Enjoy... and application for this and other strange things I've been posting recently coming very, very soon.

Comments

# Yet another string splitting UDF: Tokenize @ Thursday, March 10, 2005 8:28 AM

Yet another string splitting UDF: Tokenize

amachanic

# Tokenize UDF @ Thursday, July 13, 2006 12:35 AM

Originally posted here.
Yes, another string splitting UDF from a guy who's obvioiusly become
obsessed...

Anonymous

# Tokenize UDF @ Thursday, July 13, 2006 12:35 AM

Originally posted here.

Yes, another string splitting UDF from a guy who's obvioiusly become
obsessed...

Anonymous

# Tokenize UDF @ Monday, January 08, 2007 2:33 PM

Originally posted here . Yes, another string splitting UDF from a guy who's obvioiusly become obsessed

Anonymous