posted on Tuesday, February 15, 2005 1:44 PM by amachanic

Pattern-based split string

"hickymanz" asked in the SQL Server Central forums for a method of counting unique words in a text column. Wayne Lawton recommended using a string split function, which was a good idea, but not quite adequate for the job in my opinion.

Typical string split functions, like this one that I wrote can handle only a single delimiter, e.g. a comma. But in the case of splitting for unique words you want all sorts of possible delimiters -- punctuation marks, white space including spaces, possibly numerics, etc.

I believe I've seen other, similar requests for getting unique words before, so I decided to solve the problem. I modified that split string function to accept a pattern of valid characters for the second argument. Anything NOT in the pattern will be treated as a delimiter. I figured that would be a bit more flexible than the other way around, so that people don't have to type in (or figure out) every single possible white space and/or punctuation character.

Anyway, I think the modification is pretty straightforward if you understand the previous function, so check it out and if you have any questions feel free to ask. Here is the function:

CREATE FUNCTION dbo.SplitStringPattern
(
	@List TEXT,
	@Pattern VARCHAR(50)
)
RETURNS @ReturnTbl TABLE (OutParam VARCHAR(20))
WITH SCHEMABINDING
AS
BEGIN
	DECLARE @LeftSplit VARCHAR(7998)
	DECLARE @SplitStart INT SET @SplitStart = 0
	DECLARE @SplitEnd INT
	SET @SplitEnd = 7998

	SET @Pattern = '%' + '[^' + RIGHT(@Pattern, LEN(@Pattern) - 1) + '%'

	DECLARE @Delimiter CHAR(1)
	SELECT @Delimiter = CHAR(MAX(Number))
	FROM dbo.Numbers
	WHERE CHAR(Number) LIKE @Pattern

	SELECT @SplitEnd = MAX(Number)
	FROM dbo.Numbers
	WHERE (SUBSTRING(@List, Number, 1) LIKE @Pattern
			OR Number = DATALENGTH(@List) + 1)
		AND Number BETWEEN @SplitStart AND @SplitEnd

	WHILE @SplitStart < DATALENGTH(@List) - 1
	BEGIN
		SET @LeftSplit = @Delimiter + SUBSTRING(@List, @SplitStart, @SplitEnd - @SplitStart) + @Delimiter

		INSERT @ReturnTbl (OutParam)
		SELECT LTRIM(RTRIM(SUBSTRING(@LeftSplit, Number + 1,
	                    PATINDEX(@Pattern, SUBSTRING(@LeftSplit, Number + 1, LEN(@LeftSplit))) - 1)))
	                AS Value
		FROM   dbo.Numbers
		WHERE  Number <= LEN(@LeftSplit) - 1
			AND SUBSTRING(@LeftSplit, Number, 1) LIKE @Pattern
			AND SUBSTRING(@LeftSplit, Number + 1,
	                    	PATINDEX(@Pattern, SUBSTRING(@LeftSplit, Number + 1, LEN(@LeftSplit))) - 1) <> ''

		SET @SplitStart = @SplitEnd + 1
		SET @SplitEnd = @SplitEnd + 7998

		SELECT @SplitEnd = MAX(Number) + @SplitStart
		FROM dbo.Numbers
		WHERE (SUBSTRING(@List, Number + @SplitStart, 1) LIKE @Pattern
				OR Number+@SplitStart = DATALENGTH(@List) + 1)
			AND Number BETWEEN 1 AND @SplitEnd - @SplitStart
	END

	RETURN
END

A note on the pattern for input: The pattern should be single-character based, and must be delimited with [ ]. Also, you must type in the actual characters in the pattern. I do not recommend using ranges; ranges cause strange side-effects because they tend to include (depending on collation) characters that you wouldn't expect to show up (e.g. characters with umlauts over them). So unless that's what you really want, don't use ranges.

Here's a usage example:

SELECT OutParam
FROM dbo.SplitStringPattern('This is a test.', '[abcdefghijklmnopqrstuvwxyz]')


OutParam             
-------------
This
is
a
test

Enjoy!

Comments

# New UDF - Pattern-based split string @ Tuesday, February 15, 2005 1:46 PM

New UDF - Pattern-based split string

amachanic

# SQL Server User Defined Function performance follow up @ Thursday, December 01, 2005 5:59 PM

http://www.luschny.de/math/factorial/SrcCsharp/index.html
http://sqljunkies.com/WebLog/amachanic/articles/PatternSplitString.asp...

Anonymous

# SQL Server User Defined Function performance follow up @ Sunday, December 04, 2005 4:17 PM

http://www.luschny.de/math/factorial/SrcCsharp/index.html
http://sqljunkies.com/WebLog/amachanic/articles/PatternSplitString.asp...

Anonymous

# SQL Server User Defined Function performance follow up @ Tuesday, May 23, 2006 4:29 AM

http://www.luschny.de/math/factorial/SrcCsharp/index.html
http://sqljunkies.com/WebLog/amachanic/articles/PatternSplitString.asp...

Anonymous

# SQL Server User Defined Function performance follow up @ Tuesday, May 23, 2006 5:27 AM

http://www.luschny.de/math/factorial/SrcCsharp/index.html
http://sqljunkies.com/WebLog/amachanic/articles/PatternSplitString.asp...

Anonymous

# Pattern-based split string @ Thursday, July 13, 2006 12:31 AM

Originally posted here.

&quot;hickymanz&quot; asked in the SQL Server Central forums
for a method of counting...

Anonymous

# Pattern-based split string @ Monday, January 08, 2007 2:33 PM

Originally posted here . &quot;hickymanz&quot; asked in the SQL Server Central forums for a method of counting

Anonymous

# SQL Server User Defined Function performance follow up @ Sunday, June 03, 2007 4:08 PM

http://www.luschny.de/math/factorial/SrcCsharp/index.html http://sqljunkies.com/WebLog/amachanic/articles/PatternSplitString.aspx

Anonymous