Enjoy Every Sandwich

Thoughts on SQL, XML, .NET and sometimes beer.

<January 2009>
SuMoTuWeThFrSa
28293031123
45678910
11121314151617
18192021222324
25262728293031
1234567


Navigation

Tools

List O'Links

Kent's Other Stuff

Subscriptions

News

Please read these
Notices and Disclamiers

Post Categories

Article Categories



Going from Perl to VB.NET

Some Perl people probably wouldn't consider VB.NET a good environment for doing the sorts of tasks they do with Perl. Hogwash, I say, at least in some cases.

Today, I had to -- yet again -- munge some IIS logs from our Intranet. About 28 days worth of files totalling about 3.5 gigabytes in size (we had over 11 million hits on the that site in those 28 days...) Specifically, we wanted to know the number of hits on a certain subpage. We knew we could identify that page by looking for 771 in the URI Query field, so that part was pretty easy. We also wanted to know who, by domain ID, was looking at this subpage and how many hits per day the subpage was getting. Nothing hard.

My first choice was, of course, Perl.


$start = time();
@flist = glob("ex*.log");
foreach $file (@flist) {
	print "$file\n";
	open(inf,"<$file")||die($!);
	while(<inf>) {
		if(/^\d/) {
			chomp;
			@fields = split(/ /);
			$_ = $fields[10];
			$user = $fields[3];
			$date = $fields[0];
			if(/771/) {
				$hitcountbyuser{$user}++;
				$hitcountbydate{$date}++;
			}
			$total_hits++;
		}
	}
	close(inf);
}
open(outf,">report.txt")||die($!);
print outf "Total hits = $total_hits\n\n";
print outf "Hits by user\n";
foreach $user (sort keys %hitcountbyuser) {
	print outf "$user\t$hitcountbyuser{$user}\n";
}
print outf "Hits by date\n";
foreach $date (sort keys %hitcountbydate) {
	print outf "$date\t$hitcountbydate{$date}\n";
}
$end = time();
$elapsed = $end-$start;
print outf "\nThat took roughly ".$elapsed." seconds.\n";

The only bummer about this code was takes about 854 seconds to complete and ties up alot of the machine's CPU cycles. That made me wonder about rewriting this script as a VB.NET console application. That would have to be faster and wouldn't hog up the CPU, right?

Well, here's the VB.NET code I came up with. I tried to keep it semantically the same as the Perl code -- basically just a translation. That didn't quite work. The first versions of this used Collections.HashTables instead Collections.SortedLists. That's because I though the HashList would be significantly faster than sorted lists (turns out, it was maybe 10 seconds faster at best.) But then, I wanted the essentially the same output from the VB.NET version and the Perl version, so I went with SortedLists instead.


Option Strict On
Option Explicit On 
Imports System.Text
Imports System.Text.RegularExpressions
Imports System.IO
Imports System.Collections
Module BLP
    Private Enum IISLogColumns
        dateCol
        timeCol
        cIpCol
        csUserNameCol
        sSiteNameCol
        sComputerNameCol
        sIpCol
        sPortCol
        csMethodCol
        csUriStemCol
        csUriQueryCol
        scStatusCol
        scWin32StatusCol
        scBytesCol
        csBytesCol
        timeTakenCol
        csVersionCol
        csUserAgentCol
        csCookieCol
        csRefererCol
    End Enum
    Sub Main()
        Dim fileList() As IO.FileInfo
        Dim hitUser, hitDate, rline, curFilePath, matchField As String
        Dim dir As New IO.DirectoryInfo("..") ' note, the actual directory has been removed.
        Dim readStream As IO.StreamReader
        Dim writeStream As IO.StreamWriter
        Dim fields() As String
        Dim userHits As New Collections.SortedList(255)
        Dim dateHits As New Collections.SortedList(30)
        Dim totalHits As Integer = 0
        Dim start, done, hits As Long
        Dim elapsed As TimeSpan

        start = Now.Ticks
        Try
            fileList = dir.GetFiles("ex*.log")
            For index As Integer = 0 To fileList.Length - 1
                curFilePath = fileList(index).FullName
                Console.WriteLine(curFilePath)
                readStream = New IO.StreamReader(curFilePath)
                rline = readStream.ReadLine()
                While Not (rline Is Nothing)
                    If Regex.IsMatch(rline, "^\d") Then
                        totalHits += 1
                        fields = Split(rline, " ")
                        If fields.Length > 19 Then
                            hitUser = fields(IISLogColumns.csUserNameCol)
                            hitDate = fields(IISLogColumns.dateCol)
                            matchField = fields(IISLogColumns.csUriQueryCol)
                            If matchField.IndexOf("771") > 0 Then
                                If userHits.ContainsKey(hitUser) Then
                                    hits = CLng(userHits(hitUser))
                                    hits += 1
                                    userHits(hitUser) = hits
                                Else
                                    userHits.Add(hitUser, 1)
                                End If
                                If dateHits.ContainsKey(hitDate) Then
                                    hits = CLng(dateHits(hitDate))
                                    hits += 1
                                    dateHits(hitDate) = hits
                                Else
                                    dateHits.Add(hitDate, 1)
                                End If
                            End If
                        End If
                    End If
                    rline = readStream.ReadLine()
                End While
                readStream.Close()
            Next
            done = Now.Ticks
            elapsed = New TimeSpan(done - start)
            writeStream = New IO.StreamWriter(".\report.net.txt", False)
            writeStream.WriteLine("Total hits = {0}", totalHits)
            writeStream.WriteLine("Hits by User")
            For Each hitUser In userHits.Keys
                writeStream.WriteLine("{0} = {1}", hitUser, userHits(hitUser))
            Next
            writeStream.WriteLine("Hits by Date")
            For Each hitDate In dateHits.Keys
                writeStream.WriteLine("{0} = {1}", hitDate, dateHits(hitDate))
            Next
            writeStream.WriteLine("That took {0} seconds.", elapsed.TotalSeconds)
            writeStream.Flush()
        Catch ex As Exception
            Throw ex
        Finally
            readStream.Close()
            writeStream.Close()
        End Try
    End Sub
End Module

Runtime for this in a debug release was about 331 seconds, about a 61% improvement in processing time. So the next time I have a huge amount of information to parse like this, I think I might just start with VB.NET instead. Granted, I spend maybe three more minutes writing and testing the VB.NET than I did the Perl version, but even with that, I had a net time savings of more than five minutes. That's about long it took to blog this. <grin>

posted on Friday, December 05, 2003 10:31 AM by ktegels





Powered by Dot Net Junkies, by Telligent Systems