Going from Perl to VB.NET
Some Perl people probably wouldn't consider VB.NET a good environment for doing the sorts of tasks they do with Perl. Hogwash, I say, at least in some cases.
Today, I had to -- yet again -- munge some IIS logs from our Intranet. About 28 days worth of files totalling about 3.5 gigabytes in size (we had over 11 million hits on the that site in those 28 days...) Specifically, we wanted to know the number of hits on a certain subpage. We knew we could identify that page by looking for 771 in the URI Query field, so that part was pretty easy. We also wanted to know who, by domain ID, was looking at this subpage and how many hits per day the subpage was getting. Nothing hard.
My first choice was, of course, Perl.
$start = time();
@flist = glob("ex*.log");
foreach $file (@flist) {
print "$file\n";
open(inf,"<$file")||die($!);
while(<inf>) {
if(/^\d/) {
chomp;
@fields = split(/ /);
$_ = $fields[10];
$user = $fields[3];
$date = $fields[0];
if(/771/) {
$hitcountbyuser{$user}++;
$hitcountbydate{$date}++;
}
$total_hits++;
}
}
close(inf);
}
open(outf,">report.txt")||die($!);
print outf "Total hits = $total_hits\n\n";
print outf "Hits by user\n";
foreach $user (sort keys %hitcountbyuser) {
print outf "$user\t$hitcountbyuser{$user}\n";
}
print outf "Hits by date\n";
foreach $date (sort keys %hitcountbydate) {
print outf "$date\t$hitcountbydate{$date}\n";
}
$end = time();
$elapsed = $end-$start;
print outf "\nThat took roughly ".$elapsed." seconds.\n";
The only bummer about this code was takes about 854 seconds to complete and ties up alot of the machine's CPU cycles. That made me wonder about rewriting this script as a VB.NET console application. That would have to be faster and wouldn't hog up the CPU, right?
Well, here's the VB.NET code I came up with. I tried to keep it semantically the same as the Perl code -- basically just a translation. That didn't quite work. The first versions of this used Collections.HashTables instead Collections.SortedLists. That's because I though the HashList would be significantly faster than sorted lists (turns out, it was maybe 10 seconds faster at best.) But then, I wanted the essentially the same output from the VB.NET version and the Perl version, so I went with SortedLists instead.
Option Strict On
Option Explicit On
Imports System.Text
Imports System.Text.RegularExpressions
Imports System.IO
Imports System.Collections
Module BLP
Private Enum IISLogColumns
dateCol
timeCol
cIpCol
csUserNameCol
sSiteNameCol
sComputerNameCol
sIpCol
sPortCol
csMethodCol
csUriStemCol
csUriQueryCol
scStatusCol
scWin32StatusCol
scBytesCol
csBytesCol
timeTakenCol
csVersionCol
csUserAgentCol
csCookieCol
csRefererCol
End Enum
Sub Main()
Dim fileList() As IO.FileInfo
Dim hitUser, hitDate, rline, curFilePath, matchField As String
Dim dir As New IO.DirectoryInfo("..") ' note, the actual directory has been removed.
Dim readStream As IO.StreamReader
Dim writeStream As IO.StreamWriter
Dim fields() As String
Dim userHits As New Collections.SortedList(255)
Dim dateHits As New Collections.SortedList(30)
Dim totalHits As Integer = 0
Dim start, done, hits As Long
Dim elapsed As TimeSpan
start = Now.Ticks
Try
fileList = dir.GetFiles("ex*.log")
For index As Integer = 0 To fileList.Length - 1
curFilePath = fileList(index).FullName
Console.WriteLine(curFilePath)
readStream = New IO.StreamReader(curFilePath)
rline = readStream.ReadLine()
While Not (rline Is Nothing)
If Regex.IsMatch(rline, "^\d") Then
totalHits += 1
fields = Split(rline, " ")
If fields.Length > 19 Then
hitUser = fields(IISLogColumns.csUserNameCol)
hitDate = fields(IISLogColumns.dateCol)
matchField = fields(IISLogColumns.csUriQueryCol)
If matchField.IndexOf("771") > 0 Then
If userHits.ContainsKey(hitUser) Then
hits = CLng(userHits(hitUser))
hits += 1
userHits(hitUser) = hits
Else
userHits.Add(hitUser, 1)
End If
If dateHits.ContainsKey(hitDate) Then
hits = CLng(dateHits(hitDate))
hits += 1
dateHits(hitDate) = hits
Else
dateHits.Add(hitDate, 1)
End If
End If
End If
End If
rline = readStream.ReadLine()
End While
readStream.Close()
Next
done = Now.Ticks
elapsed = New TimeSpan(done - start)
writeStream = New IO.StreamWriter(".\report.net.txt", False)
writeStream.WriteLine("Total hits = {0}", totalHits)
writeStream.WriteLine("Hits by User")
For Each hitUser In userHits.Keys
writeStream.WriteLine("{0} = {1}", hitUser, userHits(hitUser))
Next
writeStream.WriteLine("Hits by Date")
For Each hitDate In dateHits.Keys
writeStream.WriteLine("{0} = {1}", hitDate, dateHits(hitDate))
Next
writeStream.WriteLine("That took {0} seconds.", elapsed.TotalSeconds)
writeStream.Flush()
Catch ex As Exception
Throw ex
Finally
readStream.Close()
writeStream.Close()
End Try
End Sub
End Module
Runtime for this in a debug release was about 331 seconds, about a 61% improvement in processing time. So the next time I have a huge amount of information to parse like this, I think I might just start with VB.NET instead. Granted, I spend maybe three more minutes writing and testing the VB.NET than I did the Perl version, but even with that, I had a net time savings of more than five minutes. That's about long it took to blog this. <grin>