Part 3: Testing Full-Text IFilters
In this final article of my three-part series, we’ll look at a code sample that may be used to test the output of a filter component (also referred to as IFilter after the interface being implemented). This filter is used to extract text and property data from binary or formatted documents (for example, Microsoft Word or HTML documents). In this article, I will not focus as much on the fairly extensive ‘baggage’ that is involved in supporting the IFilter interface (there are many approaches to hooking the various pieces together), but rather, I’ll spend more time explaining how to get everything working so you can experiment further with your IFilters and fine-tune the techniques employed to make these components available in C#.
Filters are used at indexing time with Full-Text Search – once ‘chunks’ are extracted from a document, the text and property values are passed to the appropriate wordbreaker, and are then indexed. Understanding filter behavior can sometimes be a bit of a mystery – hopefully the code samples provided here will help make this behavior more transparent.
This sample is broken into two files, a filter-specific file (Filter.cs), and a main driver file (main.cs). Again (I’ve said this also in the past two articles), it should be noted that these samples are intended to be used for testing purposes only – any other use (for example, to add filtering capabilities to an application) could be illegal, as many of the components we will be testing may require a license to be used. Also, we should note that I am posting these samples in my spare time (i.e., not on company time) and I’m making no guarantees or warranties that the code will even work J But if you don’t like them, blame me, Andrew Cencini, and not Microsoft. So use at your own risk, and I am always open to comments and suggestions. Also, apologies if the formatting is a little funky -- this example is perhaps the most hairy of the three.
Let’s first look at Filter.cs – be sure to take a swing through MSDN (search for IFilter) to brush up on some of the finer points of the IFilter interface and its baggage.
///==============================================================
/// Filter.cs
///==============================================================
using System;
using System.Text;
using System.Runtime.InteropServices;
namespace StemText
{
[Flags]
public enum IFILTER_INIT
{
NONE = 0,
CANON_PARAGRAPHS = 1,
HARD_LINE_BREAKS = 2,
CANON_HYPHENS = 4,