Lucene Custom Tokenizer

Lucene has many in-built tokenizers But in some cases, requirement demands to write new code or extend existing.

Almost every application needs to implement search features. There are various alternatives (like SQL, Full Text Search and other Non-Sql frameworks) to implement these features. One of them is Lucene. Lucene basically extracts tokens from content which can be indexed once tokenization process is completed; and search algorithm is applied on these indexed tokens.

Lucene has many in-built tokenizers like StandardTokenizer, WhiteSpaceTokenizer. But in some cases, requirement demands to write new code or extends existing code.

Let’s consider following example:

Data: 1097-0215 (i.v) product-123 anti-virus, we investigated the mechanisms. 2266-73 In the present study

Tokens generated with StandardTokenizer:
[1097-0215] [i.v] [product-123] [anti] [virus] [we] [investigated] [the] [mechanisms] [2266-73] [In] [the] [present] [study]

Tokens generated with WhiteSpaceTokenizer:
[1097-0215] [(i.v)] [product-123] [anti-virus,] [we] [investigated] [the] [mechanisms.] [2266-73] [In] [the] [present] [study]

Creating a custom tokenization process-tokenizer would be beneficiary if requirement needs to control the process. The customization can be done with one of following classes:

1). Lucene.Net.Analysis.CharTokenizer
2). Lucene.Net.Analysis.Tokenizer

Let’s write code, with Lucene.Net.Analysis.CharTokenizer, for the process that creates tokens having only alphanumeric characters:

Code Steps:
1). Create a class that should be inherited from “Lucene.Net.Analysis.CharTokenizer“.
2). Override the method “IsTokenChar” with appropriate logic. In our case, we are returning true if a character is alphanumeric.

Code Snippet:
public class AlphaNumbericTokenizer : Lucene.Net.Analysis.CharTokenizer
{
    public AlphaNumbericTokenizer (System.IO.TextReader input) : base(input)
    {
    }
    protected override bool IsTokenChar(char c)
    {
        //TODO: Logic for identifying token or token separator
        return char.IsLetterOrDigit(c);
    }
}

Tokens generated with AlphaNumbericTokenizer:
[1097] [0215] [i] [v] [product] [123] [anti] [virus] [we] [investigated] [the] [mechanisms] [2266] [73] [In] [the] [present] [study]

In Lucene, a tokenizer should be composed with any one Analyzer. Just like tokenizer, Analyzer can be created using one of the following ways:
1). Overrides behavior of the existing analyzer (like Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net.Analysis.WhitespaceAnalyzer etc).
2). Create an analyzer after inheriting from Lucene.Net.Analysis.Standard.Analyzer.

Code Steps: (for AlphaNumbericAnlyzer would be)
1). Create a class that should be inherited from “Lucene.Net.Analysis.Standard.StandardAnalyzer“.
2). Override the method “TokenStream” with appropriate logic. In our case, we are returning TokenStream that uses AlphaNumbericTokenizer.

Code Snippet:
public class AlphaNumbericAnlyzer: StandardAnalyzer
{
    public AlphaNumbericAnlyzer (Version version) : base(version) { }
    public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
    {
        AlphaNumbericTokenizer tokenizer =new AlphaNumbericTokenizer(reader);
        TokenStream filterStream = (TokenStream) new StandardFilter((TokenStream)tokenizer);
        TokenStream tokenStream = (TokenStream) new LowerCaseFilter(filterStream);
        TokenStream stream = (TokenStream) new StopFilter(true,tokenStream, StopAnalyzer.ENGLISH_STOP_WORDS_SET, true);
        return stream;
    }
}