3.7.1.2.1 TextAnalyzer

The TextAnalyzer assumes that text values are “plain text” (i.e., no metadata or mark-up). For each field value it tokenizes, the TextAnalyzer generates zero or more terms as lowercased letter/digit/apostrophe sequences separated by consecutive whitespace/punctuation sequences. That is, each contiguous sequence of letters, digits, and/or apostrophes becomes a term. A term can contain an apostrophe, but it cannot begin or end with an apostrophe. For example, the apostrophe is included in doesn’t, but outer apostrophes in the sequence ‘tough’ are excluded, yielding the term tough. An apostrophe is any of the following characters:

•

The Unicode APOSTROPHE (0x27)

•

The Windows right single quote (0x92)

•

The Unicode RIGHT SINGLE QUOTATATION MARK (0x2019)

As example of how text fields are tokenized by the TextAnalyzer, suppose an email object is created with the following text field values, all indexed with TextAnalyzer:

From: John Smith

To: Betty Sue

Subject: The Office Move

Body: Hi Betty,
Just a reminder that you’re scheduled to move to your “fancy” new office tomorrow, number B413. If you have any questions, please let me know.
Thanks, John.

The TextAnalyzer indexes these fields to generate the following terms:

Field Name	Terms
From	john smith
To	betty sue
Subject	move office the
Body	a any b413 betty fancy have hi if john just know let me move new number office please questions reminder scheduled thanks that to tomorrow you your you're

As shown, terms are extracted in lowercase, and punctuation and whitespace are removed. As part of down-casing, the TextAnalyzer converts any apostrophe retained within a term to the “straight apostrophe” character (0x27.) Although a term may appear multiple times within a field, it is indexed but once.

Though not shown above, the TextAnalyzer also creates a term equal to the term’s entire field value, down-cased, and enclosed in single quotes. This value is used as an optimization for equality searches. For example, the whole-field value for the field From is 'john smith'. For text fields with large values, this “whole field” value is created as an MD5 value instead of the literal text.

The terms generated by the TextAnalyzer allow efficient execution of a wide range of full text queries: single terms, phrases, wildcard terms, range clauses, etc. Searches are performed without case sensitivity: for example the phrase “You’re Scheduled to MOVE” will match the Body field shown above.