The TextAnalyzer assumes that text values are “plain text” (i.e., no metadata or mark-up). For each field value it tokenizes, the
TextAnalyzer generates zero or more terms as lowercased letter/digit/apostrophe sequences separated by consecutive whitespace/punctuation sequences. That is, each contiguous sequence of letters, digits, and/or apostrophes becomes a term. A term can
contain an apostrophe, but it cannot begin or end with an apostrophe. For example, the apostrophe is included in
doesn’t, but outer apostrophes in the sequence
‘tough’ are excluded, yielding the term
tough. An apostrophe is any of the following characters:
Though not shown above, the TextAnalyzer also creates a term equal to the term’s entire field value, down-cased, and enclosed in single quotes. This value is used as an optimization for equality searches. For example, the whole-field value for the field
From is
'john smith'. For text fields with large values, this “whole field” value is created as an MD5 value instead of the literal text.
The terms generated by the TextAnalyzer allow efficient execution of a wide range of full text queries: single terms, phrases, wildcard terms, range clauses, etc. Searches are performed without case sensitivity: for example the phrase “You’re Scheduled to MOVE” will match the
Body field shown above.