public class PreprocessingContext.AllWords extends Object
PreprocessingContext.documents. An entry in each parallel array corresponds to one
conflated form of a word. For example, data and DATA will most likely become
a single entry in the words table. However, different grammatical forms of a single lemma
(like computer and computers) will have different entries in the
words table. See PreprocessingContext.AllStems for inflection-conflated versions.
All arrays in this class have the same length and values across different arrays correspond to each other for the same index.
| Modifier and Type | Field and Description |
|---|---|
byte[] |
fieldIndices
A bit-packed indices of all fields in which this word appears at least once.
|
char[][] |
image
The most frequently appearing variant of the word with respect to case.
|
int[] |
stemIndex
A pointer to the
PreprocessingContext.AllStems arrays for this word. |
int[] |
tf
Term Frequency of the word, aggregated across all variants with respect to
case.
|
int[][] |
tfByDocument
Term Frequency of the word for each document.
|
short[] |
type
Token type of this word copied from
PreprocessingContext.AllTokens.type. |
| Constructor and Description |
|---|
PreprocessingContext.AllWords() |
public char[][] image
This array is produced by CaseNormalizer.
public short[] type
PreprocessingContext.AllTokens.type. Additional
flags are set for each word by
CaseNormalizer and LanguageModelStemmer.
This array is produced by CaseNormalizer.
This array is modified by LanguageModelStemmer.
ITokenizerpublic int[] tf
This array is produced by CaseNormalizer.
public int[][] tfByDocument
PreprocessingContext.documents, elements at odd indices contain the
frequency of the word in the document. For example, an array with 4 values:
[2, 15, 138, 7] means that the word appeared 15 times in document
at index 2 and 7 times in document at index 138.
This array is produced by CaseNormalizer. The order of documents in this
array is not defined.
public int[] stemIndex
PreprocessingContext.AllStems arrays for this word.
This array is produced by LanguageModelStemmer.
public byte[] fieldIndices
PreprocessingContext.AllFields arrays. Fast conversion between the bit-packed representation
and byte[] with index values is done by PreprocessingContext.toFieldIndexes(byte)
This array is produced by CaseNormalizer.