- All Implemented Interfaces:
Accountable
int.
Implementation of the PGM-Index described at https://pgm.di.unipi.it/, based on the paper
Paolo Ferragina and Giorgio Vinciguerra. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. PVLDB, 13(8): 1162-1175, 2020.It provides
rank and range search operations. indexOf() is faster than
B+Tree, and the index is much more compact. contains() is between 4x to 7x slower than
IntHashSet#contains(), but between 2.5x to 3x faster than Arrays.binarySearch(long[], long).
Its compactness (40KB for 200MB of keys) makes it efficient for very large collections, the
index fitting easily in the L2 cache. The epsilon parameter should be set according to
the desired space-time trade-off. A smaller value makes the estimation more precise and the range
smaller but at the cost of increased space usage. In practice, epsilon 64 is a good sweet
spot.
Internally the index uses an optimal piecewise linear mapping from keys to their position in the sorted order. This mapping is represented as a sequence of linear models (segments) which are themselves recursively indexed by other piecewise linear mappings.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic classBuilds aIntPgmIndexon a provided sorted list of keys.protected static classIterator over a range of elements in a sorted array. -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intInitial value of the exponential jump when scanning out of the epsilon range.static final int2xKEY_SIZE.static final IntPgmIndexEmpty immutable IntPgmIndex.final intThe epsilon range used to build this index.static final intEpsilon approximation range when searching the list of keys.static final intEpsilon approximation range for the segments layers.final intThe recursive epsilon range used to build this index.final intThe lowest key inkeys.static final intSize of a key, measured inInteger.BYTESbecause the key is stored in an int[].final IntArrayListThe list of keys for which this index is built.final intThe highest key inkeys.final int[]The offsets insegmentDataof the first segment of each segment level.static final intData size of a segment, measured inInteger.BYTES, because segments are stored in an int[].final int[]The index data.final intThe size of the key set. -
Method Summary
Modifier and TypeMethodDescriptionbooleancontains(int key) Returns whether this key set contains the given key.<T extends IntProcedure>
TforEachInRange(T procedure, int minKey, int maxKey) Appliesprocedureto the keys in the list that are greater than or equal tominKey(inclusive), and less than or equal tomaxKey(inclusive).intindexOf(int key) Searches the specified key, and returns its index in the element list.booleanisEmpty()Returns whether this key set is empty.longEstimates the allocated memory.longEstimates the bytes that are actually used.intrangeCardinality(int minKey, int maxKey) Returns the number of keys in the list that are greater than or equal tominKey(inclusive), and less than or equal tomaxKey(inclusive).rangeIterator(int minKey, int maxKey) Returns an iterator over the keys in the list that are greater than or equal tominKey(inclusive), and less than or equal tomaxKey(inclusive).intrank(int x) Returns, for any valuex, the number of keys in the sorted list which are smaller thanx.intsize()Returns the size of the key set.
-
Field Details
-
EMPTY
Empty immutable IntPgmIndex. -
EPSILON
public static final int EPSILONEpsilon approximation range when searching the list of keys. Controls the size of the returned search range, strictly greater than 0. It should be set according to the desired space-time trade-off. A smaller value makes the estimation more precise and the range smaller but at the cost of increased space usage.With EPSILON=64 the benchmark with 200MB of keys shows that this PGM index requires only 2% additional memory on average (40KB). It depends on the distribution of the keys. This epsilon value is good even for 2MB of keys. With EPSILON=32: +5% speed, but 4x space (160KB).
- See Also:
-
EPSILON_RECURSIVE
public static final int EPSILON_RECURSIVEEpsilon approximation range for the segments layers. Controls the size of the search range in the hierarchical segment lists, strictly greater than 0.- See Also:
-
KEY_SIZE
public static final int KEY_SIZESize of a key, measured inInteger.BYTESbecause the key is stored in an int[]. -
DOUBLE_KEY_SIZE
public static final int DOUBLE_KEY_SIZE2xKEY_SIZE. -
SEGMENT_DATA_SIZE
public static final int SEGMENT_DATA_SIZEData size of a segment, measured inInteger.BYTES, because segments are stored in an int[]. -
BEYOND_EPSILON_JUMP
public static final int BEYOND_EPSILON_JUMPInitial value of the exponential jump when scanning out of the epsilon range.- See Also:
-
keys
The list of keys for which this index is built. It is sorted and may contain duplicate elements. -
size
public final int sizeThe size of the key set. That is, the number of distinct elements inkeys. -
firstKey
public final int firstKeyThe lowest key inkeys. -
lastKey
public final int lastKeyThe highest key inkeys. -
epsilon
public final int epsilonThe epsilon range used to build this index. -
epsilonRecursive
public final int epsilonRecursiveThe recursive epsilon range used to build this index. -
levelOffsets
public final int[] levelOffsetsThe offsets insegmentDataof the first segment of each segment level. -
segmentData
public final int[] segmentDataThe index data. It contains all the segments for all the levels.
-
-
Method Details
-
size
public int size()Returns the size of the key set. That is, the number of distinct elements inkeys. -
isEmpty
public boolean isEmpty()Returns whether this key set is empty. -
contains
public boolean contains(int key) Returns whether this key set contains the given key. -
indexOf
public int indexOf(int key) Searches the specified key, and returns its index in the element list. If multiple elements are equal to the specified key, there is no guarantee which one will be found.- Returns:
- The index of the searched key if it is present; otherwise,
(-(<i>insertion point</i>) - 1). The insertion point is defined as the point at which the key would be inserted into the list: the index of the first element greater than the key, orkeys#size()if all the elements are less than the specified key. Note that this guarantees that the return value will be >= 0 if and only if the key is found.
-
rank
public int rank(int x) Returns, for any valuex, the number of keys in the sorted list which are smaller thanx. It is equal toindexOf(int)ifxbelongs to the list, or -indexOf(int)-1 otherwise.If multiple elements are equal to the specified key, there is no guarantee which one will be found.
- Returns:
- The index of the searched key if it is present; otherwise, the
insertion point. The insertion point is defined as the point at which the key would be inserted into the list: the index of the first element greater than the key, orkeys#size()if all the elements are less than the specified key. Note that this method always returns a value >= 0.
-
rangeCardinality
public int rangeCardinality(int minKey, int maxKey) Returns the number of keys in the list that are greater than or equal tominKey(inclusive), and less than or equal tomaxKey(inclusive). -
rangeIterator
Returns an iterator over the keys in the list that are greater than or equal tominKey(inclusive), and less than or equal tomaxKey(inclusive). -
forEachInRange
Appliesprocedureto the keys in the list that are greater than or equal tominKey(inclusive), and less than or equal tomaxKey(inclusive). -
ramBytesAllocated
public long ramBytesAllocated()Estimates the allocated memory. It does not count the memory for the list of keys, only for the index itself.- Specified by:
ramBytesAllocatedin interfaceAccountable- Returns:
- Ram allocated in bytes
-
ramBytesUsed
public long ramBytesUsed()Estimates the bytes that are actually used. It does not count the memory for the list of keys, only for the index itself.- Specified by:
ramBytesUsedin interfaceAccountable- Returns:
- Ram used in bytes
-