Class DoublePgmIndex

java.lang.Object
com.carrotsearch.hppc.DoublePgmIndex
All Implemented Interfaces:
Accountable

@Generated(date="2024-06-04T15:20:17+0200", value="KTypePgmIndex.java") public class DoublePgmIndex extends Object implements Accountable
Space-efficient index that enables fast rank/range search operations on a sorted sequence of double.

Implementation of the PGM-Index described at https://pgm.di.unipi.it/, based on the paper

   Paolo Ferragina and Giorgio Vinciguerra.
   The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds.
   PVLDB, 13(8): 1162-1175, 2020.
 
It provides rank and range search operations. indexOf() is faster than B+Tree, and the index is much more compact. contains() is between 4x to 7x slower than IntHashSet#contains(), but between 2.5x to 3x faster than Arrays.binarySearch(long[], long).

Its compactness (40KB for 200MB of keys) makes it efficient for very large collections, the index fitting easily in the L2 cache. The epsilon parameter should be set according to the desired space-time trade-off. A smaller value makes the estimation more precise and the range smaller but at the cost of increased space usage. In practice, epsilon 64 is a good sweet spot.

Internally the index uses an optimal piecewise linear mapping from keys to their position in the sorted order. This mapping is represented as a sequence of linear models (segments) which are themselves recursively indexed by other piecewise linear mappings.

  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Class
    Description
    static class 
    Builds a DoublePgmIndex on a provided sorted list of keys.
    protected static class 
    Iterator over a range of elements in a sorted array.
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final int
    Initial value of the exponential jump when scanning out of the epsilon range.
    static final int
    static final DoublePgmIndex
    Empty immutable DoublePgmIndex.
    final int
    The epsilon range used to build this index.
    static final int
    Epsilon approximation range when searching the list of keys.
    static final int
    Epsilon approximation range for the segments layers.
    final int
    The recursive epsilon range used to build this index.
    final double
    The lowest key in keys.
    static final int
    Size of a key, measured in Integer.BYTES because the key is stored in an int[].
    The list of keys for which this index is built.
    final double
    The highest key in keys.
    final int[]
    The offsets in segmentData of the first segment of each segment level.
    static final int
    Data size of a segment, measured in Integer.BYTES, because segments are stored in an int[].
    final int[]
    The index data.
    final int
    The size of the key set.
  • Method Summary

    Modifier and Type
    Method
    Description
    boolean
    contains(double key)
    Returns whether this key set contains the given key.
    <T extends DoubleProcedure>
    T
    forEachInRange(T procedure, double minKey, double maxKey)
    Applies procedure to the keys in the list that are greater than or equal to minKey (inclusive), and less than or equal to maxKey (inclusive).
    int
    indexOf(double key)
    Searches the specified key, and returns its index in the element list.
    boolean
    Returns whether this key set is empty.
    long
    Estimates the allocated memory.
    long
    Estimates the bytes that are actually used.
    int
    rangeCardinality(double minKey, double maxKey)
    Returns the number of keys in the list that are greater than or equal to minKey (inclusive), and less than or equal to maxKey (inclusive).
    rangeIterator(double minKey, double maxKey)
    Returns an iterator over the keys in the list that are greater than or equal to minKey (inclusive), and less than or equal to maxKey (inclusive).
    int
    rank(double x)
    Returns, for any value x, the number of keys in the sorted list which are smaller than x.
    int
    Returns the size of the key set.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • EMPTY

      public static final DoublePgmIndex EMPTY
      Empty immutable DoublePgmIndex.
    • EPSILON

      public static final int EPSILON
      Epsilon approximation range when searching the list of keys. Controls the size of the returned search range, strictly greater than 0. It should be set according to the desired space-time trade-off. A smaller value makes the estimation more precise and the range smaller but at the cost of increased space usage.

      With EPSILON=64 the benchmark with 200MB of keys shows that this PGM index requires only 2% additional memory on average (40KB). It depends on the distribution of the keys. This epsilon value is good even for 2MB of keys. With EPSILON=32: +5% speed, but 4x space (160KB).

      See Also:
    • EPSILON_RECURSIVE

      public static final int EPSILON_RECURSIVE
      Epsilon approximation range for the segments layers. Controls the size of the search range in the hierarchical segment lists, strictly greater than 0.
      See Also:
    • KEY_SIZE

      public static final int KEY_SIZE
      Size of a key, measured in Integer.BYTES because the key is stored in an int[].
    • DOUBLE_KEY_SIZE

      public static final int DOUBLE_KEY_SIZE
    • SEGMENT_DATA_SIZE

      public static final int SEGMENT_DATA_SIZE
      Data size of a segment, measured in Integer.BYTES, because segments are stored in an int[].
    • BEYOND_EPSILON_JUMP

      public static final int BEYOND_EPSILON_JUMP
      Initial value of the exponential jump when scanning out of the epsilon range.
      See Also:
    • keys

      public final DoubleArrayList keys
      The list of keys for which this index is built. It is sorted and may contain duplicate elements.
    • size

      public final int size
      The size of the key set. That is, the number of distinct elements in keys.
    • firstKey

      public final double firstKey
      The lowest key in keys.
    • lastKey

      public final double lastKey
      The highest key in keys.
    • epsilon

      public final int epsilon
      The epsilon range used to build this index.
    • epsilonRecursive

      public final int epsilonRecursive
      The recursive epsilon range used to build this index.
    • levelOffsets

      public final int[] levelOffsets
      The offsets in segmentData of the first segment of each segment level.
    • segmentData

      public final int[] segmentData
      The index data. It contains all the segments for all the levels.
  • Method Details

    • size

      public int size()
      Returns the size of the key set. That is, the number of distinct elements in keys.
    • isEmpty

      public boolean isEmpty()
      Returns whether this key set is empty.
    • contains

      public boolean contains(double key)
      Returns whether this key set contains the given key.
    • indexOf

      public int indexOf(double key)
      Searches the specified key, and returns its index in the element list. If multiple elements are equal to the specified key, there is no guarantee which one will be found.
      Returns:
      The index of the searched key if it is present; otherwise, (-(<i>insertion point</i>) - 1). The insertion point is defined as the point at which the key would be inserted into the list: the index of the first element greater than the key, or keys#size() if all the elements are less than the specified key. Note that this guarantees that the return value will be >= 0 if and only if the key is found.
    • rank

      public int rank(double x)
      Returns, for any value x, the number of keys in the sorted list which are smaller than x. It is equal to indexOf(double) if x belongs to the list, or -indexOf(double)-1 otherwise.

      If multiple elements are equal to the specified key, there is no guarantee which one will be found.

      Returns:
      The index of the searched key if it is present; otherwise, the insertion point. The insertion point is defined as the point at which the key would be inserted into the list: the index of the first element greater than the key, or keys# size() if all the elements are less than the specified key. Note that this method always returns a value >= 0.
    • rangeCardinality

      public int rangeCardinality(double minKey, double maxKey)
      Returns the number of keys in the list that are greater than or equal to minKey (inclusive), and less than or equal to maxKey (inclusive).
    • rangeIterator

      public Iterator<DoubleCursor> rangeIterator(double minKey, double maxKey)
      Returns an iterator over the keys in the list that are greater than or equal to minKey (inclusive), and less than or equal to maxKey (inclusive).
    • forEachInRange

      public <T extends DoubleProcedure> T forEachInRange(T procedure, double minKey, double maxKey)
      Applies procedure to the keys in the list that are greater than or equal to minKey (inclusive), and less than or equal to maxKey (inclusive).
    • ramBytesAllocated

      public long ramBytesAllocated()
      Estimates the allocated memory. It does not count the memory for the list of keys, only for the index itself.
      Specified by:
      ramBytesAllocated in interface Accountable
      Returns:
      Ram allocated in bytes
    • ramBytesUsed

      public long ramBytesUsed()
      Estimates the bytes that are actually used. It does not count the memory for the list of keys, only for the index itself.
      Specified by:
      ramBytesUsed in interface Accountable
      Returns:
      Ram used in bytes