Java 中的稀疏矩阵/数组

algorithm java sparse-array sparse-matrix

2022-08-31 13:50:31

我正在开发一个用Java编写的项目，这需要我构建一个非常大的2-D稀疏数组。非常稀疏，如果这有所作为的话。无论如何：这个应用程序最关键的方面是时间方面的有效性（假设内存负载，尽管几乎不是无限的，以至于允许我使用标准的2-D阵列 - 关键范围在两个维度上都在数十亿）。

在数组中的 kajillion 单元格中，将有几十万个包含对象的单元格。我需要能够非常快速地修改单元格内容。

无论如何：有没有人知道一个特别好的库用于这个目的？它必须是伯克利，LGPL或类似的许可证（没有GPL，因为产品不能完全开源）。或者，如果只有一种非常简单的方法可以制作一个自制稀疏数组对象，那也没关系。

我正在考虑MTJ，但还没有听到任何关于其质量的意见。

答案 1

使用哈希映射构建的稀疏数组对于频繁读取的数据非常低效。最有效的实现使用Trie，允许访问分段分布的单个向量。

Trie可以通过仅执行只读 TWO 数组索引来计算表中是否存在元素，以获取元素存储的有效位置，或者知道基础存储中是否不存在该元素。

它还可以在后备存储中为稀疏数组的默认值提供默认位置，这样您就不需要对返回的索引进行任何测试，因为 Trie 保证所有可能的源索引将至少映射到后备存储中的默认位置（您经常会存储零，或空字符串或空对象）。

存在支持快速可更新的 Tries 的实现，具有 otional “compact（）” 操作，以在多个操作结束时优化后备存储的大小。尝试比哈希图快得多，因为它们不需要任何复杂的哈希函数，也不需要处理读取的冲突（使用哈希映射，你既有读取和写入的冲突，这需要一个循环来跳到下一个候选位置，并对它们中的每一个进行测试以比较有效的源索引...）

此外，Java Hashmaps只能在对象上编制索引，并且为每个散列的源索引创建一个Integer对象（每次读取都需要创建此对象，而不仅仅是写入）在内存操作方面代价高昂，因为它会给垃圾回收器带来压力。

我真的希望JRE包含一个IntegerMap<Object>作为慢速HashMap的默认实现<Integer，Object>或LongTrieMap<Object>作为更慢的HashMap的默认实现<Long，Object>...但事实并非如此。

您可能想知道什么是Trie？

它只是一个小的整数数组（在比矩阵的完整坐标范围更小的范围内），允许将坐标映射到向量中的整数位置。

例如，假设您想要一个仅包含几个非零值的 1024*1024 矩阵。与其将该矩阵存储在包含 1024*1024 个元素（超过 100 万个）的数组中，不如将其拆分为大小为 16*16 的子范围，而只需要 64*64 个这样的子范围。

在这种情况下，Trie 索引将仅包含 64*64 个整数（4096），并且至少会有 16*16 个数据元素（包含默认零，或稀疏矩阵中最常见的子范围）。

用于存储值的向量将仅包含 1 个副本，用于彼此相等的子范围（其中大多数都充满零，它们将由相同的子范围表示）。

因此，与其使用这样的语法，不如使用如下语法：matrix[i][j]

trie.values[trie.subrangePositions[(i & ~15) + (j >> 4)] +
            ((i & 15) << 4) + (j & 15)]

使用trie对象的访问方法可以更方便地处理。

下面是一个示例，内置于一个注释的类中（我希望它编译正常，因为它被简化了;如果有错误需要纠正，请向我发出信号）：

/**
 * Implement a sparse matrix. Currently limited to a static size
 * (<code>SIZE_I</code>, <code>SIZE_I</code>).
 */
public class DoubleTrie {

    /* Matrix logical options */        
    public static final int SIZE_I = 1024;
    public static final int SIZE_J = 1024;
    public static final double DEFAULT_VALUE = 0.0;

    /* Internal splitting options */
    private static final int SUBRANGEBITS_I = 4;
    private static final int SUBRANGEBITS_J = 4;

    /* Internal derived splitting constants */
    private static final int SUBRANGE_I =
        1 << SUBRANGEBITS_I;
    private static final int SUBRANGE_J =
        1 << SUBRANGEBITS_J;
    private static final int SUBRANGEMASK_I =
        SUBRANGE_I - 1;
    private static final int SUBRANGEMASK_J =
        SUBRANGE_J - 1;
    private static final int SUBRANGE_POSITIONS =
        SUBRANGE_I * SUBRANGE_J;

    /* Internal derived default values for constructors */
    private static final int SUBRANGES_I =
        (SIZE_I + SUBRANGE_I - 1) / SUBRANGE_I;
    private static final int SUBRANGES_J =
        (SIZE_J + SUBRANGE_J - 1) / SUBRANGE_J;
    private static final int SUBRANGES =
        SUBRANGES_I * SUBRANGES_J;
    private static final int DEFAULT_POSITIONS[] =
        new int[SUBRANGES](0);
    private static final double DEFAULT_VALUES[] =
        new double[SUBRANGE_POSITIONS](DEFAULT_VALUE);

    /* Internal fast computations of the splitting subrange and offset. */
    private static final int subrangeOf(
            final int i, final int j) {
        return (i >> SUBRANGEBITS_I) * SUBRANGE_J +
               (j >> SUBRANGEBITS_J);
    }
    private static final int positionOffsetOf(
            final int i, final int j) {
        return (i & SUBRANGEMASK_I) * MAX_J +
               (j & SUBRANGEMASK_J);
    }

    /**
     * Utility missing in java.lang.System for arrays of comparable
     * component types, including all native types like double here.
     */
    public static final int arraycompare(
            final double[] values1, final int position1,
            final double[] values2, final int position2,
            final int length) {
        if (position1 >= 0 && position2 >= 0 && length >= 0) {
            while (length-- > 0) {
                double value1, value2;
                if ((value1 = values1[position1 + length]) !=
                    (value2 = values2[position2 + length])) {
                    /* Note: NaN values are different from everything including
                     * all Nan values; they are are also neigher lower than nor
                     * greater than everything including NaN. Note that the two
                     * infinite values, as well as denormal values, are exactly
                     * ordered and comparable with <, <=, ==, >=, >=, !=. Note
                     * that in comments below, infinite is considered "defined".
                     */
                    if (value1 < value2)
                        return -1;        /* defined < defined. */
                    if (value1 > value2)
                        return 1;         /* defined > defined. */
                    if (value1 == value2)
                        return 0;         /* defined == defined. */
                    /* One or both are NaN. */
                    if (value1 == value1) /* Is not a NaN? */
                        return -1;        /* defined < NaN. */
                    if (value2 == value2) /* Is not a NaN? */
                        return 1;         /* NaN > defined. */
                    /* Otherwise, both are NaN: check their precise bits in
                     * range 0x7FF0000000000001L..0x7FFFFFFFFFFFFFFFL
                     * including the canonical 0x7FF8000000000000L, or in
                     * range 0xFFF0000000000001L..0xFFFFFFFFFFFFFFFFL.
                     * Needed for sort stability only (NaNs are otherwise
                     * unordered).
                     */
                    long raw1, raw2;
                    if ((raw1 = Double.doubleToRawLongBits(value1)) !=
                        (raw2 = Double.doubleToRawLongBits(value2)))
                        return raw1 < raw2 ? -1 : 1;
                    /* Otherwise the NaN are strictly equal, continue. */
                }
            }
            return 0;
        }
        throw new ArrayIndexOutOfBoundsException(
                "The positions and length can't be negative");
    }

    /**
     * Utility shortcut for comparing ranges in the same array.
     */
    public static final int arraycompare(
            final double[] values,
            final int position1, final int position2,
            final int length) {
        return arraycompare(values, position1, values, position2, length);
    }

    /**
     * Utility missing in java.lang.System for arrays of equalizable
     * component types, including all native types like double here.
     */ 
    public static final boolean arrayequals(
            final double[] values1, final int position1,
            final double[] values2, final int position2,
            final int length) {
        return arraycompare(values1, position1, values2, position2, length) ==
            0;
    }

    /**
     * Utility shortcut for identifying ranges in the same array.
     */
    public static final boolean arrayequals(
            final double[] values,
            final int position1, final int position2,
            final int length) {
        return arrayequals(values, position1, values, position2, length);
    }

    /**
     * Utility shortcut for copying ranges in the same array.
     */
    public static final void arraycopy(
            final double[] values,
            final int srcPosition, final int dstPosition,
            final int length) {
        arraycopy(values, srcPosition, values, dstPosition, length);
    }

    /**
     * Utility shortcut for resizing an array, preserving values at start.
     */
    public static final double[] arraysetlength(
            double[] values,
            final int newLength) {
        final int oldLength =
            values.length < newLength ? values.length : newLength;
        System.arraycopy(values, 0, values = new double[newLength], 0,
            oldLength);
        return values;
    }

    /* Internal instance members. */
    private double values[];
    private int subrangePositions[];
    private bool isSharedValues;
    private bool isSharedSubrangePositions;

    /* Internal method. */
    private final reset(
            final double[] values,
            final int[] subrangePositions) {
        this.isSharedValues =
            (this.values = values) == DEFAULT_VALUES;
        this.isSharedsubrangePositions =
            (this.subrangePositions = subrangePositions) ==
                DEFAULT_POSITIONS;
    }

    /**
     * Reset the matrix to fill it with the same initial value.
     *
     * @param initialValue  The value to set in all cell positions.
     */
    public reset(final double initialValue = DEFAULT_VALUE) {
        reset(
            (initialValue == DEFAULT_VALUE) ? DEFAULT_VALUES :
                new double[SUBRANGE_POSITIONS](initialValue),
            DEFAULT_POSITIONS);
    }

    /**
     * Default constructor, using single default value.
     *
     * @param initialValue  Alternate default value to initialize all
     *                      positions in the matrix.
     */
    public DoubleTrie(final double initialValue = DEFAULT_VALUE) {
        this.reset(initialValue);
    }

    /**
     * This is a useful preinitialized instance containing the
     * DEFAULT_VALUE in all cells.
     */
    public static DoubleTrie DEFAULT_INSTANCE = new DoubleTrie();

    /**
     * Copy constructor. Note that the source trie may be immutable
     * or not; but this constructor will create a new mutable trie
     * even if the new trie initially shares some storage with its
     * source when that source also uses shared storage.
     */
    public DoubleTrie(final DoubleTrie source) {
        this.values = (this.isSharedValues =
            source.isSharedValues) ?
            source.values :
            source.values.clone();
        this.subrangePositions = (this.isSharedSubrangePositions =
            source.isSharedSubrangePositions) ?
            source.subrangePositions :
            source.subrangePositions.clone());
    }

    /**
     * Fast indexed getter.
     *
     * @param i  Row of position to set in the matrix.
     * @param j  Column of position to set in the matrix.
     * @return   The value stored in matrix at that position.
     */
    public double getAt(final int i, final int j) {
        return values[subrangePositions[subrangeOf(i, j)] +
                      positionOffsetOf(i, j)];
    }

    /**
     * Fast indexed setter.
     *
     * @param i      Row of position to set in the sparsed matrix.
     * @param j      Column of position to set in the sparsed matrix.
     * @param value  The value to set at this position.
     * @return       The passed value.
     * Note: this does not compact the sparsed matric after setting.
     * @see compact(void)
     */
    public double setAt(final int i, final int i, final double value) {
       final int subrange       = subrangeOf(i, j);
       final int positionOffset = positionOffsetOf(i, j);
       // Fast check to see if the assignment will change something.
       int subrangePosition, valuePosition;
       if (Double.compare(
               values[valuePosition =
                   (subrangePosition = subrangePositions[subrange]) +
                   positionOffset],
               value) != 0) {
               /* So we'll need to perform an effective assignment in values.
                * Check if the current subrange to assign is shared of not.
                * Note that we also include the DEFAULT_VALUES which may be
                * shared by several other (not tested) trie instances,
                * including those instanciated by the copy contructor. */
               if (isSharedValues) {
                   values = values.clone();
                   isSharedValues = false;
               }
               /* Scan all other subranges to check if the position in values
                * to assign is shared by another subrange. */
               for (int otherSubrange = subrangePositions.length;
                       --otherSubrange >= 0; ) {
                   if (otherSubrange != subrange)
                       continue; /* Ignore the target subrange. */
                   /* Note: the following test of range is safe with future
                    * interleaving of common subranges (TODO in compact()),
                    * even though, for now, subranges are sharing positions
                    * only between their common start and end position, so we
                    * could as well only perform the simpler test <code>
                    * (otherSubrangePosition == subrangePosition)</code>,
                    * instead of testing the two bounds of the positions
                    * interval of the other subrange. */
                   int otherSubrangePosition;
                   if ((otherSubrangePosition =
                           subrangePositions[otherSubrange]) >=
                           valuePosition &&
                           otherSubrangePosition + SUBRANGE_POSITIONS <
                           valuePosition) {
                       /* The target position is shared by some other
                        * subrange, we need to make it unique by cloning the
                        * subrange to a larger values vector, copying all the
                        * current subrange values at end of the new vector,
                        * before assigning the new value. This will require
                        * changing the position of the current subrange, but
                        * before doing that, we first need to check if the
                        * subrangePositions array itself is also shared
                        * between instances (including the DEFAULT_POSITIONS
                        * that should be preserved, and possible arrays
                        * shared by an external factory contructor whose
                        * source trie was declared immutable in a derived
                        * class). */
                       if (isSharedSubrangePositions) {
                           subrangePositions = subrangePositions.clone();
                           isSharedSubrangePositions = false;
                       }
                       /* TODO: no attempt is made to allocate less than a
                        * fully independant subrange, using possible
                        * interleaving: this would require scanning all
                        * other existing values to find a match for the
                        * modified subrange of values; but this could
                        * potentially leave positions (in the current subrange
                        * of values) unreferenced by any subrange, after the
                        * change of position for the current subrange. This
                        * scanning could be prohibitively long for each
                        * assignement, and for now it's assumed that compact()
                        * will be used later, after those assignements. */
                       values = setlengh(
                           values,
                           (subrangePositions[subrange] =
                            subrangePositions = values.length) +
                           SUBRANGE_POSITIONS);
                       valuePosition = subrangePositions + positionOffset;
                       break;
                   }
               }
               /* Now perform the effective assignment of the value. */
               values[valuePosition] = value;
           }
       }
       return value;
    }

    /**
     * Compact the storage of common subranges.
     * TODO: This is a simple implementation without interleaving, which
     * would offer a better data compression. However, interleaving with its
     * O(N²) complexity where N is the total length of values, should
     * be attempted only after this basic compression whose complexity is
     * O(n²) with n being SUBRANGE_POSITIIONS times smaller than N.
     */
    public void compact() {
        final int oldValuesLength = values.length;
        int newValuesLength = 0;
        for (int oldPosition = 0;
                 oldPosition < oldValuesLength;
                 oldPosition += SUBRANGE_POSITIONS) {
            int oldPosition = positions[subrange];
            bool commonSubrange = false;
            /* Scan values for possible common subranges. */
            for (int newPosition = newValuesLength;
                    (newPosition -= SUBRANGE_POSITIONS) >= 0; )
                if (arrayequals(values, newPosition, oldPosition,
                        SUBRANGE_POSITIONS)) {
                    commonSubrange = true;
                    /* Update the subrangePositions|] with all matching
                     * positions from oldPosition to newPosition. There may
                     * be several index to change, if the trie has already
                     * been compacted() before, and later reassigned. */
                    for (subrange = subrangePositions.length;
                         --subrange >= 0; )
                        if (subrangePositions[subrange] == oldPosition)
                            subrangePositions[subrange] = newPosition;
                    break;
                }
            if (!commonSubrange) {
                /* Move down the non-common values, if some previous
                 * subranges have been compressed when they were common.
                 */
                if (!commonSubrange && oldPosition != newValuesLength) {
                    arraycopy(values, oldPosition, newValuesLength,
                        SUBRANGE_POSITIONS);
                    /* Advance compressed values to preserve these new ones. */
                    newValuesLength += SUBRANGE_POSITIONS;
                }
            }
        }
        /* Check the number of compressed values. */
        if (newValuesLength < oldValuesLength) {
            values = values.arraysetlength(newValuesLength);
            isSharedValues = false;
        }
    }

}

注意：此代码不完整，因为它处理单个矩阵大小，并且其压缩器仅限于检测常见的子范围，而不交错它们。

此外，代码不会根据矩阵大小确定用于将矩阵拆分为子范围（对于 x 或 y 坐标）的最佳宽度或高度。它只是使用相同的静态子范围大小 16（对于两个坐标），但它可以方便地使用任何其他小幂 2（但非 2 的幂会减慢和内部方法的速度），对于两个坐标都是独立的，并且直到矩阵的最大宽度或高度。int indexOf(int, int)int offsetOf(int, int)compact()

如果这些拆分子范围的大小可以变化，那么将需要为这些子范围大小添加实例成员而不是静态，并使静态方法和非静态;和初始化数组，并且需要以不同的方式删除或重新定义。SUBRANGE_POSITIONSint subrangeOf(int i, int j)int positionOffsetOf(int i, int j)DEFAULT_POSITIONSDEFAULT_VALUES

如果你想支持交错，基本上你会首先将现有值分成两个大小大致相同的值（两者都是最小子范围大小的倍数，第一个子集可能比第二个子集多一个子范围），并且您将在所有连续位置扫描较大的子集以找到匹配的交错;然后，您将尝试匹配这些值。然后，您将通过将子集分成两半（也是最小子范围大小的倍数）来递归循环，然后再次扫描以匹配这些子集（这将使子集数乘以2：您必须怀疑子范围Position索引的两倍大小是否值得与现有值大小相比的值，以查看它是否提供有效的压缩（如果不是，你止步于此：你已经直接从交错压缩过程中找到了最佳的子范围大小）。在这种情况下;在压缩期间，子范围大小将是可变的。

但是，此代码演示如何分配非零值并重新分配数组以用于其他（非零）子范围，然后如何优化（在使用该方法执行分配之后）此数据的存储，当存在可能统一在数据中的重复子范围时，并在数组中的相同位置重新编制索引。datacompact()setAt(int i, int j, double value)subrangePositions

无论如何，trie的所有原则都在那里实现：

使用单个向量而不是双索引数组数组（每个数组单独分配）来表示矩阵总是更快（并且在内存中更紧凑，这意味着更好的局部性）。改进在方法中是显而易见的！double getAt(int, int)
您可以节省大量空间，但在赋值时，重新分配新的子范围可能需要一些时间。因此，子范围不应太小，否则重新分配将过于频繁地发生，无法设置矩阵。
通过检测公共子范围，可以将初始大矩阵自动转换为更紧凑的矩阵。然后，典型的实现将包含如上所述的方法。但是，如果 get（）访问非常快，而 set（）非常快，那么如果有很多常见的子范围需要压缩，compact（）可能会非常慢（例如，当用自身减去一个大型非稀疏随机填充矩阵时，或将其乘以零时：在这种情况下，通过实例化新子范围并删除旧矩阵来重置 trie 会更简单、更快）。compact()
公共子范围在数据中使用公共存储，因此此共享数据必须是只读的。如果必须更改单个值而不更改矩阵的其余部分，则必须首先确保在索引中仅引用该值一次。否则，您需要在向量的任何位置（方便地在末端）分配一个新的子范围，然后将这个新子范围的位置存储到索引中。subrangePositionsvaluessubrangePositions

请注意，通用的Colt库虽然非常好，但在处理稀疏矩阵时并不那么好，因为它使用哈希（或行压缩）技术，这些技术目前不支持尝试，尽管它是一个出色的优化，既节省空间又节省时间，特别是对于最频繁的getAt（）操作。

即使此处描述的 setAt（）操作 for try 也节省了大量时间（此处实现了该方法，即在设置后没有自动压缩，这仍然可以根据需求和估计时间来实现，其中压缩仍将以时间代价节省大量存储空间）：节省时间与子范围内的单元格数量成正比，并且节省的空间与每个子范围的单元格数成反比。如果使用子范围大小，那么每个子范围的单元格数是2D矩阵中单元格总数的平方根（使用3D矩阵时将是三次根），则一个很好的计算。

Colt 稀疏矩阵实现中使用的哈希技术具有不便之处，即它们会增加大量存储开销，并且由于可能的冲突而导致访问时间变慢。尝试可以避免所有冲突，然后可以保证在最坏的情况下将线性O（n）时间保存到O（1）时间，其中（n）是可能的碰撞次数（在稀疏矩阵的情况下，可能高达矩阵中非默认值单元格的数量，即矩阵大小的总数乘以与散列填充因子成比例的因子，对于非稀疏，即完整矩阵）。

Colt中使用的RC（行压缩）技术更接近Tries，但这是以另一种价格，这里使用的压缩技术，对于最频繁的只读get（）操作具有非常慢的访问时间，而对于setAt（）操作的压缩非常慢。此外，使用的压缩不是正交的，这与在“尝试”的演示中保持正交性不同。对于相关的查看操作，Try 也将保留此正交性，例如步长、转置（被视为基于整数循环模运算的步进运算）、子排列（以及一般的子选择，包括排序视图）。

我只是希望Colt将来会更新，以使用Trys实现另一个实现（即TrieSparseMatrix，而不仅仅是HashSparseMatrix和RCSparseMatrix）。这些想法在本文中。

Trove实现（基于int->int映射）也基于类似于Colt的HashedSparseMatrix的哈希技术，即它们具有相同的不便。尝试将快得多，消耗适度的额外空间（但是这个空间可以优化，并且在延迟的时间内比Trove和Colt更好，在生成的矩阵/ trie上使用最终的紧离子运算）。

注意：此 Trie 实现绑定到特定的本机类型（此处为双精度）。这是自愿的，因为使用装箱类型的通用实现具有巨大的空间开销（并且在access时间内要慢得多）。在这里，它只使用双精度的原生一维数组，而不是泛型Vector。但是，对于 Tries，当然也可以派生出一个通用实现...不幸的是，Java仍然不允许编写具有本机类型所有优点的真正泛型类，除非通过编写多个实现（对于泛型对象类型或每个本机类型），并通过类型工厂提供所有这些操作。该语言应该能够自动实例化本机实现并自动构建工厂（目前即使在Java 7中也不是这样，这是.Net仍然保持其与本机类型一样快的真正泛型类型的优势）。

答案 2

以下框架来测试Java矩阵库，还提供了这些库的良好列表！https://lessthanoptimal.github.io/Java-Matrix-Benchmark/

经测试的库：

* Colt
* Commons Math
* Efficient Java Matrix Library (EJML)
* Jama
* jblas
* JScience (Older benchmarks only)
* Matrix Toolkit Java (MTJ)
* OjAlgo
* Parallel Colt
* Universal Java Matrix Package (UJMP)