tacco.utils.sparse_distance_matrix

sparse_distance_matrix(A, max_distance, method='numba', numba_blocksize=None, numba_experimental_dict='auto', dtype=<class 'numpy.float64'>, parallel=True, batch_size=None, bin_batching=200, low_mem=False, verbose=1)[source]

Calclulate a sparse pairwise distance matrix of dense inputs. Only euclidean metric is supported.

For a dense version, see dense_distance_matrix().

Parameters:
  • A – A 2d ndarray.

  • max_distance – A distance cutoff: all distances larger than this value are excluded from the result.

  • method

    A string indicating the method to use. Available are:

    • ’scipy’: Use scipy.spatial.cKDTree.sparse_distance_matrix(). This is most efficient for not too many points and relatively small max_distance.

    • ’numba’: Use a custom numba based implementation, which is much faster for larger max_distance and datasets.

  • numba_blocksize – If method is ‘numba’, this gives the size of the blocks within which the distances should be computed densely. Has to be at least max_distance and - depending on the dataset - should be several times larger for optimal performance. If None, use a heuristic to find a reasonable value. Smaller values need less memory.

  • numba_experimental_dict – If method is ‘numba’, how to accelerate some parts of the code by using numba dictionaries, which is an experimental numba feature. If ‘auto’, runs a small test set to determine whether numba dicts seem to work and uses them accordingly.

  • dtype – The data type to use for calculations. Internally method ‘scipy’ always uses np.float64, i.e. double precision. Method ‘numba’ can get a significant speedup from using lower precision - which can also lead to rounding errors making different nearby points have numerically 0 distance, which is then discarded in the sparse result… All non-floating dtypes will be casted to np.float64.

  • parallel – Whether to run using multiple cores. Only method ‘numba’ supports this option.

  • batch_size – The number of blocks to calculate per batch. Small values need less memory while large values tend to give higher parallel speedup. If None, uses the number of available threads - even if parallel==False and only a single thread is used.

  • bin_batching – If larger than 0, the calculations for individual blocks are grouped such that (mostly) at least bin_batching points are considered together. This reduces overhead and the impact of choosing a too small numba_blocksize.

  • low_mem – Whether to access harddisc to buffer big temporaries. This is slower than in memory operations, but can reduce the memory consumption by a factor of 2.

Returns:

A coo_matrix containing the distances.