Prague Stringology Conference 2025

Ondřej Sladký, Pavel Veselý and Karel Břinda

Towards Efficient k-Mer Set Operations via Function-Assigned Masked Superstrings

Abstract:
The design of efficient dynamic data structures for large k-mer sets belongs to central challenges of sequence bioinformatics. Recent advances in compact k-mer set representations via Spectrum-Preserving String Sets (SPSS), culminating with the masked superstring framework, have provided data structures of remarkable space efficiency for wide ranges of k-mer sets. However, the possibility to perform set operations with the resulting indexes has remained limited due to the static nature of the underlying compact representations. Here, we develop f-masked superstrings, a concept combining masked superstrings with custom demasking functions f to enable k-mer set operations based on index merging. Combined with the FMSI index for masked superstrings, we obtain a memory-efficient k-mer membership index and compressed dictionary supporting set operations via Burrows-Wheeler Transform merging. The framework provides a promising theoretical solution to a pressing bioinformatics problem and highlights the potential of f-masked superstrings to become an elementary data type for k-mer sets.

Download paper: Article in PostScript Article in PDF BibTeX Reference
 PostScript   PDF   BibTeX reference 
Download presentation: Presentation