Table of Contents Overview Detailed commentsGeneral Abstract 1. Introduction 2. Methods2.1 Outline 2.2 Indexing super-k-mers 2.3 Lazy encoding 2.4 Probing 2.5 Superbuckets 2.6 Implementation details 3. Results3.1 Parameters 3.2 Multicore 3.4 Comparison 3.5 Query times 4. Conclusion These are some (biased) comments on Brisk, a dynamic k-mer dictionary (Smith et al. 2024). Overview As is common these days, Brisk builds a dynamic k-mer dictionary using super-kmers, like e.g. SSHash (Pibiri 2022).| Posts on CuriousCoding
Table of Contents 1Introduction1.1Overview 1.2Previous reviews 2Theory of sampling schemes2.1Questions 2.2Types of schemes 2.3Parameter regimes 2.4Different perspectives 2.5UHS vs minimizer scheme 2.6(Asymptotic) bounds 2.7Lower bounds 3Minimizer schemes3.1Orders 3.2UHS-based and search-based schemes 3.3Pure schemes 3.4Other variantsSelection schemes Canonical minimizers 4Open questions 1 Introduction sadf Lots of DNA data Most algorithms deal with k-mers. k-mers overlap, and hence considerin...| Posts on CuriousCoding
Table of Contents Overview Detailed commentsTerminology Abstract Preliminaries Methods3.5 Transformations Results Discussion Comments on “Expected density of random minimizers” These are some (biased) comments on “Generating Low-Density Minimizers” (Golan et al. 2024), which introduces the GreedyMini minimizer scheme. At the bottom, there are also some comment on Golan and Shur (2024). Overview The GreedyMinimizer is very cute and simple idea, and works as follows:| Posts on CuriousCoding
Table of Contents The importance of ordering Asymptotically optimal minimizers These are some (biased) comments on “When Less Is More: Sketching with Minimizers in Genomics” (Ndiaye et al. 2024). The importance of ordering the interest lies in constructing a minimizer with a density within a constant factor, i.e., \(O(1/w)\) for any \(k\). With lexicographic ordering, minimizers can achieve such density, but with large \(k\) values (\(\geq \log_{|Σ|}(w)-c\) for a constant \(c\)), which m...| Posts on CuriousCoding
Table of Contents 1Suffix arrays 2Searching methods2.1Naive \(O(|P|\cdot \lg_2 n)\) search 2.2Faster \(O(|P|\cdot \lg_2 n)\) search 2.3LCP-based \(O(|P| + \lg_2 n)\) search 3Analysing the faster search We’ll prove that using the “faster” binary search algorithm (see 2.2) that tracks the LCP with the left and right boundary of the remaining search interval has amortized runtime \[ O\Big(\lg_2(n) + |P| + |P| \cdot \lg_2(Occ(P))\Big), \] when \(P\) is a randomly sampled fixed-length patter...| Posts on CuriousCoding
Here I’ll briefly list some FM-index and related implementations around the web. Implementations seem relatively inconsistent, mostly because the FM-index is more of a ‘wrapper’ type around a given Burrows-Wheeler-transform and an occurrences list. Both can be implemented in various ways. In particular occurrences should be stored using a wavelet tree for optimal compressing. The nucleic-acid repo contains a completely unoptimised version. The Rust-bio crate contains a generic FM-index....| Posts on CuriousCoding
Table of Contents 1Sampling schemes1.1Definitions 1.2Miniception 1.3Mod-minimizer 1.4Forward scheme lower bound 1.5Open syncmer minimizer 1.6Open-closed minimizer 1.7New: General mod-minimizer 1.8Variant: Open-closed minimizer using offsets 2Selection schemes2.1Definition 2.2Bd-anchors 2.3New: Smallest unique substring anchors 2.4New: Anti lexicographic sorting 3More sampling schemes3.1Anti-lex sus-anchors 3.2Threshold anchors 3.3The $t$-gap disappears for large alphabets 4Computing the densi...| Posts on CuriousCoding
Table of Contents 1Steps1.1Using kwargs 2TODOs Using PyO3 and maturin, it’s very easy to call Rust code from Python. I’m mostly following the guide at pyo3.rs, but leaving out some thing related to python environments. 1 Steps Install maturin. I use the Arch package but you can also do a pip install in the environment below. Make sure you have a lib target, and add cdylib as a crate-type.| Posts on CuriousCoding
https://transformer-circuits.pub/2021/framework/index.html https://infini-gram.io/ https://sleepinyourhat.github.io/checklist/| Posts on CuriousCoding
Table of Contents 1General observations 2Heuristic track 3Parameterized track 4Exact track In this post I will collect some high level ideas and approaches used to solve the PACE 2024 challenge. Very briefly, the goal is to write fast solvers for NP-hard problems. The problem for the 2024 edition is one-side crossing minimization: Given is a bipartite graph \((A, B)\) that is drawn in standard way with the nodes of both \(A\) and \(B\) on a line, where the order of the nodes of \(A\) is fixed...| Posts on CuriousCoding
Table of Contents 1Space dust 1 Space dust What is the total mass of space dust hitting the earth during the Perseids meteor shower? References| Posts on CuriousCoding
Table of Contents A*PA PA-Bench Mod-minimizers There are links for my talks and posters. A*PA A*PA (bioinformatic)Groot Koerkamp, Ragnar, and Pesho Ivanov. 2024. “Exact Global Alignment Using A* with Chaining Seed Heuristic and Match Pruning.” Edited by Tobias Marschall. Bioinformatics 40 (3). https://doi.org/10.1093/bioinformatics/btae032.A*PA2 (WABI24)Groot Koerkamp, Ragnar. 2024. “A*PA2: Up to 19 Faster Exact Global Alignment.” In 24th International Workshop on Algorithms in Bioinf...| Posts on CuriousCoding
Table of Contents Succinct backgroundDefinitions Lower bounds A new lower bound Discussion Post scriptum Acknowledgement The results of this post are now published in Bioinformatics: DOI, PDF: Kille, Bryce, Ragnar Groot Koerkamp, Drake McAdams, Alan Liu, and Todd J Treangen. 2024. “A near-Tight Lower Bound on the Density of Forward Sampling Schemes.” Edited by Yann Ponty. Bioinformatics, December. https://doi.org/10.1093/bioinformatics/btae736. --- In this post I will prove a new lower bo...| Posts on CuriousCoding
Table of Contents 1Sapling 2PLA-Index 3LISA: learned index Let’s summarize some tools for efficiently searching suffix arrays. 1 Sapling Sapling (Kirsche, Das, and Schatz 2020) works as follows: Choose a parameter \(p\) store for each of the \(2^p\) $p$-bit prefixes the corresponding position in the suffix array. When querying, first find the bucket for the query prefix. Then do a linear interpolation inside the bucket. Search the area \([-E, +E]\) around the interpolated position, where \(...| Posts on CuriousCoding
Popular C libraries are: divsufsort libsais Both have a ..64 variant that supports input strings longer than 2GB. Rust wrappers: divsufsort: rust reimplementation, does not support large inputs. cdivsufsort: c-wrapper, does not support large inputs livdivsufsort-rs: c-wrapper, does support large inputs sais: unrelated to the original library; does not implement a linear time algorithm anyway libsais-rs: Daniel Liu’s fork-of-fork of the original, but not on crates.io. Supports multithreading...| Posts on CuriousCoding
Table of Contents Summary Background Review commentsDFS Supplementary methods Details of pruning Evals Discussion Code & repo Here are some thoughts on POASTA (van Dijk et al. 2024), a recent affine-cost sequence-to-DAG (POA) aligner inspired by WFA and using A*. Summary Take a query and a directed acyclic graph (DAG). Align the query to the full DAG. It’s like global alignment for graphs. In fact I think the graph doesn’t actually have to be acyclic, as long as it has a start and end. (W...| Posts on CuriousCoding
Table of Contents Abstract 1Introduction1.1Contributions 1.2Previous work1.2.1Needleman-Wunsch 1.2.2Graph algorithms 1.2.3Computational volumes 1.2.4Parallelism 1.2.5Tools 2Preliminaries 3Methods3.1Band-doubling 3.2Blocks 3.3Memory 3.4SIMD 3.5SIMD-friendly sequence profile 3.6Traceback 3.7A*3.7.1Bulk-contours update 3.7.2Pre-pruning 3.8Determining the rows to compute3.8.1Sparse heuristic invocation 3.9Incremental doubling 4Results4.1Setup 4.2Comparison with other aligners 4.3Effects of method...| Posts on CuriousCoding
Table of Contents Summary Main issues 1. Introduction 2. Methods2.3 heuristic 3. Results Discussion Code These are my review-like notes on refined minimizers, introduced in Pan and Reinert (2024). Summary The paper introduces refined minimizers, a new scheme for sampling canonical minimizers that is less biased than the usual scheme. Instead of taking the minimum of the minimizer of the forward and reverse strand, the minimizer of the strand with the higher GT density is chosen. The less bias...| Posts on CuriousCoding
Table of Contents Applications BackgroundMinimizers Density bounds Robust minimizers PASHA Miniception Closed syncmers Bd-anchors New: Mod-minimizers Experiments Conclusion Small k experimentsSearch methods Directed minimizer \(k=1\), \(w=2\) \(k=1\), \(w=4\) \(k=1\), \(w=5\) \(k=2\), \(w=2\) \(k=2\), \(w=4\) Notes Reading list \[ \newcommand{\d}{\mathrm{d}} \newcommand{\L}{\mathcal{L}} \] This post introduces some background for minimizers and some experiments for a new minimizer variant. Th...| Posts on CuriousCoding
Table of Contents Overview Rust featuresBasics Basic syntax Expressions everywhere! Closures Pattern matching References Ownership Containers Traits Iterators Common libraries Ecosystem Useful links Hands-onInstallation Create a project Hello, world! Small project ideas These are notes for a quick introduction to Rust. Overview Statically typed & Compiled language. Great developer experience: cargo build system rust-analyzer LSP Rust features Basics C++Rust std::size_tusize std::pointerdiff_t...| Posts on CuriousCoding
Table of Contents Paper overview Remarks on the paper Thoughts \[ \newcommand{\A}{\mathcal{A}_\ell} \newcommand{\T}{\mathcal{T}_\ell} \] These are some notes on Bidirectional String Anchors (Loukides, Pissis, and Sweering 2023), also called bd-anchors. Resources: Loukides and Pissis (2021): preceding conference paper with subset of content. Loukides, Pissis, and Sweering (2023): The paper discussed here. Ayad, Loukides, and Pissis (2023): follow-up/second paper containing a faster average-cas...| Posts on CuriousCoding
Table of Contents Paper summaryIntro Prelims Related work Sparse and skew hashing Remarks Ideas \[\newcommand{\S}{\mathcal{S}}\] Paper summary Intro SsHash (Pibiri 2022) is a datastructure for indexing kmers. Given a set of kmers \(\S\), it supports two operations: \(Lookup(g)\)return the unique id \(i\in [|\S|]\) of the kmer \(g\).\(Access(i)\)return the kmer corresponding to id \(i\).It also supports streaming queries, looking up all kmers from a longer string consecutively, by expoiting th...| Posts on CuriousCoding
Table of Contents NtHash MinimizersRobust minimizers Is NtHash injective on kmers?Searching for a collision Proving perfection Alternatives SmHasher results TODO benchmark NtHash, NtHash2, FxHash NtHash NtHash (Mohamadi et al. 2016) is a rolling hash suitable for hashing any kind of text, but made for DNA originally. For a string of length \(k\) it is a \(64\) bit value computed as: \begin{equation} h(x) = \bigoplus_{i=0}^{k-1} rot^i(h(x_i)) \end{equation}| Posts on CuriousCoding
I recently gave a talk about A*PA at CWI. Sadly the recording doesn’t show the blackboard, but either way, find it here.| Posts on CuriousCoding
Table of Contents NotesColoured Tree Problem Generic sparse suffix array Sparse suffix array on minimizers Discussion / TODOsEvals These are my running notes on implementing an algorithm for Longest Common Repeat using minimizers. Notes Coloured Tree Problem See Lemma 3 at here Generic sparse suffix array paper: https://arxiv.org/pdf/2310.09023.pdf code: https://github.com/lorrainea/SSA/blob/main/PA/ssa.cc For random strings and \(b \leq n / \log n\), direct radix sort on $2log n + log log n$...| Posts on CuriousCoding
Table of Contents MondayFimpera: bloom filter for kmers Progress of tools Order-preserving MPHF of minimizers Algorithmic bottlenecks in SSHash Fourier transform of the human genome? TuesdayVariant types WednesdaySSHash PTHash de Bruijn Graphs These are notes of discussions at the ALPACA/PANGAIA conference in November 2023. Monday I had interesting discussions with Giulio, Paul, and Lucas Robidou. Fimpera: bloom filter for kmers Idea: instead of storing $k$mers in bloom filter, store all cons...| Posts on CuriousCoding
Table of Contents Lecture 1, 14 NovemberResources Reader friendlyness Typical problems Lecture 2, 21 NovemberParagraph level expectations Flow Assignment for next week Lecture 3, 28 NovemberBad organization Figures References to figures Indicative vs Informative (ex. 7) Lecture 4, December 5Introduction ConclusionTense Lecture 5, December 12Abstracts Titles PunctuationComma Dashes Some notes from the writing course I’m taking. Lecture 1, 14 November Resources Searching phrases/alternatives ...| Posts on CuriousCoding
Steps: Clone https://github.com/RagnarGrootKoerkamp/BAPCtools Make an alias to the executable: 1 ln -s ~/git/BAPCtools/bin/tools.py ~/bin/bt Create a new problem: 1 2 3 cd ~/problems bt new_problem my_problem_name cd ~/problems/my_problem_name You now have the following: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 . ├── data │ ├── sample │ │ └── 1.in # Sample testcase input │ │ └── 1.ans # Sample testcase output │ └─...| Posts on CuriousCoding
Table of Contents Questions and remarks on PTHash paper Ideas for improvementParameters Align packed vectors to cachelines Prefetching Faster modulo operations Store dictionary \(D\) sorted using Elias-Fano coding How many bits of \(n\) and hash entropy do we need? Ideas for faster construction Implementation logHashing function Bitpacking crates Construction Fastmod TODO Try out fastdivide and reciprocal crates First benchmark Faster bucket computation Branchless, for real now! (aka the tric...| Posts on CuriousCoding
Table of Contents Possible speedup? BBHash Limasset et al. (2017) uses multiple layers to create a minimal perfect hashing functions (MPFH), that hashes some input set into \([n]\). (See also my note on PTHash (Pibiri and Trani 2021).) Simply said, it maps the \(n\) elements into \([\gamma \cdot n]\) using hashing function \(h_0\). The \(k_0\) elements that have collisions are mapped into \([\gamma \cdot k_0]\) using \(h_1\). Then, the \(k_1\) elements with collisions are mapped into \([\gamm...| Posts on CuriousCoding
Table of Contents Problem Input Example Discussion Found the bug Outlook The supplement (download) of the Loving, Hernandez, and Benson (2014) paper introduces a \(15\) operation version of Myers (1999) bitpacking algorithm, which uses \(16\) operations when modified for edit distance. I tried implementing it, but it seems to have a bug that I will describe below. The fix is here. Problem To recap, this algorithm solves the unit-cost edit distance problem by using bitpacking to compute a \(1\...| Posts on CuriousCoding
Table of Contents Shortest path algorithms .... in general .. for circuit design Bucket queues Shortest path algorithms by HadlockGrid graphs Strings Spouge’s computational volumes This note summarizes some papers I was reading while investigating the history of A* for pairwise alignment, and related to that the first usage of a bucket queue. Schrijver (2012) provides a nice overview of general shortest path methods. Shortest path algorithms .. .. in general Moore (1959) was already present...| Posts on CuriousCoding
Table of Contents Introduction Research planImprove query performance using Heavy-Light Decomposition Add more query types Extend to non-exact suffix-prefix-overlap that allows for read errors Implement an algorithm to build string graphs, and possibly a full assembler This is a research proposal for a 5 month internship at CWI during autumn/winter 2023-2024. Introduction An important problem in bioinformatics is genome assembly: DNA sequencing machines read substrings of a full DNA genome, a...| Posts on CuriousCoding
Table of Contents CommentsPrelims One-to-One One-to-All Report and Count Top-\(K\) A small rant on $τ$-micro-macro trees Ideas for simplificationReplace $τ$-micro-macro tree Heavy-Light-Decomposition (HLD) for \(Count\) queries in \(O(\log n)\) time Finding the largest \(l\) with \(Count(i, l) \geq K\) in \(O(\log n)\) time Reporting matching strings Comparison Closing thoughts \[\newcommand{\dol}{\$}\] These are some comments and new ideas on the paper by Loukides, Pissis, Thankachan, and ...| Posts on CuriousCoding
These are notes for DSB 2023. They’re not very structured though. I usually find methods more interesting than results. Day 1, Tuesday Practical data structures for longest common extensions, Alexander Herlez LCE: longest common extension: given \(i\), \(j\), the max \(k\) s.t. \(A[i, i+k) = A[j, j+k)\). alg: compare first k if same: sample a subset and use black-box datastructure. similar idea to minhash/mash kmer selection methods, same(?) as syncmers string synchronizing sets (SSS): roll...| Posts on CuriousCoding
Table of Contents Research Proposal: Near-linear exact pairwise alignmentAbstract Introduction and current state of research in the field Goals of the thesisImpact Progress to date Detailed work planWP1: A*PA v1: initial version WP2: Visualizing aligners WP3: Benchmarking aligners WP4: Theory review WP5: A*PA v2: efficient implementation WP6: Affine costs WP7: Ends-free alignment and mapping WP8: Further extension and open ended research WP9: Thesis writing Publication plan Time schedule Teac...| Posts on CuriousCoding
Table of Contents Thoughts and remarksGood Bad My programming language journeyLego mindstorms LabVIEW C++ Python Rust These are some notes on my opinions on Rust after one year of using it. Thoughts and remarks These pros and cons are mostly relative to C++, the language I used for the past ~10 years. Good Sum types! Option and enum are so much nicer than optional and in particular variant. Build system Never used 3rd party code/libs in personal C++ projects. Use random crates all the time no...| Posts on CuriousCoding
Table of Contents Complexity analysisComplexity of edit distance Complexity of affine cost alignment Comparison Implementation efficiencyBand doubling for affine scores was never implemented WFA vs band doubling for affine costs ConclusionFuture work This note explores the complexity and performance of band doubling (Edlib) and WFA under varying cost models. Edlib (Šošić and Šikić 2017) uses band doubling and runs in \(O(ns)\) time, for sequence length \(n\) and edit distance \(s\) betwe...| Posts on CuriousCoding
Select the algorithm to visualize Click the buttons, or click the canvas and use the indicated keys Suffix-array construction is explained here and BWT is explained here. Source code is on GitHub. Algorithm String Query Delay (s)| Posts on CuriousCoding
Table of Contents Linear programming Assumptions Idea for an algorithm This note contains some ideas about linear programming and most-orthogonal faces. They’re mostly on an intuitive level and not very formal. Postscriptum: The ideas here don’t work. Linear programming \begin{equation*} \newcommand{\v}[1]{\textbf{#1}} \newcommand{\x}{\v x} \newcommand{\t}{\v t} \newcommand{\b}{\v b} \end{equation*} Maximize \(\t\x\) subject to \(A\x \leq \b\). \(\x\) is a vector of \(n\) variables \(x_i\...| Posts on CuriousCoding
Table of Contents Notation Needleman-Wunsch: where it all begins Dijkstra/BFS: visiting fewer states Band doubling: Dijkstra, but more efficient GapCost: A first heuristic Computational volumes: an even smaller search Cheating: an oracle gave us \(g^*\) A*: Better heuristics Broken idea: A* and computational volumes Local doublingWithout heuristic With heuristic Diagonal Transition A* with Diagonal Transition and pruning: doing less work Goal: Diagonal Transition + pruning + local doubling Pr...| Posts on CuriousCoding
Table of Contents Burrows-Wheeler Transformation (BWT)Last-to-first mapping (LF mapping) Pattern matching Visualization Bi-directional BWT These are some notes about the Burrows-Wheeler Transform (BWT), FM-index, and variants. See my post on the linear time suffix array construction algorithm for notation and terminology. At the bottom you can find a visualization. This page has an interactive demo. Source code for visualizations is this GitHub repo. Burrows-Wheeler Transformation (BWT) The B...| Posts on CuriousCoding
Some notes regarding the identity \begin{equation} \sum_{k=0}^n \binom{2k}k \binom{2n-2k}{n-k} = 4^n \end{equation} Gould has two derivations: The first, from Jensens equality, (18) in (Jensen 1902; Shijie 1303). A second via the Chu-Vandermonde convolution: \begin{equation} \sum_{k=0}^n \binom{x}k \binom{y}{n-k} = \binom{x+y}n \end{equation} using \(x=y=-\frac 12\) and using the $-\frac 12$-transform: \begin{equation} \binom{-1/2}{n} = (-1)^n\binom{2n}{n}\frac 1 {2^{2n}} \end{equation} Duart...| Posts on CuriousCoding
Table of Contents Definitions Proof of Lemma 1 TODO Proof of Lemma 2 This is a proof that Tensor Embedding (Joudaki, Rätsch, and Kahles 2020) with $ℓ^2$-norm preserves the Hamming distance. This is in collaboration with Amir Joudaki. \begin{equation*} \newcommand{\I}{\mathcal I} \newcommand{\EE}{\mathbb E} \newcommand{\var}{\operatorname{Var}} \end{equation*} Definitions NotationThe alphabet is \(\Sigma\), of size \(|\Sigma| = \sigma\). The set of indices is \(\I := \{(i_1, \dots, i_t) \in...| Posts on CuriousCoding
Table of Contents Notation Small and Large suffixes Building the suffix array from a smaller one Visualization These are some notes about linear time suffix array (SA) construction algorithms (SACA’s). At the bottom you can find a visualization. This page has an interactive demo. History of suffix array construction algorithms: 1990 first algorithm: Manber and Myers (1993) 2002 small/large suffixes, explained below: Ko and Aluru (2005) 2009 recursion only on LMS suffixes: Nong, Zhang, and C...| Posts on CuriousCoding
Table of Contents Contest strategies Pairwise Alignment using A* Exercises Contest strategies PreparationThinking costs energy! Sleep enough; early to bed the 2 nights before. No practising on contest day (and the day before); it just takes energy. During the contestEat! At the very least take a break halfway with the entire team and eat some snacks. Make sure to read all the problems before the end of the contest. In the beginning, split the problems to find the simple ones, but towards the ...| Posts on CuriousCoding
Table of Contents Motivation Parititioning A* memory by frontsNon-consistent heuristics Front indexing Tracing back the path Here is an idea to reduce the memory usage of A* by only storing one front at a time, similar to what Edlib and WFA do. Note that for now this will not work, but I’m putting this online anyway. Motivation In our implementation of A*PA, we use a hashmap to store the value of \(g\) of all visited (explored/expanded) states by A*. This can take up a lot of memory and sim...| Posts on CuriousCoding
Table of Contents Motivation Summary Why is A* slow? Computational volumes Dealing with pruningThoughts on more aggressive pruning Algorithm summary Challenges Results What about band-doubling?Maybe doubling can work after all? TODOs Extensions This post build on top of our recent preprint Groot Koerkamp and Ivanov (2024) and gives an overview of some of my new ideas to significantly speed up exact global pairwise alignment. It’s recommended you understand the seed heuristic and match pruni...| Posts on CuriousCoding
I made an improved version of the Oxford Bioinformatics latex template. See the Github repository.| Posts on CuriousCoding
Table of Contents Motivation Path traceback: two strategies ObservationsWhat information is needed for path tracing A pragmatic solution Another interpretation Affine costs Conclusion Figure 1: Only the red substitutions and blue indel need to be stored to trace the entire path. In this post I’ll discuss an idea to run WFA using less memory, while still allowing us to trace back the optimal path from the target state back to the start of the search.| Posts on CuriousCoding
Table of Contents Tricks with match bonus or how to fool Dijkstra’s limitationsEdit graph Algorithms PotentialsMultiple variants Some notes on algorithmsWFA A* Extending to different cost modelsAffine costs Substitution matrices But not local alignment EvaluationsUnequal string length Equal string lengths Conclusion Tricks with match bonus or how to fool Dijkstra’s limitations The reader is assumed to have basic knowledge about pairwise alignment and graph theory.| Posts on CuriousCoding
Table of Contents Notation Naming and style This is a growing list of notation and style decisions Pesho and I made during the writing of our paper, written down so that we don’t have to spend time on it again next time. Notation MathModulo: \(a\bmod m\) for remainder, \(a\equiv b\pmod m\) for equivalence. Alphabet\(\Sigma\), \(|\Sigma| = 4\) Sequences\(A = \overline{a_0\dots a_{n-1}} \in \Sigma^*\), \(|A| = n\) \(B = \overline{b_0\dots b_{m-1}} \in \Sigma^*\), \(|B| = m\) Edit distance \(\...| Posts on CuriousCoding
Table of Contents Diamond transition or how technicalities can break conceptsBut let’s take a closer look Conclusion Diamond transition or how technicalities can break concepts We assume the reader has some basic knowledge about pairwise alignment and in particular the WFA algorithm. In this post we dive into a potential 2x speedup of WFA — one that turns out not to work. Let’s take a look at one of the most important and efficient algorithms for pairwise alignment — WFA (Marco-Sola e...| Posts on CuriousCoding
These are some links and papers on bidirectional A* variants. Nothing insightful at the moment. small lectureintroduces \(h_f(u) = \frac 12 (\pi_f(u) - \pi_r)\). Not found a paper yet.An Improved Bidirectional Heuristic Search Algorithm (Champeaux 1977)introduces a bidirectional variantBidirectional Heuristic Search Again (Champeaux 1983)fixes a bug in the above paperEfficient modified bidirectional A* algorithm for optimal route-findingDidn’t read closely yet.A new bidirectional algorithm ...| Posts on CuriousCoding
cross references:BiWFA GitHub issue It seems that getting the meeting/overlap condition of BiWFA (Marco-Sola et al. (2023), Algorithm 1 and Lemma 2.1) correct is tricky. Let \(p := \max(x, o+e)\) be the maximal cost of any edge in the edit graph. As in the BiWFA paper, let \(s_f\) and \(s_r\) be the distances of the forward and reverse fronts computed so far. We prove the following lemma: Lemma Once BiWFA has expanded the forward and reverse fronts up to \(s_f\) and \(s_r\) and has found some...| Posts on CuriousCoding
These are some quick notes listing papers related to A* itself and variants. In particular, here I’m interested in papers that update \(h\) during the A* search, as a background for pruning. Specifically, our version of pruning increases \(h\) during a single A* search, and in fact the heuristic becomes in-admissible after pruning. Changing \(h\) The original A* paper has a proof of optimality. Later papers consider this also with heuristics that change their value over time.| Posts on CuriousCoding
These are the slides Pesho Ivanov and I presented at IGGSY 2022 on Astarix and A*PA. Drive: here Pdf: here| Posts on CuriousCoding
Benchmarking is harder than you think, even when taking into account this rule. This post lists some lessons I learned while attempting to run benchmarks for A* pairwise aligner. I was doing this on a laptop, which likely has different characteristics from CPUs in a typical server rack. All the programs I run are single threaded. Hardware Do not run while charging the laptopCharging makes the battery hot and causes throttling. Run either on battery power or with a completely full battery to p...| Posts on CuriousCoding
It’s not the need for faster software that motivates; it’s the mathematical discovery that needs sharing.| Posts on CuriousCoding
Table of Contents Pairwise alignment in subquadratic time Random model AlgorithmSeed heuristic Match pruning AnalysisExpanded statesExcess errors Algorithmic complexity This post is a proof sketch to show that A* with the seed heuristic (Groot Koerkamp and Ivanov 2024) does exact pairwise alignment of random strings with random mutations in near linear time. Pairwise alignment in subquadratic time Backurs and Indyk (2018) show that computing edit distance can not be done in strongly subquadra...| Posts on CuriousCoding
Table of Contents Gap open Gap close Symmetric alternatives Another symmetry Conclusions cross references:BiWFA GitHub issue In this post I will explore some variations of the recursion used by WFA/BiWFA for the affine version of the diagonal transition algorithm. In particular, we will go over a gap-close variant, and look into some more symmetric formulations. Gap open WFA (Marco-Sola et al. 2020) introduces the affine cost variant of the classic diagonal transition method. Let us call it a...| Posts on CuriousCoding
This is a growing list of ambiguous terms and their definitions. More of a place to store random remarks than a complete reference for now. diagonal transitionname introduced by Navarro (2001)approximateapproximate algorithm: an algorithms that does not always give the correct answer. $k$-approximate string matching: variant semi-global alignment where we find all matches of a pattern in a reference with at most \(k\) mistakes. Also approximate string matching: alternative name for global pai...| Posts on CuriousCoding
Table of Contents Variants of pairwise alignmentCost models Alignment types A chronological overview of global pairwise alignment Algorithms in detailClassic DP algorithmsCubic algorithm of Needleman and Wunsch (1970) A quadratic DP Local alignment Affine costs Minimizing vs. maximizing duality Four Russians method TODO \(O(ns)\) methodsTODO Exponential search on band TODO LCS: thresholds, $k$-candidates and contours TODO Diagonal transition: furthest reaching and wavefronts TODO Suffixtree f...| Posts on CuriousCoding
Note: this post extends the concept of multiple-path pruning presented in Poole and Mackworth (2017). Say we’re running A* in a graph from \(s\) to \(t\). \(d(s,t)\) is the distance we are looking for. An A* heuristic has to satisfy \(h(u) \leq d(u, t)\) to be admissible: the estimated distance to the end should never be larger than the actual distance to guarantee that the algorithm finds a shortest path.| Posts on CuriousCoding
Papers AStarix: Fast and Optimal Sequence-to-Graph Alignment Fast and Optimal Sequence-to-Graph Alignment Guided by Seeds AStarix is a method for aligning sequences (reads) to graphs: InputA reference sequence or graph Alignment costs \((\Delta_{match}, \Delta_{subst}, \Delta_{del}, \Delta_{ins})\) for a match, substitution, insertion and deletion Sequence(s) to align OutputAn optimal alignment of each input sequenceThe input is a reference graph (automaton really) \(G_r = (V_r, E_r)\) with e...| Posts on CuriousCoding
Neighbour joining (NJ, paper) is a phylogeny reconstruction method. It differs from UPGMA in the way it computes the distances between clusters. This algorithm first assumes that the phylogeny is a star graph. Then it finds the pair of vertices that when merged and split out gives the minimal total edge length \(S_{ij}\) of the new almost-star graph. (See eq. (4) and figure 2a and 2b in the paper.) \[ S_{i,j} = \frac1{2(n-2)} \sum_{k\not\in \{i,j\}}(d(i, k)+d(j,k)) + \frac 12 d(i,j)+\frac 1{n...| Posts on CuriousCoding
Unweighted pair group method with arithmetic mean (UPGMA) is a phylogeny reconstruction method. InputMatrix of pairwise distancesOutputPhylogenyAlgorithmRepeatedly merge the nearest two clusters. The distance between clusters is the average of all pairwise distances between them. When merging two clusters, the distances of the new cluster are the weighted averages of distances from the two clusters being merged.Complexity\(O(n^3)\) naive, \(O(n^2 \ln n)\) using heap.| Posts on CuriousCoding
Read The F*ing Error When you complain about an error without reading it first. When you assume you understand the problem halfway through reading the error, and only after more debugging you realize you failed to read properly.| Posts on CuriousCoding
Important deadlines require important procrastination.| Posts on CuriousCoding
Experiments and their analysis should be reproducible, and all data/figures in a paper should be reviewable. Pipelines (e.g. snakemake files) to generated them should be attached to the paper. I’ve asked for automated scripts to reproduce test data on 3+ github repositories now, and got a satisfactory answer zero times: WFA: https://github.com/smarco/WFA/issues/26 Link to a datadump on the block-aligner repository. Good to have actual data, but exactly how this data was created is unclear t...| Posts on CuriousCoding
Table of Contents Background$k$-mers Sketching MinHash Terminology Introduction Spaced $k$-mer Seeded DistanceImproving performance Analysis Pruning false positive candidate matches Phylogeny reconstructionRunning the algorithm TODO Assembly \[ \newcommand{\vp}{\varphi} \newcommand{\A}{\mathcal A} \newcommand{\O}{\mathcal O} \newcommand{\N}{\mathbb N} \newcommand{\ed}{\mathrm{ed}} \newcommand{\mh}{\mathrm{mh}} \newcommand{\hash}{\mathrm{hash}} \] Background Quickly finding similar pieces of D...| Posts on CuriousCoding
Let’s go over some reasons for why I’m writing this blog. The internet is more accessible than papers The inspiration for this blog is the post on Succinct de Bruijn Graphs by Alex Bowe. I think blog posts are a great way to quickly learn about new ideas and concepts, since they are usually more accessible than papers. A blog post can omit some of the more formal text required in papers and spend more time explaining things on an intuitive level. This way, scientific concepts can be under...| Posts on CuriousCoding
Here’s the customary how I made this site using X post. This site is built using Hugo and ox-hugo. The source is written in Org mode, which is converted to markdown by ox-hugo. To get started yourself, check out the initial commit of the source repository and build from there. Some notes: I’m using the Hugo-coder theme. Since the conversion from Org to markdown is done using an Emacs plugin, the emacs folder contains a simple init.el to import ox-hugo and a function to export all *.org fi...| Posts on CuriousCoding
1 print("Hello, World!") 1 std::cout<<"Hello, World!"<<std::endl;| Posts on CuriousCoding
Table of Contents Spaced \(k\)-mers Minimap SPAdes MUMmer4 BLASR Bowtie 2 Patternhunter Spaced seeds improve \(k\)-mer-based metagenomic classification LoMeX Meeting notes Concepts: Mapping Map a sequence onto a reference genome/dataset Assembly Build a genome from a set of reads de novo (implied): without using a reference genome Otherwise just called mapping Typical complicating factors: read errors non-uniform coverage insert size variation chimeric reads (?) bireads non-uniform read cover...| Posts on CuriousCoding
\[ \newcommand{\vp}{\varphi} \newcommand{\A}{\mathcal A} \newcommand{\O}{\mathcal O} \newcommand{\N}{\mathbb N} \newcommand{\Z}{\mathbb Z} \newcommand{\ed}{\mathrm{ed}} \newcommand{\mh}{\mathrm{mh}} \newcommand{\hash}{\mathrm{hash}} \] Here is an idea for an algorithm to assemble long reads. Go over all sequences and sketch their windows using the Hamming distance preserving sketch method described here. This method may need some tweaking to also work with an indel rate of around 10%. Let’s...| Posts on CuriousCoding
Table of Contents Background$k$-mers Sketching MinHash Introduction Hamming Similarity SearchImproving performance Analysis Pruning false positive candidate matches Phylogeny reconstructionRunning the algorithm Assembly \[ \newcommand{\vp}{\varphi} \newcommand{\A}{\mathcal A} \newcommand{\O}{\mathcal O} \newcommand{\N}{\mathbb N} \newcommand{\ed}{\mathrm{ed}} \newcommand{\mh}{\mathrm{mh}} \newcommand{\hash}{\mathrm{hash}} \] Background Quickly finding similar pieces of DNA within large datase...| Posts on CuriousCoding
Xrefs: PR for Sway | AUR packagesway-inhibit-fullscreen-git Once upon a time, Chromium had a bug where using $mod+f in i3 to fullscreen the Chromium window changed the window to occupy the entire screen, but didn’t actually make Chromium enter full screen mode. According to some, those1 were2 the3 good4 days5, 6. Watching 4 YouTube streams in parallel was still possibly, back in those days: Without patches, the best we can do nowadays7 is the following| Posts on CuriousCoding
Table of Contents My aur packages Some issues I reported/fixed My aur packages List on aur.archlinux.org bapctools-git: BAPCtools is used for developing ICPC style programming contest problems. feh-preload-next-image-git: Branch of Feh that loads the next image to speed up browsing images in a remote directory. i3-focus-last-git: Window switcher for i3/sway. python-pyexiftool-nocheck: the original python-pyexiftool is outdated, orphaned, and still depends on python2. sway-inhibit-fullscreen-g...| Posts on CuriousCoding
Related posts: Dark mode with Vimium Vimium (Github, Chromium extension) is not only a great way to navigate webpages; it’s also a great help to quickly search many webpages. I am using it many times a day to search for just the documentation I need. Some of the search engines I have configured: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 # Documentationarchwiki:https://wiki.archlinux.org/index.php?search=%s ArchWikiaur:https://aur.archlinux.org/packages/?K=%s AURcpp:https://en.cp...| Posts on CuriousCoding
This post goes over some useful utilities I have been using on my Wayland system. Screen brightness: light Light is a nice tool to manage screen and keyboard brightness. Install light Add your user to the video group: usermod -aG video <user> I really like the light -T flag, which multiplies the current brightness by some value. This way you can have fine grained control both for very low and very high brightness values. To prevent yourself from decreasing the brightness all the way to 0, you...| Posts on CuriousCoding
Table of Contents 1 Introduction 1.1 Problem statement 1.2 Motivation 1.3 Recommended reading 1.4 Binary search and Eytzinger layout 1.5 Hugepages 1.6 A note on benchmarking 1.7 Cache lines 1.8 S-trees and B-trees 2 Optimizing find 2.1 Linear 2.2 Auto-vectorization 2.3 Trailing zeros 2.4 Popcount 2.5 Manual SIMD 3 Optimizing the search 3.1 Batching 3.2 Prefetching 3.3 Pointer arithmetic 3.3.1 Up-front splat 3.3.2 Byte-based pointers 3.3.3 The final version 3.4 Skip prefetch 3.5 Interleave 4 O...| CuriousCoding