seqtree

Fast fuzzy search over biological sequences (amino-acid or nucleotide), as a C++ core with a minimal Python binding. Build an immutable index once, then search single queries or massive batches in parallel.

Two search engines over one trie:

seqtm — branch-and-bound enumeration. Exact per-type edit caps (max_subs / max_ins / max_dels) and a fast Hamming-only path. Best for small edit distances (UMI collapse, error correction, CDR3/epitope matching).
seqtrie — banded edit-distance DP. Matrix-weighted score budgets (BLOSUM62 + gap costs), cost independent of edit count. Best for similarity-scored searches.

engine="auto" picks one per query. Results are payload-agnostic: (ref_id, score, n_subs, n_ins, n_dels). Downstream libraries map ref_id back to their own payloads (V gene, MHC, counts) and filter.

Beyond search, seqtree ships:

Substitution matrices — built-in BLOSUM62 and PAM50, plus custom matrices via SubstitutionMatrix.from_similarity (Gram-distance penalty s(a,a)+s(b,b)−2·s(a,b)).
E-values / significance — calibrate hit counts against a background control repertoire (load_control + evalues), the TCRNET approach on a finite-sample footing. See the E-value guide.

Install

pip install seqtree       # prebuilt wheels for CPython 3.10–3.13

Prebuilt wheels cover Linux x86-64, macOS arm64 (Apple Silicon), and Windows x86-64. There are no Intel/x86-64 macOS wheels — Intel Macs build from source (see below), which just needs a C++17 compiler and CMake (pulled in automatically by the build).

Build from source

bash setup.sh            # repo-local .venv + editable install
bash setup.sh --tests    # + pytest
bash setup.sh --bench     # + benchmark deps (huggingface_hub, pandas, psutil)

Quickstart

import seqtree

idx = seqtree.Index.build(["CASSLAPGATNEKLFF", "CASSLELGATNEKLFF"], alphabet="aa")

p = seqtree.SearchParams(max_subs=2, engine="seqtm")
for hit in idx.search("CASSLAPGATNEKLFF", p):
    print(hit.ref_id, hit.score, hit.n_subs)

# parallel batch (releases the GIL)
results = idx.search_batch(queries, p, threads=0)   # 0 = all cores

# matrix-weighted budget
pm = seqtree.SearchParams(matrix="BLOSUM62", max_penalty=12, engine="seqtrie")
top = idx.search_top("CASSLAPGATNEKLFF", pm, k=5)

# alignment on demand
aln = idx.align(0, "CASSLELGATNEKLFF", p)
print(aln.aligned_query, aln.aligned_ref, aln.ops)

# batch-vs-batch (auto-indexes the larger set)
pairs = seqtree.pairwise_batch(query_set, db_set, p, alphabet="aa")

# E-values against a background control repertoire (TCRNET-style significance)
control = seqtree.load_control("human_trb_aa", size=1_000_000)
target = seqtree.Index.build(vdjdb_cdr3s, alphabet="aa")
for q, r in zip(queries, seqtree.evalues(target, control, queries, p)):
    if r["p_enrichment"] < 1e-3:
        print(q, r["E"], r["n_target"], r["n_control"])

Tests

cmake -S . -B build -G Ninja -DSEQTREE_TESTS=ON
cmake --build build
ctest --test-dir build           # C++ unit tests
pytest tests/python              # Python tests

Benchmarks

python bench/bench_gnuplot.py        # throughput / scaling / matrix / collisions → SVG (needs gnuplot)
python bench/bench.py                # recall vs ground truth (real VDJdb data)
python bench/bench_evalue.py         # true E-value benchmark (target vs background control)
python bench/bench_evalue_matrix.py  # significance across reference/control/query/scope grid
python bench/bench_epitope.py        # epitope detection-complexity (GIL vs NLV)

Figures (throughput, scaling, matrix-scoring overhead, collisions, E-value matrix, epitope detection) and the full methodology are in the benchmarks docs. Set RUN_BENCHMARK=1 for the large tiers.

Development

This repo follows git-flow:

master — stable, release-ready; CI + docs deploy run here.
dev — integration branch for day-to-day work.
feature branches branch off dev and merge back via PR; releases merge dev → master.

Roadmap (affine gaps, position-specific matrices, succinct memory packing) lives in docs/roadmap.rst. Control-set E-values already ship — see the E-value guide.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
appendix		appendix
bench		bench
docs		docs
include/seqtree		include/seqtree
python/seqtree		python/seqtree
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

seqtree

Install

Build from source

Quickstart

Tests

Benchmarks

Development

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

seqtree

Install

Build from source

Quickstart

Tests

Benchmarks

Development

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages