feat: Reverse Translation Mapping Support#108
Draft
bencap wants to merge 5 commits into
Draft
Conversation
Add MappingOutcome so every (variant, level) record carries a typed outcome, distinguishing a benign absence (intronic, no protein consequence) from a genuine failure that error_message alone cannot convey. Derive the preferred (authoritative) layer from the target's assay level rather than always preferring genomic.
…c projection Select a coding transcript for NC_ protein-coding targets via Ensembl locus overlap plus MANE, so they are no longer silently skipped by reverse translation. Project each measured variant onto its deterministically reachable layers (g<->c, nucleotide->p), emitted as typed-outcome records and routed by preferred_layer_only.
…ferred-layer records - Track represented variant IDs at the preferred layer; re-attribute only variants that have no preferred-layer record, avoiding duplicate mapped_scores for variants with both a dead genomic attempt and a measured protein record (e.g. codon-optimised targets). - Synthesize a preferred-layer failure for variants that mapped only at a non-preferred layer (e.g. wild-type p.= on a genomic-preferred target) so every input variant always has exactly one output record. - Extract _map_protein_layer in vrs_map to return (mapping, reason) instead of an ad-hoc error MappedScore; a row that maps at no layer is failed once, layer-agnostically, carrying the detailed reason. - Add TestNullFailureDedup and TestMapProteinLayerReason test coverage.
…ignments
For protein-vs-DNA BLAT alignments, qcoords always increase (protein
reads N→C regardless of genome strand), so they cannot be used to
detect strand. Switch to tcoords direction for protein queries.
Also normalise hit_subranges and hit_range entries with min/max so
they are always in ascending order, which they are not when the target
gene sits on the minus strand.
- Use tcoords direction (not qcoords) for strand detection when
-q=prot is in blat_params
- Wrap hit_subrange and hit_range endpoints in min/max in both
_get_best_match and align_target_to_protein
_get_mapped_reference_sequence had no CDNA branch, so it fell through to the genomic chromosome lookup and returned the NC_ accession as the post_mapped reference for the cdna layer. This caused target_genes post_mapped_metadata to carry NC_000017.11 under the "cdna" key instead of the NM transcript. Add a CDNA branch that resolves the NM/ENST accession from tx_output.nm (preferred, covers NC_/sequence-based targets) or from the target's own accession when it is already an NM_/ENST (cdna-source targets). Returns None rather than a chromosome when no NM is resolvable. Also adds unit tests for all three layer paths.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.