You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Scripts for sequencing data processing and analysis used in the Sherwood Lab at BWH. Organized by project type, with each project folder containing cluster job submission scripts (qsub_script/) and analysis/parsing scripts (script/). Cluster scripts are in SGE qsub format targeting the BGM compute cluster.
Repository Structure
.
├── Tian_demult_scripts/ # FASTQ demultiplexing for dual-indexed libraries
├── NGS_analysis_proj/ # ACCESS-seq & ATAC-seq analysis
├── PPIseq_proj/ # PPI-seq (protein-protein interaction sequencing)
├── gRNA_ct_proj/ # gRNA counting for pooled CRISPR screens
└── read_ct_proj/ # Barcode/read counting for reporter assays
Tian_demult_scripts/
Custom demultiplexing pipeline for dual-indexed paired-end libraries. Handles Illumina-format reads (two index reads + two genomic reads) as well as output from AVITI and BPFNGS sequencers. Matching supports exact-match and one-base wildcard (N) tolerance. Includes documentation (BPFNGS_demultiplex_instructions.Rmd/.pdf) with step-by-step instructions for running the BPFNGS demultiplexing workflow.
Core demultiplexers
Script
Description
demult_2reads_2index.pl
Splits paired-end R1/R2 reads into per-sample gzip-compressed FASTQ files by matching I1 + I2 index sequences. Supports N-wildcard positions in index references for flexible matching. Reports total and passed read counts per run.
demult_2reads_2index_1primer.pl
Extension of the above that additionally requires a primer sequence match within R1 (fuzzy or exact), and trims the matched primer from output reads. Useful when samples share an index pair but differ by a leading primer.
demult_2reads_2index_sc.pl
Single-cell variant of the dual-index demultiplexer; adapted for libraries where cell barcodes are embedded in the read structure.
Three-step pipeline for BPFNGS sequencer output, with a corresponding SGE submission script.
Utilities
Script
Description
count_index_pairs.pl
Reads I1 and I2 FASTQ files and tallies the frequency of every observed index-pair combination, then outputs results sorted by count. Used to audit which index combinations are present before demultiplexing.
check_R1_by_index_pairs.pl
Cross-checks R1 reads against a set of expected index pairs to verify correct library composition.
get_index_files.pl
Parses barcode pairs embedded in read headers (format NNNNNN+NNNNNN) and writes them as separate I1 and I2 FASTQ files with synthetic quality scores. Useful when index reads are not provided separately.
fix_fq.pl
Repairs malformed FASTQ records by trimming quality strings to match the corresponding sequence length when they are inadvertently longer.
fq_count_Xnt.pl
Counts the number of reads with a specified sequence length in a FASTQ file.
merge_identical_files.sh
Merges files with identical content across sequencing lanes into a single output.
swap_filenames.sh
Swaps filenames between two files (used for correcting sample-name mix-ups).
NGS_analysis_proj/
End-to-end analysis pipeline for ACCESS-seq (Adenosine or Cytosine Deaminase Chromatin Editing Sequencing) and ATAC-seq. ACCESS-seq uses base editors (Ddd enzymes) to generate C→T or A→G edits as proxy signals for chromatin accessibility and TF occupancy. The pipeline covers alignment, base-level edit calling, motif-level analysis, machine learning–based TF binding state classification, and specialized workflows for single-cell, duplex sequencing, and variant-ACCESS applications.
qsub_script/ — Cluster Job Scripts
Alignment & preprocessing
Script
Description
ACCESS_bwameth_mapping.qsub
Aligns ACCESS-seq paired-end reads to the reference genome using bwameth (bisulfite-aware aligner treating C→T edits analogously to bisulfite conversions). Runs with 48 threads on the BGM cluster.
ACCESS_bwameth_mapping_DuplexSeq.qsub
bwameth alignment variant for duplex sequencing libraries.
ACCESS_bwameth_mapping_scPBMC.qsub
bwameth alignment variant for single-cell PBMC ACCESS-seq libraries.
ACCESS_hisat3n_mapping.qsub
Alternative alignment using HISAT-3N, which natively handles C→T or A→G conversion reads and may provide better splice-aware mapping.
ACCESS_iterative_mapping.qsub
Iterative trimming-and-mapping strategy: progressively trims reads that fail to align and reattempts mapping, improving recovery of edge-heavy reads.
scACCESS_preprocess.qsub
Preprocessing for single-cell ACCESS-seq (e.g., barcode extraction and FASTQ formatting).
scACCESS_iterative_mapping.qsub
Iterative mapping variant for scACCESS libraries.
cutadapt.qsub
Adapter trimming with Cutadapt before alignment.
fastq_sample.qsub
Random subsampling of FASTQ files for downsampling experiments.
Standard ATAC-seq pipeline using Bowtie2 or BWA for alignment, followed by peak calling.
Quality control
Script
Description
ACCESS_quality_metric.qsub
Computes library quality metrics including mapping rate, duplicate rate, fragment length distribution, and edit rate summaries for ACCESS-seq data.
ACCESS_tss_enrichment.qsub
Calculates TSS enrichment score (a standard ATAC-seq QC metric) from the aligned BAM using ENCODE-style normalization.
ACCESS_fragment_length.qsub
Generates fragment length distribution from BAM to assess nucleosomal banding patterns.
ACCESS_read_count.qsub
Per-sample read counting at various stages of the pipeline.
scACCESS_quality_metric.qsub
QC metrics specific to single-cell ACCESS libraries (per-barcode statistics).
Base counting & edit rate
Script
Description
ACCESS_mpileup_baseCt.qsub
Runs samtools mpileup on aligned BAM files and pipes output to the base-count parsing script, producing per-position A/T/G/C counts across the genome (or defined regions). This is the core step for edit rate computation.
baseCtByPos.qsub
Aggregates base counts by genomic position into a compact format.
get_edit_rate.qsub
Computes position-specific C→T or A→G edit rates from the base count table output by mpileup.
get_genotype_ct.qsub
Counts genotypes (reference vs edited vs other) at each position from base count files.
Base count extraction scripts tailored to specific assay designs (HDR minipool, MIAA, and MS2 hairpin variants).
Chromatin accessibility & motif analysis
Script
Description
ACCESS_chromAcc_bin_edits.qsub
Runs chromAcc_bin_analysis.py to stratify edit fractions across chromatin accessibility bins (using a BigWig accessibility track), revealing how base editing efficiency correlates with open vs. closed chromatin.
ACCESS_motif_edit_rate.qsub
Runs get_motif_edit_rate.py to calculate C→T/A→G edit rates at positions flanking TF binding motifs, revealing TF footprints as protected (low-edit) sites within higher-edit open chromatin.
ACCESS_bg_edit_pattern.qsub
Characterizes background edit patterns (e.g., sequence context effects, strand bias) to enable normalization of motif-level signals.
ACCESSseq_selected_TF.qsub
Targeted motif edit rate analysis for a curated set of TFs.
Single-cell ACCESS-seq
Script
Description
scACCESS_sep_hyperpool.qsub
Demultiplexes hyperpooled scACCESS libraries by separating reads into per-subpool FASTQ files based on embedded pool barcodes.
bam_mapping_frac.qsub
Reports the fraction of reads mapped to various reference regions for QC in single-cell experiments.
Variant ACCESS-seq (varACCESS)
Script
Description
varACCESS_get_edits.qsub
Extracts all edit positions from aligned varACCESS reads.
Detects and counts single-allele editing events from ACCESS-seq data.
ACCESS_mutationPerPos.qsub
Counts all mutation types per genomic position (not limited to expected edit type).
ACCESS_iterative_mapping.qsub
(See alignment section above.)
ACCESS_compile_twin_reads_by_chr.sh
Chromosome-parallelized version of twin read compilation.
access_codec_master.sh
Master shell script orchestrating the full ACCESS CODEC pipeline: sample demultiplexing (step 1), adapter trimming via Cutadapt (step 2), and consensus generation (step 3), driven by a CSV sample sheet.
script/ — Analysis Scripts
Alignment & BAM utilities (Python)
Script
Description
bam2editMatrix.py
Reads a BAM file and constructs a co-occurrence matrix of paired C→T / G→A edits centered on a TF motif window. Shows which pairs of positions within a motif tend to be co-edited on the same read, enabling analysis of clonal editing patterns.
bam2editMatrix_individual.py
Per-read variant of bam2editMatrix.py that outputs individual read edit vectors rather than an aggregated matrix.
bam2mutationPerPos.py
Tallies all mutation types at every position in a BAM file, independent of expected edit type. Useful for distinguishing true edits from sequencing error or other variants.
bam2fragment_len.py
Extracts fragment length from paired-end BAM records and outputs a distribution.
bam_to_bw.py
Converts BAM coverage to BigWig format for genome browser visualization.
download_bigwig.py
Downloads BigWig files from ENCODE or other remote sources given a list of URLs.
Convert samtools mpileup output to a compact per-position base-count CSV (A/T/G/C columns), stripping indel notation. The _byChr variant processes one chromosome at a time for parallel execution.
get_baseCt_MGHNGS.pl / get_baseCt_MGHNGS_v2.pl
Demultiplex and base-count reads from MGHNGS-format amplicon libraries: matches each read to a template sequence, then counts A/T/G/C at each position for both forward and reverse strands, outputting edit rates per sample.
MS2hp_PE_baseCt.pl
Base counting for the MS2 hairpin paired-end assay: extracts and counts bases at defined positions within MS2hp constructs.
baseCtByPos.pl
Aggregates base counts by position across a set of reads matched to a template.
Compute C→T, A→G (or other) edit rates per position from a template-matched FASTQ: read start sequences are matched to a template, edits at defined positions are tallied, and per-position edit rates are output as CSV. V2/V3 add improved matching logic and additional output fields.
get_genotype_ct.pl / get_genotype_ct_v2.pl
Count how many reads carry each combination of reference and edited bases at specified positions (i.e., genotyping at edit sites).
get_edit_stats.py
Computes per-read and per-position edit statistics from an ACCESS BAM: outputs the distribution of C→T and G→A edit counts per read, and per-position base-change frequencies, for library QC.
Core ACCESS-seq analysis: given a BAM, reference FASTA, and a BigWig accessibility track, segments the genome into bins by accessibility score and calculates C→T / G→A edit fractions per bin. Generates plotnine visualizations and CSV output. The multithread version parallelizes over chromosomes; v2 adds additional filtering and output options.
acc_score_bin_edits.py
Similar binning analysis focused on peak-based accessibility scores; supports strand-aware relative positioning and trinucleotide context stratification.
Calculates edit fractions at positions flanking annotated TF binding motifs: reads motif peak BED files, fetches BAM reads over each motif, counts C→T / G→A edits at each relative position, and outputs edit rate profiles for footprint detection. V4 adds flexible di/tri-nucleotide motif detection and peak score filtering. _v2_lite is a memory-efficient version for large datasets.
get_motif_edit_matrix.py
Constructs an edit matrix (reads × relative positions) centered on motif sites; used as input for downstream modeling.
Calculates Ddd enzyme edit fractions for arbitrary user-defined genomic regions (specified as coordinate strings), with optional strand-aware positioning.
get_bg_edit_pattern.py
Characterizes background sequence-context–dependent edit rates (e.g., TCA vs. TCG context bias), used to normalize motif signals against background deaminase preferences.
Preprocesses FIMO motif BED files for ACCESS analysis: step 1 labels motif sites as active or inactive by overlapping with ChIP-seq peaks and filters by read coverage; step 2 further filters and formats BED files for model training.
3-state TF binding model (Python + TensorFlow)
The 3-state model classifies ACCESS-seq reads (or motif sites) as unbound, bound, or recently-bound based on the pattern of edits around TF binding motifs. Training uses read-level feature matrices from active and inactive motif sites.
Script
Description
calc_motif_features_step1.py
Step 1 of feature extraction: given a BAM, motif BED directory, cell type, and TF name, extracts per-read edit vectors centered on each motif instance and saves them as part files.
V2 of step 1 adds FASTA-based sequence features; step 2 aggregates the part files from step 1 into a single HDF5 file per TF/cell-type combination.
create_trainingSet_3stateModel.py / …_v2.py
Assembles training and test sets from HDF5 feature files: samples read vectors from active and inactive motif sites, assigns 3-state labels, and stores in h5py format for model training.
Trains TensorFlow neural network classifiers (one model each for base coverage channel, edit channel, and combined) on the 3-state training data, using class-weighted loss to handle label imbalance. Saves trained model dictionaries as pickle files.
evaluate_3StateModel_single_h5.py / …_v2.py
Evaluates trained models on held-out test data, computing AUPRC, AUROC, F1, and per-class fractions per TF motif.
evaluate_3StateModel_summarize.py
Aggregates evaluation metrics across all tested TF/cell-type pairs, filters high-confidence motifs by Spearman correlation with known binding data, and generates summary boxplots and CSV tables.
Co-binding analysis using model predictions: step 1 identifies nearby motif pairs (5–100 bp), computes 3-state read fractions per pair, and tests for statistical enrichment (Fisher's exact + Mann-Whitney U); step 2 summarizes odds ratios and read-type medians across all pairs.
cobinding_definedLabel.py
Co-binding analysis variant that uses user-defined activity labels rather than model predictions.
Extracts all paired-end reads belonging to a single specified cell barcode from FASTQ, writing a per-cell R1/R2 pair. Useful for targeted single-cell re-analysis.
sc_separate_hyperpool.pl
Demultiplexes hyperpooled scACCESS FASTQ by matching pool barcodes from R2, trims adapters, and outputs one gzip-compressed FASTQ pair per subpool with a read-distribution summary.
Count cell barcodes from either a BAM (via read tags) or FASTQ (via sequence matching), outputting per-barcode read counts. Used to assess barcode diversity and identify high-quality cells.
map_sc_barcodes.pl
Maps observed cell barcodes to a reference whitelist with 1-mismatch tolerance, collapsing barcode variants to their nearest valid barcode.
sc_edit_stats.py
Aggregates per-barcode ACCESS-seq edit statistics (total reads, edited reads, C→T count, G→A count) from a BAM file; supports optional filtering against a barcode whitelist.
Variant ACCESS-seq (Python / Perl)
varACCESS is a modification of ACCESS-seq that encodes TF binding site variants as short "phrase" sequences, allowing simultaneous measurement of accessibility at thousands of sequence variants in a single experiment.
Script
Description
varACCESS_match_phrases.py
Filters length-selected FASTQ reads and matches them against a phrase library (loaded from XLSX), allowing C→T / G→A edits plus a defined number of mismatches. Processes in parallel; classifies reads as C2T-dominant, G2A-dominant, or mixed; outputs matched reads as FASTA and an optional edit-position summary CSV.
varACCESS_match_phrases_v2.py
Updated version with additional filtering options and improved parallelization.
varACCESS_fasta_to_edits.py
Tallies C→T and G→A edit positions from phrase-matched FASTA files, processing headers to recover phrase identity and edit type; outputs per-position and per-phrase edit fraction CSVs.
Perl-based edit extraction for varACCESS reads: matches reads to phrase library, calls edits at each position, and outputs edit tables. _allPos version reports all positions rather than only predefined ones.
varACCESS_match_stats.py
Summarizes phrase match rates, edit-type fractions, and per-phrase coverage statistics across a run.
liftover_vcf_hg19_to_hg38.py / …_v2.py
Converts VCF coordinates from hg19 to hg38 using pyliftover chain files; supports SNV-only and primary-chromosome-only filters; optionally outputs a CSV with coordinate mapping metadata.
Duplex sequencing (Python / Perl)
Script
Description
process_duplex_seq.py
Preprocesses raw duplex sequencing FASTQ: extracts 5 bp end-barcodes from read start/end to construct unique molecule identifiers (UMIs), and outputs trimmed/labeled FASTQ files for downstream consensus calling.
detect_twin_reads.py
Scans a BAM for read pairs where one read shows predominantly C→T edits and the other shows predominantly G→A edits at the same positions (i.e., "twin" reads from opposite strands of the same molecule); outputs pairing statistics.
Merges detected twin read pairs into consensus sequences by combining the complementary edit information from both strands; the _by_chr version parallelizes over chromosomes.
duplex_seq_to_twin_csv.py
Exports identified twin read pair data to CSV for summary analysis.
ACCESS CODEC pipeline (Python)
ACCESS CODEC (COnsensus Deaminase Chromatin sequencing) is a library preparation method that generates overlapping paired-end reads to improve single-molecule accuracy.
Script
Description
access_codec_step1.py
Demultiplexes ACCESS CODEC FASTQ files by matching expected read-start sequences (loaded from a sample sheet); trims matched sequences and outputs per-sample FASTQ pairs.
access_codec_step2.py
Detects overlapping regions between R1 and R2 reads, generates a consensus sequence from the overlap, and resolves C/T and G/A ambiguities using both strands; outputs consensus FASTA with edit-position annotations and error-rate statistics.
access_codec.ipynb
Jupyter notebook for interactive exploration and QC of ACCESS CODEC results.
Parse 150 nt pegRNA reporter assay reads: match reads to the pegRNA library (guide RNA + PBS + reporter sequence), then count unedited vs. edited reporter variants per construct to quantify prime editing efficiency. The _fused variant handles FLASH-merged (fused) read input.
pegRNA_endo_150nt_step1.pl / …_step2.pl
Two-step pipeline for endogenous pegRNA editing: step 1 extracts and matches reads to endogenous amplicon sequences; step 2 counts base identities at the edit site to measure in-situ prime editing efficiency.
Other assays (Perl)
Script
Description
ShuttleSeq_parse.pl / ShuttleSeq_parse_v2.pl
Parse ShuttleSeq reads: match a fixed scaffold sequence (allowing C/T or G/A degeneracy at edit positions), extract the gRNA insert, and count per-gRNA reads with edit state classification.
TFome_reporter_get_edits.pl
Extract and count edits from TFome reporter reads: matches reads to TF-editor fusion constructs and reports per-TFBS-phrase edit counts, supporting multiple edit modes (A2G, G2A, T2C, C2T).
vORF_barcode_ct.pl
Counts barcodes associated with variant ORFs from FASTQ by matching a fixed pre-sequence and extracting barcode sequences.
HDRminipool_build_dict.pl
Builds a barcode-to-construct dictionary for HDR minipool libraries by parsing reads and collapsing barcodes within Hamming distance 1.
HDRminipool_edit_count.pl
Counts edits per construct in HDR minipool assays using the dictionary built by HDRminipool_build_dict.pl.
spike_in_analysis.pl / spike_in_process_fastq.pl
Process spike-in control reads for normalization: process_fastq demultiplexes spike-in FASTQ, analysis computes per-spike-in read counts and normalization factors.
sam_demult_ATAC.pl
Demultiplexes ATAC-seq reads from SAM format by barcode, writing per-sample output files.
Brandon_script.pl
Custom analysis script (contact lab for details).
encode_lib_common.py / encode_lib_genomic.py
Utility libraries adapted from the ENCODE ATAC-seq pipeline, providing functions for file I/O, genome arithmetic, and BAM processing.
encode_task_tss_enrich.py
Calculates TSS enrichment score for ATAC-seq QC using the metaseq library: aggregates read coverage in windows around TSS regions and applies Greenleaf-lab normalization, generating enrichment plots and a numeric score.
PPIseq_proj/
Pipeline for PPI-seq (Protein-Protein Interaction sequencing), a base-editor–coupled screen that measures protein interactions by linking interacting ORF pairs to uniquely decodable barcode combinations. When two proteins interact, their fusion constructs co-localize, allowing the base editor to mark both barcodes with characteristic A→G or T→C edits. Sequencing the edited barcodes (cDNA) and comparing to unedited (gDNA) counts reveals interaction enrichment. The pipeline supports multiple library formats: ORFeome, RBP, ClinORFeome, 100ZF, coIP-PPIseq, and DMS-PPIseq.
qsub_script/ — Cluster Job Scripts
The qsub scripts follow a two-step pattern per library type:
Build dictionary: Map barcodes → ORF/gene assignments from plasmid sequencing data (BLAST or cDNA-based).
Compute edit rates in parallel: Split the barcode space, assign cDNA reads, and compute enrichment scores.
Script
Description
ORFeomePPIseq_master.sh
Master shell script orchestrating the full ORFeome PPI-seq pipeline.
*_build_dict_master.qsub
Builds barcode-to-ORF dictionaries for each library variant (RBP, ClinORFeome, ORFeome, StitchR-COMMD, ABE-ORF, ABE25ORF-MCPL5ORF, coIP).
*_editRate_parallel.qsub
Computes per-barcode edit rates across the library in parallel (100ZF, RBP, ClinORFeome, ORFeome, RPIseq, ABE-MCP, DMS).
Parallel step 2 for the ABE-ORF / MCP-ORF dual-barcode multi-PPIseq.
merge_job_out.sh
Merges output files from parallel edit-rate jobs back into a single result file.
Ondrej_analysis.qsub
Custom analysis job for Ondrej's PPI-seq dataset.
script/ — Analysis Scripts
FASTQ-to-barcode conversion (Perl)
Script
Description
fqPE_to_faBC.pl
Extracts the first 15 nt of R1 as the barcode and uses R2 as the associated sequence, writing R1-barcode / R2-sequence pairs as FASTA. The first step of the dictionary-building pipeline for most PPIseq formats.
fqPE_to_faBC_pairs.pl
Extracts barcode pairs from paired-end reads for dual-barcode libraries.
fqPE_to_faBC_variable_BC_length.pl
Variant that handles variable-length barcodes (e.g., when barcode length varies by ORF).
partition_barcodes.pl
Splits a full barcode list into N equal partitions for parallel downstream processing; each partition is written to a separate file to enable job array–style parallelization.
cDNA parsing — ORF assignment (Perl)
These scripts read cDNA FASTQ files, extract barcodes, match them against a pre-built barcode dictionary, validate the edit pattern (T→C or A→G at expected positions), and output per–gene-pair read counts.
Script
Description
ORFeomePPIseq_cDNA_parse.pl / …_v2.pl
Core ORFeome PPIseq cDNA parser: extracts the 15 nt barcode from each read, looks it up in a barcode→gene dictionary, validates T→C or A→G edits at predetermined positions, and outputs read counts per gene with edit statistics.
RPIseq_cDNA_parse.pl
cDNA parsing for RPI-seq (RNA-Protein Interaction sequencing) format.
Ondrej_cDNA_parse.pl
Custom parser for Ondrej's library format.
multiPPIseq_ABEL5ORF_MCPL5ORF_cDNA_parse.pl
Dual-barcode cDNA parser for the ABE-L5ORF / MCP-L5ORF multi-PPIseq format: extracts variable-length ABE and fixed-length MCP barcodes, validates edit patterns on both barcodes, and counts gene-pair interactions.
multiPPIseq_ABE25ORF_MCPL5ORF_cDNA_parse.pl
Dual-barcode cDNA parser for the ABE25ORF / MCPL5ORF variant.
*_cDNA_parse_parallel/
Subdirectories containing parallelized (partitioned-barcode) cDNA parse scripts for large libraries: 100ZF, ClinORFeome, RBP, LDL_PPI_Super, multiPPIseq ABE-MCP, and multiPPIseq DMS.
BLAST-based barcode-to-ORF assignment (Perl)
Used when barcode-to-ORF mapping is established via BLAST of barcode sequences against an ORF library rather than by direct read matching.
Script
Description
ORFeomePPIseq_blastn_parse.pl / …_v2.pl
Parses BLAST tabular output (outfmt 6) to assign each barcode its best ORF match: filters by >95% identity and >15 bp alignment length, requires >2 supporting reads, and collapses near-duplicate barcodes within Hamming distance 1.
100ZF_blastn_parse.pl
BLASTN-based ORF assignment for the 100 zinc-finger library.
Parses paired-end FASTQ to extract barcodeA, barcodeB, and ORF sequences from fixed-position read coordinates; matches ORFs against a reference library; outputs a barcode-pair dictionary CSV with read counts, dominance statistics, and distribution histograms. The parallel version partitions processing for large libraries.
coIP_PPIseq_filter_dict.py
Filters the coIP dictionary by minimum read support and dominance thresholds to remove low-confidence barcode assignments.
coIP_PPIseq_process_cDNA.py
Processes cDNA FASTQ for coIP-PPIseq: extracts barcode pairs using fixed flanking sequences and counts read support per pair against the filtered dictionary, outputting per–gene-pair interaction counts.
multiPPIseq_DMS_dict_build/
Subdirectory with dictionary-building utilities for the DMS-PPIseq format.
Barcode filtering (Perl)
Script
Description
filter_ABE_barcodes_by_dict.pl
For dual-barcode systems, filters BLAST results by matching the ABE component against a validated barcode dictionary (1-base Hamming tolerance) and combines with the MCP barcode annotation to produce a paired barcode–gene table.
filter_ABE_barcodes_by_lib.pl
Filters ABE barcodes against a known library list rather than a dictionary.
gRNA_ct_proj/
Counts gRNA sequences from FASTQ files for pooled CRISPR screens. Supports both single-guide libraries and dual-guide (paired) libraries. Integrates with BEAN for Bayesian effect estimation.
qsub_script/
Script
Description
fq_gRNA_ct.qsub
Cluster job for single gRNA counting.
fq_gRNA_pair_ct.qsub
Cluster job for paired gRNA counting.
trim_count.qsub
Trims reads then runs counting.
run_BEAN.qsub
Runs BEAN (Bayesian Estimation of Allelic effects from NGS) on gRNA count output for base-editing screen analysis.
script/
Script
Description
fq_gRNA_ct.pl
Counts gRNA read frequencies from FASTQ: extracts the first 20 nt from reads positioned after a fixed pre-sequence and matches against a gRNA library with 20 nt and 19 nt fallback matching. Outputs per-gRNA read counts with total and unmapped statistics.
fq_gRNA_ct_alt.pl
Alternate single-guide counter that uses a different read-anchor strategy for libraries with non-standard pre-sequence positions.
fq_gRNA_pair_ct.pl
Counts paired gRNA reads from R1/R2 FASTQ files: extracts partial sequences from each read and searches for complementary pairs in a gRNA-pair library, tracking reads matching both guides, either guide alone, or a paired combination. Outputs per–gRNA-pair mapping counts.
fq_gRNA_pair_ct_alt.pl
Alternate paired-guide counting variant for different library architecture.
trim_count.pl
Extracts 20 nt sequence variants from FASTQ, sorts by frequency, and maps to gRNA library entries; outputs per-sequence counts with gRNA IDs. Simpler variant of fq_gRNA_ct.pl useful for QC and troubleshooting.
read_ct_proj/
Barcode and read counting for a range of massively parallel reporter assays (MPRA), CRISPRa screens, and splicing assays. Most assays follow a multi-step workflow: (1) extract and assign barcodes from sequencing reads, then (2) count cDNA and gDNA barcodes separately to compute enrichment ratios.
qsub_script/
Script
Description
read_ct.qsub
General-purpose barcode/read counting job.
fq_gRNA_ct.qsub
gRNA counting job for CRISPRa screens (shared with gRNA_ct_proj).
Counts oligonucleotides for ADRD regulatory variant assay.
dCas9Sun_count_oligo.qsub
Counts oligos for dCas9-SunTag CRISPRa assay.
dCasRx_SF.qsub
dCasRx splicing factor reporter assay job.
UDN_MPSA.qsub
UDN massively parallel splicing assay job.
script/
CRISPRa reporter assay pipeline (Perl)
This multi-step pipeline pairs each gRNA to its set of promoter barcodes (from plasmid sequencing), then quantifies cDNA and gDNA barcode abundance to compute per-promoter transcriptional enrichment scores.
Script
Description
CRISPRa_scriptV2_step1.pl
Step 1: processes paired FASTQ files by matching gRNA sequences (from R1, after a fixed pre-sequence) and promoter barcodes (from R2), assigns gRNA–promoter barcode pairs with Hamming-distance–1 error correction, and outputs a barcode→gRNA→promoter assignment table with QC metrics.
CRISPRa_scriptV2_step1_gRNA_only.pl
Step 1 variant that extracts gRNA reads only (no promoter barcode), used for gRNA-only counting.
CRISPRa_scriptV2_step1_subsample.pl
Step 1 with read subsampling for sensitivity analysis.
CRISPRa_scriptV2_step2a.pl
Step 2a: extracts barcode sequences from cDNA and gDNA FASTQ files and outputs per-barcode read counts (partitioned into multiple files for parallel step 2b/2c processing).
CRISPRa_scriptV2_step2b.pl
Step 2b: collapses cDNA/gDNA barcode counts from step 2a onto the proper-barcode set from step 1, using Hamming-distance–1 merging to correct sequencing errors.
CRISPRa_scriptV2_step2c.pl
Step 2c: aggregates collapsed barcode counts into per-gRNA and per-promoter enrichment ratios (cDNA / gDNA).
Barcode counting utilities (Perl)
Script
Description
barcodeCt.pl
Extracts 15 nt barcodes from reads after a fixed pre-sequence; collapses barcodes within Hamming distance 1 to correct sequencing errors; outputs final barcode → count mapping.
barcodeCt_byPromoter.pl
Barcode counts stratified by promoter identity (uses the gRNA–promoter assignment from step 1).
barcodeCt_bygRNA.pl
Barcode counts stratified by gRNA identity.
readCt_byBarcode.pl
Matches barcode sequences from a reference library to FASTQ reads using regex search within a 60 bp window; outputs per-barcode read counts.
readCt_byLib.pl
Read counts grouped by library construct (rather than individual barcode).
readCt_PEvsWT.pl
Compares prime-edited (PE) vs. wild-type (WT) read counts in a pegRNA reporter assay: searches for PE and WT diagnostic sequences and reports the ratio of PE to WT reads.
fq_gRNA_ct.pl
gRNA counting (also used in gRNA_ct_proj; see above).
Splicing & reporter assays (Perl)
Script
Description
UDN_MPSA_step1.pl
Step 1 of the UDN massively parallel splicing assay: matches 40 bp oligo and 15 bp barcode reads from paired FASTQ by fixed flanking sequences, builds oligo–barcode associations, and outputs oligo and barcode statistics with paired assignments.
UDN_MPSA_step2.pl
Step 2: uses the oligo–barcode map from step 1 to quantify each variant's representation in the spliced (included) and unspliced (excluded) RNA fractions, computing per-oligo splicing efficiency.
dCasRx_SF_script.pl / dCasRx_SF_script_temp.pl
dCasRx splicing factor assay: matches R1 library sequences (20 bp) and R2 exon variant markers (10 bp) to their respective reference libraries, then outputs per–library-member counts of exon-included, exon-excluded, and other variants for splicing analysis.
dCas9Sun_count_oligo.pl
Counts oligo sequences for the dCas9-SunTag CRISPRa transcriptional reporter assay.
ADRD_count_oligo.pl
Counts 40 bp oligonucleotide sequences from FASTQ for the ADRD (Alzheimer's Disease and Related Dementias) regulatory variant reporter assay: matches a fixed leading sequence, extracts the oligo, and accumulates per-oligo read counts.