Sherwood Lab – NGS Analysis Scripts

Scripts for sequencing data processing and analysis used in the Sherwood Lab at BWH. Organized by project type, with each project folder containing cluster job submission scripts (qsub_script/) and analysis/parsing scripts (script/). Cluster scripts are in SGE qsub format targeting the BGM compute cluster.

Repository Structure

.
├── Tian_demult_scripts/        # FASTQ demultiplexing for dual-indexed libraries
├── NGS_analysis_proj/          # ACCESS-seq & ATAC-seq analysis
├── PPIseq_proj/                # PPI-seq (protein-protein interaction sequencing)
├── gRNA_ct_proj/               # gRNA counting for pooled CRISPR screens
└── read_ct_proj/               # Barcode/read counting for reporter assays

`Tian_demult_scripts/`

Custom demultiplexing pipeline for dual-indexed paired-end libraries. Handles Illumina-format reads (two index reads + two genomic reads) as well as output from AVITI and BPFNGS sequencers. Matching supports exact-match and one-base wildcard (N) tolerance. Includes documentation (BPFNGS_demultiplex_instructions.Rmd/.pdf) with step-by-step instructions for running the BPFNGS demultiplexing workflow.

Core demultiplexers

Script	Description
`demult_2reads_2index.pl`	Splits paired-end R1/R2 reads into per-sample gzip-compressed FASTQ files by matching I1 + I2 index sequences. Supports N-wildcard positions in index references for flexible matching. Reports total and passed read counts per run.
`demult_2reads_2index_1primer.pl`	Extension of the above that additionally requires a primer sequence match within R1 (fuzzy or exact), and trims the matched primer from output reads. Useful when samples share an index pair but differ by a leading primer.
`demult_2reads_2index_sc.pl`	Single-cell variant of the dual-index demultiplexer; adapted for libraries where cell barcodes are embedded in the read structure.
`demult_2reads_2index_1primer_sc.pl`	Combined single-cell + primer-match demultiplexer.
`demult_V2.pl`	Updated demultiplexer that reads a sample sheet to build a dynamic index-pair lookup, then routes reads to per-sample output files.
`demult_master_Tian.qsub` / `demult_master_Tian_v2.qsub`	SGE job scripts that orchestrate multi-sample demultiplexing on the cluster.
`demult_master_Tian_Aviti_step1/2.sh`	Two-step shell pipeline for AVITI sequencer output: step 1 handles pre-processing/format conversion, step 2 runs demultiplexing. Includes a `_sc` variant for single-cell AVITI runs.
`demult_master_Tian_BPFNGS_step1/2/3.sh` / `demult_master_Tian_BPFNGS.qsub`	Three-step pipeline for BPFNGS sequencer output, with a corresponding SGE submission script.

Utilities

Script	Description
`count_index_pairs.pl`	Reads I1 and I2 FASTQ files and tallies the frequency of every observed index-pair combination, then outputs results sorted by count. Used to audit which index combinations are present before demultiplexing.
`check_R1_by_index_pairs.pl`	Cross-checks R1 reads against a set of expected index pairs to verify correct library composition.
`get_index_files.pl`	Parses barcode pairs embedded in read headers (format `NNNNNN+NNNNNN`) and writes them as separate I1 and I2 FASTQ files with synthetic quality scores. Useful when index reads are not provided separately.
`fix_fq.pl`	Repairs malformed FASTQ records by trimming quality strings to match the corresponding sequence length when they are inadvertently longer.
`fq_count_Xnt.pl`	Counts the number of reads with a specified sequence length in a FASTQ file.
`merge_identical_files.sh`	Merges files with identical content across sequencing lanes into a single output.
`swap_filenames.sh`	Swaps filenames between two files (used for correcting sample-name mix-ups).

`NGS_analysis_proj/`

End-to-end analysis pipeline for ACCESS-seq (Adenosine or Cytosine Deaminase Chromatin Editing Sequencing) and ATAC-seq. ACCESS-seq uses base editors (Ddd enzymes) to generate C→T or A→G edits as proxy signals for chromatin accessibility and TF occupancy. The pipeline covers alignment, base-level edit calling, motif-level analysis, machine learning–based TF binding state classification, and specialized workflows for single-cell, duplex sequencing, and variant-ACCESS applications.

`qsub_script/` — Cluster Job Scripts

Alignment & preprocessing

Script	Description
`ACCESS_bwameth_mapping.qsub`	Aligns ACCESS-seq paired-end reads to the reference genome using bwameth (bisulfite-aware aligner treating C→T edits analogously to bisulfite conversions). Runs with 48 threads on the BGM cluster.
`ACCESS_bwameth_mapping_DuplexSeq.qsub`	bwameth alignment variant for duplex sequencing libraries.
`ACCESS_bwameth_mapping_scPBMC.qsub`	bwameth alignment variant for single-cell PBMC ACCESS-seq libraries.
`ACCESS_hisat3n_mapping.qsub`	Alternative alignment using HISAT-3N, which natively handles C→T or A→G conversion reads and may provide better splice-aware mapping.
`ACCESS_iterative_mapping.qsub`	Iterative trimming-and-mapping strategy: progressively trims reads that fail to align and reattempts mapping, improving recovery of edge-heavy reads.
`scACCESS_preprocess.qsub`	Preprocessing for single-cell ACCESS-seq (e.g., barcode extraction and FASTQ formatting).
`scACCESS_iterative_mapping.qsub`	Iterative mapping variant for scACCESS libraries.
`cutadapt.qsub`	Adapter trimming with Cutadapt before alignment.
`fastq_sample.qsub`	Random subsampling of FASTQ files for downsampling experiments.
`ATACseq_run_script.qsub` / `ATACseq_run_script_BWA.qsub`	Standard ATAC-seq pipeline using Bowtie2 or BWA for alignment, followed by peak calling.

Quality control

Script	Description
`ACCESS_quality_metric.qsub`	Computes library quality metrics including mapping rate, duplicate rate, fragment length distribution, and edit rate summaries for ACCESS-seq data.
`ACCESS_tss_enrichment.qsub`	Calculates TSS enrichment score (a standard ATAC-seq QC metric) from the aligned BAM using ENCODE-style normalization.
`ACCESS_fragment_length.qsub`	Generates fragment length distribution from BAM to assess nucleosomal banding patterns.
`ACCESS_read_count.qsub`	Per-sample read counting at various stages of the pipeline.
`scACCESS_quality_metric.qsub`	QC metrics specific to single-cell ACCESS libraries (per-barcode statistics).

Base counting & edit rate

Script	Description
`ACCESS_mpileup_baseCt.qsub`	Runs samtools mpileup on aligned BAM files and pipes output to the base-count parsing script, producing per-position A/T/G/C counts across the genome (or defined regions). This is the core step for edit rate computation.
`baseCtByPos.qsub`	Aggregates base counts by genomic position into a compact format.
`get_edit_rate.qsub`	Computes position-specific C→T or A→G edit rates from the base count table output by mpileup.
`get_genotype_ct.qsub`	Counts genotypes (reference vs edited vs other) at each position from base count files.
`get_baseCt_HDRminipool.qsub` / `get_baseCt_MIAA.qsub` / `get_baseCt_MS2hp.qsub`	Base count extraction scripts tailored to specific assay designs (HDR minipool, MIAA, and MS2 hairpin variants).

Chromatin accessibility & motif analysis

Script	Description
`ACCESS_chromAcc_bin_edits.qsub`	Runs `chromAcc_bin_analysis.py` to stratify edit fractions across chromatin accessibility bins (using a BigWig accessibility track), revealing how base editing efficiency correlates with open vs. closed chromatin.
`ACCESS_motif_edit_rate.qsub`	Runs `get_motif_edit_rate.py` to calculate C→T/A→G edit rates at positions flanking TF binding motifs, revealing TF footprints as protected (low-edit) sites within higher-edit open chromatin.
`ACCESS_bg_edit_pattern.qsub`	Characterizes background edit patterns (e.g., sequence context effects, strand bias) to enable normalization of motif-level signals.
`ACCESSseq_selected_TF.qsub`	Targeted motif edit rate analysis for a curated set of TFs.

Single-cell ACCESS-seq

Script	Description
`scACCESS_sep_hyperpool.qsub`	Demultiplexes hyperpooled scACCESS libraries by separating reads into per-subpool FASTQ files based on embedded pool barcodes.
`bam_mapping_frac.qsub`	Reports the fraction of reads mapped to various reference regions for QC in single-cell experiments.

Variant ACCESS-seq (varACCESS)

Script	Description
`varACCESS_get_edits.qsub`	Extracts all edit positions from aligned varACCESS reads.
`varACCESS_periodicTFs.qsub` / `varACCESS_periodicTFs_v2.qsub`	Full pipeline for periodic TF varACCESS experiments: runs Cutadapt trimming, FLASH read fusion, phrase matching (`varACCESS_match_phrases.py`), and edit tallying (`varACCESS_fasta_to_edits.py`) in sequence.
`varACCESS_phrase_edit_validation.sh`	Validates that phrase-matched reads have the expected edit patterns.
`varACCESS_tkfo.sh`	Shell pipeline for the "TKFO" varACCESS experiment variant.

Duplex sequencing

Script	Description
`duplex_reads.qsub`	Processes raw duplex sequencing FASTQ files (barcode extraction, trimming, pairing).
`process_DuplexSeq_reads_se.qsub`	Single-end processing variant for duplex sequencing reads.
`ACCESS_compile_twin_reads.qsub`	Compiles paired "twin" read groups (one C→T strand, one G→A strand) from the BAM for duplex consensus analysis.
`ACCESS_detect_twin_reads.qsub`	Flags and enumerates twin read pairs in a BAM file to assess duplex library quality.
`bam_dedup_by_samtools.sh`	Deduplicates BAM using samtools before duplex processing.
`ACCESS_ATAC_signal.sh`	Computes ATAC-seq–like open chromatin signal from ACCESS-seq data.

Reporter & other assays

Script	Description
`pegRNA_reporter_assay.qsub`	Processes pegRNA reporter assay FASTQ: parses 150 nt reads to count edited vs. unedited reporter constructs and quantify prime editing efficiency.
`pegRNA_endo_assay.qsub`	Endogenous pegRNA editing assay: counts edits at endogenous target sites using amplicon reads.
`ShuttleSeq.qsub`	Runs ShuttleSeq library processing: extracts gRNA sequences and counts integration events or editing patterns.
`TFome_reporter.qsub`	TFome reporter assay: processes sequencing reads from a TF–editor fusion reporter to extract per-site edit counts.
`vORF_barcode_count.qsub`	Counts barcodes linked to variant ORFs in a vORF library.
`HepG2_hg19_hg38_liftOver.qsub`	Lifts over HepG2 variant coordinates from hg19 to hg38.
`filter_bam_by_bed.qsub`	Filters a BAM to only reads overlapping regions specified in a BED file.
`merge.qsub`	Merges multiple BAM or output files into a single file.
`ACCESS_single_allele.qsub` / `ACCESS_single_allele_2.qsub`	Detects and counts single-allele editing events from ACCESS-seq data.
`ACCESS_mutationPerPos.qsub`	Counts all mutation types per genomic position (not limited to expected edit type).
`ACCESS_iterative_mapping.qsub`	(See alignment section above.)
`ACCESS_compile_twin_reads_by_chr.sh`	Chromosome-parallelized version of twin read compilation.
`access_codec_master.sh`	Master shell script orchestrating the full ACCESS CODEC pipeline: sample demultiplexing (step 1), adapter trimming via Cutadapt (step 2), and consensus generation (step 3), driven by a CSV sample sheet.

`script/` — Analysis Scripts

Alignment & BAM utilities (Python)

Script	Description
`bam2editMatrix.py`	Reads a BAM file and constructs a co-occurrence matrix of paired C→T / G→A edits centered on a TF motif window. Shows which pairs of positions within a motif tend to be co-edited on the same read, enabling analysis of clonal editing patterns.
`bam2editMatrix_individual.py`	Per-read variant of `bam2editMatrix.py` that outputs individual read edit vectors rather than an aggregated matrix.
`bam2mutationPerPos.py`	Tallies all mutation types at every position in a BAM file, independent of expected edit type. Useful for distinguishing true edits from sequencing error or other variants.
`bam2fragment_len.py`	Extracts fragment length from paired-end BAM records and outputs a distribution.
`bam_to_bw.py`	Converts BAM coverage to BigWig format for genome browser visualization.
`download_bigwig.py`	Downloads BigWig files from ENCODE or other remote sources given a list of URLs.
`parse_flagstat_iterative_mapping.pl` / `parse_flagstat_v2.pl`	Parse samtools flagstat output from iterative mapping steps into tabular summaries.

Base counting & edit rate (Perl)

Script	Description
`ATACseq_mpileup2baseCt.pl` / `ATACseq_mpileup2baseCt_byChr.pl`	Convert samtools mpileup output to a compact per-position base-count CSV (A/T/G/C columns), stripping indel notation. The `_byChr` variant processes one chromosome at a time for parallel execution.
`get_baseCt_MGHNGS.pl` / `get_baseCt_MGHNGS_v2.pl`	Demultiplex and base-count reads from MGHNGS-format amplicon libraries: matches each read to a template sequence, then counts A/T/G/C at each position for both forward and reverse strands, outputting edit rates per sample.
`MS2hp_PE_baseCt.pl`	Base counting for the MS2 hairpin paired-end assay: extracts and counts bases at defined positions within MS2hp constructs.
`baseCtByPos.pl`	Aggregates base counts by position across a set of reads matched to a template.
`get_edit_rate.pl` / `get_edit_rate_V2.pl` / `get_edit_rate_V3.pl`	Compute C→T, A→G (or other) edit rates per position from a template-matched FASTQ: read start sequences are matched to a template, edits at defined positions are tallied, and per-position edit rates are output as CSV. V2/V3 add improved matching logic and additional output fields.
`get_genotype_ct.pl` / `get_genotype_ct_v2.pl`	Count how many reads carry each combination of reference and edited bases at specified positions (i.e., genotyping at edit sites).
`get_edit_stats.py`	Computes per-read and per-position edit statistics from an ACCESS BAM: outputs the distribution of C→T and G→A edit counts per read, and per-position base-change frequencies, for library QC.

Chromatin accessibility & motif analysis (Python)

Script	Description
`chromAcc_bin_analysis.py` / `chromAcc_bin_analysis_multithread.py` / `chromAcc_bin_analysis_v2.py`	Core ACCESS-seq analysis: given a BAM, reference FASTA, and a BigWig accessibility track, segments the genome into bins by accessibility score and calculates C→T / G→A edit fractions per bin. Generates plotnine visualizations and CSV output. The multithread version parallelizes over chromosomes; v2 adds additional filtering and output options.
`acc_score_bin_edits.py`	Similar binning analysis focused on peak-based accessibility scores; supports strand-aware relative positioning and trinucleotide context stratification.
`get_motif_edit_rate.py` / `…_v2.py` / `…_v2_lite.py` / `…_v3.py` / `…_v4.py`	Calculates edit fractions at positions flanking annotated TF binding motifs: reads motif peak BED files, fetches BAM reads over each motif, counts C→T / G→A edits at each relative position, and outputs edit rate profiles for footprint detection. V4 adds flexible di/tri-nucleotide motif detection and peak score filtering. `_v2_lite` is a memory-efficient version for large datasets.
`get_motif_edit_matrix.py`	Constructs an edit matrix (reads × relative positions) centered on motif sites; used as input for downstream modeling.
`get_motif_edit_rate_per_site.py`	Reports per-motif-site (rather than aggregate) edit rates, allowing site-level variability analysis.
`get_region_edit_ct.py`	Calculates Ddd enzyme edit fractions for arbitrary user-defined genomic regions (specified as coordinate strings), with optional strand-aware positioning.
`get_bg_edit_pattern.py`	Characterizes background sequence-context–dependent edit rates (e.g., TCA vs. TCG context bias), used to normalize motif signals against background deaminase preferences.
`preprocess_motif_beds_step1.py` / `preprocess_motif_beds_step2.py`	Preprocesses FIMO motif BED files for ACCESS analysis: step 1 labels motif sites as active or inactive by overlapping with ChIP-seq peaks and filters by read coverage; step 2 further filters and formats BED files for model training.

3-state TF binding model (Python + TensorFlow)

The 3-state model classifies ACCESS-seq reads (or motif sites) as unbound, bound, or recently-bound based on the pattern of edits around TF binding motifs. Training uses read-level feature matrices from active and inactive motif sites.

Script	Description
`calc_motif_features_step1.py`	Step 1 of feature extraction: given a BAM, motif BED directory, cell type, and TF name, extracts per-read edit vectors centered on each motif instance and saves them as part files.
`calc_motif_features_step1_v2.py` / `calc_motif_features_step2.py`	V2 of step 1 adds FASTA-based sequence features; step 2 aggregates the part files from step 1 into a single HDF5 file per TF/cell-type combination.
`create_trainingSet_3stateModel.py` / `…_v2.py`	Assembles training and test sets from HDF5 feature files: samples read vectors from active and inactive motif sites, assigns 3-state labels, and stores in h5py format for model training.
`train_3StateModel.py` / `train_3StateModel_single_h5.py`	Trains TensorFlow neural network classifiers (one model each for base coverage channel, edit channel, and combined) on the 3-state training data, using class-weighted loss to handle label imbalance. Saves trained model dictionaries as pickle files.
`evaluate_3StateModel_single_h5.py` / `…_v2.py`	Evaluates trained models on held-out test data, computing AUPRC, AUROC, F1, and per-class fractions per TF motif.
`evaluate_3StateModel_summarize.py`	Aggregates evaluation metrics across all tested TF/cell-type pairs, filters high-confidence motifs by Spearman correlation with known binding data, and generates summary boxplots and CSV tables.
`cobinding_3StateModel_step1.py` / `cobinding_3StateModel_step2.py`	Co-binding analysis using model predictions: step 1 identifies nearby motif pairs (5–100 bp), computes 3-state read fractions per pair, and tests for statistical enrichment (Fisher's exact + Mann-Whitney U); step 2 summarizes odds ratios and read-type medians across all pairs.
`cobinding_definedLabel.py`	Co-binding analysis variant that uses user-defined activity labels rather than model predictions.
`access_util.py`	Shared utility library: neural network architecture definitions, data preprocessing functions (edit matrix construction, label encoding), and statistical helpers (Spearman correlation, motif-centered window extraction).

Single-cell ACCESS-seq (Perl / Python)

Script	Description
`sc_sep_fqPE_by_barcode.pl`	Extracts all paired-end reads belonging to a single specified cell barcode from FASTQ, writing a per-cell R1/R2 pair. Useful for targeted single-cell re-analysis.
`sc_separate_hyperpool.pl`	Demultiplexes hyperpooled scACCESS FASTQ by matching pool barcodes from R2, trims adapters, and outputs one gzip-compressed FASTQ pair per subpool with a read-distribution summary.
`sc_count_barcodes.pl` / `sc_count_barcodes_from_bam.pl` / `sc_count_barcodes_from_fastq.pl`	Count cell barcodes from either a BAM (via read tags) or FASTQ (via sequence matching), outputting per-barcode read counts. Used to assess barcode diversity and identify high-quality cells.
`map_sc_barcodes.pl`	Maps observed cell barcodes to a reference whitelist with 1-mismatch tolerance, collapsing barcode variants to their nearest valid barcode.
`sc_edit_stats.py`	Aggregates per-barcode ACCESS-seq edit statistics (total reads, edited reads, C→T count, G→A count) from a BAM file; supports optional filtering against a barcode whitelist.

Variant ACCESS-seq (Python / Perl)

varACCESS is a modification of ACCESS-seq that encodes TF binding site variants as short "phrase" sequences, allowing simultaneous measurement of accessibility at thousands of sequence variants in a single experiment.

Script	Description
`varACCESS_match_phrases.py`	Filters length-selected FASTQ reads and matches them against a phrase library (loaded from XLSX), allowing C→T / G→A edits plus a defined number of mismatches. Processes in parallel; classifies reads as C2T-dominant, G2A-dominant, or mixed; outputs matched reads as FASTA and an optional edit-position summary CSV.
`varACCESS_match_phrases_v2.py`	Updated version with additional filtering options and improved parallelization.
`varACCESS_fasta_to_edits.py`	Tallies C→T and G→A edit positions from phrase-matched FASTA files, processing headers to recover phrase identity and edit type; outputs per-position and per-phrase edit fraction CSVs.
`varACCESS_get_edits.pl` / `varACCESS_get_edits_allPos.pl`	Perl-based edit extraction for varACCESS reads: matches reads to phrase library, calls edits at each position, and outputs edit tables. `_allPos` version reports all positions rather than only predefined ones.
`varACCESS_match_stats.py`	Summarizes phrase match rates, edit-type fractions, and per-phrase coverage statistics across a run.
`liftover_vcf_hg19_to_hg38.py` / `…_v2.py`	Converts VCF coordinates from hg19 to hg38 using pyliftover chain files; supports SNV-only and primary-chromosome-only filters; optionally outputs a CSV with coordinate mapping metadata.

Duplex sequencing (Python / Perl)

Script	Description
`process_duplex_seq.py`	Preprocesses raw duplex sequencing FASTQ: extracts 5 bp end-barcodes from read start/end to construct unique molecule identifiers (UMIs), and outputs trimmed/labeled FASTQ files for downstream consensus calling.
`detect_twin_reads.py`	Scans a BAM for read pairs where one read shows predominantly C→T edits and the other shows predominantly G→A edits at the same positions (i.e., "twin" reads from opposite strands of the same molecule); outputs pairing statistics.
`compile_twin_reads.py` / `compile_twin_reads_by_chr.py`	Merges detected twin read pairs into consensus sequences by combining the complementary edit information from both strands; the `_by_chr` version parallelizes over chromosomes.
`duplex_seq_to_twin_csv.py`	Exports identified twin read pair data to CSV for summary analysis.

ACCESS CODEC pipeline (Python)

ACCESS CODEC (COnsensus Deaminase Chromatin sequencing) is a library preparation method that generates overlapping paired-end reads to improve single-molecule accuracy.

Script	Description
`access_codec_step1.py`	Demultiplexes ACCESS CODEC FASTQ files by matching expected read-start sequences (loaded from a sample sheet); trims matched sequences and outputs per-sample FASTQ pairs.
`access_codec_step2.py`	Detects overlapping regions between R1 and R2 reads, generates a consensus sequence from the overlap, and resolves C/T and G/A ambiguities using both strands; outputs consensus FASTA with edit-position annotations and error-rate statistics.
`access_codec.ipynb`	Jupyter notebook for interactive exploration and QC of ACCESS CODEC results.

pegRNA assays (Perl)

Script	Description
`pegRNA_reporter_150nt_assay.pl` / `…_v2.pl` / `…_v2_fused.pl`	Parse 150 nt pegRNA reporter assay reads: match reads to the pegRNA library (guide RNA + PBS + reporter sequence), then count unedited vs. edited reporter variants per construct to quantify prime editing efficiency. The `_fused` variant handles FLASH-merged (fused) read input.
`pegRNA_endo_150nt_step1.pl` / `…_step2.pl`	Two-step pipeline for endogenous pegRNA editing: step 1 extracts and matches reads to endogenous amplicon sequences; step 2 counts base identities at the edit site to measure in-situ prime editing efficiency.

Other assays (Perl)

Script	Description
`ShuttleSeq_parse.pl` / `ShuttleSeq_parse_v2.pl`	Parse ShuttleSeq reads: match a fixed scaffold sequence (allowing C/T or G/A degeneracy at edit positions), extract the gRNA insert, and count per-gRNA reads with edit state classification.
`TFome_reporter_get_edits.pl`	Extract and count edits from TFome reporter reads: matches reads to TF-editor fusion constructs and reports per-TFBS-phrase edit counts, supporting multiple edit modes (A2G, G2A, T2C, C2T).
`vORF_barcode_ct.pl`	Counts barcodes associated with variant ORFs from FASTQ by matching a fixed pre-sequence and extracting barcode sequences.
`HDRminipool_build_dict.pl`	Builds a barcode-to-construct dictionary for HDR minipool libraries by parsing reads and collapsing barcodes within Hamming distance 1.
`HDRminipool_edit_count.pl`	Counts edits per construct in HDR minipool assays using the dictionary built by `HDRminipool_build_dict.pl`.
`spike_in_analysis.pl` / `spike_in_process_fastq.pl`	Process spike-in control reads for normalization: `process_fastq` demultiplexes spike-in FASTQ, `analysis` computes per-spike-in read counts and normalization factors.
`sam_demult_ATAC.pl`	Demultiplexes ATAC-seq reads from SAM format by barcode, writing per-sample output files.
`Brandon_script.pl`	Custom analysis script (contact lab for details).
`encode_lib_common.py` / `encode_lib_genomic.py`	Utility libraries adapted from the ENCODE ATAC-seq pipeline, providing functions for file I/O, genome arithmetic, and BAM processing.
`encode_task_tss_enrich.py`	Calculates TSS enrichment score for ATAC-seq QC using the metaseq library: aggregates read coverage in windows around TSS regions and applies Greenleaf-lab normalization, generating enrichment plots and a numeric score.

`PPIseq_proj/`

Pipeline for PPI-seq (Protein-Protein Interaction sequencing), a base-editor–coupled screen that measures protein interactions by linking interacting ORF pairs to uniquely decodable barcode combinations. When two proteins interact, their fusion constructs co-localize, allowing the base editor to mark both barcodes with characteristic A→G or T→C edits. Sequencing the edited barcodes (cDNA) and comparing to unedited (gDNA) counts reveals interaction enrichment. The pipeline supports multiple library formats: ORFeome, RBP, ClinORFeome, 100ZF, coIP-PPIseq, and DMS-PPIseq.

`qsub_script/` — Cluster Job Scripts

The qsub scripts follow a two-step pattern per library type:

Build dictionary: Map barcodes → ORF/gene assignments from plasmid sequencing data (BLAST or cDNA-based).
Compute edit rates in parallel: Split the barcode space, assign cDNA reads, and compute enrichment scores.

Script	Description
`ORFeomePPIseq_master.sh`	Master shell script orchestrating the full ORFeome PPI-seq pipeline.
`*_build_dict_master.qsub`	Builds barcode-to-ORF dictionaries for each library variant (RBP, ClinORFeome, ORFeome, StitchR-COMMD, ABE-ORF, ABE25ORF-MCPL5ORF, coIP).
`*_editRate_parallel.qsub`	Computes per-barcode edit rates across the library in parallel (100ZF, RBP, ClinORFeome, ORFeome, RPIseq, ABE-MCP, DMS).
`coIP_PPIseq_build_dict.qsub` / `coIP_PPIseq_process_cDNA.qsub`	coIP-PPIseq–specific dictionary building and cDNA processing jobs.
`multiPPIseq_DMS_build_dict/`	Subdirectory containing DMS (deep mutational scanning) PPIseq dictionary-building pipeline scripts.
`multiPPIseq_ABE-ORF_MCP-ORF_step2_parallel.sh`	Parallel step 2 for the ABE-ORF / MCP-ORF dual-barcode multi-PPIseq.
`merge_job_out.sh`	Merges output files from parallel edit-rate jobs back into a single result file.
`Ondrej_analysis.qsub`	Custom analysis job for Ondrej's PPI-seq dataset.

`script/` — Analysis Scripts

FASTQ-to-barcode conversion (Perl)

Script	Description
`fqPE_to_faBC.pl`	Extracts the first 15 nt of R1 as the barcode and uses R2 as the associated sequence, writing R1-barcode / R2-sequence pairs as FASTA. The first step of the dictionary-building pipeline for most PPIseq formats.
`fqPE_to_faBC_pairs.pl`	Extracts barcode pairs from paired-end reads for dual-barcode libraries.
`fqPE_to_faBC_variable_BC_length.pl`	Variant that handles variable-length barcodes (e.g., when barcode length varies by ORF).
`partition_barcodes.pl`	Splits a full barcode list into N equal partitions for parallel downstream processing; each partition is written to a separate file to enable job array–style parallelization.

cDNA parsing — ORF assignment (Perl)

These scripts read cDNA FASTQ files, extract barcodes, match them against a pre-built barcode dictionary, validate the edit pattern (T→C or A→G at expected positions), and output per–gene-pair read counts.

Script	Description
`ORFeomePPIseq_cDNA_parse.pl` / `…_v2.pl`	Core ORFeome PPIseq cDNA parser: extracts the 15 nt barcode from each read, looks it up in a barcode→gene dictionary, validates T→C or A→G edits at predetermined positions, and outputs read counts per gene with edit statistics.
`RPIseq_cDNA_parse.pl`	cDNA parsing for RPI-seq (RNA-Protein Interaction sequencing) format.
`Ondrej_cDNA_parse.pl`	Custom parser for Ondrej's library format.
`multiPPIseq_ABEL5ORF_MCPL5ORF_cDNA_parse.pl`	Dual-barcode cDNA parser for the ABE-L5ORF / MCP-L5ORF multi-PPIseq format: extracts variable-length ABE and fixed-length MCP barcodes, validates edit patterns on both barcodes, and counts gene-pair interactions.
`multiPPIseq_ABE25ORF_MCPL5ORF_cDNA_parse.pl`	Dual-barcode cDNA parser for the ABE25ORF / MCPL5ORF variant.
`*_cDNA_parse_parallel/`	Subdirectories containing parallelized (partitioned-barcode) cDNA parse scripts for large libraries: 100ZF, ClinORFeome, RBP, LDL_PPI_Super, multiPPIseq ABE-MCP, and multiPPIseq DMS.

BLAST-based barcode-to-ORF assignment (Perl)

Used when barcode-to-ORF mapping is established via BLAST of barcode sequences against an ORF library rather than by direct read matching.

Script	Description
`ORFeomePPIseq_blastn_parse.pl` / `…_v2.pl`	Parses BLAST tabular output (outfmt 6) to assign each barcode its best ORF match: filters by >95% identity and >15 bp alignment length, requires >2 supporting reads, and collapses near-duplicate barcodes within Hamming distance 1.
`100ZF_blastn_parse.pl`	BLASTN-based ORF assignment for the 100 zinc-finger library.
`multiPPIseq_ABE-MCP_blastn_parse_v2.pl` / `multiPPIseq_ABE-MCP_blastn_parse_noHD.pl`	BLASTN parsing for the dual ABE/MCP barcode system; v2 adds Hamming-distance merging; `noHD` skips merging.
`multiPPIseq_ABE-ORF_blastn_parse_v2.pl` / `…_v2_ABE24.pl`	BLASTN parsing for ABE-ORF format; the `_ABE24` variant targets ABE24 barcode architecture.
`*_blastn_parse_parallel/`	Parallelized BLASTN parsing scripts for ABE-MCP and 100ZF formats.

Dictionary building (Python)

Used for coIP-PPIseq, where barcodes are assigned via a co-immunoprecipitation read structure rather than BLAST.

Script	Description
`coIP_PPIseq_build_dict.py` / `coIP_PPIseq_build_dict_parallel.py`	Parses paired-end FASTQ to extract barcodeA, barcodeB, and ORF sequences from fixed-position read coordinates; matches ORFs against a reference library; outputs a barcode-pair dictionary CSV with read counts, dominance statistics, and distribution histograms. The parallel version partitions processing for large libraries.
`coIP_PPIseq_filter_dict.py`	Filters the coIP dictionary by minimum read support and dominance thresholds to remove low-confidence barcode assignments.
`coIP_PPIseq_process_cDNA.py`	Processes cDNA FASTQ for coIP-PPIseq: extracts barcode pairs using fixed flanking sequences and counts read support per pair against the filtered dictionary, outputting per–gene-pair interaction counts.
`multiPPIseq_DMS_dict_build/`	Subdirectory with dictionary-building utilities for the DMS-PPIseq format.

Barcode filtering (Perl)

Script	Description
`filter_ABE_barcodes_by_dict.pl`	For dual-barcode systems, filters BLAST results by matching the ABE component against a validated barcode dictionary (1-base Hamming tolerance) and combines with the MCP barcode annotation to produce a paired barcode–gene table.
`filter_ABE_barcodes_by_lib.pl`	Filters ABE barcodes against a known library list rather than a dictionary.

`gRNA_ct_proj/`

Counts gRNA sequences from FASTQ files for pooled CRISPR screens. Supports both single-guide libraries and dual-guide (paired) libraries. Integrates with BEAN for Bayesian effect estimation.

`qsub_script/`

Script	Description
`fq_gRNA_ct.qsub`	Cluster job for single gRNA counting.
`fq_gRNA_pair_ct.qsub`	Cluster job for paired gRNA counting.
`trim_count.qsub`	Trims reads then runs counting.
`run_BEAN.qsub`	Runs BEAN (Bayesian Estimation of Allelic effects from NGS) on gRNA count output for base-editing screen analysis.

`script/`

Script	Description
`fq_gRNA_ct.pl`	Counts gRNA read frequencies from FASTQ: extracts the first 20 nt from reads positioned after a fixed pre-sequence and matches against a gRNA library with 20 nt and 19 nt fallback matching. Outputs per-gRNA read counts with total and unmapped statistics.
`fq_gRNA_ct_alt.pl`	Alternate single-guide counter that uses a different read-anchor strategy for libraries with non-standard pre-sequence positions.
`fq_gRNA_pair_ct.pl`	Counts paired gRNA reads from R1/R2 FASTQ files: extracts partial sequences from each read and searches for complementary pairs in a gRNA-pair library, tracking reads matching both guides, either guide alone, or a paired combination. Outputs per–gRNA-pair mapping counts.
`fq_gRNA_pair_ct_alt.pl`	Alternate paired-guide counting variant for different library architecture.
`trim_count.pl`	Extracts 20 nt sequence variants from FASTQ, sorts by frequency, and maps to gRNA library entries; outputs per-sequence counts with gRNA IDs. Simpler variant of `fq_gRNA_ct.pl` useful for QC and troubleshooting.

`read_ct_proj/`

Barcode and read counting for a range of massively parallel reporter assays (MPRA), CRISPRa screens, and splicing assays. Most assays follow a multi-step workflow: (1) extract and assign barcodes from sequencing reads, then (2) count cDNA and gDNA barcodes separately to compute enrichment ratios.

`qsub_script/`

Script	Description
`read_ct.qsub`	General-purpose barcode/read counting job.
`fq_gRNA_ct.qsub`	gRNA counting job for CRISPRa screens (shared with gRNA_ct_proj).
`CRISPRaLDL.qsub` / `CRISPRaLDL_v2.qsub`	CRISPRa LDL reporter assay (MPRA for LDL pathway). V2 adds subsampling.
`CRISPRaLDL_v2_subsample.qsub`	Subsampled variant for CRISPRaLDL.
`ADRD_count_oligo.qsub`	Counts oligonucleotides for ADRD regulatory variant assay.
`dCas9Sun_count_oligo.qsub`	Counts oligos for dCas9-SunTag CRISPRa assay.
`dCasRx_SF.qsub`	dCasRx splicing factor reporter assay job.
`UDN_MPSA.qsub`	UDN massively parallel splicing assay job.

`script/`

CRISPRa reporter assay pipeline (Perl)

This multi-step pipeline pairs each gRNA to its set of promoter barcodes (from plasmid sequencing), then quantifies cDNA and gDNA barcode abundance to compute per-promoter transcriptional enrichment scores.

Script	Description
`CRISPRa_scriptV2_step1.pl`	Step 1: processes paired FASTQ files by matching gRNA sequences (from R1, after a fixed pre-sequence) and promoter barcodes (from R2), assigns gRNA–promoter barcode pairs with Hamming-distance–1 error correction, and outputs a barcode→gRNA→promoter assignment table with QC metrics.
`CRISPRa_scriptV2_step1_gRNA_only.pl`	Step 1 variant that extracts gRNA reads only (no promoter barcode), used for gRNA-only counting.
`CRISPRa_scriptV2_step1_subsample.pl`	Step 1 with read subsampling for sensitivity analysis.
`CRISPRa_scriptV2_step2a.pl`	Step 2a: extracts barcode sequences from cDNA and gDNA FASTQ files and outputs per-barcode read counts (partitioned into multiple files for parallel step 2b/2c processing).
`CRISPRa_scriptV2_step2b.pl`	Step 2b: collapses cDNA/gDNA barcode counts from step 2a onto the proper-barcode set from step 1, using Hamming-distance–1 merging to correct sequencing errors.
`CRISPRa_scriptV2_step2c.pl`	Step 2c: aggregates collapsed barcode counts into per-gRNA and per-promoter enrichment ratios (cDNA / gDNA).

Barcode counting utilities (Perl)

Script	Description
`barcodeCt.pl`	Extracts 15 nt barcodes from reads after a fixed pre-sequence; collapses barcodes within Hamming distance 1 to correct sequencing errors; outputs final barcode → count mapping.
`barcodeCt_byPromoter.pl`	Barcode counts stratified by promoter identity (uses the gRNA–promoter assignment from step 1).
`barcodeCt_bygRNA.pl`	Barcode counts stratified by gRNA identity.
`readCt_byBarcode.pl`	Matches barcode sequences from a reference library to FASTQ reads using regex search within a 60 bp window; outputs per-barcode read counts.
`readCt_byLib.pl`	Read counts grouped by library construct (rather than individual barcode).
`readCt_PEvsWT.pl`	Compares prime-edited (PE) vs. wild-type (WT) read counts in a pegRNA reporter assay: searches for PE and WT diagnostic sequences and reports the ratio of PE to WT reads.
`fq_gRNA_ct.pl`	gRNA counting (also used in gRNA_ct_proj; see above).

Splicing & reporter assays (Perl)

Script	Description
`UDN_MPSA_step1.pl`	Step 1 of the UDN massively parallel splicing assay: matches 40 bp oligo and 15 bp barcode reads from paired FASTQ by fixed flanking sequences, builds oligo–barcode associations, and outputs oligo and barcode statistics with paired assignments.
`UDN_MPSA_step2.pl`	Step 2: uses the oligo–barcode map from step 1 to quantify each variant's representation in the spliced (included) and unspliced (excluded) RNA fractions, computing per-oligo splicing efficiency.
`dCasRx_SF_script.pl` / `dCasRx_SF_script_temp.pl`	dCasRx splicing factor assay: matches R1 library sequences (20 bp) and R2 exon variant markers (10 bp) to their respective reference libraries, then outputs per–library-member counts of exon-included, exon-excluded, and other variants for splicing analysis.
`dCas9Sun_count_oligo.pl`	Counts oligo sequences for the dCas9-SunTag CRISPRa transcriptional reporter assay.
`ADRD_count_oligo.pl`	Counts 40 bp oligonucleotide sequences from FASTQ for the ADRD (Alzheimer's Disease and Related Dementias) regulatory variant reporter assay: matches a fixed leading sequence, extracts the oligo, and accumulates per-oligo read counts.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
NGS_analysis_proj		NGS_analysis_proj
PPIseq_proj		PPIseq_proj
Tian_demult_scripts		Tian_demult_scripts
gRNA_ct_proj		gRNA_ct_proj
read_ct_proj		read_ct_proj
.DS_Store		.DS_Store
.gitattributes		.gitattributes
BPFNGS_demultiplex_instructions.Rmd		BPFNGS_demultiplex_instructions.Rmd
BPFNGS_demultiplex_instructions.pdf		BPFNGS_demultiplex_instructions.pdf
README.md		README.md
biomart_canonical.R		biomart_canonical.R
clickEdit_pipeline.ipynb		clickEdit_pipeline.ipynb
get_sample_sheet.R		get_sample_sheet.R
get_sequences_hg38.R		get_sequences_hg38.R

Folders and files

Latest commit

History

Repository files navigation

Sherwood Lab – NGS Analysis Scripts

Repository Structure

Tian_demult_scripts/

Core demultiplexers

Utilities

NGS_analysis_proj/

qsub_script/ — Cluster Job Scripts

Alignment & preprocessing

Quality control

Base counting & edit rate

Chromatin accessibility & motif analysis

Single-cell ACCESS-seq

Variant ACCESS-seq (varACCESS)

Duplex sequencing

Reporter & other assays

script/ — Analysis Scripts

Alignment & BAM utilities (Python)

Base counting & edit rate (Perl)

Chromatin accessibility & motif analysis (Python)

3-state TF binding model (Python + TensorFlow)

Single-cell ACCESS-seq (Perl / Python)

Variant ACCESS-seq (Python / Perl)

Duplex sequencing (Python / Perl)

ACCESS CODEC pipeline (Python)

pegRNA assays (Perl)

Other assays (Perl)

PPIseq_proj/

qsub_script/ — Cluster Job Scripts

script/ — Analysis Scripts

FASTQ-to-barcode conversion (Perl)

cDNA parsing — ORF assignment (Perl)

BLAST-based barcode-to-ORF assignment (Perl)

Dictionary building (Python)

Barcode filtering (Perl)

gRNA_ct_proj/

qsub_script/

script/

read_ct_proj/

qsub_script/

script/

CRISPRa reporter assay pipeline (Perl)

Barcode counting utilities (Perl)

Splicing & reporter assays (Perl)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`Tian_demult_scripts/`

`NGS_analysis_proj/`

`qsub_script/` — Cluster Job Scripts

`script/` — Analysis Scripts

`PPIseq_proj/`

`qsub_script/` — Cluster Job Scripts

`script/` — Analysis Scripts

`gRNA_ct_proj/`

`qsub_script/`

`script/`

`read_ct_proj/`

`qsub_script/`

`script/`

Packages