Jeffrey Hoelzel Jr.

About

PepSeqPred is an open-source machine learning pipeline I developed to predict antibody epitope locations across large pathogen proteins. The project was motivated by a major bottleneck in immunological research: experimentally mapping antibody binding sites through peptide serology is slow, expensive, and difficult to scale across large pathogen proteomes. PepSeqPred addresses this by learning sequence and structural patterns associated with antibody recognition, allowing researchers to prioritize high-likelihood peptide targets before committing to costly experimental validation. The project was conducted through Northern Arizona University's Pathogen and Microbiome Institute under a HURA research grant, with mentorship from Dr. Jason Ladner and Dr. Evan Elko.

At the core of PepSeqPred is a residue-level classification pipeline that converts protein sequences into ESM-2 protein language model embeddings, then passes those embeddings through a fully-connected neural network to generate per-residue epitope probability predictions. Because ESM-2 has a sequence length limit, I implemented a sliding window embedding strategy for long proteins, averaging overlapping regions to better preserve full sequence context. The model uses peptide-level PepSeq serology reactivity measurements and projects those labels down to the residue level, enabling training on highly multiplexed data. To reduce leakage and improve generalization, the training pipeline also supports family-aware splitting, k-fold ensembling, deterministic multi-seeded experiments, and class imbalance handling.

A major focus of the project was building a reproducible research pipeline rather than a one-off model. I designed the repository to support end- to-end preprocessing, ESM-2 embedding generation, residue-level label construction, fully-connected neural network training, Optuna hyperparameter tuning, prediction, evaluation, and HPC orchestration. The developer workflow includes command-line tools for each pipeline stage, SLURM scripts for large-scale runs on NAU's Monsoon supercompute cluster, and support for distributed training across multiple GPUs. This allowed PepSeqPred to scale across large embedding corpora while maintaining experiment reproducibility, deterministic split planning, and clear artifact contracts for checkpoints, ensemble manifests, label shards, and evaluation outputs.

PepSeqPred was evaluated using metrics designed for extreme class imbalance, with precision-recall AUC prioritized over accuracy or ROC AUC because true epitope residues are rare. In validation, the model achieved a PR AUC of 0.246 compared to a random baseline of 0.070, representing a 3.53x improvement in ranking true epitopes. The model also showed improved ROC AUC, balanced accuracy, F1, and MCC compared to baseline expectations. External validation on an independent fungal dataset demonstrated that the model could maintain useful signal outside the human virome training domain, suggesting that PepSeqPred learned transferable patterns of antibody recognition rather than simply memorizing training data.

To make the work practically usable, I packaged PepSeqPred as a Python library with a lightweight inference API and bundled pretrained models. Users can install the package with pip install pepseqpred, load a pretrained predictor, and generate residue-aligned binary epitope masks from a protein sequence or FASTA file. This turns the research pipeline into a tool that can be used directly by scientists to prioritize antibody targets, guide serological assay design, and accelerate large-scale studies of infection history.

Tech Stack

PythonPyTorchESM-2Sequence ModelsOptunaSLURMHPCPyPIPipelineNext.js

Visit Website Visit GitHub