Peptide Databases

How Immunoinformatics Predicts Peptide Immune Responses

13 min read|March 22, 2026

Peptide Databases

97% accuracy

Modern neural network-based MHC binding predictors like NetMHCpan achieve over 97% accuracy in distinguishing peptides that bind MHC class I molecules from those that do not.

Nielsen et al., Annual Review of Biomedical Data Science, 2020

Nielsen et al., Annual Review of Biomedical Data Science, 2020

Computational visualization of a peptide fitting into an MHC binding groove with prediction scoresView as image

Every protein in a virus, bacterium, or tumor cell gets chopped into short peptide fragments by the cell's proteasome. Some of those fragments bind to MHC molecules on the cell surface. A fraction of those MHC-peptide complexes get recognized by T cells, triggering an immune response. The problem: a single pathogen proteome generates billions of possible peptide fragments, but only a tiny fraction actually makes it through the MHC binding filter and triggers a T cell response.[1] Immunoinformatics is the field that uses computational tools to predict which peptides will pass each filter, and it has become essential to modern vaccine design, cancer immunotherapy, and peptide drug development. For a broader look at how computational tools are catalogued and accessed, see our pillar article on peptide databases.

Key Takeaways

  • MHC class I molecules bind peptides of 8-11 amino acids with specific anchor residue preferences at positions 2 and 9; predicting these binding events is the core task of immunoinformatics (Nielsen et al., 2020)
  • NetMHCpan 4.1, trained on both binding affinity and eluted ligand data from mass spectrometry, achieves over 97% AUC for MHC class I binding prediction (Nielsen et al., 2020)
  • Pan-allele prediction methods can generate binding forecasts for any of the 12,000+ known HLA alleles, even those with few or no experimental binding measurements (Farrell et al., 2021)
  • Peptide-MHC binding is necessary but not sufficient for immune activation; proteasomal cleavage, TAP transport, T cell receptor recognition, and immunodominance hierarchies all filter which peptides actually trigger responses (Nielsen et al., 2020)
  • Multi-epitope vaccine designs use immunoinformatics to combine predicted epitopes from multiple antigens into single constructs that maximize population coverage (Lu et al., 2017)
  • Neoantigen prediction for personalized cancer vaccines relies on the same MHC binding algorithms applied to tumor-specific mutations identified by sequencing (Rao et al., 2020)

The antigen presentation pipeline

Before a peptide can trigger a T cell response, it must pass through a multi-step processing and presentation pipeline. Immunoinformatics attempts to predict the outcome at each step.

Step 1: Proteasomal cleavage. Intracellular proteins are degraded by the proteasome into peptide fragments of varying lengths. Not all potential cleavage sites are used equally. Computational proteasome cleavage predictors (like NetChop) estimate which fragments will actually be generated.

Step 2: TAP transport. Peptide fragments must be transported from the cytoplasm into the endoplasmic reticulum by the transporter associated with antigen processing (TAP). TAP has its own peptide selectivity, favoring fragments of 8-16 amino acids with specific C-terminal residues.

Step 3: MHC binding. Inside the ER, peptides compete for binding to newly synthesized MHC class I molecules. Only peptides that fit the MHC binding groove with sufficient affinity will form stable complexes. This is the most selective step and the one where computational prediction has achieved the greatest accuracy.[1]

Step 4: Surface presentation. Stable peptide-MHC complexes travel to the cell surface, where they are displayed for T cell surveillance.

Step 5: T cell recognition. A T cell receptor (TCR) must recognize the specific peptide-MHC complex. This step is the least predictable computationally because TCR diversity is enormous and the structural determinants of TCR-pMHC recognition are less well understood than MHC binding.

Most immunoinformatics tools focus on Step 3 because MHC binding is the strongest bottleneck and the most tractable prediction problem.

How MHC binding prediction works

MHC class I molecules bind peptides of 8-11 amino acids in a groove formed by two alpha-helices sitting on a beta-sheet floor. The groove has pockets (labeled A through F) that accommodate specific amino acid side chains from the bound peptide. Two positions are particularly important: the anchor residues at positions 2 and the C-terminus (usually position 9 for a 9-mer), which make the strongest contacts with MHC groove pockets B and F.[1]

Different HLA alleles have different groove shapes, meaning they prefer different amino acids at anchor positions. HLA-A02:01 (the most common allele in many populations) prefers leucine or methionine at position 2 and valine or leucine at position 9. HLA-B07:02 prefers proline at position 2. These allele-specific preferences create "binding motifs" that prediction algorithms learn to recognize.

Machine learning approaches

Modern MHC binding predictors use neural networks trained on large datasets of experimentally measured peptide-MHC binding affinities and mass spectrometry-derived eluted ligand data.[1]

Allele-specific models train separate neural networks for each HLA allele with sufficient binding data. This works well for common alleles but fails for rare alleles with few measurements.

Pan-allele models solve this by incorporating the amino acid sequence of the MHC binding groove as an input feature alongside the peptide sequence. This allows the algorithm to generalize across alleles, predicting binding for any HLA allele based on its groove chemistry.[2] NetMHCpan, the leading pan-allele predictor, can generate binding predictions for all 12,000+ known HLA class I alleles.

Training data integration. Nielsen et al.'s 2020 review described how combining two types of experimental data dramatically improved prediction accuracy: binding affinity measurements (IC50 values from competitive binding assays) and eluted ligand data (peptides stripped from MHC molecules on actual cell surfaces and identified by mass spectrometry). Models trained on both data types achieve AUC values exceeding 0.97, meaning they correctly rank a true binder above a non-binder 97% of the time.[1]

MHC class II: a harder prediction problem

MHC class II molecules present longer peptides (12-25 amino acids) to CD4+ helper T cells. The class II binding groove is open at both ends, allowing peptides to extend beyond it. This creates a harder prediction problem because the binding core (typically 9 amino acids) must be identified within a longer peptide, and flanking residues outside the groove also influence binding.[1]

Class II prediction accuracy lags behind class I, with AUC values typically in the 0.85-0.92 range. The smaller volume of training data for class II alleles and the added complexity of binding core identification contribute to this gap. Tools like NetMHCIIpan address this with pan-allele approaches similar to those used for class I.

Key prediction tools

Several tools dominate the immunoinformatics landscape:[2]

NetMHCpan 4.1: The current gold standard for MHC class I binding prediction. Uses an ensemble of neural networks trained on binding affinity and eluted ligand data. Covers all known HLA-A, -B, and -C alleles through pan-allele prediction.

MHCflurry 2.0: An open-source alternative that uses a similar neural network architecture and achieves comparable performance. Includes antigen processing prediction (proteasomal cleavage and TAP transport) alongside MHC binding.

epitopepredict: A programmatic framework created by Farrell et al. that integrates multiple binding prediction algorithms under a single interface, enabling whole-proteome screening across multiple HLA alleles simultaneously.[2]

IEDB Analysis Resource: The Immune Epitope Database provides a web-based suite of prediction tools including binding, processing, and T cell reactivity predictors, with over 1.5 million curated epitope records for validation.

For how these tools fit into the broader AI revolution in peptide drug discovery, see our dedicated article.

Applications: from vaccines to cancer

Peptide vaccine design

Immunoinformatics enables rational vaccine design by identifying epitopes most likely to trigger protective immune responses across diverse human populations. Lu et al.'s 2017 multi-epitope vaccine approach exemplifies this: computational prediction identifies candidate epitopes from pathogen proteins, selects those with broad HLA coverage, and assembles them into synthetic constructs that maximize population-level immunity.[3]

The COVID-19 pandemic accelerated this approach. Multiple groups used immunoinformatics to predict SARS-CoV-2 T cell epitopes within weeks of the genome sequence being published, identifying conserved epitopes that proved accurate when subsequently tested experimentally.

For a specific example of peptide vaccines in clinical development, see our article on gp100 peptide vaccines for melanoma.

Personalized cancer immunotherapy

Neoantigen prediction represents the cutting edge of immunoinformatics applied to oncology. Tumor cells accumulate somatic mutations that create altered peptide sequences (neoantigens) not present in normal tissue. Rao et al.'s 2020 PROTECCT tool predicts which tumor-specific mutations will generate peptides that bind the patient's own HLA alleles and trigger T cell responses.[4]

The workflow: sequence the tumor genome, identify somatic mutations, generate all possible mutant peptide fragments, predict which bind the patient's HLA alleles, filter for immunogenicity, and synthesize a personalized vaccine or identify targets for T cell therapy. This entire pipeline depends on accurate MHC binding prediction as its foundation.

For the clinical progress of peptide cancer vaccines, see our articles on HER2 peptide vaccines for breast cancer.

Peptide drug immunogenicity screening

Therapeutic peptides can trigger unwanted immune responses that reduce efficacy or cause adverse reactions. Immunoinformatics tools are increasingly used during drug development to predict whether a candidate peptide drug will contain T cell epitopes that could provoke anti-drug antibodies. Peptide sequences can be computationally screened for MHC binding before synthesis, allowing immunogenic sequences to be modified ("deimmunized") early in development.[1]

Limitations of current prediction

Predicting MHC binding is not the same as predicting immune responses. Several gaps remain:

T cell receptor recognition. A peptide can bind MHC perfectly but never trigger a T cell response if no TCR in the individual's repertoire recognizes the complex. TCR prediction is far less developed than MHC binding prediction.

Immunodominance. When multiple epitopes are presented simultaneously, the immune system focuses its response on a subset (immunodominant epitopes). The rules governing immunodominance are poorly understood and not captured by current binding predictors.

Post-translational modifications. Proteins undergo phosphorylation, glycosylation, and other modifications that alter peptide sequences. Most prediction tools operate on canonical amino acid sequences and cannot account for these modifications.

Tolerance and self-reactivity. The immune system is educated to ignore self-peptides. Predicting whether a peptide will be seen as foreign versus self requires modeling thymic selection, which current tools do not do well.

MHC class II accuracy. Class II prediction, essential for CD4+ T cell responses and vaccine helper epitopes, remains less accurate than class I prediction.

Where the field is heading

Several trends are advancing immunoinformatics beyond MHC binding:

Integrated pipelines combine proteasomal cleavage, TAP transport, MHC binding, and surface stability predictions into single-score immunogenicity estimates. These multi-step models better approximate the full antigen presentation pipeline.

TCR-pMHC interaction prediction using deep learning is emerging but remains in early development. Predicting which TCR will recognize a given peptide-MHC complex would close the largest remaining gap in immunoinformatics.

Population-level coverage optimization algorithms select minimal epitope sets that cover the maximum proportion of a target population's HLA diversity. This is critical for designing peptide vaccines with global applicability rather than coverage limited to specific ethnic populations.

For how in silico toxicity prediction complements immunogenicity prediction in the peptide development pipeline, see our sibling article.

The Bottom Line

Immunoinformatics predicts which peptide fragments will bind MHC molecules and potentially trigger T cell immune responses. Neural network-based tools like NetMHCpan achieve over 97% accuracy for MHC class I binding prediction by integrating binding affinity and mass spectrometry data across pan-allele models covering 12,000+ HLA variants. These predictions are foundational to modern peptide vaccine design, personalized cancer neoantigen therapy, and therapeutic peptide immunogenicity screening. The primary limitation is that MHC binding prediction alone does not capture the full determinants of immune activation, including TCR recognition, immunodominance, and immune tolerance.

Frequently Asked Questions