Peptide Databases: Finding Every Bioactive Peptide
Peptide Bioinformatics
863,498 AMPs cataloged
The AMPSphere, the largest antimicrobial peptide resource, contains 863,498 non-redundant sequences identified through machine learning analysis of 63,410 metagenomes.
Santos-Junior et al., Cell, 2024
Santos-Junior et al., Cell, 2024
View as imagePeptide databases are the infrastructure that makes modern peptide research possible. Every computational prediction of antimicrobial activity, every machine learning model for peptide toxicity, and every virtual screening campaign for therapeutic peptide candidates depends on curated databases of known peptide sequences, structures, and biological activities. These resources range from focused collections of a few thousand experimentally validated antimicrobial peptides to massive computationally generated catalogs of hundreds of thousands of predicted sequences. Understanding which databases exist, what they contain, and how they differ is essential for researchers navigating the peptide bioinformatics landscape. This article surveys the major peptide databases, explains their strengths and limitations, and shows how they connect to the computational tools that are transforming peptide discovery and design.
Key Takeaways
- The Antimicrobial Peptide Database (APD3) has curated over 3,000 natural antimicrobial peptides with validated sequences, activities, and structures since its founding in 2003
- DBAASP v3 contains over 15,700 entries with experimentally determined antimicrobial/cytotoxic activity and structural data, enabling structure-activity relationship analysis
- PeptideAtlas maps peptides observed in mass spectrometry experiments across multiple organisms, providing proteome-level peptide detection evidence
- Machine learning models trained on database entries can now predict antimicrobial function from sequence alone with over 83% accuracy (Ma et al., Nature Biotechnology, 2022)
- Database-guided peptide design without machine learning produced potent, non-hemolytic AMPs using physicochemical property templates from curated databases (Mechesso et al., 2026)
- AI-driven immunopeptidomics databases catalog peptides presented by HLA molecules, enabling infection diagnostics and vaccine design (Vo et al., 2025)
The Antimicrobial Peptide Database (APD)
The Antimicrobial Peptide Database, maintained at the University of Nebraska Medical Center, is one of the oldest and most carefully curated peptide resources. Founded in 2003, APD focuses exclusively on natural antimicrobial peptides with four requirements for inclusion: the peptide must be (1) naturally occurring, (2) have a known amino acid sequence, (3) have demonstrated biological activity, and (4) contain fewer than 100 amino acid residues. APD3, the current version, contains over 3,000 entries spanning peptides from bacteria (261 bacteriocins), archaea, protists, fungi, plants (321 entries), and animals (1,972 host defense peptides).
What distinguishes APD from larger, less curated databases is its emphasis on experimentally validated data. Every entry has been manually reviewed for sequence accuracy and biological activity confirmation. The database provides a classification system based on peptide source, target organism, mechanism of action, and structural class. Researchers can search by amino acid composition, net charge, hydrophobic percentage, or structural features (alpha-helix, beta-sheet, extended, cyclic).
APD also provides tools for peptide analysis beyond simple searching. Users can calculate physicochemical properties (molecular weight, net charge, hydrophobic ratio, Boman index) for any input sequence, compare a novel peptide against the entire APD collection, and identify the most similar known AMPs. The prediction tool estimates whether an input sequence is likely to have antimicrobial activity based on its physicochemical profile relative to known AMPs. These built-in analysis features make APD a workbench as well as a database.
APD's primary limitation is its conservative scope. By restricting entries to natural peptides under 100 residues with confirmed activity, it excludes synthetic analogs, computationally predicted peptides, and larger antimicrobial proteins. This makes it an excellent gold-standard reference set for training machine learning models but an incomplete catalog of the full antimicrobial peptide landscape. The curation lag is another constraint: new peptides published in the literature may take months to appear in APD, meaning the most recently discovered AMPs may not yet be included. For more on the APD and how it catalogs nature's antibiotics, see our dedicated article.
DBAASP: Structure Meets Activity
The Database of Antimicrobial Activity and Structure of Peptides (DBAASP) takes a different approach from APD. Rather than restricting entries to natural peptides, DBAASP includes both natural and synthetic antimicrobial peptides, and it emphasizes the relationship between peptide structure and antimicrobial/cytotoxic activity. Version 3.0 contains over 15,700 entries with experimentally determined activity data against specific target organisms at defined concentrations.
DBAASP's structure-activity focus makes it particularly valuable for peptide design. Each entry includes minimum inhibitory concentration (MIC) values against specific bacterial strains, hemolytic activity data (toxicity to red blood cells), and where available, three-dimensional structural data from NMR or X-ray crystallography. This combination allows researchers to identify which structural features correlate with potent antimicrobial activity and low toxicity, the central challenge in therapeutic peptide development.
Mechesso et al. demonstrated the practical value of this approach in 2026 by using database-guided physicochemical property templates (rather than machine learning) to design potent, non-hemolytic antimicrobial peptides in a single step.[1] By extracting charge, hydrophobicity, and amphipathicity parameters from the most active and least toxic peptides in curated databases, they defined a property space that guided the design of novel sequences with optimized therapeutic indices. This demonstrates that well-curated databases enable rational design even without sophisticated computational models.
Santaweesuk et al. used DBAASP and related databases to guide the design of novel AMPs with enhanced efficacy and selectivity against methicillin-resistant Staphylococcus aureus (MRSA) in 2026.[2] The database-derived structural insights enabled selective targeting of the pathogen while minimizing toxicity to human cells.
DRAMP: Patents and Clinical Data
The Data Repository of Antimicrobial Peptides (DRAMP) fills a niche that APD and DBAASP do not: it includes patent entries and clinical-stage peptides alongside general research entries. Version 2.0 contains approximately 19,900 entries total, with 5,084 general entries, 14,739 patent entries, and 76 clinical entries. This makes DRAMP the most commercially oriented AMP database, useful for freedom-to-operate analyses, competitive intelligence, and tracking the clinical development pipeline.
The patent entries are particularly valuable because they capture peptide sequences that may not appear in the scientific literature. Companies often file patents on antimicrobial peptide sequences before (or instead of) publishing them in journals, meaning that patent databases contain peptide diversity invisible to literature-only resources. DRAMP's integration of this patent data with activity data provides a more complete picture of the AMP landscape than any single academic database.
The 76 clinical entries in DRAMP represent AMPs that have entered human clinical trials, with information on trial phase, indication, and development status. This clinical tracking function is unique among AMP databases and provides context for how the theoretical promise of antimicrobial peptides is translating (or failing to translate) into approved medicines. As of recent updates, the clinical success rate for AMP drug candidates remains low compared to small-molecule antibiotics, though several peptide antibiotics including daptomycin, polymyxins, and nisin have achieved clinical or commercial success.
DRAMP also categorizes entries by antimicrobial spectrum, mechanism of action, and source organism, enabling filtered searches that answer specific research questions. A researcher seeking beta-hairpin AMPs active against gram-negative bacteria at concentrations below 10 micromolar can construct that query across the DRAMP dataset, retrieving a focused candidate set for further analysis.
PeptideAtlas: The Proteomics Evidence Map
PeptideAtlas takes a fundamentally different approach from the databases described above. Rather than cataloging peptides with known bioactive functions, PeptideAtlas maps peptides that have been experimentally observed through mass spectrometry across proteomics experiments. It answers the question: which peptides have actually been detected in biological samples?
The Human PeptideAtlas contains hundreds of thousands of peptide identifications from thousands of mass spectrometry experiments, organized by tissue, cell type, and experimental condition. Recent builds include specialized resources such as the Human Phosphoproteome Atlas. For researchers working on peptide biomarker discovery or metaproteomic profiling, PeptideAtlas provides the reference framework for confirming that a peptide of interest is detectable by current analytical methods.
Chen et al. applied related mass spectrometry approaches to discover peptide quality markers for quality control and identification of complex biological products in 2025.[3] PeptideAtlas-type resources are essential for this work because they provide the spectral libraries against which new mass spectrometry data is matched.
PeptideAtlas differs from functional databases in a fundamental way: it makes no claims about what peptides do, only that they exist in particular biological contexts. This observation-without-interpretation approach creates a resource that is useful precisely because it is unbiased. If a peptide appears in PeptideAtlas, it means the peptide was detected in a real biological sample by a validated analytical method. This evidence of existence complements the evidence of function provided by databases like APD and DBAASP.
The Human Plasma PeptideAtlas is particularly relevant for biomarker research. It catalogs peptides detectable in human blood plasma, providing the reference against which candidate diagnostic peptides can be evaluated. If a peptide proposed as a disease biomarker does not appear in the Plasma PeptideAtlas, researchers must consider whether it is genuinely absent from plasma or simply below the detection limits of the experiments included in the atlas.
Specialized Databases: Beyond Antimicrobials
The peptide database ecosystem extends far beyond antimicrobial peptides. Specialized databases exist for nearly every peptide functional class:
Immunopeptide databases catalog peptides presented by MHC/HLA molecules on cell surfaces. Boehm et al. developed improved machine learning approaches for predicting peptide presentation by MHC class I molecules, training their models on immunopeptidome data derived from mass spectrometry experiments.[4] Vo et al. reviewed how AI is transforming the immunopeptidomics landscape in 2025, noting that data-driven prediction models now complement experimental profiling for identifying presented peptides.[5] Willems et al. demonstrated that data-independent immunopeptidomics can detect low-abundant bacterial epitopes that standard approaches miss, expanding the accessible immunopeptidome.[6] For more on how immunoinformatics predicts peptide immune responses, see our dedicated article.
Peptipedia is a comprehensive database that integrates information on bioactive peptides of all functional classes (antimicrobial, antihypertensive, anticancer, antioxidant, and others) with machine learning-powered prediction tools. The database contains over 58,000 experimentally validated therapeutic peptides with annotated structure and multi-function property information.
ConoServer focuses specifically on conotoxins, the peptide toxins from cone snails that have yielded approved drugs (ziconotide/Prialt for intractable pain) and numerous ion channel research tools. ConoServer includes sequence, structure, pharmacological target, and post-translational modification data for over 4,000 conopeptide sequences. CyBase catalogs cyclic peptides, including cyclotides from plants that have exceptional stability due to their cyclic cystine knot topology. The database's focus on cyclic peptides is valuable because cyclization dramatically improves protease resistance and oral bioavailability, making cyclic peptide scaffolds attractive starting points for drug design.
TumorHoPe collects experimentally validated tumor-homing peptides that bind selectively to tumor vasculature or cancer cells. These peptides are used as targeting ligands for drug delivery rather than as therapeutics themselves. CancerPPD catalogs anticancer peptides and peptidomimetics, including cell-penetrating peptides with demonstrated cytotoxicity against cancer cell lines. BaAMPs (Biofilm-Active AMPs) focuses specifically on peptides with activity against bacterial biofilms, a growing concern in hospital-acquired infections and chronic wounds.
SATPdb (Structurally Annotated Therapeutic Peptides Database) provides structural information for therapeutic peptides, with a particular strength in hemolysis data. This is critical for drug development because therapeutic peptides must avoid destroying red blood cells at therapeutic concentrations, and SATPdb provides the largest collection of hemolysis measurements for structure-activity analysis.
Machine Learning on Database Data
The value of peptide databases is multiplied when their curated data trains machine learning models. The shift from database-as-reference to database-as-training-set has transformed peptide bioinformatics.
Ghulam et al. developed AMP-CapsNet in 2026, a multi-view feature fusion approach using capsule networks for antimicrobial peptide prediction.[7] The model integrates sequence features, physicochemical properties, and evolutionary information extracted from database entries to predict whether a novel sequence has antimicrobial activity. Capsule networks are particularly suited to this task because they preserve spatial relationships between sequence features that conventional neural networks may lose.
Lin et al. introduced PepGraphormer in 2026, an ESM-GAT hybrid deep learning framework for AMP prediction that combines protein language model embeddings (ESM) with graph attention networks.[8] This architecture treats each peptide as a graph where amino acids are nodes and their interactions are edges, capturing structural relationships that sequence-only models miss. The model was trained and validated on curated database entries, demonstrating how database quality directly determines model performance. A model trained on a noisy database with inconsistent activity annotations will learn noise; a model trained on a carefully curated dataset with standardized activity measurements will learn biology. This dependency creates a virtuous cycle: better databases produce better models, which in turn identify errors and gaps in databases.
The progression from simple sequence-based classifiers to protein language models represents a qualitative shift in what databases enable. Early AMP prediction tools used hand-crafted features (amino acid composition, charge distribution, hydrophobic moment) extracted from database entries. Current models like ESM-2 are pre-trained on millions of protein sequences to learn general protein "language" patterns, then fine-tuned on AMP database entries for the specific task of antimicrobial activity prediction. This transfer learning approach means that even small, focused databases like APD can produce powerful prediction models by leveraging pre-trained representations learned from much larger protein sequence resources.
These prediction models have immediate practical applications. Fatima et al. applied AI-driven peptide discovery for endometrial cancer in 2026, using deep generative modeling and molecular simulation to design candidate therapeutic peptides entirely in silico.[9] The generative model learned peptide design rules from database entries and produced novel sequences predicted to bind cancer-relevant targets. Schofield et al. used computational design, trained on database structural data, to create stapled peptide-based antagonists of the CGRP receptor.[10]
Wang et al. demonstrated in 2023 that machine learning models trained on peptide stability data could predict peptide degradation rates in the gastrointestinal tract, enabling rational design of orally stable peptides.[11] The training data came from curated database entries linking peptide sequence features to experimentally measured stability, illustrating how databases enable predictive applications far beyond their original scope.
For more on how these deep learning models predict peptide properties and how researchers design peptides from scratch, see our dedicated articles. For computational safety assessment, see in silico toxicity prediction.
From Database to Discovery: Integrated Workflows
Modern peptide discovery pipelines integrate databases at every stage. A typical workflow begins with database mining to identify natural peptide templates with desired properties. Machine learning models trained on database data predict which structural modifications might improve activity, reduce toxicity, or enhance stability. Candidate sequences are then screened computationally against target structures before the most promising designs advance to experimental synthesis and testing.
Martell-Huguet et al. exemplified this integrated approach in 2026, combining computational and experimental methods to discover multifunctional peptides from marine organisms.[12] Their pipeline used database-derived natural peptide sequences as starting points, applied computational optimization, and validated predictions experimentally, demonstrating how the database-computation-experiment cycle accelerates discovery.
Al-Mamari et al. used a combined bioinformatics approach to identify and characterize novel antimicrobial peptides from camel (Camelus dromedarius) sequences in 2026.[13] By cross-referencing genomic data against AMP database entries and structural prediction tools, they identified candidate sequences that would have been invisible to traditional wet-lab screening.
Hasannejad-Asl et al. explored novel antimicrobial peptides from gut probiotic Enterococcus species against drug-resistant pathogens in 2026.[14] Their work used database resources to identify AMP-encoding genes in probiotic genomes, then characterized the predicted peptides experimentally. This gut-microbiome-to-database-to-drug pipeline connects microbial genomics with peptide bioinformatics in a workflow that would be impossible without curated database infrastructure.
Ramtel et al. analyzed beta-hairpin antimicrobial peptides for class diversity and sequence patterns in 2026, using database-scale sequence analysis to identify conserved structural motifs across hundreds of related peptides.[15] This type of large-scale structural classification depends entirely on database completeness and annotation quality.
Database Limitations and Future Directions
Current peptide databases face several challenges. Data quality varies across resources: some databases accept computationally predicted entries alongside experimentally validated ones without clear distinction. Activity data is often reported under non-standardized conditions (different bacterial strains, media, incubation times), making cross-database comparisons unreliable. Structural coverage remains incomplete, with three-dimensional structures available for only a fraction of cataloged peptides.
Database maintenance is another concern. Academic databases depend on grant funding and individual investigators, creating sustainability risks. Some established databases have gone offline or stopped updating when their maintainers retired or lost funding. The peptide bioinformatics community lacks the institutional infrastructure that supports larger biological databases like UniProt or GenBank.
The integration problem is perhaps the most pressing. With over a dozen major AMP databases, each with different entry formats, activity reporting standards, and inclusion criteria, researchers must manually cross-reference multiple resources to get a complete picture. Efforts to create unified meta-databases or standardized data exchange formats are underway but not yet widely adopted. The comprehensive therapeutic peptide dataset published in Scientific Data in 2025, containing 58,583 experimentally validated peptides with multi-function annotations, represents one attempt to consolidate dispersed data, but it is a static snapshot rather than a continuously updated resource.
Annotation inconsistency creates practical problems for machine learning. A peptide reported as "active" against E. coli in one database may have been tested at a different concentration, in a different growth medium, and with a different activity threshold than an "active" entry in another database. These methodological differences introduce noise into training datasets, potentially degrading model performance. Standardized reporting frameworks, analogous to the MIAME standards that transformed microarray data quality, are needed for peptide activity data but do not yet exist.
The sustainability question deserves attention. Several peptide databases that were valuable resources a decade ago have become inactive or inaccessible because the principal investigator retired, changed institutions, or lost funding. The peptide bioinformatics community has not yet developed the institutional support structures (dedicated staff, sustainable funding models, governance frameworks) that sustain larger biological resources like UniProt, GenBank, or the Protein Data Bank. Individual databases that represent years of curation effort could disappear when their creators move on.
Despite these limitations, peptide databases remain indispensable. Every new therapeutic peptide approved, every machine learning model that predicts peptide function, and every computational design that produces a novel bioactive sequence traces back to curated database entries generated by decades of experimental work. The databases are imperfect, but the alternative, working without them, would set the field back by decades. The ongoing challenge is maintaining and improving these resources as the peptide field grows in both academic and commercial significance. Researchers, funders, and publishers all have roles to play: depositing new peptide data in standardized formats, supporting database infrastructure through dedicated funding mechanisms, and requiring data deposition as a condition of publication.
The future of peptide databases likely involves greater integration with protein structure prediction tools (AlphaFold, ESMFold), real-time updating from literature mining using natural language processing, and federated architectures that allow distributed databases to share data without centralization. The peptide bioinformatics field is small enough that coordinated community efforts could address these challenges, but large enough that the impact of doing so would accelerate peptide drug discovery across therapeutic areas from antimicrobials to cancer to metabolic disease.
The Bottom Line
Peptide databases provide the curated sequence, structure, and activity data that powers modern peptide research. Resources range from carefully curated collections like APD3 (3,000+ natural AMPs) and DBAASP (15,700+ entries with structure-activity data) to massive computationally generated catalogs like the AMPSphere (863,498 sequences). Machine learning models trained on these databases can predict antimicrobial function, peptide stability, and immune presentation with increasing accuracy. The databases face challenges in standardization, maintenance, and integration, but they remain essential infrastructure for peptide drug discovery, bioinformatics research, and computational peptide design.