Home · About / cite

About methods, citation, and license · v0.9 · 2026-05-14

What is mpCGCdb?

mpCGCdb (the marine polysaccharide CAZyme gene clusterer database) catalogs every carbohydrate-active enzyme gene cluster (CGC) predicted from > 24,000 medium- and high-quality metagenome-assembled genomes (MAGs) from the Global Ocean Microbiome Catalog (Chen et al. (2024)). Each CGC contains at least one CAZyme or sulfatase plus its co-localized neighbours (transporters, peptidases, sigma/anti-sigma factors, transcriptional regulators, etc.), capturing the gene-level context in which marine prokaryotes process polysaccharides.


This atlas links > 289,000 CGCs through a 531 million edge sequence similarity network, then partitions that network using Leiden clustering to build > 67,000 similarity colocalization network (SCoNe) communities. Each community is summarized by a representative protein, InterPro architecture, GTDB-level host distribution, and a curated link to CAZy / SulfAtlas EC-numbered activity where available.

Abbreviations

Database & data model
mpCGCdbmarine polysaccharide CAZyme gene clusterer database
CGCcarbohydrate-active enzyme gene cluster
MAGmetagenome-assembled genome
GOMCGlobal Ocean Microbiome Catalog
GTDBGenome Taxonomy Database
PULpolysaccharide utilization locus
Enzyme families
CAZymecarbohydrate-active enzyme
GHglycoside hydrolase
PLpolysaccharide lyase
CEcarbohydrate esterase
AAauxiliary activity
GTglycosyltransferase
CBMcarbohydrate-binding module
S1 / S2 / S3SulfAtlas sulfatase families
ECEnzyme Commission (number)
STPsignal transduction protein
TFtranscription factor
Networks & annotation
SSNsequence similarity network
SCoNesimilarity colocalization network
Genome metrics
bp / kb / Mbbase pairs / kilobases / megabases
N50assembly contiguity statistic
GCguanine–cytosine content

Data sources

GenomesGOMC (22,607 marine MAGs, GTDB v207)
CAZyme callsdbCAN v4.1.4 / CAZy DB pull 2024-11
SulfatasesSulfAtlas v2.4
EC nomenclatureExPASy ENZYME.DAT (IUBMB)
Domain & family architectureInterProScan v5.66
TaxonomyGTDB r207 (2022-04-08)

Citation

If mpCGCdb informs your work, please cite the preprint:

Oliver A, et al. (in prep). mpCGCdb: a sequence-community atlas of carbohydrate-active gene clusters across the global marine microbiome. bioRxiv DOI to be assigned.

BibTeX:

@unpublished{mpcgcdb2026, title = {mpCGCdb: a sequence-community atlas of carbohydrate-active gene clusters across the global marine microbiome}, author = {Oliver, Aaron A. and collaborators}, year = {2026}, note = {Manuscript in preparation}, url = {https://mpcgcdb.com} }

Important works

mpCGCdb is built on top of these resources. Please cite the primary sources you use alongside mpCGCdb.

  1. Paoli et al. (2022). Biosynthetic potential of the global ocean microbiome. Nature. doi:10.1038/s41586-022-04862-3
  2. Chen et al. (2024). Global marine microbial diversity and its potential in bioprospecting. Nature. doi:10.1038/s41586-024-07891-2
  3. Nishimura et al. (2022). The OceanDNA MAG catalog contains over 50,000 prokaryotic genomes originated from various marine environments. Scientific Data. doi:10.1038/s41597-022-01392-5
  4. Parks et al. (2025). GTDB release 10: a complete and systematic taxonomy for 715,230 bacterial and 17,245 archaeal genomes. Nucleic Acids Research. doi:10.1093/nar/gkaf1040
  5. Zheng et al. (2023). dbCAN3: automated carbohydrate-active enzyme and substrate annotation. Nucleic Acids Research. doi:10.1093/nar/gkad328
  6. Stam et al. (2023). SulfAtlas, the sulfatase database: state of the art and new developments. Nucleic Acids Research. doi:10.1093/nar/gkac977
  7. Saier et al. (2021). The Transporter Classification Database (TCDB): 2021 update. Nucleic Acids Research. doi:10.1093/nar/gkaa1004

License

Database contentCC BY 4.0: free to use, share, and adapt with attribution.
Source codeMIT: build pipeline & rendering scripts at github.com/AaronAOliver/mpcgc.
Third-party dataRespect upstream licenses for CAZy, SulfAtlas, GTDB, InterPro, GOMC.