About methods, citation, and license · v0.9 · 2026-05-14
What is mpCGCdb?
mpCGCdb (the marine polysaccharide CAZyme gene clusterer database) catalogs every carbohydrate-active enzyme gene cluster (CGC) predicted from > 24,000 medium- and high-quality metagenome-assembled genomes (MAGs) from the Global Ocean Microbiome Catalog (Chen et al. (2024)). Each CGC contains at least one CAZyme or sulfatase plus its co-localized neighbours (transporters, peptidases, sigma/anti-sigma factors, transcriptional regulators, etc.), capturing the gene-level context in which marine prokaryotes process polysaccharides.
This atlas links > 289,000 CGCs through a 531 million edge sequence similarity network, then partitions that network using Leiden clustering to build > 67,000 similarity colocalization network (SCoNe) communities. Each community is summarized by a representative protein, InterPro architecture, GTDB-level host distribution, and a curated link to CAZy / SulfAtlas EC-numbered activity where available.
Abbreviations
| Database & data model | |
| mpCGCdb | marine polysaccharide CAZyme gene clusterer database |
|---|---|
| CGC | carbohydrate-active enzyme gene cluster |
| MAG | metagenome-assembled genome |
| GOMC | Global Ocean Microbiome Catalog |
| GTDB | Genome Taxonomy Database |
| PUL | polysaccharide utilization locus |
| Enzyme families | |
| CAZyme | carbohydrate-active enzyme |
| GH | glycoside hydrolase |
| PL | polysaccharide lyase |
| CE | carbohydrate esterase |
| AA | auxiliary activity |
| GT | glycosyltransferase |
| CBM | carbohydrate-binding module |
| S1 / S2 / S3 | SulfAtlas sulfatase families |
| EC | Enzyme Commission (number) |
| STP | signal transduction protein |
| TF | transcription factor |
| Networks & annotation | |
| SSN | sequence similarity network |
| SCoNe | similarity colocalization network |
| Genome metrics | |
| bp / kb / Mb | base pairs / kilobases / megabases |
| N50 | assembly contiguity statistic |
| GC | guanine–cytosine content |
Data sources
| Genomes | GOMC (22,607 marine MAGs, GTDB v207) |
|---|---|
| CAZyme calls | dbCAN v4.1.4 / CAZy DB pull 2024-11 |
| Sulfatases | SulfAtlas v2.4 |
| EC nomenclature | ExPASy ENZYME.DAT (IUBMB) |
| Domain & family architecture | InterProScan v5.66 |
| Taxonomy | GTDB r207 (2022-04-08) |
Citation
If mpCGCdb informs your work, please cite the preprint:
BibTeX:
Important works
mpCGCdb is built on top of these resources. Please cite the primary sources you use alongside mpCGCdb.
- Paoli et al. (2022). Biosynthetic potential of the global ocean microbiome. Nature. doi:10.1038/s41586-022-04862-3
- Chen et al. (2024). Global marine microbial diversity and its potential in bioprospecting. Nature. doi:10.1038/s41586-024-07891-2
- Nishimura et al. (2022). The OceanDNA MAG catalog contains over 50,000 prokaryotic genomes originated from various marine environments. Scientific Data. doi:10.1038/s41597-022-01392-5
- Parks et al. (2025). GTDB release 10: a complete and systematic taxonomy for 715,230 bacterial and 17,245 archaeal genomes. Nucleic Acids Research. doi:10.1093/nar/gkaf1040
- Zheng et al. (2023). dbCAN3: automated carbohydrate-active enzyme and substrate annotation. Nucleic Acids Research. doi:10.1093/nar/gkad328
- Stam et al. (2023). SulfAtlas, the sulfatase database: state of the art and new developments. Nucleic Acids Research. doi:10.1093/nar/gkac977
- Saier et al. (2021). The Transporter Classification Database (TCDB): 2021 update. Nucleic Acids Research. doi:10.1093/nar/gkaa1004
License
| Database content | CC BY 4.0: free to use, share, and adapt with attribution. |
|---|---|
| Source code | MIT: build pipeline & rendering scripts at github.com/AaronAOliver/mpcgc. |
| Third-party data | Respect upstream licenses for CAZy, SulfAtlas, GTDB, InterPro, GOMC. |