Home · Downloads

Downloads 21 bulk tables · 1.6 GB total · CC-BY 4.0

Bulk tables hosted on Zenodo

All tables below are mirrored to a citable Zenodo record so downloads don't burn through this site's bandwidth and so the dataset gets a permanent DOI for the manuscript.

Record: https://zenodo.org/records/20219287

Each "download" link below resolves directly to that file on Zenodo.

Primary tables 4 files

all_cgc_magmapped.tsv
All CGC gene rows
350.1 MB
One row per gene in every CGC across all MAGs (CGC#, MAG, gene type, contig, protein ID, coords, strand, annotation).
mag_metadata.txt
MAG metadata
5.1 MB
Per-MAG completeness, contamination, GC, N50, gene count, GTDB lineage, and extras (Cas, MGE, WD40, ARG).
mag_taxonomy.tsv
MAG → GTDB taxonomy
851 KB
Two-column lookup from bin_id to full GTDB v207 lineage string.
gomc_metadata.tsv
GOMC sample metadata
5.7 MB
Provenance for each MAG: bioproject, sample, depth, environment, geographic region.

Sequence clustering / SSN 8 files

global_clusters_r1p0.tsv
Leiden communities (r = 1.0, primary)
123.9 MB
Protein → community ID at resolution 1.0 (the partition used throughout the site).
global_clusters_r0p5.tsv
Leiden communities (r = 0.5)
123.8 MB
Coarser partition — larger and fewer communities. Useful for higher-level taxonomy of activities.
global_clusters_r4p0.tsv
Leiden communities (r = 4.0)
124.1 MB
Finer partition — many small, specific communities. Useful for narrowing in on isofunctional groups.
global_protein_to_cgc.tsv
Protein → CGC map
179.1 MB
Lookup table joining 2.44 M protein IDs to their parent CGC.
global_protein_annotation.tsv
Per-protein annotation
125.2 MB
CAZyme/sulfatase/TC/peptidase calls per protein, with HMMER + DIAMOND + eCAMI consensus.
global_protein_category.tsv
Protein functional category
129.8 MB
Coarse category tag (CAZyme / Sulfatase / TC / Peptidase / Regulator / Other) per protein.
global_scone_edges.tsv
Sequence-Colocalization Network (SCoNe) edges
22.6 MB
Pairwise sequence-co-occurrence weights between communities (the SCoNe network backbone).
global_scone_layout.tsv
SCoNe layout coordinates
4.4 MB
Pre-computed 2-D coordinates for the SCoNe network nodes (matches the Cytoscape sessions).

Community-level summaries 2 files

global_community_reps.tsv
Community representatives
2.8 MB
Representative protein, top family call, member count, and host distribution per community.
global_community_interpro.tsv
InterPro architectures per community
6.1 MB
Distribution of InterPro domain calls within each community (dominant + secondary architectures).

Annotation references 4 files

cazy_metadata.tsv
CAZy family metadata
78 KB
Curated activity / EC / substrate descriptors for every GH, PL, CE, AA, GT, CBM family.
sulfatlas_metadata.tsv
SulfAtlas family metadata
372 B
Subfamily-level summaries from SulfAtlas v2.4 (S1_1 … S1_x, S2, S3).
sulfatlas_curated.tsv
SulfAtlas curated EC / activity
3 KB
Manually-curated EC numbers and substrate calls for sulfatase subfamilies.
ec_functions.tsv
EC → function map
609 KB
IUBMB ENZYME.DAT export linking EC numbers to canonical reaction descriptors.

Representatives & FASTAs 3 files

representatives.tsv
Representative-protein registry
420.1 MB
FASTA-coordinate lookup for the representative protein chosen for every community (≈ 421 MB).
noncatalytic_cgcs.tsv
Non-catalytic CGCs
4.7 MB
Subset of CGCs whose only CAZyme call is to a non-catalytic family (CBMs, regulators, etc.).
gomc_nmpfs.tsv
NMPF (novel metagenomic protein family) catalog
1.6 MB
Curated list of novel metagenomic protein families produced from the SSN tail.

Per-entity downloads

Each entity page exposes its own slice of the data. To bulk-collect family- or MAG-specific files, point a recursive download tool (e.g. wgetcurl) at the matching directory:

Per-family https://mpcgcdb.com/<family>/{cgcs.txt, ranked_families.tsv} — e.g. GH13/cgcs.txt.
Per-MAG https://mpcgcdb.com/genomes/<bin_id>/{cgcs.txt, proteins.tsv, ranked_families.tsv}

Protein sequences (FASTA) are not served per-entity — the full proteome is available from the Global Ocean Microbiome Catalog (download the protein catalog from that page).

Filenames are stable across releases. Diffs between versions will be tracked at GitHub releases.

License & reuse

All tables on this page are released under CC BY 4.0. Use freely with attribution; please cite mpCGCdb plus the upstream primary sources (CAZy, SulfAtlas, GTDB, GOMC) where applicable. See About / cite for the recommended citation.