chromatin-gnn

Variational Graph Autoencoder for learning latent representations of chromatin topology from Hi-C data

DOI License: MIT


Overview

The three-dimensional organisation of chromatin in the nucleus is not random. Chromosomes fold into compartments, topologically associating domains (TADs), and loop structures that correlate strongly with gene regulation. Hi-C sequencing captures these contacts genome-wide, but the resulting data are high-dimensional and require principled dimensionality reduction to extract biologically interpretable structure.

This repository applies a Variational Graph Autoencoder (VGAE) to Hi-C contact data to learn a compact, continuous latent representation of chromatin topology. Genomic bins are treated as graph nodes; normalised contact frequencies define weighted edges; ChIP-seq tracks for CTCF and H3K27me3 supply node features. The model is trained end-to-end on a link-prediction objective and evaluated for its ability to recover known biological structure — A/B compartments — in an entirely unsupervised manner.


Scientific question

Can a VGAE learn biologically meaningful latent representations of chromatin topology — capturing A/B compartments and cell-type-specific reorganisation — from Hi-C contact data alone, in an unsupervised manner?


Architecture

Three encoder architectures are provided (--encoder gcn | gat | deep_gcn). The best-performing model is DeepGCN (3-layer GCN with edge-weighted message passing):

Node features (3D: CTCF, H3K27me3, H3K4me3)
         │
     BatchNorm
         │
   GCNConv(128, edge_weight)  ← layer 1
         │
   BatchNorm → ReLU → Dropout(0.3)
         │
   GCNConv(128, edge_weight)  ← layer 2
         │
       ReLU → Dropout(0.3)
        / \
GCNConv(64) GCNConv(64)   (edge_weight)
    μ           log σ
        \ /
   Reparameterisation
         │
     z ∈ ℝ⁶⁴   (node embeddings)
         │
 Inner-product decoder
 (β-VGAE objective: BCE + β·KL with linear warm-up)

ICE-balanced contact weights (log1p-normalised) are passed as edge_weight to every GCNConv layer, allowing the model to up-weight strong contacts during message passing. The decoder is the standard dot-product decoder from Kipf & Welling (2017). Training uses a link-prediction objective: the model distinguishes real Hi-C contacts from randomly sampled non-contacts.


Dataset

All data are from the GRCh38/hg38 reference genome, chromosome 21 at 25 kb resolution.

File Cell line Type Source Accession
GM12878.mcool GM12878 (lymphoblastoid) Hi-C contact matrix 4DN Data Portal 4DNFIRUMEC32
IMR90.mcool IMR-90 (lung fibroblast) Hi-C contact matrix 4DN Data Portal 4DNFIABB3FHQ
GM12878_CTCF.bw GM12878 CTCF ChIP-seq (FC/control) ENCODE ENCFF741BAQ (exp. ENCSR000AKB)
GM12878_H3K27me3.bw GM12878 H3K27me3 ChIP-seq (FC/control) ENCODE ENCFF736CNQ (exp. ENCSR000AKD)
GM12878_H3K4me3.bw GM12878 H3K4me3 ChIP-seq (FC/control) ENCODE
IMR90_CTCF.bw IMR-90 CTCF ChIP-seq (FC/control) ENCODE ENCFF770DUD (exp. ENCSR000EFI)
IMR90_H3K27me3.bw IMR-90 H3K27me3 ChIP-seq (FC/control) ENCODE ENCFF158HZL (exp. ENCSR431UUY)
IMR90_H3K4me3.bw IMR-90 H3K4me3 ChIP-seq (FC/control) ENCODE

Graph statistics:

Cell line Bins (chr21, 25 kb) Edges (contacts, undirected) Node features
GM12878 1,869 172,310 3 (CTCF, H3K27me3, H3K4me3)
IMR90 1,869 136,121 3 (CTCF, H3K27me3, H3K4me3)

IMR90 has ~55% more intra-chromosomal contacts than GM12878 at chr21, suggesting a more compact or contact-rich chromatin organisation in this fibroblast cell line.


Installation

conda create -n chromatin_gnn python=3.10 -y
conda activate chromatin_gnn

# CPU-only PyTorch (replace URL for GPU builds)
pip install torch==2.1.2 --index-url https://download.pytorch.org/whl/cpu

# All other dependencies
pip install torch-geometric==2.5.3 cooler==0.9.3 pyBigWig pandas \
    "numpy>=1.24,<2.0" scikit-learn matplotlib umap-learn scipy seaborn tqdm

Note: cooler==0.9.3 requires numpy<2.0. The env.yml captures the exact versions used for this release.


Workflow

# Full end-to-end run (downloads bigwigs automatically; .mcool files must be present in data/raw/)
bash run_pipeline.sh

# Or run individual steps:

# 1. Build contact graph
python scripts/build_graph.py \
    --mcool data/raw/GM12878.mcool \
    --chrom chr21 --res 25000 \
    --bigwigs data/raw/GM12878_CTCF.bw data/raw/GM12878_H3K27me3.bw \
    --out data/processed/GM12878_chr21.pt

# 2. Compute A/B compartments
python scripts/compute_compartments.py \
    --mcool data/raw/GM12878.mcool --chrom chr21 --res 25000 \
    --bigwig_orient data/raw/GM12878_CTCF.bw \
    --out results/GM12878/compartments_chr21.csv

# 3. Train VGAE (best config: DeepGCN + edge weights + 3 node features)
python scripts/train_vgae.py \
    --graph data/processed/GM12878_chr21_3feat.pt \
    --encoder deep_gcn \
    --hidden 128 --latent 64 \
    --epochs 300 --patience 50 \
    --lr 3e-4 --dropout 0.3 --beta 0.5 --kl_anneal 100 \
    --outdir results/GM12878

# 4. Encode a second cell line with the trained model
python scripts/encode_graph.py \
    --model results/GM12878/model.pt \
    --graph data/processed/IMR90_chr21.pt \
    --out   results/IMR90/emb.npy

# 5. UMAP visualisation
python scripts/visualize_embeddings.py \
    --emb results/GM12878/emb.npy results/IMR90/emb.npy \
    --labels GM12878 IMR90 \
    --compartments results/GM12878/compartments_chr21.csv \
                   results/IMR90/compartments_chr21.csv \
    --prefix results/figures/umap

# 6. Per-bin embedding comparison
python scripts/compare_embeddings.py \
    --emb1 results/GM12878/emb.npy --emb2 results/IMR90/emb.npy \
    --label1 GM12878 --label2 IMR90 \
    --prefix results/figures/chr21

Results

Training (GM12878, chr21, 25 kb)

Encoder Node features Edge weights Test AUC Test AP Epochs
GCN (v1 baseline) 2 0.777 0.759 31
GAT (v2) 2 0.797 0.745 73
DeepGCN 2 0.888 0.844 137
DeepGCN 3 0.893 0.852 141

The dominant improvement came from passing ICE-balanced contact weights (edge_weight) to every GCNConv layer — signal that was computed but silently unused in earlier versions. The three-layer receptive field of DeepGCN covers a full TAD width at 25 kb resolution, which is the scale at which compartment identity is determined.


A/B compartment separation in the latent space

The UMAP of GM12878 node embeddings coloured by A/B compartment shows strong, clean separation of the two compartment types without the model ever receiving compartment labels during training.

Cell line Model Silhouette score (A/B, cosine) A bins B bins Masked (N)
GM12878 (training) GCN v1 0.775 602 683 584
IMR90 (zero-shot) GCN v1 0.443 614 709 546
GM12878 (training) DeepGCN 0.663 602 683 584
IMR90 (zero-shot) DeepGCN 0.473 614 709 546

The v1 GM12878 silhouette (0.775) is higher than DeepGCN's (0.663) because the higher-dimensional latent space (64 vs 32) spreads clusters further apart in cosine geometry. The more meaningful comparison is the zero-shot IMR90 transfer, where DeepGCN improves from 0.443 → 0.473 despite the BatchNorm statistics being fit to GM12878.

Figures:

Figure Description
results/figures/umap_GM12878_compartment.png GM12878 UMAP coloured by A/B compartment
results/figures/umap_GM12878_position.png GM12878 UMAP coloured by genomic position
results/figures/umap_IMR90_compartment.png IMR90 UMAP coloured by A/B compartment
results/figures/umap_joint.png Joint UMAP of both cell lines
results/figures/chr21_delta.png Per-bin cosine distance track (GM12878 vs IMR90)

Cell-type comparison: GM12878 vs IMR90

The IMR90 graph was encoded with the GM12878-trained model (zero-shot transfer). Per-bin cosine distances between the two embedding matrices reveal which genomic loci undergo the largest chromatin reorganisation between cell types.

Per-bin cosine distance summary:

Statistic Value
Mean 0.245
Median 0.028
Bins with distance < 0.1 (stable) 968 / 1,869 (52%)
Bins with distance > 0.5 (high shift) 293 / 1,869 (16%)

Mean cosine distance by GM12878 compartment:

Compartment Mean distance Median distance
A (active) 0.042 0.001
B (repressive) 0.451 0.352
N (masked) 0.213 0.000

Key finding: A-compartment bins are nearly invariant between the two cell types (mean Δ = 0.042), while B-compartment bins shift substantially (mean Δ = 0.451). This is consistent with the known biology: constitutively active chromatin domains tend to be conserved across cell types, while heterochromatic B-compartment organisation is more cell-type-specific.

Compartment switches (GM12878 → IMR90):

GM12878 IMR90 Bins Interpretation
A A 493 Stable active
B B 581 Stable repressive
B → A A 101 Loci that open in IMR90
A → B B 70 Loci that close in IMR90

101 loci switch from B (repressive in GM12878) to A (active in IMR90), versus 70 in the reverse direction. This asymmetry suggests that IMR90 fibroblasts activate more lineage-specific loci on chr21 than GM12878 lymphoblastoid cells.


Biological interpretation

The results demonstrate that a VGAE trained on Hi-C data recovers A/B compartment structure without supervision (silhouette = 0.775). The latent space organises chr21 bins according to their chromatin state, not just their linear genomic position — the UMAP coloured by genomic position shows a broadly continuous gradient, while the compartment-coloured UMAP shows discrete clusters.

The zero-shot application to IMR90 captures the partial conservation of compartment organisation across cell types. The B-compartment instability revealed by the cosine distance analysis is consistent with the literature: heterochromatin rewiring is a known driver of cell-type identity (Lieberman-Aiden et al. 2009; Dixon et al. 2015).

Notably, the model achieves this with only two node features (CTCF and H3K27me3 signal), demonstrating that a modest set of epigenomic marks, combined with contact topology, is sufficient to encode the major axis of chromatin organisation.


Limitations

  1. Single chromosome, single resolution. Results are for chr21 at 25 kb only. Chr21 is acrocentric with a large masked pericentromeric region (584 / 1,869 bins masked in GM12878), which may reduce statistical power compared to gene-rich autosomes.

  2. Random negative sampling inflates AUC. Negative edges are drawn uniformly at random, so many are long-range pairs with trivially near-zero contact frequency. Distance-matched negative sampling (same genomic distance band as positives) would give a more stringent and biologically honest evaluation.

  3. Link-prediction objective ≠ compartment recovery. The model is optimised to predict contacts, not compartments. The silhouette score is emergent, not guaranteed. The objective could be supplemented with biologically-informed losses.

  4. Zero-shot transfer with fixed BatchNorm. Encoding IMR90 with GM12878 BatchNorm statistics means the model sees IMR90 features in GM12878's normalisation frame. A domain-adaptation approach (e.g., re-fitting BatchNorm on IMR90 with frozen GCN weights) would give a fairer comparison.

  5. Compartment calling is approximate. The O/E → Pearson correlation → PC1 pipeline is sensitive to the choice of orientation signal (CTCF here). Bins with low coverage are masked and assigned no compartment label, which affects the silhouette calculation.

  6. No TAD-level evaluation. TAD boundary detection would require a graph-level or boundary-aware metric. The current evaluation is node-level only.

  7. No statistical significance testing. The compartment switch counts (101 B→A, 70 A→B) have not been tested for significance against a null model (e.g., random permutation of compartment labels).


Future work

  • Apply to all autosomes and compare genome-wide compartment recovery.
  • Add a TAD-boundary evaluation metric (e.g., insulation score correlation with latent space gradients).
  • Fine-tune on IMR90 (transfer learning / BatchNorm adaptation) to improve zero-shot silhouette.
  • Add cohesin depletion or auxin-inducible degron (AID) perturbation data as a controlled condition comparison.
  • Replace the inner-product decoder with a distance-aware decoder that subtracts expected polymer-scaling decay — the main remaining confounder for AUC.
  • Implement distance-matched negative sampling for a more stringent link-prediction evaluation.
  • Extend node features to H3K27ac and H3K9me3; all four active/repressive marks are available in data/raw/.
  • Benchmark against GAT — done; GAT (AUC 0.797) underperforms DeepGCN with edge weights (AUC 0.893) on this dataset.

Scripts

Script Purpose
scripts/build_graph.py Convert .mcool + bigWigs → PyG Data object
scripts/train_vgae.py Train VGAE with link-prediction objective + early stopping
scripts/encode_graph.py Encode a new graph with a trained model
scripts/compute_compartments.py O/E matrix → Pearson PCA → A/B compartment calls
scripts/visualize_embeddings.py UMAP + compartment visualisation + silhouette
scripts/compare_embeddings.py Per-bin cosine/L2/L1 distance between two embeddings
scripts/model.py Shared Encoder class (imported by train and encode scripts)

Citation

If you use this code or data in your research, please cite:

@software{okada_chromatin_gnn_2024,
  author  = {Okada, Toru},
  title   = {{chromatin-gnn: Variational Graph Autoencoder for learning
              latent representations of chromatin topology from Hi-C data}},
  year    = {2024},
  doi     = {10.5281/zenodo.XXXXXXX},
  url     = {https://github.com/ToruOkadaOi/chromatin-gnn},
  version = {1.0.0}
}

See also CITATION.cff for CFF-format metadata.

Key references:

  • Kipf, T. N. & Welling, M. (2016). Variational Graph Auto-Encoders. arXiv:1611.07308
  • Kipf, T. N. & Welling, M. (2017). Semi-Supervised Classification with Graph Convolutional Networks. ICLR 2017
  • Rao, S. S. P. et al. (2014). A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. Cell, 159(7), 16651680. https://doi.org/10.1016/j.cell.2014.11.021
  • Lieberman-Aiden, E. et al. (2009). Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science, 326(5950), 289293.

License

MIT — see LICENSE.

Description
No description provided
Readme MIT Cite this repository 167 MiB
Languages
Python 83.5%
Shell 16.5%