aman acadbd780c v1.0.0: VGAE applied to GM12878 vs IMR90 chr21 Hi-C at 25kb
Full reproducible pipeline: .mcool + ChIP-seq bigwigs → latent
  embeddings → A/B compartment calls → cross-cell comparison.

  Key results (chr21, 25 kb, latent dim=32):
  - Test AUC=0.777, AP=0.759 (converged epoch 31/300)
  - GM12878 A/B silhouette (cosine) = 0.775
  - IMR90 zero-shot silhouette = 0.443
  - A-compartment bins stable across cell types (mean cosine Δ=0.042)
  - B-compartment bins shift substantially (mean cosine Δ=0.451)
  - 101 B→A and 70 A→B compartment switches GM12878→IMR90
2026-05-15 01:53:04 +02:00
2026-04-12 01:09:35 +02:00

chromatin-gnn

Variational Graph Autoencoder for learning latent representations of chromatin topology from Hi-C data

DOI License: MIT


Overview

The three-dimensional organisation of chromatin in the nucleus is not random. Chromosomes fold into compartments, topologically associating domains (TADs), and loop structures that correlate strongly with gene regulation. Hi-C sequencing captures these contacts genome-wide, but the resulting data are high-dimensional and require principled dimensionality reduction to extract biologically interpretable structure.

This repository applies a Variational Graph Autoencoder (VGAE) to Hi-C contact data to learn a compact, continuous latent representation of chromatin topology. Genomic bins are treated as graph nodes; normalised contact frequencies define weighted edges; ChIP-seq tracks for CTCF and H3K27me3 supply node features. The model is trained end-to-end on a link-prediction objective and evaluated for its ability to recover known biological structure — A/B compartments — in an entirely unsupervised manner.


Scientific question

Can a VGAE learn biologically meaningful latent representations of chromatin topology — capturing A/B compartments and cell-type-specific reorganisation — from Hi-C contact data alone, in an unsupervised manner?


Architecture

Node features (2D: CTCF, H3K27me3)
         │
     BatchNorm
         │
   GCNConv(64)  ← shared message-passing layer
         │
       ReLU + Dropout(0.2)
        / \
GCNConv(32) GCNConv(32)
    μ           log σ
        \ /
   Reparameterisation
         │
     z ∈ ℝ³²   (node embeddings)
         │
 Inner-product decoder
 (link prediction objective: binary cross-entropy + KL divergence)

The encoder is a two-layer Graph Convolutional Network (Kipf & Welling 2016, 2017) with a BatchNorm input layer. The decoder is the standard dot-product decoder used in the original VGAE paper. Training uses a link-prediction objective: the model is asked to distinguish real Hi-C contacts from randomly sampled non-contacts.


Dataset

All data are from the GRCh38/hg38 reference genome, chromosome 21 at 25 kb resolution.

File Cell line Type Source Accession
GM12878.mcool GM12878 (lymphoblastoid) Hi-C contact matrix 4DN Data Portal 4DNFIRUMEC32
IMR90.mcool IMR-90 (lung fibroblast) Hi-C contact matrix 4DN Data Portal 4DNFIABB3FHQ
GM12878_CTCF.bw GM12878 CTCF ChIP-seq (FC/control) ENCODE ENCFF741BAQ (exp. ENCSR000AKB)
GM12878_H3K27me3.bw GM12878 H3K27me3 ChIP-seq (FC/control) ENCODE ENCFF736CNQ (exp. ENCSR000AKD)
IMR90_CTCF.bw IMR-90 CTCF ChIP-seq (FC/control) ENCODE ENCFF770DUD (exp. ENCSR000EFI)
IMR90_H3K27me3.bw IMR-90 H3K27me3 ChIP-seq (FC/control) ENCODE ENCFF158HZL (exp. ENCSR431UUY)

Graph statistics:

Cell line Bins (chr21, 25 kb) Edges (contacts) Node features
GM12878 1,869 87,557 2 (CTCF, H3K27me3)
IMR90 1,869 136,121 2 (CTCF, H3K27me3)

IMR90 has ~55% more intra-chromosomal contacts than GM12878 at chr21, suggesting a more compact or contact-rich chromatin organisation in this fibroblast cell line.


Installation

conda create -n chromatin_gnn python=3.10 -y
conda activate chromatin_gnn

# CPU-only PyTorch (replace URL for GPU builds)
pip install torch==2.1.2 --index-url https://download.pytorch.org/whl/cpu

# All other dependencies
pip install torch-geometric==2.5.3 cooler==0.9.3 pyBigWig pandas \
    "numpy>=1.24,<2.0" scikit-learn matplotlib umap-learn scipy seaborn tqdm

Note: cooler==0.9.3 requires numpy<2.0. The env.yml captures the exact versions used for this release.


Workflow

# Full end-to-end run (downloads bigwigs automatically; .mcool files must be present in data/raw/)
bash run_pipeline.sh

# Or run individual steps:

# 1. Build contact graph
python scripts/build_graph.py \
    --mcool data/raw/GM12878.mcool \
    --chrom chr21 --res 25000 \
    --bigwigs data/raw/GM12878_CTCF.bw data/raw/GM12878_H3K27me3.bw \
    --out data/processed/GM12878_chr21.pt

# 2. Compute A/B compartments
python scripts/compute_compartments.py \
    --mcool data/raw/GM12878.mcool --chrom chr21 --res 25000 \
    --bigwig_orient data/raw/GM12878_CTCF.bw \
    --out results/GM12878/compartments_chr21.csv

# 3. Train VGAE
python scripts/train_vgae.py \
    --graph data/processed/GM12878_chr21.pt \
    --epochs 300 --patience 20 --hidden 64 --latent 32 \
    --outdir results/GM12878

# 4. Encode a second cell line with the trained model
python scripts/encode_graph.py \
    --model results/GM12878/model.pt \
    --graph data/processed/IMR90_chr21.pt \
    --out   results/IMR90/emb.npy

# 5. UMAP visualisation
python scripts/visualize_embeddings.py \
    --emb results/GM12878/emb.npy results/IMR90/emb.npy \
    --labels GM12878 IMR90 \
    --compartments results/GM12878/compartments_chr21.csv \
                   results/IMR90/compartments_chr21.csv \
    --prefix results/figures/umap

# 6. Per-bin embedding comparison
python scripts/compare_embeddings.py \
    --emb1 results/GM12878/emb.npy --emb2 results/IMR90/emb.npy \
    --label1 GM12878 --label2 IMR90 \
    --prefix results/figures/chr21

Results

Training (GM12878, chr21, 25 kb)

Metric Value
Epochs to convergence 31 / 300 (early stopping, patience=20)
Validation AUC (link prediction) 0.774
Test AUC 0.777
Test AP 0.759
Latent dimensionality 32

The model converged rapidly, suggesting that the graph structure of chr21 at 25 kb is learnable with a shallow two-layer GCN.


A/B compartment separation in the latent space

The UMAP of GM12878 node embeddings coloured by A/B compartment shows strong, clean separation of the two compartment types without the model ever receiving compartment labels during training.

Cell line Silhouette score (A/B, cosine) A bins B bins Masked (N)
GM12878 (training) 0.775 602 683 584
IMR90 (zero-shot) 0.443 614 709 546

The GM12878 silhouette of 0.775 indicates that the VGAE has learned a latent space in which A and B compartments are nearly linearly separable — a strong signal given that compartment identity was never provided as a training label.

For IMR90, encoded zero-shot with the GM12878-trained model, the silhouette drops to 0.443. This is expected: the model's BatchNorm statistics were fit to GM12878, and IMR90's chromatin organisation partially diverges.

Figures:

Figure Description
results/figures/umap_GM12878_compartment.png GM12878 UMAP coloured by A/B compartment
results/figures/umap_GM12878_position.png GM12878 UMAP coloured by genomic position
results/figures/umap_IMR90_compartment.png IMR90 UMAP coloured by A/B compartment
results/figures/umap_joint.png Joint UMAP of both cell lines
results/figures/chr21_delta.png Per-bin cosine distance track (GM12878 vs IMR90)

Cell-type comparison: GM12878 vs IMR90

The IMR90 graph was encoded with the GM12878-trained model (zero-shot transfer). Per-bin cosine distances between the two embedding matrices reveal which genomic loci undergo the largest chromatin reorganisation between cell types.

Per-bin cosine distance summary:

Statistic Value
Mean 0.245
Median 0.028
Bins with distance < 0.1 (stable) 968 / 1,869 (52%)
Bins with distance > 0.5 (high shift) 293 / 1,869 (16%)

Mean cosine distance by GM12878 compartment:

Compartment Mean distance Median distance
A (active) 0.042 0.001
B (repressive) 0.451 0.352
N (masked) 0.213 0.000

Key finding: A-compartment bins are nearly invariant between the two cell types (mean Δ = 0.042), while B-compartment bins shift substantially (mean Δ = 0.451). This is consistent with the known biology: constitutively active chromatin domains tend to be conserved across cell types, while heterochromatic B-compartment organisation is more cell-type-specific.

Compartment switches (GM12878 → IMR90):

GM12878 IMR90 Bins Interpretation
A A 493 Stable active
B B 581 Stable repressive
B → A A 101 Loci that open in IMR90
A → B B 70 Loci that close in IMR90

101 loci switch from B (repressive in GM12878) to A (active in IMR90), versus 70 in the reverse direction. This asymmetry suggests that IMR90 fibroblasts activate more lineage-specific loci on chr21 than GM12878 lymphoblastoid cells.


Biological interpretation

The results demonstrate that a VGAE trained on Hi-C data recovers A/B compartment structure without supervision (silhouette = 0.775). The latent space organises chr21 bins according to their chromatin state, not just their linear genomic position — the UMAP coloured by genomic position shows a broadly continuous gradient, while the compartment-coloured UMAP shows discrete clusters.

The zero-shot application to IMR90 captures the partial conservation of compartment organisation across cell types. The B-compartment instability revealed by the cosine distance analysis is consistent with the literature: heterochromatin rewiring is a known driver of cell-type identity (Lieberman-Aiden et al. 2009; Dixon et al. 2015).

Notably, the model achieves this with only two node features (CTCF and H3K27me3 signal), demonstrating that a modest set of epigenomic marks, combined with contact topology, is sufficient to encode the major axis of chromatin organisation.


Limitations

  1. Single chromosome, single resolution. Results are for chr21 at 25 kb only. Chr21 is acrocentric with a large masked pericentromeric region (584 / 1,869 bins masked in GM12878), which may reduce statistical power compared to gene-rich autosomes.

  2. Shallow encoder. The two-layer GCN has a local receptive field (2-hop neighbourhood). Long-range chromatin interactions spanning multiple TADs are not directly encoded. Deeper networks or attention-based architectures may capture these better.

  3. Link-prediction objective ≠ compartment recovery. The model is optimised to predict contacts, not compartments. The strong silhouette score is emergent, not guaranteed. The objective could be supplemented with biologically-informed losses.

  4. Zero-shot transfer with fixed BatchNorm. Encoding IMR90 with GM12878 BatchNorm statistics means the model sees IMR90 features in GM12878's normalisation frame. A domain-adaptation approach (e.g., re-fitting BatchNorm on IMR90 with frozen GCN weights) would give a fairer comparison.

  5. Compartment calling is approximate. The O/E → Pearson correlation → PC1 pipeline is sensitive to the choice of orientation signal (CTCF here). Bins with low coverage are masked and assigned no compartment label, which affects the silhouette calculation.

  6. No TAD-level evaluation. TAD boundary detection would require a graph-level or boundary-aware metric. The current evaluation is node-level only.

  7. No statistical significance testing. The compartment switch counts (101 B→A, 70 A→B) have not been tested for significance against a null model (e.g., random permutation of compartment labels).


Future work

  • Apply to all autosomes and compare genome-wide compartment recovery.
  • Add a TAD-boundary evaluation metric (e.g., insulation score correlation with latent space gradients).
  • Fine-tune on IMR90 (transfer learning) to improve the IMR90 silhouette score.
  • Add cohesin depletion or auxin-inducible degron (AID) perturbation data as a controlled condition comparison.
  • Replace the inner-product decoder with a distance-aware decoder that incorporates linear genomic distance.
  • Benchmark against PCA/UMAP of the raw contact matrix and against other graph-based methods (GraphSAGE, GAT).
  • Extend node features to include additional histone marks (H3K4me3, H3K27ac, H3K9me3) to test whether richer epigenomic context improves compartment recovery.

Scripts

Script Purpose
scripts/build_graph.py Convert .mcool + bigWigs → PyG Data object
scripts/train_vgae.py Train VGAE with link-prediction objective + early stopping
scripts/encode_graph.py Encode a new graph with a trained model
scripts/compute_compartments.py O/E matrix → Pearson PCA → A/B compartment calls
scripts/visualize_embeddings.py UMAP + compartment visualisation + silhouette
scripts/compare_embeddings.py Per-bin cosine/L2/L1 distance between two embeddings
scripts/model.py Shared Encoder class (imported by train and encode scripts)

Citation

If you use this code or data in your research, please cite:

@software{okada_chromatin_gnn_2024,
  author  = {Okada, Toru},
  title   = {{chromatin-gnn: Variational Graph Autoencoder for learning
              latent representations of chromatin topology from Hi-C data}},
  year    = {2024},
  doi     = {10.5281/zenodo.XXXXXXX},
  url     = {https://github.com/ToruOkadaOi/chromatin-gnn},
  version = {1.0.0}
}

See also CITATION.cff for CFF-format metadata.

Key references:

  • Kipf, T. N. & Welling, M. (2016). Variational Graph Auto-Encoders. arXiv:1611.07308
  • Kipf, T. N. & Welling, M. (2017). Semi-Supervised Classification with Graph Convolutional Networks. ICLR 2017
  • Rao, S. S. P. et al. (2014). A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. Cell, 159(7), 16651680. https://doi.org/10.1016/j.cell.2014.11.021
  • Lieberman-Aiden, E. et al. (2009). Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science, 326(5950), 289293.

License

MIT — see LICENSE.

Description
No description provided
Readme MIT Cite this repository 167 MiB
Languages
Python 83.5%
Shell 16.5%