The three-dimensional organisation of chromatin in the nucleus is not random. Chromosomes fold into compartments, topologically associating domains (TADs), and loop structures that correlate strongly with gene regulation. Hi-C sequencing captures these contacts genome-wide, but the resulting data are high-dimensional and require principled dimensionality reduction to extract biologically interpretable structure.

This repository applies a Variational Graph Autoencoder (VGAE) to Hi-C contact data to learn a compact, continuous latent representation of chromatin topology. Genomic bins are treated as graph nodes; normalised contact frequencies define weighted edges; ChIP-seq tracks for CTCF and H3K27me3 supply node features. The model is trained end-to-end on a link-prediction objective and evaluated for its ability to recover known biological structure — A/B compartments — in an entirely unsupervised manner.

Scientific question

Can a VGAE learn biologically meaningful latent representations of chromatin topology — capturing A/B compartments and cell-type-specific reorganisation — from Hi-C contact data alone, in an unsupervised manner?

Architecture

Three encoder architectures are provided (--encoder gcn | gat | deep_gcn). The best-performing model is DeepGCN (3-layer GCN with edge-weighted message passing):

Node features (3D: CTCF, H3K27me3, H3K4me3)
         │
     BatchNorm
         │
   GCNConv(128, edge_weight)  ← layer 1
         │
   BatchNorm → ReLU → Dropout(0.3)
         │
   GCNConv(128, edge_weight)  ← layer 2
         │
       ReLU → Dropout(0.3)
        / \
GCNConv(64) GCNConv(64)   (edge_weight)
    μ           log σ
        \ /
   Reparameterisation
         │
     z ∈ ℝ⁶⁴   (node embeddings)
         │
 Inner-product decoder
 (β-VGAE objective: BCE + β·KL with linear warm-up)

ICE-balanced contact weights (log1p-normalised) are passed as edge_weight to every GCNConv layer, allowing the model to up-weight strong contacts during message passing. The decoder is the standard dot-product decoder from Kipf & Welling (2017). Training uses a link-prediction objective: the model distinguishes real Hi-C contacts from randomly sampled non-contacts.

Dataset

All data are from the GRCh38/hg38 reference genome, chromosome 21 at 25 kb resolution.

File	Cell line	Type	Source	Accession
GM12878.mcool	GM12878 (lymphoblastoid)	Hi-C contact matrix	4DN Data Portal	4DNFIRUMEC32
IMR90.mcool	IMR-90 (lung fibroblast)	Hi-C contact matrix	4DN Data Portal	4DNFIABB3FHQ
GM12878_CTCF.bw	GM12878	CTCF ChIP-seq (FC/control)	ENCODE	ENCFF741BAQ (exp. ENCSR000AKB)
GM12878_H3K27me3.bw	GM12878	H3K27me3 ChIP-seq (FC/control)	ENCODE	ENCFF736CNQ (exp. ENCSR000AKD)
GM12878_H3K4me3.bw	GM12878	H3K4me3 ChIP-seq (FC/control)	ENCODE	—
IMR90_CTCF.bw	IMR-90	CTCF ChIP-seq (FC/control)	ENCODE	ENCFF770DUD (exp. ENCSR000EFI)
IMR90_H3K27me3.bw	IMR-90	H3K27me3 ChIP-seq (FC/control)	ENCODE	ENCFF158HZL (exp. ENCSR431UUY)
IMR90_H3K4me3.bw	IMR-90	H3K4me3 ChIP-seq (FC/control)	ENCODE	—

Graph statistics:

Cell line	Bins (chr21, 25 kb)	Edges (contacts, undirected)	Node features
GM12878	1,869	172,310	3 (CTCF, H3K27me3, H3K4me3)
IMR90	1,869	136,121	3 (CTCF, H3K27me3, H3K4me3)

IMR90 has ~55% more intra-chromosomal contacts than GM12878 at chr21, suggesting a more compact or contact-rich chromatin organisation in this fibroblast cell line.

Installation

conda create -n chromatin_gnn python=3.10 -y
conda activate chromatin_gnn

# CPU-only PyTorch (replace URL for GPU builds)
pip install torch==2.1.2 --index-url https://download.pytorch.org/whl/cpu

# All other dependencies
pip install torch-geometric==2.5.3 cooler==0.9.3 pyBigWig pandas \
    "numpy>=1.24,<2.0" scikit-learn matplotlib umap-learn scipy seaborn tqdm

Note: cooler==0.9.3 requires numpy<2.0. The env.yml captures the exact versions used for this release.

Workflow

# Full end-to-end run (downloads bigwigs automatically; .mcool files must be present in data/raw/)
bash run_pipeline.sh

# Or run individual steps:

# 1. Build contact graph
python scripts/build_graph.py \
    --mcool data/raw/GM12878.mcool \
    --chrom chr21 --res 25000 \
    --bigwigs data/raw/GM12878_CTCF.bw data/raw/GM12878_H3K27me3.bw \
    --out data/processed/GM12878_chr21.pt

# 2. Compute A/B compartments
python scripts/compute_compartments.py \
    --mcool data/raw/GM12878.mcool --chrom chr21 --res 25000 \
    --bigwig_orient data/raw/GM12878_CTCF.bw \
    --out results/GM12878/compartments_chr21.csv

# 3. Train VGAE (best config: DeepGCN + edge weights + 3 node features)
python scripts/train_vgae.py \
    --graph data/processed/GM12878_chr21_3feat.pt \
    --encoder deep_gcn \
    --hidden 128 --latent 64 \
    --epochs 300 --patience 50 \
    --lr 3e-4 --dropout 0.3 --beta 0.5 --kl_anneal 100 \
    --outdir results/GM12878

# 4. Encode a second cell line with the trained model
python scripts/encode_graph.py \
    --model results/GM12878/model.pt \
    --graph data/processed/IMR90_chr21.pt \
    --out   results/IMR90/emb.npy

# 5. UMAP visualisation
python scripts/visualize_embeddings.py \
    --emb results/GM12878/emb.npy results/IMR90/emb.npy \
    --labels GM12878 IMR90 \
    --compartments results/GM12878/compartments_chr21.csv \
                   results/IMR90/compartments_chr21.csv \
    --prefix results/figures/umap

# 6. Per-bin embedding comparison
python scripts/compare_embeddings.py \
    --emb1 results/GM12878/emb.npy --emb2 results/IMR90/emb.npy \
    --label1 GM12878 --label2 IMR90 \
    --prefix results/figures/chr21

Results

Training (GM12878, chr21, 25 kb)

Encoder	Node features	Edge weights	Test AUC	Test AP	Epochs
GCN (v1 baseline)	2	✗	0.777	0.759	31
GAT (v2)	2	✗	0.797	0.745	73
DeepGCN	2	✓	0.888	0.844	137
DeepGCN	3	✓	0.893	0.852	141

The dominant improvement came from passing ICE-balanced contact weights (edge_weight) to every GCNConv layer — signal that was computed but silently unused in earlier versions. The three-layer receptive field of DeepGCN covers a full TAD width at 25 kb resolution, which is the scale at which compartment identity is determined.

A/B compartment separation in the latent space

The UMAP of GM12878 node embeddings coloured by A/B compartment shows strong, clean separation of the two compartment types without the model ever receiving compartment labels during training.

Cell line	Model	Silhouette score (A/B, cosine)	A bins	B bins	Masked (N)
GM12878 (training)	GCN v1	0.775	602	683	584
IMR90 (zero-shot)	GCN v1	0.443	614	709	546
GM12878 (training)	DeepGCN	0.663	602	683	584
IMR90 (zero-shot)	DeepGCN	0.473	614	709	546

The v1 GM12878 silhouette (0.775) is higher than DeepGCN's (0.663) because the higher-dimensional latent space (64 vs 32) spreads clusters further apart in cosine geometry. The more meaningful comparison is the zero-shot IMR90 transfer, where DeepGCN improves from 0.443 → 0.473 despite the BatchNorm statistics being fit to GM12878.

Figures:

Figure	Description
`results/figures/umap_GM12878_compartment.png`	GM12878 UMAP coloured by A/B compartment
`results/figures/umap_GM12878_position.png`	GM12878 UMAP coloured by genomic position
`results/figures/umap_IMR90_compartment.png`	IMR90 UMAP coloured by A/B compartment
`results/figures/umap_joint.png`	Joint UMAP of both cell lines
`results/figures/chr21_delta.png`	Per-bin cosine distance track (GM12878 vs IMR90)

Cell-type comparison: GM12878 vs IMR90

The IMR90 graph was encoded with the GM12878-trained model (zero-shot transfer). Per-bin cosine distances between the two embedding matrices reveal which genomic loci undergo the largest chromatin reorganisation between cell types.

Per-bin cosine distance summary:

Statistic	Value
Mean	0.245
Median	0.028
Bins with distance < 0.1 (stable)	968 / 1,869 (52%)
Bins with distance > 0.5 (high shift)	293 / 1,869 (16%)

Mean cosine distance by GM12878 compartment:

Compartment	Mean distance	Median distance
A (active)	0.042	0.001
B (repressive)	0.451	0.352
N (masked)	0.213	0.000

Key finding: A-compartment bins are nearly invariant between the two cell types (mean Δ = 0.042), while B-compartment bins shift substantially (mean Δ = 0.451). This is consistent with the known biology: constitutively active chromatin domains tend to be conserved across cell types, while heterochromatic B-compartment organisation is more cell-type-specific.

Compartment switches (GM12878 → IMR90):

GM12878	IMR90	Bins	Interpretation
A	A	493	Stable active
B	B	581	Stable repressive
B → A	A	101	Loci that open in IMR90
A → B	B	70	Loci that close in IMR90

101 loci switch from B (repressive in GM12878) to A (active in IMR90), versus 70 in the reverse direction. This asymmetry suggests that IMR90 fibroblasts activate more lineage-specific loci on chr21 than GM12878 lymphoblastoid cells.

Biological interpretation

The results demonstrate that a VGAE trained on Hi-C data recovers A/B compartment structure without supervision (silhouette = 0.775). The latent space organises chr21 bins according to their chromatin state, not just their linear genomic position — the UMAP coloured by genomic position shows a broadly continuous gradient, while the compartment-coloured UMAP shows discrete clusters.

The zero-shot application to IMR90 captures the partial conservation of compartment organisation across cell types. The B-compartment instability revealed by the cosine distance analysis is consistent with the literature: heterochromatin rewiring is a known driver of cell-type identity (Lieberman-Aiden et al. 2009; Dixon et al. 2015).

Notably, the model achieves this with only two node features (CTCF and H3K27me3 signal), demonstrating that a modest set of epigenomic marks, combined with contact topology, is sufficient to encode the major axis of chromatin organisation.

Limitations

Single chromosome, single resolution. Results are for chr21 at 25 kb only. Chr21 is acrocentric with a large masked pericentromeric region (584 / 1,869 bins masked in GM12878), which may reduce statistical power compared to gene-rich autosomes.
Random negative sampling inflates AUC. Negative edges are drawn uniformly at random, so many are long-range pairs with trivially near-zero contact frequency. Distance-matched negative sampling (same genomic distance band as positives) would give a more stringent and biologically honest evaluation.
Link-prediction objective ≠ compartment recovery. The model is optimised to predict contacts, not compartments. The silhouette score is emergent, not guaranteed. The objective could be supplemented with biologically-informed losses.
Zero-shot transfer with fixed BatchNorm. Encoding IMR90 with GM12878 BatchNorm statistics means the model sees IMR90 features in GM12878's normalisation frame. A domain-adaptation approach (e.g., re-fitting BatchNorm on IMR90 with frozen GCN weights) would give a fairer comparison.
Compartment calling is approximate. The O/E → Pearson correlation → PC1 pipeline is sensitive to the choice of orientation signal (CTCF here). Bins with low coverage are masked and assigned no compartment label, which affects the silhouette calculation.
No TAD-level evaluation. TAD boundary detection would require a graph-level or boundary-aware metric. The current evaluation is node-level only.
No statistical significance testing. The compartment switch counts (101 B→A, 70 A→B) have not been tested for significance against a null model (e.g., random permutation of compartment labels).

Future work

Apply to all autosomes and compare genome-wide compartment recovery.
Add a TAD-boundary evaluation metric (e.g., insulation score correlation with latent space gradients).
Fine-tune on IMR90 (transfer learning / BatchNorm adaptation) to improve zero-shot silhouette.
Add cohesin depletion or auxin-inducible degron (AID) perturbation data as a controlled condition comparison.
Replace the inner-product decoder with a distance-aware decoder that subtracts expected polymer-scaling decay — the main remaining confounder for AUC.
Implement distance-matched negative sampling for a more stringent link-prediction evaluation.
Extend node features to H3K27ac and H3K9me3; all four active/repressive marks are available in data/raw/.
~~Benchmark against GAT~~ — done; GAT (AUC 0.797) underperforms DeepGCN with edge weights (AUC 0.893) on this dataset.

Scripts

Script	Purpose
`scripts/build_graph.py`	Convert .mcool + bigWigs → PyG Data object
`scripts/train_vgae.py`	Train VGAE with link-prediction objective + early stopping
`scripts/encode_graph.py`	Encode a new graph with a trained model
`scripts/compute_compartments.py`	O/E matrix → Pearson PCA → A/B compartment calls
`scripts/visualize_embeddings.py`	UMAP + compartment visualisation + silhouette
`scripts/compare_embeddings.py`	Per-bin cosine/L2/L1 distance between two embeddings
`scripts/model.py`	Shared Encoder class (imported by train and encode scripts)

Citation

If you use this code or data in your research, please cite:

@software{okada_chromatin_gnn_2024,
  author  = {Okada, Toru},
  title   = {{chromatin-gnn: Variational Graph Autoencoder for learning
              latent representations of chromatin topology from Hi-C data}},
  year    = {2024},
  doi     = {10.5281/zenodo.XXXXXXX},
  url     = {https://github.com/ToruOkadaOi/chromatin-gnn},
  version = {1.0.0}
}

See also CITATION.cff for CFF-format metadata.

Key references:

Kipf, T. N. & Welling, M. (2016). Variational Graph Auto-Encoders. arXiv:1611.07308
Kipf, T. N. & Welling, M. (2017). Semi-Supervised Classification with Graph Convolutional Networks. ICLR 2017
Rao, S. S. P. et al. (2014). A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. Cell, 159(7), 1665–1680. https://doi.org/10.1016/j.cell.2014.11.021
Lieberman-Aiden, E. et al. (2009). Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science, 326(5950), 289–293.

License

MIT — see LICENSE.

README.md Unescape Escape

chromatin-gnn

Overview