Files
chromatin-vgae-hic/README.md
2026-05-15 03:41:47 +02:00

326 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# chromatin-gnn
**Variational Graph Autoencoder for learning latent representations of chromatin topology from Hi-C data**
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.XXXXXXX.svg)](https://doi.org/10.5281/zenodo.XXXXXXX)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
---
## Overview
The three-dimensional organisation of chromatin in the nucleus is not random. Chromosomes fold into compartments, topologically associating domains (TADs), and loop structures that correlate strongly with gene regulation. Hi-C sequencing captures these contacts genome-wide, but the resulting data are high-dimensional and require principled dimensionality reduction to extract biologically interpretable structure.
This repository applies a **Variational Graph Autoencoder (VGAE)** to Hi-C contact data to learn a compact, continuous latent representation of chromatin topology. Genomic bins are treated as graph nodes; normalised contact frequencies define weighted edges; ChIP-seq tracks for CTCF and H3K27me3 supply node features. The model is trained end-to-end on a link-prediction objective and evaluated for its ability to recover known biological structure — A/B compartments — in an entirely unsupervised manner.
---
## Scientific question
> Can a VGAE learn biologically meaningful latent representations of chromatin topology — capturing A/B compartments and cell-type-specific reorganisation — from Hi-C contact data alone, in an unsupervised manner?
---
## Architecture
Three encoder architectures are provided (`--encoder gcn | gat | deep_gcn`).
The best-performing model is **DeepGCN** (3-layer GCN with edge-weighted message passing):
```
Node features (3D: CTCF, H3K27me3, H3K4me3)
BatchNorm
GCNConv(128, edge_weight) ← layer 1
BatchNorm → ReLU → Dropout(0.3)
GCNConv(128, edge_weight) ← layer 2
ReLU → Dropout(0.3)
/ \
GCNConv(64) GCNConv(64) (edge_weight)
μ log σ
\ /
Reparameterisation
z ∈ ℝ⁶⁴ (node embeddings)
Inner-product decoder
(β-VGAE objective: BCE + β·KL with linear warm-up)
```
ICE-balanced contact weights (log1p-normalised) are passed as `edge_weight` to every
GCNConv layer, allowing the model to up-weight strong contacts during message passing.
The decoder is the standard dot-product decoder from Kipf & Welling (2017). Training
uses a link-prediction objective: the model distinguishes real Hi-C contacts from
randomly sampled non-contacts.
---
## Dataset
All data are from the GRCh38/hg38 reference genome, chromosome 21 at 25 kb resolution.
| File | Cell line | Type | Source | Accession |
|------|-----------|------|--------|-----------|
| GM12878.mcool | GM12878 (lymphoblastoid) | Hi-C contact matrix | 4DN Data Portal | 4DNFIRUMEC32 |
| IMR90.mcool | IMR-90 (lung fibroblast) | Hi-C contact matrix | 4DN Data Portal | 4DNFIABB3FHQ |
| GM12878_CTCF.bw | GM12878 | CTCF ChIP-seq (FC/control) | ENCODE | ENCFF741BAQ (exp. ENCSR000AKB) |
| GM12878_H3K27me3.bw | GM12878 | H3K27me3 ChIP-seq (FC/control) | ENCODE | ENCFF736CNQ (exp. ENCSR000AKD) |
| GM12878_H3K4me3.bw | GM12878 | H3K4me3 ChIP-seq (FC/control) | ENCODE | — |
| IMR90_CTCF.bw | IMR-90 | CTCF ChIP-seq (FC/control) | ENCODE | ENCFF770DUD (exp. ENCSR000EFI) |
| IMR90_H3K27me3.bw | IMR-90 | H3K27me3 ChIP-seq (FC/control) | ENCODE | ENCFF158HZL (exp. ENCSR431UUY) |
| IMR90_H3K4me3.bw | IMR-90 | H3K4me3 ChIP-seq (FC/control) | ENCODE | — |
**Graph statistics:**
| Cell line | Bins (chr21, 25 kb) | Edges (contacts, undirected) | Node features |
|-----------|---------------------|------------------------------|---------------|
| GM12878 | 1,869 | 172,310 | 3 (CTCF, H3K27me3, H3K4me3) |
| IMR90 | 1,869 | 136,121 | 3 (CTCF, H3K27me3, H3K4me3) |
IMR90 has ~55% more intra-chromosomal contacts than GM12878 at chr21, suggesting a more compact or contact-rich chromatin organisation in this fibroblast cell line.
---
## Installation
```bash
conda create -n chromatin_gnn python=3.10 -y
conda activate chromatin_gnn
# CPU-only PyTorch (replace URL for GPU builds)
pip install torch==2.1.2 --index-url https://download.pytorch.org/whl/cpu
# All other dependencies
pip install torch-geometric==2.5.3 cooler==0.9.3 pyBigWig pandas \
"numpy>=1.24,<2.0" scikit-learn matplotlib umap-learn scipy seaborn tqdm
```
> **Note:** `cooler==0.9.3` requires `numpy<2.0`. The env.yml captures the exact versions used for this release.
---
## Workflow
```bash
# Full end-to-end run (downloads bigwigs automatically; .mcool files must be present in data/raw/)
bash run_pipeline.sh
# Or run individual steps:
# 1. Build contact graph
python scripts/build_graph.py \
--mcool data/raw/GM12878.mcool \
--chrom chr21 --res 25000 \
--bigwigs data/raw/GM12878_CTCF.bw data/raw/GM12878_H3K27me3.bw \
--out data/processed/GM12878_chr21.pt
# 2. Compute A/B compartments
python scripts/compute_compartments.py \
--mcool data/raw/GM12878.mcool --chrom chr21 --res 25000 \
--bigwig_orient data/raw/GM12878_CTCF.bw \
--out results/GM12878/compartments_chr21.csv
# 3. Train VGAE (best config: DeepGCN + edge weights + 3 node features)
python scripts/train_vgae.py \
--graph data/processed/GM12878_chr21_3feat.pt \
--encoder deep_gcn \
--hidden 128 --latent 64 \
--epochs 300 --patience 50 \
--lr 3e-4 --dropout 0.3 --beta 0.5 --kl_anneal 100 \
--outdir results/GM12878
# 4. Encode a second cell line with the trained model
python scripts/encode_graph.py \
--model results/GM12878/model.pt \
--graph data/processed/IMR90_chr21.pt \
--out results/IMR90/emb.npy
# 5. UMAP visualisation
python scripts/visualize_embeddings.py \
--emb results/GM12878/emb.npy results/IMR90/emb.npy \
--labels GM12878 IMR90 \
--compartments results/GM12878/compartments_chr21.csv \
results/IMR90/compartments_chr21.csv \
--prefix results/figures/umap
# 6. Per-bin embedding comparison
python scripts/compare_embeddings.py \
--emb1 results/GM12878/emb.npy --emb2 results/IMR90/emb.npy \
--label1 GM12878 --label2 IMR90 \
--prefix results/figures/chr21
```
---
## Results
### Training (GM12878, chr21, 25 kb)
| Encoder | Node features | Edge weights | Test AUC | Test AP | Epochs |
|---------|--------------|-------------|----------|---------|--------|
| GCN (v1 baseline) | 2 | ✗ | 0.777 | 0.759 | 31 |
| GAT (v2) | 2 | ✗ | 0.797 | 0.745 | 73 |
| DeepGCN | 2 | ✓ | 0.888 | 0.844 | 137 |
| **DeepGCN** | **3** | **✓** | **0.893** | **0.852** | **141** |
The dominant improvement came from passing ICE-balanced contact weights (`edge_weight`)
to every GCNConv layer — signal that was computed but silently unused in earlier versions.
The three-layer receptive field of DeepGCN covers a full TAD width at 25 kb resolution,
which is the scale at which compartment identity is determined.
---
### A/B compartment separation in the latent space
The UMAP of GM12878 node embeddings coloured by A/B compartment shows strong, clean separation of the two compartment types without the model ever receiving compartment labels during training.
| Cell line | Model | Silhouette score (A/B, cosine) | A bins | B bins | Masked (N) |
|-----------|-------|-------------------------------|--------|--------|------------|
| GM12878 (training) | GCN v1 | 0.775 | 602 | 683 | 584 |
| IMR90 (zero-shot) | GCN v1 | 0.443 | 614 | 709 | 546 |
| GM12878 (training) | **DeepGCN** | **0.663** | 602 | 683 | 584 |
| IMR90 (zero-shot) | **DeepGCN** | **0.473** | 614 | 709 | 546 |
The v1 GM12878 silhouette (0.775) is higher than DeepGCN's (0.663) because the
higher-dimensional latent space (64 vs 32) spreads clusters further apart in cosine
geometry. The more meaningful comparison is the zero-shot IMR90 transfer, where
DeepGCN improves from 0.443 → 0.473 despite the BatchNorm statistics being fit to
GM12878.
**Figures:**
| Figure | Description |
|--------|-------------|
| `results/figures/umap_GM12878_compartment.png` | GM12878 UMAP coloured by A/B compartment |
| `results/figures/umap_GM12878_position.png` | GM12878 UMAP coloured by genomic position |
| `results/figures/umap_IMR90_compartment.png` | IMR90 UMAP coloured by A/B compartment |
| `results/figures/umap_joint.png` | Joint UMAP of both cell lines |
| `results/figures/chr21_delta.png` | Per-bin cosine distance track (GM12878 vs IMR90) |
---
### Cell-type comparison: GM12878 vs IMR90
The IMR90 graph was encoded with the GM12878-trained model (zero-shot transfer). Per-bin cosine distances between the two embedding matrices reveal which genomic loci undergo the largest chromatin reorganisation between cell types.
**Per-bin cosine distance summary:**
| Statistic | Value |
|-----------|-------|
| Mean | 0.245 |
| Median | 0.028 |
| Bins with distance < 0.1 (stable) | 968 / 1,869 (52%) |
| Bins with distance > 0.5 (high shift) | 293 / 1,869 (16%) |
**Mean cosine distance by GM12878 compartment:**
| Compartment | Mean distance | Median distance |
|-------------|--------------|-----------------|
| A (active) | 0.042 | 0.001 |
| B (repressive) | 0.451 | 0.352 |
| N (masked) | 0.213 | 0.000 |
**Key finding:** A-compartment bins are nearly invariant between the two cell types (mean Δ = 0.042), while B-compartment bins shift substantially (mean Δ = 0.451). This is consistent with the known biology: constitutively active chromatin domains tend to be conserved across cell types, while heterochromatic B-compartment organisation is more cell-type-specific.
**Compartment switches (GM12878 → IMR90):**
| GM12878 | IMR90 | Bins | Interpretation |
|---------|-------|------|----------------|
| A | A | 493 | Stable active |
| B | B | 581 | Stable repressive |
| B → A | A | 101 | Loci that open in IMR90 |
| A → B | B | 70 | Loci that close in IMR90 |
101 loci switch from B (repressive in GM12878) to A (active in IMR90), versus 70 in the reverse direction. This asymmetry suggests that IMR90 fibroblasts activate more lineage-specific loci on chr21 than GM12878 lymphoblastoid cells.
---
## Biological interpretation
The results demonstrate that a VGAE trained on Hi-C data **recovers A/B compartment structure without supervision** (silhouette = 0.775). The latent space organises chr21 bins according to their chromatin state, not just their linear genomic position — the UMAP coloured by genomic position shows a broadly continuous gradient, while the compartment-coloured UMAP shows discrete clusters.
The zero-shot application to IMR90 captures the partial conservation of compartment organisation across cell types. The B-compartment instability revealed by the cosine distance analysis is consistent with the literature: heterochromatin rewiring is a known driver of cell-type identity (Lieberman-Aiden et al. 2009; Dixon et al. 2015).
Notably, the model achieves this with only two node features (CTCF and H3K27me3 signal), demonstrating that a modest set of epigenomic marks, combined with contact topology, is sufficient to encode the major axis of chromatin organisation.
---
## Limitations
1. **Single chromosome, single resolution.** Results are for chr21 at 25 kb only. Chr21 is acrocentric with a large masked pericentromeric region (584 / 1,869 bins masked in GM12878), which may reduce statistical power compared to gene-rich autosomes.
2. **Random negative sampling inflates AUC.** Negative edges are drawn uniformly at random, so many are long-range pairs with trivially near-zero contact frequency. Distance-matched negative sampling (same genomic distance band as positives) would give a more stringent and biologically honest evaluation.
3. **Link-prediction objective ≠ compartment recovery.** The model is optimised to predict contacts, not compartments. The silhouette score is emergent, not guaranteed. The objective could be supplemented with biologically-informed losses.
4. **Zero-shot transfer with fixed BatchNorm.** Encoding IMR90 with GM12878 BatchNorm statistics means the model sees IMR90 features in GM12878's normalisation frame. A domain-adaptation approach (e.g., re-fitting BatchNorm on IMR90 with frozen GCN weights) would give a fairer comparison.
5. **Compartment calling is approximate.** The O/E → Pearson correlation → PC1 pipeline is sensitive to the choice of orientation signal (CTCF here). Bins with low coverage are masked and assigned no compartment label, which affects the silhouette calculation.
6. **No TAD-level evaluation.** TAD boundary detection would require a graph-level or boundary-aware metric. The current evaluation is node-level only.
7. **No statistical significance testing.** The compartment switch counts (101 B→A, 70 A→B) have not been tested for significance against a null model (e.g., random permutation of compartment labels).
---
## Future work
- Apply to all autosomes and compare genome-wide compartment recovery.
- Add a TAD-boundary evaluation metric (e.g., insulation score correlation with latent space gradients).
- Fine-tune on IMR90 (transfer learning / BatchNorm adaptation) to improve zero-shot silhouette.
- Add cohesin depletion or auxin-inducible degron (AID) perturbation data as a controlled condition comparison.
- Replace the inner-product decoder with a distance-aware decoder that subtracts expected polymer-scaling decay — the main remaining confounder for AUC.
- Implement distance-matched negative sampling for a more stringent link-prediction evaluation.
- Extend node features to H3K27ac and H3K9me3; all four active/repressive marks are available in `data/raw/`.
- ~~Benchmark against GAT~~ — done; GAT (AUC 0.797) underperforms DeepGCN with edge weights (AUC 0.893) on this dataset.
---
## Scripts
| Script | Purpose |
|--------|---------|
| `scripts/build_graph.py` | Convert .mcool + bigWigs → PyG Data object |
| `scripts/train_vgae.py` | Train VGAE with link-prediction objective + early stopping |
| `scripts/encode_graph.py` | Encode a new graph with a trained model |
| `scripts/compute_compartments.py` | O/E matrix → Pearson PCA → A/B compartment calls |
| `scripts/visualize_embeddings.py` | UMAP + compartment visualisation + silhouette |
| `scripts/compare_embeddings.py` | Per-bin cosine/L2/L1 distance between two embeddings |
| `scripts/model.py` | Shared Encoder class (imported by train and encode scripts) |
---
## Citation
If you use this code or data in your research, please cite:
```bibtex
@software{okada_chromatin_gnn_2024,
author = {Okada, Toru},
title = {{chromatin-gnn: Variational Graph Autoencoder for learning
latent representations of chromatin topology from Hi-C data}},
year = {2024},
doi = {10.5281/zenodo.XXXXXXX},
url = {https://github.com/ToruOkadaOi/chromatin-gnn},
version = {1.0.0}
}
```
See also `CITATION.cff` for CFF-format metadata.
**Key references:**
- Kipf, T. N. & Welling, M. (2016). Variational Graph Auto-Encoders. *arXiv:1611.07308*
- Kipf, T. N. & Welling, M. (2017). Semi-Supervised Classification with Graph Convolutional Networks. *ICLR 2017*
- Rao, S. S. P. et al. (2014). A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. *Cell*, 159(7), 16651680. https://doi.org/10.1016/j.cell.2014.11.021
- Lieberman-Aiden, E. et al. (2009). Comprehensive mapping of long-range interactions reveals folding principles of the human genome. *Science*, 326(5950), 289293.
---
## License
MIT — see [LICENSE](LICENSE).