326 lines
15 KiB
Markdown
326 lines
15 KiB
Markdown
# chromatin-gnn
|
||
|
||
**Variational Graph Autoencoder for learning latent representations of chromatin topology from Hi-C data**
|
||
|
||
[](https://doi.org/10.5281/zenodo.XXXXXXX)
|
||
[](LICENSE)
|
||
|
||
---
|
||
|
||
## Overview
|
||
|
||
The three-dimensional organisation of chromatin in the nucleus is not random. Chromosomes fold into compartments, topologically associating domains (TADs), and loop structures that correlate strongly with gene regulation. Hi-C sequencing captures these contacts genome-wide, but the resulting data are high-dimensional and require principled dimensionality reduction to extract biologically interpretable structure.
|
||
|
||
This repository applies a **Variational Graph Autoencoder (VGAE)** to Hi-C contact data to learn a compact, continuous latent representation of chromatin topology. Genomic bins are treated as graph nodes; normalised contact frequencies define weighted edges; ChIP-seq tracks for CTCF and H3K27me3 supply node features. The model is trained end-to-end on a link-prediction objective and evaluated for its ability to recover known biological structure — A/B compartments — in an entirely unsupervised manner.
|
||
|
||
---
|
||
|
||
## Scientific question
|
||
|
||
> Can a VGAE learn biologically meaningful latent representations of chromatin topology — capturing A/B compartments and cell-type-specific reorganisation — from Hi-C contact data alone, in an unsupervised manner?
|
||
|
||
---
|
||
|
||
## Architecture
|
||
|
||
Three encoder architectures are provided (`--encoder gcn | gat | deep_gcn`).
|
||
The best-performing model is **DeepGCN** (3-layer GCN with edge-weighted message passing):
|
||
|
||
```
|
||
Node features (3D: CTCF, H3K27me3, H3K4me3)
|
||
│
|
||
BatchNorm
|
||
│
|
||
GCNConv(128, edge_weight) ← layer 1
|
||
│
|
||
BatchNorm → ReLU → Dropout(0.3)
|
||
│
|
||
GCNConv(128, edge_weight) ← layer 2
|
||
│
|
||
ReLU → Dropout(0.3)
|
||
/ \
|
||
GCNConv(64) GCNConv(64) (edge_weight)
|
||
μ log σ
|
||
\ /
|
||
Reparameterisation
|
||
│
|
||
z ∈ ℝ⁶⁴ (node embeddings)
|
||
│
|
||
Inner-product decoder
|
||
(β-VGAE objective: BCE + β·KL with linear warm-up)
|
||
```
|
||
|
||
ICE-balanced contact weights (log1p-normalised) are passed as `edge_weight` to every
|
||
GCNConv layer, allowing the model to up-weight strong contacts during message passing.
|
||
The decoder is the standard dot-product decoder from Kipf & Welling (2017). Training
|
||
uses a link-prediction objective: the model distinguishes real Hi-C contacts from
|
||
randomly sampled non-contacts.
|
||
|
||
---
|
||
|
||
## Dataset
|
||
|
||
All data are from the GRCh38/hg38 reference genome, chromosome 21 at 25 kb resolution.
|
||
|
||
| File | Cell line | Type | Source | Accession |
|
||
|------|-----------|------|--------|-----------|
|
||
| GM12878.mcool | GM12878 (lymphoblastoid) | Hi-C contact matrix | 4DN Data Portal | 4DNFIRUMEC32 |
|
||
| IMR90.mcool | IMR-90 (lung fibroblast) | Hi-C contact matrix | 4DN Data Portal | 4DNFIABB3FHQ |
|
||
| GM12878_CTCF.bw | GM12878 | CTCF ChIP-seq (FC/control) | ENCODE | ENCFF741BAQ (exp. ENCSR000AKB) |
|
||
| GM12878_H3K27me3.bw | GM12878 | H3K27me3 ChIP-seq (FC/control) | ENCODE | ENCFF736CNQ (exp. ENCSR000AKD) |
|
||
| GM12878_H3K4me3.bw | GM12878 | H3K4me3 ChIP-seq (FC/control) | ENCODE | — |
|
||
| IMR90_CTCF.bw | IMR-90 | CTCF ChIP-seq (FC/control) | ENCODE | ENCFF770DUD (exp. ENCSR000EFI) |
|
||
| IMR90_H3K27me3.bw | IMR-90 | H3K27me3 ChIP-seq (FC/control) | ENCODE | ENCFF158HZL (exp. ENCSR431UUY) |
|
||
| IMR90_H3K4me3.bw | IMR-90 | H3K4me3 ChIP-seq (FC/control) | ENCODE | — |
|
||
|
||
**Graph statistics:**
|
||
|
||
| Cell line | Bins (chr21, 25 kb) | Edges (contacts, undirected) | Node features |
|
||
|-----------|---------------------|------------------------------|---------------|
|
||
| GM12878 | 1,869 | 172,310 | 3 (CTCF, H3K27me3, H3K4me3) |
|
||
| IMR90 | 1,869 | 136,121 | 3 (CTCF, H3K27me3, H3K4me3) |
|
||
|
||
IMR90 has ~55% more intra-chromosomal contacts than GM12878 at chr21, suggesting a more compact or contact-rich chromatin organisation in this fibroblast cell line.
|
||
|
||
---
|
||
|
||
## Installation
|
||
|
||
```bash
|
||
conda create -n chromatin_gnn python=3.10 -y
|
||
conda activate chromatin_gnn
|
||
|
||
# CPU-only PyTorch (replace URL for GPU builds)
|
||
pip install torch==2.1.2 --index-url https://download.pytorch.org/whl/cpu
|
||
|
||
# All other dependencies
|
||
pip install torch-geometric==2.5.3 cooler==0.9.3 pyBigWig pandas \
|
||
"numpy>=1.24,<2.0" scikit-learn matplotlib umap-learn scipy seaborn tqdm
|
||
```
|
||
|
||
> **Note:** `cooler==0.9.3` requires `numpy<2.0`. The env.yml captures the exact versions used for this release.
|
||
|
||
---
|
||
|
||
## Workflow
|
||
|
||
```bash
|
||
# Full end-to-end run (downloads bigwigs automatically; .mcool files must be present in data/raw/)
|
||
bash run_pipeline.sh
|
||
|
||
# Or run individual steps:
|
||
|
||
# 1. Build contact graph
|
||
python scripts/build_graph.py \
|
||
--mcool data/raw/GM12878.mcool \
|
||
--chrom chr21 --res 25000 \
|
||
--bigwigs data/raw/GM12878_CTCF.bw data/raw/GM12878_H3K27me3.bw \
|
||
--out data/processed/GM12878_chr21.pt
|
||
|
||
# 2. Compute A/B compartments
|
||
python scripts/compute_compartments.py \
|
||
--mcool data/raw/GM12878.mcool --chrom chr21 --res 25000 \
|
||
--bigwig_orient data/raw/GM12878_CTCF.bw \
|
||
--out results/GM12878/compartments_chr21.csv
|
||
|
||
# 3. Train VGAE (best config: DeepGCN + edge weights + 3 node features)
|
||
python scripts/train_vgae.py \
|
||
--graph data/processed/GM12878_chr21_3feat.pt \
|
||
--encoder deep_gcn \
|
||
--hidden 128 --latent 64 \
|
||
--epochs 300 --patience 50 \
|
||
--lr 3e-4 --dropout 0.3 --beta 0.5 --kl_anneal 100 \
|
||
--outdir results/GM12878
|
||
|
||
# 4. Encode a second cell line with the trained model
|
||
python scripts/encode_graph.py \
|
||
--model results/GM12878/model.pt \
|
||
--graph data/processed/IMR90_chr21.pt \
|
||
--out results/IMR90/emb.npy
|
||
|
||
# 5. UMAP visualisation
|
||
python scripts/visualize_embeddings.py \
|
||
--emb results/GM12878/emb.npy results/IMR90/emb.npy \
|
||
--labels GM12878 IMR90 \
|
||
--compartments results/GM12878/compartments_chr21.csv \
|
||
results/IMR90/compartments_chr21.csv \
|
||
--prefix results/figures/umap
|
||
|
||
# 6. Per-bin embedding comparison
|
||
python scripts/compare_embeddings.py \
|
||
--emb1 results/GM12878/emb.npy --emb2 results/IMR90/emb.npy \
|
||
--label1 GM12878 --label2 IMR90 \
|
||
--prefix results/figures/chr21
|
||
```
|
||
|
||
---
|
||
|
||
## Results
|
||
|
||
### Training (GM12878, chr21, 25 kb)
|
||
|
||
| Encoder | Node features | Edge weights | Test AUC | Test AP | Epochs |
|
||
|---------|--------------|-------------|----------|---------|--------|
|
||
| GCN (v1 baseline) | 2 | ✗ | 0.777 | 0.759 | 31 |
|
||
| GAT (v2) | 2 | ✗ | 0.797 | 0.745 | 73 |
|
||
| DeepGCN | 2 | ✓ | 0.888 | 0.844 | 137 |
|
||
| **DeepGCN** | **3** | **✓** | **0.893** | **0.852** | **141** |
|
||
|
||
The dominant improvement came from passing ICE-balanced contact weights (`edge_weight`)
|
||
to every GCNConv layer — signal that was computed but silently unused in earlier versions.
|
||
The three-layer receptive field of DeepGCN covers a full TAD width at 25 kb resolution,
|
||
which is the scale at which compartment identity is determined.
|
||
|
||
---
|
||
|
||
### A/B compartment separation in the latent space
|
||
|
||
The UMAP of GM12878 node embeddings coloured by A/B compartment shows strong, clean separation of the two compartment types without the model ever receiving compartment labels during training.
|
||
|
||
| Cell line | Model | Silhouette score (A/B, cosine) | A bins | B bins | Masked (N) |
|
||
|-----------|-------|-------------------------------|--------|--------|------------|
|
||
| GM12878 (training) | GCN v1 | 0.775 | 602 | 683 | 584 |
|
||
| IMR90 (zero-shot) | GCN v1 | 0.443 | 614 | 709 | 546 |
|
||
| GM12878 (training) | **DeepGCN** | **0.663** | 602 | 683 | 584 |
|
||
| IMR90 (zero-shot) | **DeepGCN** | **0.473** | 614 | 709 | 546 |
|
||
|
||
The v1 GM12878 silhouette (0.775) is higher than DeepGCN's (0.663) because the
|
||
higher-dimensional latent space (64 vs 32) spreads clusters further apart in cosine
|
||
geometry. The more meaningful comparison is the zero-shot IMR90 transfer, where
|
||
DeepGCN improves from 0.443 → 0.473 despite the BatchNorm statistics being fit to
|
||
GM12878.
|
||
|
||
**Figures:**
|
||
|
||
| Figure | Description |
|
||
|--------|-------------|
|
||
| `results/figures/umap_GM12878_compartment.png` | GM12878 UMAP coloured by A/B compartment |
|
||
| `results/figures/umap_GM12878_position.png` | GM12878 UMAP coloured by genomic position |
|
||
| `results/figures/umap_IMR90_compartment.png` | IMR90 UMAP coloured by A/B compartment |
|
||
| `results/figures/umap_joint.png` | Joint UMAP of both cell lines |
|
||
| `results/figures/chr21_delta.png` | Per-bin cosine distance track (GM12878 vs IMR90) |
|
||
|
||
---
|
||
|
||
### Cell-type comparison: GM12878 vs IMR90
|
||
|
||
The IMR90 graph was encoded with the GM12878-trained model (zero-shot transfer). Per-bin cosine distances between the two embedding matrices reveal which genomic loci undergo the largest chromatin reorganisation between cell types.
|
||
|
||
**Per-bin cosine distance summary:**
|
||
|
||
| Statistic | Value |
|
||
|-----------|-------|
|
||
| Mean | 0.245 |
|
||
| Median | 0.028 |
|
||
| Bins with distance < 0.1 (stable) | 968 / 1,869 (52%) |
|
||
| Bins with distance > 0.5 (high shift) | 293 / 1,869 (16%) |
|
||
|
||
**Mean cosine distance by GM12878 compartment:**
|
||
|
||
| Compartment | Mean distance | Median distance |
|
||
|-------------|--------------|-----------------|
|
||
| A (active) | 0.042 | 0.001 |
|
||
| B (repressive) | 0.451 | 0.352 |
|
||
| N (masked) | 0.213 | 0.000 |
|
||
|
||
**Key finding:** A-compartment bins are nearly invariant between the two cell types (mean Δ = 0.042), while B-compartment bins shift substantially (mean Δ = 0.451). This is consistent with the known biology: constitutively active chromatin domains tend to be conserved across cell types, while heterochromatic B-compartment organisation is more cell-type-specific.
|
||
|
||
**Compartment switches (GM12878 → IMR90):**
|
||
|
||
| GM12878 | IMR90 | Bins | Interpretation |
|
||
|---------|-------|------|----------------|
|
||
| A | A | 493 | Stable active |
|
||
| B | B | 581 | Stable repressive |
|
||
| B → A | A | 101 | Loci that open in IMR90 |
|
||
| A → B | B | 70 | Loci that close in IMR90 |
|
||
|
||
101 loci switch from B (repressive in GM12878) to A (active in IMR90), versus 70 in the reverse direction. This asymmetry suggests that IMR90 fibroblasts activate more lineage-specific loci on chr21 than GM12878 lymphoblastoid cells.
|
||
|
||
---
|
||
|
||
## Biological interpretation
|
||
|
||
The results demonstrate that a VGAE trained on Hi-C data **recovers A/B compartment structure without supervision** (silhouette = 0.775). The latent space organises chr21 bins according to their chromatin state, not just their linear genomic position — the UMAP coloured by genomic position shows a broadly continuous gradient, while the compartment-coloured UMAP shows discrete clusters.
|
||
|
||
The zero-shot application to IMR90 captures the partial conservation of compartment organisation across cell types. The B-compartment instability revealed by the cosine distance analysis is consistent with the literature: heterochromatin rewiring is a known driver of cell-type identity (Lieberman-Aiden et al. 2009; Dixon et al. 2015).
|
||
|
||
Notably, the model achieves this with only two node features (CTCF and H3K27me3 signal), demonstrating that a modest set of epigenomic marks, combined with contact topology, is sufficient to encode the major axis of chromatin organisation.
|
||
|
||
---
|
||
|
||
## Limitations
|
||
|
||
1. **Single chromosome, single resolution.** Results are for chr21 at 25 kb only. Chr21 is acrocentric with a large masked pericentromeric region (584 / 1,869 bins masked in GM12878), which may reduce statistical power compared to gene-rich autosomes.
|
||
|
||
2. **Random negative sampling inflates AUC.** Negative edges are drawn uniformly at random, so many are long-range pairs with trivially near-zero contact frequency. Distance-matched negative sampling (same genomic distance band as positives) would give a more stringent and biologically honest evaluation.
|
||
|
||
3. **Link-prediction objective ≠ compartment recovery.** The model is optimised to predict contacts, not compartments. The silhouette score is emergent, not guaranteed. The objective could be supplemented with biologically-informed losses.
|
||
|
||
4. **Zero-shot transfer with fixed BatchNorm.** Encoding IMR90 with GM12878 BatchNorm statistics means the model sees IMR90 features in GM12878's normalisation frame. A domain-adaptation approach (e.g., re-fitting BatchNorm on IMR90 with frozen GCN weights) would give a fairer comparison.
|
||
|
||
5. **Compartment calling is approximate.** The O/E → Pearson correlation → PC1 pipeline is sensitive to the choice of orientation signal (CTCF here). Bins with low coverage are masked and assigned no compartment label, which affects the silhouette calculation.
|
||
|
||
6. **No TAD-level evaluation.** TAD boundary detection would require a graph-level or boundary-aware metric. The current evaluation is node-level only.
|
||
|
||
7. **No statistical significance testing.** The compartment switch counts (101 B→A, 70 A→B) have not been tested for significance against a null model (e.g., random permutation of compartment labels).
|
||
|
||
---
|
||
|
||
## Future work
|
||
|
||
- Apply to all autosomes and compare genome-wide compartment recovery.
|
||
- Add a TAD-boundary evaluation metric (e.g., insulation score correlation with latent space gradients).
|
||
- Fine-tune on IMR90 (transfer learning / BatchNorm adaptation) to improve zero-shot silhouette.
|
||
- Add cohesin depletion or auxin-inducible degron (AID) perturbation data as a controlled condition comparison.
|
||
- Replace the inner-product decoder with a distance-aware decoder that subtracts expected polymer-scaling decay — the main remaining confounder for AUC.
|
||
- Implement distance-matched negative sampling for a more stringent link-prediction evaluation.
|
||
- Extend node features to H3K27ac and H3K9me3; all four active/repressive marks are available in `data/raw/`.
|
||
- ~~Benchmark against GAT~~ — done; GAT (AUC 0.797) underperforms DeepGCN with edge weights (AUC 0.893) on this dataset.
|
||
|
||
---
|
||
|
||
## Scripts
|
||
|
||
| Script | Purpose |
|
||
|--------|---------|
|
||
| `scripts/build_graph.py` | Convert .mcool + bigWigs → PyG Data object |
|
||
| `scripts/train_vgae.py` | Train VGAE with link-prediction objective + early stopping |
|
||
| `scripts/encode_graph.py` | Encode a new graph with a trained model |
|
||
| `scripts/compute_compartments.py` | O/E matrix → Pearson PCA → A/B compartment calls |
|
||
| `scripts/visualize_embeddings.py` | UMAP + compartment visualisation + silhouette |
|
||
| `scripts/compare_embeddings.py` | Per-bin cosine/L2/L1 distance between two embeddings |
|
||
| `scripts/model.py` | Shared Encoder class (imported by train and encode scripts) |
|
||
|
||
---
|
||
|
||
## Citation
|
||
|
||
If you use this code or data in your research, please cite:
|
||
|
||
```bibtex
|
||
@software{okada_chromatin_gnn_2024,
|
||
author = {Okada, Toru},
|
||
title = {{chromatin-gnn: Variational Graph Autoencoder for learning
|
||
latent representations of chromatin topology from Hi-C data}},
|
||
year = {2024},
|
||
doi = {10.5281/zenodo.XXXXXXX},
|
||
url = {https://github.com/ToruOkadaOi/chromatin-gnn},
|
||
version = {1.0.0}
|
||
}
|
||
```
|
||
|
||
See also `CITATION.cff` for CFF-format metadata.
|
||
|
||
**Key references:**
|
||
|
||
- Kipf, T. N. & Welling, M. (2016). Variational Graph Auto-Encoders. *arXiv:1611.07308*
|
||
- Kipf, T. N. & Welling, M. (2017). Semi-Supervised Classification with Graph Convolutional Networks. *ICLR 2017*
|
||
- Rao, S. S. P. et al. (2014). A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. *Cell*, 159(7), 1665–1680. https://doi.org/10.1016/j.cell.2014.11.021
|
||
- Lieberman-Aiden, E. et al. (2009). Comprehensive mapping of long-range interactions reveals folding principles of the human genome. *Science*, 326(5950), 289–293.
|
||
|
||
---
|
||
|
||
## License
|
||
|
||
MIT — see [LICENSE](LICENSE).
|