# chromatin-gnn **Variational Graph Autoencoder for learning latent representations of chromatin topology from Hi-C data** [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.XXXXXXX.svg)](https://doi.org/10.5281/zenodo.XXXXXXX) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE) --- ## Overview The three-dimensional organisation of chromatin in the nucleus is not random. Chromosomes fold into compartments, topologically associating domains (TADs), and loop structures that correlate strongly with gene regulation. Hi-C sequencing captures these contacts genome-wide, but the resulting data are high-dimensional and require principled dimensionality reduction to extract biologically interpretable structure. This repository applies a **Variational Graph Autoencoder (VGAE)** to Hi-C contact data to learn a compact, continuous latent representation of chromatin topology. Genomic bins are treated as graph nodes; normalised contact frequencies define weighted edges; ChIP-seq tracks for CTCF and H3K27me3 supply node features. The model is trained end-to-end on a link-prediction objective and evaluated for its ability to recover known biological structure — A/B compartments — in an entirely unsupervised manner. --- ## Scientific question > Can a VGAE learn biologically meaningful latent representations of chromatin topology — capturing A/B compartments and cell-type-specific reorganisation — from Hi-C contact data alone, in an unsupervised manner? --- ## Architecture ``` Node features (2D: CTCF, H3K27me3) │ BatchNorm │ GCNConv(64) ← shared message-passing layer │ ReLU + Dropout(0.2) / \ GCNConv(32) GCNConv(32) μ log σ \ / Reparameterisation │ z ∈ ℝ³² (node embeddings) │ Inner-product decoder (link prediction objective: binary cross-entropy + KL divergence) ``` The encoder is a two-layer Graph Convolutional Network (Kipf & Welling 2016, 2017) with a BatchNorm input layer. The decoder is the standard dot-product decoder used in the original VGAE paper. Training uses a link-prediction objective: the model is asked to distinguish real Hi-C contacts from randomly sampled non-contacts. --- ## Dataset All data are from the GRCh38/hg38 reference genome, chromosome 21 at 25 kb resolution. | File | Cell line | Type | Source | Accession | |------|-----------|------|--------|-----------| | GM12878.mcool | GM12878 (lymphoblastoid) | Hi-C contact matrix | 4DN Data Portal | 4DNFIRUMEC32 | | IMR90.mcool | IMR-90 (lung fibroblast) | Hi-C contact matrix | 4DN Data Portal | 4DNFIABB3FHQ | | GM12878_CTCF.bw | GM12878 | CTCF ChIP-seq (FC/control) | ENCODE | ENCFF741BAQ (exp. ENCSR000AKB) | | GM12878_H3K27me3.bw | GM12878 | H3K27me3 ChIP-seq (FC/control) | ENCODE | ENCFF736CNQ (exp. ENCSR000AKD) | | IMR90_CTCF.bw | IMR-90 | CTCF ChIP-seq (FC/control) | ENCODE | ENCFF770DUD (exp. ENCSR000EFI) | | IMR90_H3K27me3.bw | IMR-90 | H3K27me3 ChIP-seq (FC/control) | ENCODE | ENCFF158HZL (exp. ENCSR431UUY) | **Graph statistics:** | Cell line | Bins (chr21, 25 kb) | Edges (contacts) | Node features | |-----------|---------------------|------------------|---------------| | GM12878 | 1,869 | 87,557 | 2 (CTCF, H3K27me3) | | IMR90 | 1,869 | 136,121 | 2 (CTCF, H3K27me3) | IMR90 has ~55% more intra-chromosomal contacts than GM12878 at chr21, suggesting a more compact or contact-rich chromatin organisation in this fibroblast cell line. --- ## Installation ```bash conda create -n chromatin_gnn python=3.10 -y conda activate chromatin_gnn # CPU-only PyTorch (replace URL for GPU builds) pip install torch==2.1.2 --index-url https://download.pytorch.org/whl/cpu # All other dependencies pip install torch-geometric==2.5.3 cooler==0.9.3 pyBigWig pandas \ "numpy>=1.24,<2.0" scikit-learn matplotlib umap-learn scipy seaborn tqdm ``` > **Note:** `cooler==0.9.3` requires `numpy<2.0`. The env.yml captures the exact versions used for this release. --- ## Workflow ```bash # Full end-to-end run (downloads bigwigs automatically; .mcool files must be present in data/raw/) bash run_pipeline.sh # Or run individual steps: # 1. Build contact graph python scripts/build_graph.py \ --mcool data/raw/GM12878.mcool \ --chrom chr21 --res 25000 \ --bigwigs data/raw/GM12878_CTCF.bw data/raw/GM12878_H3K27me3.bw \ --out data/processed/GM12878_chr21.pt # 2. Compute A/B compartments python scripts/compute_compartments.py \ --mcool data/raw/GM12878.mcool --chrom chr21 --res 25000 \ --bigwig_orient data/raw/GM12878_CTCF.bw \ --out results/GM12878/compartments_chr21.csv # 3. Train VGAE python scripts/train_vgae.py \ --graph data/processed/GM12878_chr21.pt \ --epochs 300 --patience 20 --hidden 64 --latent 32 \ --outdir results/GM12878 # 4. Encode a second cell line with the trained model python scripts/encode_graph.py \ --model results/GM12878/model.pt \ --graph data/processed/IMR90_chr21.pt \ --out results/IMR90/emb.npy # 5. UMAP visualisation python scripts/visualize_embeddings.py \ --emb results/GM12878/emb.npy results/IMR90/emb.npy \ --labels GM12878 IMR90 \ --compartments results/GM12878/compartments_chr21.csv \ results/IMR90/compartments_chr21.csv \ --prefix results/figures/umap # 6. Per-bin embedding comparison python scripts/compare_embeddings.py \ --emb1 results/GM12878/emb.npy --emb2 results/IMR90/emb.npy \ --label1 GM12878 --label2 IMR90 \ --prefix results/figures/chr21 ``` --- ## Results ### Training (GM12878, chr21, 25 kb) | Metric | Value | |--------|-------| | Epochs to convergence | 31 / 300 (early stopping, patience=20) | | Validation AUC (link prediction) | 0.774 | | Test AUC | 0.777 | | Test AP | 0.759 | | Latent dimensionality | 32 | The model converged rapidly, suggesting that the graph structure of chr21 at 25 kb is learnable with a shallow two-layer GCN. --- ### A/B compartment separation in the latent space The UMAP of GM12878 node embeddings coloured by A/B compartment shows strong, clean separation of the two compartment types without the model ever receiving compartment labels during training. | Cell line | Silhouette score (A/B, cosine) | A bins | B bins | Masked (N) | |-----------|-------------------------------|--------|--------|------------| | GM12878 (training) | **0.775** | 602 | 683 | 584 | | IMR90 (zero-shot) | 0.443 | 614 | 709 | 546 | The GM12878 silhouette of **0.775** indicates that the VGAE has learned a latent space in which A and B compartments are nearly linearly separable — a strong signal given that compartment identity was never provided as a training label. For IMR90, encoded zero-shot with the GM12878-trained model, the silhouette drops to **0.443**. This is expected: the model's BatchNorm statistics were fit to GM12878, and IMR90's chromatin organisation partially diverges. **Figures:** | Figure | Description | |--------|-------------| | `results/figures/umap_GM12878_compartment.png` | GM12878 UMAP coloured by A/B compartment | | `results/figures/umap_GM12878_position.png` | GM12878 UMAP coloured by genomic position | | `results/figures/umap_IMR90_compartment.png` | IMR90 UMAP coloured by A/B compartment | | `results/figures/umap_joint.png` | Joint UMAP of both cell lines | | `results/figures/chr21_delta.png` | Per-bin cosine distance track (GM12878 vs IMR90) | --- ### Cell-type comparison: GM12878 vs IMR90 The IMR90 graph was encoded with the GM12878-trained model (zero-shot transfer). Per-bin cosine distances between the two embedding matrices reveal which genomic loci undergo the largest chromatin reorganisation between cell types. **Per-bin cosine distance summary:** | Statistic | Value | |-----------|-------| | Mean | 0.245 | | Median | 0.028 | | Bins with distance < 0.1 (stable) | 968 / 1,869 (52%) | | Bins with distance > 0.5 (high shift) | 293 / 1,869 (16%) | **Mean cosine distance by GM12878 compartment:** | Compartment | Mean distance | Median distance | |-------------|--------------|-----------------| | A (active) | 0.042 | 0.001 | | B (repressive) | 0.451 | 0.352 | | N (masked) | 0.213 | 0.000 | **Key finding:** A-compartment bins are nearly invariant between the two cell types (mean Δ = 0.042), while B-compartment bins shift substantially (mean Δ = 0.451). This is consistent with the known biology: constitutively active chromatin domains tend to be conserved across cell types, while heterochromatic B-compartment organisation is more cell-type-specific. **Compartment switches (GM12878 → IMR90):** | GM12878 | IMR90 | Bins | Interpretation | |---------|-------|------|----------------| | A | A | 493 | Stable active | | B | B | 581 | Stable repressive | | B → A | A | 101 | Loci that open in IMR90 | | A → B | B | 70 | Loci that close in IMR90 | 101 loci switch from B (repressive in GM12878) to A (active in IMR90), versus 70 in the reverse direction. This asymmetry suggests that IMR90 fibroblasts activate more lineage-specific loci on chr21 than GM12878 lymphoblastoid cells. --- ## Biological interpretation The results demonstrate that a VGAE trained on Hi-C data **recovers A/B compartment structure without supervision** (silhouette = 0.775). The latent space organises chr21 bins according to their chromatin state, not just their linear genomic position — the UMAP coloured by genomic position shows a broadly continuous gradient, while the compartment-coloured UMAP shows discrete clusters. The zero-shot application to IMR90 captures the partial conservation of compartment organisation across cell types. The B-compartment instability revealed by the cosine distance analysis is consistent with the literature: heterochromatin rewiring is a known driver of cell-type identity (Lieberman-Aiden et al. 2009; Dixon et al. 2015). Notably, the model achieves this with only two node features (CTCF and H3K27me3 signal), demonstrating that a modest set of epigenomic marks, combined with contact topology, is sufficient to encode the major axis of chromatin organisation. --- ## Limitations 1. **Single chromosome, single resolution.** Results are for chr21 at 25 kb only. Chr21 is acrocentric with a large masked pericentromeric region (584 / 1,869 bins masked in GM12878), which may reduce statistical power compared to gene-rich autosomes. 2. **Shallow encoder.** The two-layer GCN has a local receptive field (2-hop neighbourhood). Long-range chromatin interactions spanning multiple TADs are not directly encoded. Deeper networks or attention-based architectures may capture these better. 3. **Link-prediction objective ≠ compartment recovery.** The model is optimised to predict contacts, not compartments. The strong silhouette score is emergent, not guaranteed. The objective could be supplemented with biologically-informed losses. 4. **Zero-shot transfer with fixed BatchNorm.** Encoding IMR90 with GM12878 BatchNorm statistics means the model sees IMR90 features in GM12878's normalisation frame. A domain-adaptation approach (e.g., re-fitting BatchNorm on IMR90 with frozen GCN weights) would give a fairer comparison. 5. **Compartment calling is approximate.** The O/E → Pearson correlation → PC1 pipeline is sensitive to the choice of orientation signal (CTCF here). Bins with low coverage are masked and assigned no compartment label, which affects the silhouette calculation. 6. **No TAD-level evaluation.** TAD boundary detection would require a graph-level or boundary-aware metric. The current evaluation is node-level only. 7. **No statistical significance testing.** The compartment switch counts (101 B→A, 70 A→B) have not been tested for significance against a null model (e.g., random permutation of compartment labels). --- ## Future work - Apply to all autosomes and compare genome-wide compartment recovery. - Add a TAD-boundary evaluation metric (e.g., insulation score correlation with latent space gradients). - Fine-tune on IMR90 (transfer learning) to improve the IMR90 silhouette score. - Add cohesin depletion or auxin-inducible degron (AID) perturbation data as a controlled condition comparison. - Replace the inner-product decoder with a distance-aware decoder that incorporates linear genomic distance. - Benchmark against PCA/UMAP of the raw contact matrix and against other graph-based methods (GraphSAGE, GAT). - Extend node features to include additional histone marks (H3K4me3, H3K27ac, H3K9me3) to test whether richer epigenomic context improves compartment recovery. --- ## Scripts | Script | Purpose | |--------|---------| | `scripts/build_graph.py` | Convert .mcool + bigWigs → PyG Data object | | `scripts/train_vgae.py` | Train VGAE with link-prediction objective + early stopping | | `scripts/encode_graph.py` | Encode a new graph with a trained model | | `scripts/compute_compartments.py` | O/E matrix → Pearson PCA → A/B compartment calls | | `scripts/visualize_embeddings.py` | UMAP + compartment visualisation + silhouette | | `scripts/compare_embeddings.py` | Per-bin cosine/L2/L1 distance between two embeddings | | `scripts/model.py` | Shared Encoder class (imported by train and encode scripts) | --- ## Citation If you use this code or data in your research, please cite: ```bibtex @software{okada_chromatin_gnn_2024, author = {Okada, Toru}, title = {{chromatin-gnn: Variational Graph Autoencoder for learning latent representations of chromatin topology from Hi-C data}}, year = {2024}, doi = {10.5281/zenodo.XXXXXXX}, url = {https://github.com/ToruOkadaOi/chromatin-gnn}, version = {1.0.0} } ``` See also `CITATION.cff` for CFF-format metadata. **Key references:** - Kipf, T. N. & Welling, M. (2016). Variational Graph Auto-Encoders. *arXiv:1611.07308* - Kipf, T. N. & Welling, M. (2017). Semi-Supervised Classification with Graph Convolutional Networks. *ICLR 2017* - Rao, S. S. P. et al. (2014). A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. *Cell*, 159(7), 1665–1680. https://doi.org/10.1016/j.cell.2014.11.021 - Lieberman-Aiden, E. et al. (2009). Comprehensive mapping of long-range interactions reveals folding principles of the human genome. *Science*, 326(5950), 289–293. --- ## License MIT — see [LICENSE](LICENSE).