updated README
This commit is contained in:
93
README.md
93
README.md
@@ -23,27 +23,38 @@ This repository applies a **Variational Graph Autoencoder (VGAE)** to Hi-C conta
|
||||
|
||||
## Architecture
|
||||
|
||||
Three encoder architectures are provided (`--encoder gcn | gat | deep_gcn`).
|
||||
The best-performing model is **DeepGCN** (3-layer GCN with edge-weighted message passing):
|
||||
|
||||
```
|
||||
Node features (2D: CTCF, H3K27me3)
|
||||
Node features (3D: CTCF, H3K27me3, H3K4me3)
|
||||
│
|
||||
BatchNorm
|
||||
│
|
||||
GCNConv(64) ← shared message-passing layer
|
||||
GCNConv(128, edge_weight) ← layer 1
|
||||
│
|
||||
ReLU + Dropout(0.2)
|
||||
BatchNorm → ReLU → Dropout(0.3)
|
||||
│
|
||||
GCNConv(128, edge_weight) ← layer 2
|
||||
│
|
||||
ReLU → Dropout(0.3)
|
||||
/ \
|
||||
GCNConv(32) GCNConv(32)
|
||||
GCNConv(64) GCNConv(64) (edge_weight)
|
||||
μ log σ
|
||||
\ /
|
||||
Reparameterisation
|
||||
│
|
||||
z ∈ ℝ³² (node embeddings)
|
||||
z ∈ ℝ⁶⁴ (node embeddings)
|
||||
│
|
||||
Inner-product decoder
|
||||
(link prediction objective: binary cross-entropy + KL divergence)
|
||||
(β-VGAE objective: BCE + β·KL with linear warm-up)
|
||||
```
|
||||
|
||||
The encoder is a two-layer Graph Convolutional Network (Kipf & Welling 2016, 2017) with a BatchNorm input layer. The decoder is the standard dot-product decoder used in the original VGAE paper. Training uses a link-prediction objective: the model is asked to distinguish real Hi-C contacts from randomly sampled non-contacts.
|
||||
ICE-balanced contact weights (log1p-normalised) are passed as `edge_weight` to every
|
||||
GCNConv layer, allowing the model to up-weight strong contacts during message passing.
|
||||
The decoder is the standard dot-product decoder from Kipf & Welling (2017). Training
|
||||
uses a link-prediction objective: the model distinguishes real Hi-C contacts from
|
||||
randomly sampled non-contacts.
|
||||
|
||||
---
|
||||
|
||||
@@ -57,15 +68,17 @@ All data are from the GRCh38/hg38 reference genome, chromosome 21 at 25 kb resol
|
||||
| IMR90.mcool | IMR-90 (lung fibroblast) | Hi-C contact matrix | 4DN Data Portal | 4DNFIABB3FHQ |
|
||||
| GM12878_CTCF.bw | GM12878 | CTCF ChIP-seq (FC/control) | ENCODE | ENCFF741BAQ (exp. ENCSR000AKB) |
|
||||
| GM12878_H3K27me3.bw | GM12878 | H3K27me3 ChIP-seq (FC/control) | ENCODE | ENCFF736CNQ (exp. ENCSR000AKD) |
|
||||
| GM12878_H3K4me3.bw | GM12878 | H3K4me3 ChIP-seq (FC/control) | ENCODE | — |
|
||||
| IMR90_CTCF.bw | IMR-90 | CTCF ChIP-seq (FC/control) | ENCODE | ENCFF770DUD (exp. ENCSR000EFI) |
|
||||
| IMR90_H3K27me3.bw | IMR-90 | H3K27me3 ChIP-seq (FC/control) | ENCODE | ENCFF158HZL (exp. ENCSR431UUY) |
|
||||
| IMR90_H3K4me3.bw | IMR-90 | H3K4me3 ChIP-seq (FC/control) | ENCODE | — |
|
||||
|
||||
**Graph statistics:**
|
||||
|
||||
| Cell line | Bins (chr21, 25 kb) | Edges (contacts) | Node features |
|
||||
|-----------|---------------------|------------------|---------------|
|
||||
| GM12878 | 1,869 | 87,557 | 2 (CTCF, H3K27me3) |
|
||||
| IMR90 | 1,869 | 136,121 | 2 (CTCF, H3K27me3) |
|
||||
| Cell line | Bins (chr21, 25 kb) | Edges (contacts, undirected) | Node features |
|
||||
|-----------|---------------------|------------------------------|---------------|
|
||||
| GM12878 | 1,869 | 172,310 | 3 (CTCF, H3K27me3, H3K4me3) |
|
||||
| IMR90 | 1,869 | 136,121 | 3 (CTCF, H3K27me3, H3K4me3) |
|
||||
|
||||
IMR90 has ~55% more intra-chromosomal contacts than GM12878 at chr21, suggesting a more compact or contact-rich chromatin organisation in this fibroblast cell line.
|
||||
|
||||
@@ -110,10 +123,13 @@ python scripts/compute_compartments.py \
|
||||
--bigwig_orient data/raw/GM12878_CTCF.bw \
|
||||
--out results/GM12878/compartments_chr21.csv
|
||||
|
||||
# 3. Train VGAE
|
||||
# 3. Train VGAE (best config: DeepGCN + edge weights + 3 node features)
|
||||
python scripts/train_vgae.py \
|
||||
--graph data/processed/GM12878_chr21.pt \
|
||||
--epochs 300 --patience 20 --hidden 64 --latent 32 \
|
||||
--graph data/processed/GM12878_chr21_3feat.pt \
|
||||
--encoder deep_gcn \
|
||||
--hidden 128 --latent 64 \
|
||||
--epochs 300 --patience 50 \
|
||||
--lr 3e-4 --dropout 0.3 --beta 0.5 --kl_anneal 100 \
|
||||
--outdir results/GM12878
|
||||
|
||||
# 4. Encode a second cell line with the trained model
|
||||
@@ -143,15 +159,17 @@ python scripts/compare_embeddings.py \
|
||||
|
||||
### Training (GM12878, chr21, 25 kb)
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Epochs to convergence | 31 / 300 (early stopping, patience=20) |
|
||||
| Validation AUC (link prediction) | 0.774 |
|
||||
| Test AUC | 0.777 |
|
||||
| Test AP | 0.759 |
|
||||
| Latent dimensionality | 32 |
|
||||
| Encoder | Node features | Edge weights | Test AUC | Test AP | Epochs |
|
||||
|---------|--------------|-------------|----------|---------|--------|
|
||||
| GCN (v1 baseline) | 2 | ✗ | 0.777 | 0.759 | 31 |
|
||||
| GAT (v2) | 2 | ✗ | 0.797 | 0.745 | 73 |
|
||||
| DeepGCN | 2 | ✓ | 0.888 | 0.844 | 137 |
|
||||
| **DeepGCN** | **3** | **✓** | **0.893** | **0.852** | **141** |
|
||||
|
||||
The model converged rapidly, suggesting that the graph structure of chr21 at 25 kb is learnable with a shallow two-layer GCN.
|
||||
The dominant improvement came from passing ICE-balanced contact weights (`edge_weight`)
|
||||
to every GCNConv layer — signal that was computed but silently unused in earlier versions.
|
||||
The three-layer receptive field of DeepGCN covers a full TAD width at 25 kb resolution,
|
||||
which is the scale at which compartment identity is determined.
|
||||
|
||||
---
|
||||
|
||||
@@ -159,14 +177,18 @@ The model converged rapidly, suggesting that the graph structure of chr21 at 25
|
||||
|
||||
The UMAP of GM12878 node embeddings coloured by A/B compartment shows strong, clean separation of the two compartment types without the model ever receiving compartment labels during training.
|
||||
|
||||
| Cell line | Silhouette score (A/B, cosine) | A bins | B bins | Masked (N) |
|
||||
|-----------|-------------------------------|--------|--------|------------|
|
||||
| GM12878 (training) | **0.775** | 602 | 683 | 584 |
|
||||
| IMR90 (zero-shot) | 0.443 | 614 | 709 | 546 |
|
||||
| Cell line | Model | Silhouette score (A/B, cosine) | A bins | B bins | Masked (N) |
|
||||
|-----------|-------|-------------------------------|--------|--------|------------|
|
||||
| GM12878 (training) | GCN v1 | 0.775 | 602 | 683 | 584 |
|
||||
| IMR90 (zero-shot) | GCN v1 | 0.443 | 614 | 709 | 546 |
|
||||
| GM12878 (training) | **DeepGCN** | **0.663** | 602 | 683 | 584 |
|
||||
| IMR90 (zero-shot) | **DeepGCN** | **0.473** | 614 | 709 | 546 |
|
||||
|
||||
The GM12878 silhouette of **0.775** indicates that the VGAE has learned a latent space in which A and B compartments are nearly linearly separable — a strong signal given that compartment identity was never provided as a training label.
|
||||
|
||||
For IMR90, encoded zero-shot with the GM12878-trained model, the silhouette drops to **0.443**. This is expected: the model's BatchNorm statistics were fit to GM12878, and IMR90's chromatin organisation partially diverges.
|
||||
The v1 GM12878 silhouette (0.775) is higher than DeepGCN's (0.663) because the
|
||||
higher-dimensional latent space (64 vs 32) spreads clusters further apart in cosine
|
||||
geometry. The more meaningful comparison is the zero-shot IMR90 transfer, where
|
||||
DeepGCN improves from 0.443 → 0.473 despite the BatchNorm statistics being fit to
|
||||
GM12878.
|
||||
|
||||
**Figures:**
|
||||
|
||||
@@ -230,9 +252,9 @@ Notably, the model achieves this with only two node features (CTCF and H3K27me3
|
||||
|
||||
1. **Single chromosome, single resolution.** Results are for chr21 at 25 kb only. Chr21 is acrocentric with a large masked pericentromeric region (584 / 1,869 bins masked in GM12878), which may reduce statistical power compared to gene-rich autosomes.
|
||||
|
||||
2. **Shallow encoder.** The two-layer GCN has a local receptive field (2-hop neighbourhood). Long-range chromatin interactions spanning multiple TADs are not directly encoded. Deeper networks or attention-based architectures may capture these better.
|
||||
2. **Random negative sampling inflates AUC.** Negative edges are drawn uniformly at random, so many are long-range pairs with trivially near-zero contact frequency. Distance-matched negative sampling (same genomic distance band as positives) would give a more stringent and biologically honest evaluation.
|
||||
|
||||
3. **Link-prediction objective ≠ compartment recovery.** The model is optimised to predict contacts, not compartments. The strong silhouette score is emergent, not guaranteed. The objective could be supplemented with biologically-informed losses.
|
||||
3. **Link-prediction objective ≠ compartment recovery.** The model is optimised to predict contacts, not compartments. The silhouette score is emergent, not guaranteed. The objective could be supplemented with biologically-informed losses.
|
||||
|
||||
4. **Zero-shot transfer with fixed BatchNorm.** Encoding IMR90 with GM12878 BatchNorm statistics means the model sees IMR90 features in GM12878's normalisation frame. A domain-adaptation approach (e.g., re-fitting BatchNorm on IMR90 with frozen GCN weights) would give a fairer comparison.
|
||||
|
||||
@@ -248,11 +270,12 @@ Notably, the model achieves this with only two node features (CTCF and H3K27me3
|
||||
|
||||
- Apply to all autosomes and compare genome-wide compartment recovery.
|
||||
- Add a TAD-boundary evaluation metric (e.g., insulation score correlation with latent space gradients).
|
||||
- Fine-tune on IMR90 (transfer learning) to improve the IMR90 silhouette score.
|
||||
- Fine-tune on IMR90 (transfer learning / BatchNorm adaptation) to improve zero-shot silhouette.
|
||||
- Add cohesin depletion or auxin-inducible degron (AID) perturbation data as a controlled condition comparison.
|
||||
- Replace the inner-product decoder with a distance-aware decoder that incorporates linear genomic distance.
|
||||
- Benchmark against PCA/UMAP of the raw contact matrix and against other graph-based methods (GraphSAGE, GAT).
|
||||
- Extend node features to include additional histone marks (H3K4me3, H3K27ac, H3K9me3) to test whether richer epigenomic context improves compartment recovery.
|
||||
- Replace the inner-product decoder with a distance-aware decoder that subtracts expected polymer-scaling decay — the main remaining confounder for AUC.
|
||||
- Implement distance-matched negative sampling for a more stringent link-prediction evaluation.
|
||||
- Extend node features to H3K27ac and H3K9me3; all four active/repressive marks are available in `data/raw/`.
|
||||
- ~~Benchmark against GAT~~ — done; GAT (AUC 0.797) underperforms DeepGCN with edge weights (AUC 0.893) on this dataset.
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user