chromatin-vgae-hic/README.md

# chromatin-gnn

**Variational Graph Autoencoder for learning latent representations of chromatin topology from Hi-C data**

---

## Overview

Chromosomes fold into compartments, topologically associating domains (TADs), and loop structures that correlate with gene regulation. Hi-C sequencing captures these contacts genome-wide, but the resulting data are high-dimensional, distance-dominated, and require principled dimensionality reduction to extract biologically interpretable structure.

This repository applies a **Variational Graph Autoencoder (VGAE)** to Hi-C contact data to learn compact, continuous latent representations of chromatin topology. Genomic bins are graph nodes;

ICE-balanced (normalized) contact frequencies are the weighted edges;

ChIP-seq tracks supply node features.

The model is trained end-to-end on a link-prediction objective and evaluated across 3 hypotheses as below

---

## Scientific hypotheses

### Hypothesis 1 — Representation learning

> A VGAE trained on Hi-C contact graphs learns low-dimensional latent representations that capture biologically meaningful chromatin organisation in an unsupervised manner.

### Hypothesis 2 — Dynamic rewiring

> Changes in chromatin state between conditions manifest as measurable shifts in latent graph embeddings, with stable regulatory domains remaining conserved while loop-dependent architecture exhibits greater topological drift.

### Hypothesis 3 — Long-range topology *(in progress)*

> Weighted graph message passing over long-range chromatin interactions (> 1 Mb) enables the emergence of latent structure beyond what is explainable by local contact density alone.

---

## Architecture

Three encoder architectures are available (`--encoder gcn | gat | deep_gcn`).

The best-performing model is **DeepGCN** (3-layer GCN with edge-weighted message passing):

![β-VGAE architecture](chromatin-gnn/vgae_architecture.png)


ICE-balanced contact weights (log1p-normalised) are passed as `edge_weight` to every GCNConv layer.

The decoder is the standard dot-product decoder from Kipf & Welling (2016).

Training uses a link-prediction objective.

---

## Repository structure

<details>
<summary>Show layout</summary>

```
chromatin_gnn/              # importable package — model and graph utilities
experiments/
  h1_representation/        # H1: GM12878 VGAE + IMR90 zero-shot transfer
  h2_rewiring/              # H2: RAD21 depletion perturbation analysis
  h3_longrange/             # H3: graph ablation
data/
  raw/                      # gitignored — mcool, bigWig files
  processed/
    gm12878/  imr90/  hct116/
results/
  h1_representation/        # model, embeddings, compartments, figures
  h2_rewiring/              # model, drift arrays, stats, figures
  h3_longrange/
pyproject.toml              # pip install -e . makes chromatin_gnn importable
```

</details>

---

## Installation

<details>
<summary>Show install steps</summary>

```bash
conda create -n chromatin_gnn python=3.10 -y
conda activate chromatin_gnn
pip install torch==2.1.2 --index-url https://download.pytorch.org/whl/cpu
pip install torch-geometric==2.5.3 cooler==0.9.3 pyBigWig pandas \
    "numpy>=1.24,<2.0" scikit-learn matplotlib umap-learn scipy tqdm

# local install
pip install -e .
```

####  future to do - dockerize and add mlops setup

</details>

---

## H1 — Representation learning

### Dataset

GM12878 lymphoblastoid and IMR-90 fibroblast cell lines.

Chr1 at 25 kb resolution. 25kb seems reasonable to keep the TADs & loops.

| File | Cell line | Type | Source | Accession |
|------|-----------|------|--------|-----------|
| GM12878.mcool | GM12878 | Hi-C | 4DN | 4DNFIRUMEC32 |
| IMR90.mcool | IMR-90 | Hi-C | 4DN | 4DNFIABB3FHQ |
| GM12878_CTCF.bw | GM12878 | CTCF ChIP-seq | ENCODE | ENCFF741BAQ |
| GM12878_H3K27me3.bw | GM12878 | H3K27me3 ChIP-seq | ENCODE | ENCFF736CNQ |
| GM12878_H3K4me3.bw | GM12878 | H3K4me3 ChIP-seq | ENCODE | ENCSR057BWO |
| IMR90_CTCF.bw | IMR-90 | CTCF ChIP-seq | ENCODE | ENCFF770DUD |
| IMR90_H3K27me3.bw | IMR-90 | H3K27me3 ChIP-seq | ENCODE | ENCFF158HZL |
| IMR90_H3K4me3.bw | IMR-90 | H3K4me3 ChIP-seq | ENCODE | ENCFF127NCC |

**Graph statistics (chr1, 25 kb):**

| Cell line | Bins | Edges | Node features |
|-----------|------|-------|---------------|
| GM12878 | 9,959 | ~625k | 3 (CTCF, H3K27me3, H3K4me3) |
| IMR90 | 9,959 | ~625k | 3 |

### Modeling strategy

The VGAE is trained on GM12878 chr1 with a link-prediction objective.

No compartment labels are used during training.

After training, the GM12878 model is applied zero-shot to encode IMR90 — the same model weights, no retraining.

### Results

**Link prediction (training objective):**

| Encoder | Features | Edge weights | Val AUC | Test AUC | Test AP |
|---------|----------|-------------|---------|----------|---------|
| DeepGCN | 3 ChIP | ✓ | 0.823 | 0.823 | 0.766 |

**Compartment classification (biological validation):**

A/B compartments are called by O/E Pearson PCA on the contact matrix (standard Hi-C method). VGAE embeddings are never shown compartment labels.

    To test biological validity, a logistic regression is trained on embeddings to predict A/B labels (5-fold cross-validation).

| Method | What it uses | GM12878 AUC | GM12878 AP |
|--------|-------------|-------------|------------|
| Hi-C PC1 (classical) | Full contact matrix | 1.000* | 1.000* |
| ChIP-seq features only (PCA) | Node features, no graph | 0.530 | 0.580 |
| **VGAE in-domain** | **Graph + features** | **0.700** | **0.760** |
| VGAE zero-shot IMR90 | GM12878 model on IMR90 | 0.584 | 0.549 |

\* *The Hi-C PC1 baseline is 1.000 by construction: A/B labels are defined as sign(PC1), so PC1 predicts them easily. It represents the standard method*

**Key findings:**
- VGAE AUC 0.700 > feature-only 0.530 (+0.17): the graph topology contributes significantly beyond ChIP-seq features alone.
- Spearman r(PC1, best latent dim) = **0.372** for GM12878 — the latent space has compartment-correlated structure without compartment supervision.
- Zero-shot IMR90 AUC = 0.584: partial but significant cross-cell-type transfer.

**Topology ablation (H1 extension):**

To isolate whether graph structure or node features drive compartment recovery, a VGAE is trained with **constant node features** (ones vector, 1D — no ChIP-seq). Every node looks identical as input; the model can only learn from *who contacts whom*.

| Model | Node features | Link pred AUC | Compartment AUC |
|-------|--------------|---------------|-----------------|
| Full (VGAE + ChIP) | 3 ChIP tracks | 0.823 | **0.700** |
| Topology-only | no ChIP | 0.531 | **0.487** |
| Feature-only (PCA) | 3 ChIP, no graph | — | **0.530** |

**The topology-only model converges to near-random performance (link pred AUC 0.531, compartment AUC 0.487 ≈ chance, early stopped at 58/300 epochs).**


Neither the contact graph alone nor the ChIP-seq features alone achieve meaningful compartment recovery.

The VGAE's power comes from **message passing that propagates epigenomic signals through the contact topology**: what a bin encodes in the latent space is not just its own ChIP-seq state, but the combined epigenomic state of its contact neighbourhood.

#### I need models with efficient message passing

### Genome-wide replication (chr1 – chr22)

To test whether the chr1 result generalises, the full H1 pipeline was rerun independently on all 22 autosomes (per-chromosome models, identical architecture and hyperparameters; 5-fold CV LogReg on standardised embeddings).

**Summary across 22 autosomes:**

| Metric | Mean ± SD | Range |
|---|---|---|
| Link prediction test AUC | 0.813 ± 0.032 | 0.755 – 0.887 |
| GM12878 in-domain compartment AUC | **0.833 ± 0.082** | 0.700 – 0.993 |
| IMR90 zero-shot compartment AUC | **0.839 ± 0.091** | 0.690 – 0.985 |
| GM12878 Spearman r (PC1, best dim) | 0.503 ± 0.205 | 0.21 – 0.92 |
| IMR90 Spearman r (PC1, best dim) | 0.405 ± 0.209 | 0.12 – 0.84 |

**Key findings:**
- **Zero-shot transfer matches in-domain performance** (0.839 vs 0.833). The GM12878-trained model recovers IMR90 compartments essentially as well as it recovers GM12878's own — substantially stronger evidence of biologically meaningful representations than chr1 alone suggested.
- **Link AUC is highly stable** (CV ≈ 4 %) across all 22 chromosomes — training is robust, not chr1-specific.
- **Strong chromosome-size dependence in compartment AUC:** small chromosomes recover compartments far better than large ones (chr21 AUC 0.993, chr19 0.972, chr22 0.888 vs chr1 0.700, chr8 0.734). Likely contributors: (a) smaller chromosomes have simpler / more dominant A/B partitions; (b) fewer bins lead to easier binary classification with less label imbalance.
- Methodology note: the per-chromosome aggregate uses 5-fold CV LogReg on standardised embeddings and gives chr1 GM12878 AUC = 0.700 (identical to the chr1 table above) and chr1 IMR90 AUC = 0.690 (higher than the 0.584 reported in the chr1 table, which used a single train/test split). The genome-wide protocol is the more rigorous evaluation.

Per-chrom breakdown: `results/h1_representation/genome_summary.csv`.
Run: `bash experiments/h1_representation/run_genome.sh` → `python experiments/h1_representation/aggregate.py`.

### Workflow

<details>
<summary>Show commands</summary>

```bash
export PYTHONPATH=$(pwd)
bash experiments/h1_representation/run.sh # full

# individual steps:
python experiments/h1_representation/train.py \
    --graph data/processed/gm12878/chr1.pt \
    --encoder deep_gcn --hidden 256 --latent 64 \
    --epochs 300 --patience 50 --lr 3e-4 \
    --outdir results/h1_representation

python experiments/h1_representation/evaluate.py \
    --gm12878_emb   results/h1_representation/gm12878_emb.npy \
    --imr90_emb     results/h1_representation/imr90_emb.npy \
    --gm12878_graph data/processed/gm12878/chr1.pt \
    --imr90_graph   data/processed/imr90/chr1.pt \
    --comp_gm12878  results/h1_representation/compartments/gm12878_chr1.csv \
    --comp_imr90    results/h1_representation/compartments/imr90_chr1.csv \
    --out           results/h1_representation/evaluation.json
```

</details>

### Figures

<details>
<summary>Show figure list</summary>

| Figure | Description |
|--------|-------------|
| `results/h1_representation/figures/umap_GM12878_compartment.png` | GM12878 UMAP coloured by A/B compartment |
| `results/h1_representation/figures/umap_IMR90_compartment.png` | IMR90 UMAP coloured by A/B compartment (zero-shot) |
| `results/h1_representation/figures/umap_joint.png` | Joint UMAP of both cell lines |
| `results/h1_representation/figures/chr21_delta.png` | Per-bin cosine distance (GM12878 vs IMR90) |

</details>

### Biological interpretation

- The VGAE recovers A/B compartment structure as an *emergent property* of learning contact graph topology — without any compartment supervision. The graph topology contributes +0.17 AUC over ChIP-seq features alone (0.700 vs 0.530), demonstrating that it is contact structure, not just epigenomic marking, that encodes compartment identity in the learned representation. The partial zero-shot transfer to IMR90 (AUC 0.584) reflects the partial conservation of compartment architecture across cell types.

- The VGAE does not outperform the classical Hi-C PC1 method at compartment calling. The VGAE is a general representation that captures compartment structure as a byproduct of link prediction, and which can be applied to new cell types, conditions, or downstream tasks without retraining.

- The ablation makes the mechanism precise: the graph topology and ChIP-seq features are *multiplicatively complementary*, not independently sufficient. Contact topology without epigenomic marking is uninterpretable (AUC 0.487 ≈ chance); epigenomic marking without contact topology captures only marginal structure (AUC 0.530). ***Message passing*** through the contact graph which aggregates each bin's ChIP-seq neighbourhood is the operation that produces the compartment signal.

---

## H2 — Dynamic rewiring

### Dataset

HCT-116 colorectal carcinoma, RAD21-mAC auxin-inducible degron (Rao et al. 2017, *Cell* 171:305-320). G1/S-synchronised cells. Chr1 at 25 kb resolution, 4 ChIP-seq node features.

| File | Condition | Source | Accession |
|------|-----------|--------|-----------|
| HCT116_control.mcool | Untreated G1/S | 4DN | 4DNFINIQYFKT |
| HCT116_treated_6h.mcool | 6 h auxin (RAD21-depleted) G1/S | 4DN | 4DNFIAVRY6RG |
| HCT116_CTCF.bw | HCT-116 | CTCF ChIP-seq | ENCODE | ENCFF388PVO |
| HCT116_H3K27me3.bw | HCT-116 | H3K27me3 ChIP-seq | ENCODE | ENCFF717ZKL |
| HCT116_H3K27ac.bw | HCT-116 | H3K27ac ChIP-seq | ENCODE | ENCFF984WLE |
| HCT116_H3K4me3.bw | HCT-116 | H3K4me3 ChIP-seq | ENCODE | ENCFF649ZLF |

#### more time interval including datasets?

**Graph statistics (chr1, 25 kb):**

| Condition | Bins | Edges | Node features |
|-----------|------|-------|---------------|
| Control (G1/S) | 9,959 | 947,463 | 4 (CTCF, H3K27me3, H3K27ac, H3K4me3) |
| 6 h auxin (G1/S) | 9,959 | 946,994 | 4 |

### Modeling strategy

The VGAE is trained on the **control condition only** (link-prediction objective). The trained model is then applied zero-shot to encode the treated (RAD21-depleted) graph.

Per-bin cosine distance between control and treated embeddings quantifies the embedding drift induced by cohesin loss.

Drift is decomposed into two distance bands reflecting different biological scales:
- **Short-range drift** (edges < 1 Mb): loop scale. expected high at former loop anchors after RAD21 depletion
- **Long-range drift** (edges 2–5 Mb): sub-compartment scale — expected LOW genome-wide (compartments are preserved)

### Loop anchor definition

The Rao 2017 loop list was not publicly deposited. Therefore loops were called with `cooltools dots` (v0.7.1, FDR ≤ 0.1) on the control mcool at 10 kb resolution. The G1/S-synchronised sample (~1.35 GB) is ~10× shallower than the async sample (~12.8 GB), yielding only **1 statistically significant loop** on chr1 — insufficient for enrichment testing.

The **CTCF ChIP-seq peak bins** was therefore used as a proxy (top 25th percentile CTCF signal at 25 kb). This is biologically motivated: CTCF anchors cohesin-dependent loops; RAD21 depletion dissolves loops while leaving CTCF binding intact. The proxy is fully independent of the Hi-C data (no circularity). It over-includes CTCF sites that are not loop anchors at this resolution, making the test conservative.

### Results

**Training (HCT-116 control, chr1, 25 kb):**

| Encoder | Node features | Edge weights | Test AUC | Test AP | Epochs |
|---------|--------------|-------------|----------|---------|--------|
| DeepGCN | 4 ChIP | ✓ | **0.864** | **0.843** | 300 |

**Drift decomposition and anchor enrichment:**

| Metric | Value |
|--------|-------|
| Mean full embedding drift (cosine) | 0.128 |
| Mean short-range drift (< 1 Mb; loop scale) | 0.141 |
| Mean long-range drift (2–5 Mb; sub-compartment) | 0.174 |
| CTCF anchor bins (chr1, top 25% signal) | 2,301 / 9,959 (23%) |
| Mean short-range drift at CTCF anchors | **0.152** |
| Mean short-range drift at non-anchors | **0.138** |
| Permutation p-value (1,000 shuffles) | **0.003** |

Short-range drift is significantly enriched at CTCF loop anchor bins (p = 0.003), consistent with loop-scale contact loss after RAD21/cohesin depletion (Rao et al. 2017).

> **Caveat:** The graph `max_dist` filter of 5 Mb means the "long-range" band (2–5 Mb) is sub-compartment scale, not true compartment scale (> 10 Mb). We cannot directly test A/B compartment preservation with this graph. Rebuilding with `max_dist ≥ 20 Mb` is future work. # i need a gpu server for this!

### Genome-wide replication (chr1 – chr22)

To test whether the chr1 perturbation result is general, the H2 pipeline was rerun on all 22 autosomes — per-chromosome models, per-chromosome permutation tests (1,000 shuffles each), then combined.

**Summary across 22 autosomes (independent per-chrom permutation tests):**

| Metric | Value |
|---|---|
| Chromosomes with predicted positive contrast (anchor SR > non-anchor SR) | **19 / 22** |
| Chromosomes individually significant at p < 0.05 | 13 / 22 |
| Chromosomes individually significant at p < 0.01 | 8 / 22 |
| **Fisher's combined p-value (22 independent tests)** | **1.0 × 10⁻¹⁶** |
| Mean SR drift at CTCF anchors | 0.153 |
| Mean SR drift at non-anchor bins | 0.138 |
| Mean contrast (anchor − non-anchor) | **+0.015** (≈ 11 % relative) |

**Key findings:**
- The chr1 result generalises. RAD21 depletion produces embedding drift preferentially concentrated at CTCF loop-anchor bins genome-wide.
- The effect is **small in absolute magnitude but consistent across 19 / 22 chromosomes** — the signature of a biologically real, mechanistically specific signal rather than chr1-specific noise. Even a conservative binomial sign test on the 19/22 directional outcomes gives p ≈ 7 × 10⁻⁴.
- 3 chromosomes (chr19 Δ = −0.003, chr20 Δ = −0.006, chr22 Δ = −0.012) show slight inverse contrast; chr7 is essentially flat (Δ ≈ 0). Notably chr21 — similar in size to chr19/20/22 — gives the *strongest* positive signal (Δ = +0.071, p = 0), so this is not a small-chromosome artefact. Possible explanations include chromosome-specific differences in CTCF anchor density or model convergence quality (chr22 stopped early at 58 epochs in H1, suggesting that chromosome's training may be undertrained for the perturbation test as well).
- The genome-wide Fisher combined p of 10⁻¹⁶ accumulates partial-signal chromosomes into overwhelming significance, while honestly preserving the per-chrom distribution.

Per-chrom breakdown: `results/h2_rewiring/genome_summary.csv`, `results/h2_rewiring/genome_summary.json`.
Run: `bash experiments/h2_rewiring/run_genome.sh` → `python experiments/h2_rewiring/aggregate.py`.

### Workflow

<details>
<summary>Show commands</summary>

```bash
export PYTHONPATH=$(pwd)
bash experiments/h2_rewiring/run.sh

# Or run the analysis step alone after training:
python experiments/h2_rewiring/perturbation_analysis.py \
    --control_graph data/processed/hct116/control_chr1.pt \
    --treated_graph data/processed/hct116/treated_chr1.pt \
    --model         results/h2_rewiring/model.pt \
    --loops         data/processed/hct116/loops_chr1_10kb.bedpe \
    --chrom chr1 --res 25000 \
    --short_cutoff 1000000 --long_cutoff 2000000 \
    --outdir results/h2_rewiring

python experiments/h2_rewiring/perturbation_viz.py \
    --control_emb results/h2_rewiring/control_emb.npy \
    --treated_emb results/h2_rewiring/treated_emb.npy \
    --drift_full  results/h2_rewiring/drift_full.npy \
    --drift_short results/h2_rewiring/drift_short.npy \
    --drift_long  results/h2_rewiring/drift_long.npy \
    --anchor_mask results/h2_rewiring/anchor_mask.npy \
    --stats       results/h2_rewiring/drift_stats.json \
    --outdir      results/h2_rewiring/figures --res 25000
```

</details>

### Figures

<details>
<summary>Show figure list</summary>

| Figure | Description |
|--------|-------------|
| `results/h2_rewiring/figures/umap_perturbation.png` | UMAP of control vs RAD21-depleted embeddings |
| `results/h2_rewiring/figures/drift_track_chr1.png` | Genome-browser drift track: short-range (signal) and long-range (background) along chr1 |
| `results/h2_rewiring/figures/scatter_lr_vs_sr_drift.png` | Long-range vs short-range drift scatter; CTCF anchors highlighted |
| `results/h2_rewiring/figures/barplot_anchor_drift.png` | Mean short-range drift at loop anchors vs non-anchors, with permutation p-value |

</details>

### Biological interpretation

- A VGAE trained only on the control condition detects loop-scale chromatin rewiring as an emergent shift in latent embeddings. The permutation test confirms that short-range drift is non-randomly concentrated at CTCF loop-anchor bins (p = 0.003) — the exact loci where RAD21/cohesin-dependent contacts are expected to be lost. **This shows that latent embedding drift is a biologically structured signal, not noise.**

- The training objective (link prediction) is not designed to be sensitive to contact changes between conditions. The signal emerges from the model's learned representation of neighbourhood structure: after cohesin loss, a bin's contact neighbourhood changes, and the latent vector drifts accordingly. A perturbation-aware training objective (e.g., contrastive loss) would likely amplify this signal. # but then the model becomes less exciting, drift detection should come like a surprise

---

## H3 — Long-range topology

**Hypothesis (as posed):** Chromatin interactions beyond the local genomic neighbourhood (> 1 Mb) encode non-trivial topological structure that cannot be explained by distance-decay alone, and should carry *more* compartment signal than local contacts.

**Approach.** Graph ablation. Train three VGAE variants on the same GM12878 chr1 graph (same architecture, same node features, same hyperparameters as H1):

| Variant | Edges kept | # edges (chr1) |
|---------|------------|----------------|
| Full | All ≤ 5 Mb | ~625k |
| Local only | < 250 kb (within-TAD) | ~173k |
| Long-range only | > 1 Mb (sub-compartment) | ~339k |

### Result

| Variant | Link AUC | Compartment AUC (5-fold CV) | Spearman r (best dim, PC1) |
|---------|---------:|----------------------------:|----------------------------:|
| Full | 0.823 | **0.700** | 0.372 |
| Local only | 0.874 | 0.632 | 0.266 |
| Long-range only | 0.683 | 0.577 | 0.259 |

**Hypothesis not supported as stated.** Local edges alone outperform long-range edges alone on compartment recovery (Δ +0.055 AUC). The naive prediction — that megabase-scale interactions are the dominant compartment-carrying signal — is wrong in this setup.

**What we *did* find.** The two scales are *complementary*: the full graph beats both ablations (Δ +0.068 over local-only, Δ +0.123 over long-range-only). Three observations explain the pattern:

1. **Link prediction is dominated by local structure.** Local-only achieves the highest link AUC (0.874) because near-diagonal contacts are essentially predictable from genomic distance alone. Long-range-only collapses to 0.683 — confirming long-range contacts are genuinely harder to predict (sparse, non-trivial), but the training objective derives most of its loss from local edges.
2. **Compartments are visible at every scale through node features.** Both ablations carry the same CTCF/H3K27me3/H3K27ac/H3K4me3 features, which are themselves compartment-correlated. Local message-passing smooths these over neighbouring bins (which already share compartment identity) — extracting compartment signal "for free" without using long-range topology.
3. **Long-range edges add information the local graph cannot recover.** The Δ between full (0.700) and local-only (0.632) is non-zero, so long-range contacts contribute. They are not redundant — but in this objective, they are not dominant either.

**Reframing.** The cleaner question is *not* "which scale carries compartment signal" (both do, via different routes) but "which scale carries signal that *node features alone cannot supply*" — answerable with the feature × edge cross-ablation below.

### Feature × edge cross-ablation

To separate two confounded sources of signal — feature smoothing through the graph vs. genuine topological information — train a 2×3 grid (constant-features row trained on RTX 4090, ~6.5 min total for the three new cells; identical architecture and hyperparameters to the real-features cells):

|                              | Full edges | Local (<250 kb) | Long-range (>1 Mb) |
|------------------------------|:----------:|:---------------:|:------------------:|
| **Real features**            | 0.700      | 0.632           | 0.577              |
| **Constant features (ones)** | 0.488      | 0.488           | 0.487              |
| Δ (real − const)             | +0.212     | +0.144          | +0.090             |

**Result: H3 is firmly not supported.** The bottom row is flat at 0.487–0.488 — essentially indistinguishable from random (0.50). All three edge bands carry zero compartment-discriminative information *from topology alone*. The 0.001 spread between `const_local` and `const_longrange` is well inside the ±0.02 noise band; the apparent edge of long-range over local is a coin flip.

**What this means.**
1. **Compartments are feature-driven, not topology-driven.** The full real-features cell scores 0.700 not because the graph encodes compartments topologically, but because (a) the node features (CTCF, H3K4me3, H3K27ac, H3K27me3) are themselves compartment-correlated, and (b) message-passing smooths those features over neighbouring bins — which already share compartment identity.
2. **The original H3 ordering (full > local > longrange under real features) is fully explained by smoothing reach.** Local edges connect bins that share features and compartment identity, so smoothing is efficient. Long-range edges bridge bins of more variable compartment identity (different compartments are physically close in 3D space), so smoothing is less coherent.
3. **The graph's role is the smoothing structure, not the signal.** The VGAE, in this setup, is a sophisticated feature-smoother where the contact graph determines *which* neighbours to average over but not *what* signal to propagate.

**Wider implication for H2.** This reframes the perturbation drift result. Epigenetic marks don't change in 6 h of cohesin loss; what changes is the *topology* — and therefore which neighbours each bin's features are smoothed against. The H2 drift signal is exactly this: a shift in the smoothing neighbourhood, not a change in the underlying biology. The VGAE captures topology change *because* topology is the smoothing operator, not because topology carries independent biological signal.

Scripts: `experiments/h3_longrange/`. Run end-to-end (idempotent, GPU-aware): `bash experiments/h3_longrange/run.sh`. Output: `results/h3_longrange/cross_ablation.json`, `results/h3_longrange/ablation_comparison.json`.

---

## Limitations

1. **H3 is chr1-only.** H1 and H2 have both been replicated genome-wide on chr1–chr22 (see the *Genome-wide replication* subsections above). The H3 cross-ablation is still chr1-only; genome-wide H3 replication is future work.

2. **Random negative sampling.** Negative edges are drawn uniformly at random — many are long-range pairs with trivially near-zero contact frequency. Distance-matched negative sampling would give a more stringent AUC.

3. **Link-prediction ≠ compartment recovery.** The VGAE is optimised for contact prediction, not compartment calling. Compartment AUC is emergent. The classical Hi-C PC1 method is better at compartment calling by design (AUC 1.0 by construction).

4. **Zero-shot transfer with fixed BatchNorm.** Encoding IMR90 or treated HCT-116 with GM12878/control BatchNorm statistics introduces a normalisation mismatch. BatchNorm adaptation (re-fitting statistics with frozen GCN weights) would give a fairer transfer evaluation.

5. **CTCF proxy ≠ loop calls (H2).** CTCF bins over-represent the anchor set. A clean validation requires either deeper Hi-C or a published HCT-116 loop list.

6. **max_dist = 5 Mb (H2).** Cannot test compartment-scale contact preservation (> 10 Mb).

7. **G1/S synchronisation compresses variance (H2).** The synchronised control may have attenuated chromatin variance relative to asynchronous cells, potentially underestimating the perturbation effect.

8. **No TAD-level evaluation.** TAD boundary detection requires a graph- or boundary-level metric. The current evaluation is node-level only.

---

## Future work

- **H3:** Genome-wide replication (chr2–22) of the cross-ablation to confirm the chr1 null is not chromosome-specific. Also: distance-matched negative sampling for link prediction (current AUCs are inflated for the local-only model because near-diagonal positives are trivially separable from random negatives).

- **All:** Distance-matched negative sampling for link prediction AUC.
- **All:** Replace inner-product decoder with a distance-aware decoder (corrects for polymer-scaling decay).

---

## Scripts and package

<details>
<summary>Show module and script tables</summary>

**Shared library** (`chromatin_gnn/` package):

| Module | Purpose |
|--------|---------|
| `chromatin_gnn/model.py` | GCNEncoder, GATEncoder, DeepGCNEncoder, build_encoder() |
| `chromatin_gnn/build_graph.py` | Convert .mcool + bigWigs → PyG Data object |

**H1 — Representation learning** (`experiments/h1_representation/`):

| Script | Purpose |
|--------|---------|
| `train.py` | Train VGAE with link-prediction objective + early stopping |
| `encode.py` | Encode a new graph with a trained model |
| `compute_compartments.py` | O/E matrix → Pearson PCA → A/B compartment calls |
| `evaluate.py` | Compartment AUC, ablation comparison, Spearman r, zero-shot AUC |
| `visualize.py` | UMAP + compartment visualisation + silhouette |
| `compare.py` | Per-bin cosine/L2/L1 distance between two embeddings |

**H2 — Dynamic rewiring** (`experiments/h2_rewiring/`):

| Script | Purpose |
|--------|---------|
| `call_loops.py` | Call chromatin loops from mcool with cooltools dots |
| `perturbation_analysis.py` | Encode both conditions, decompose drift by distance band, permutation test |
| `perturbation_viz.py` | UMAP, genome-browser drift track, scatter, bar plot |

**H3 — Long-range topology** (`experiments/h3_longrange/`, planned):

| Script | Purpose |
|--------|---------|
| `build_ablation_graphs.py` | Filter edges from graph by distance band |
| `train_ablation.py` | Train full / local / long-range VGAE variants |
| `evaluate_ablation.py` | Compare AUC/AP across variants |

</details>

---

**Key reference:**

- Rao, S. S. P. et al. (2017). Cohesin Loss Eliminates All Loop Domains. *Cell* 171(2). https://doi.org/10.1016/j.cell.2017.09.026

---

## License

MIT — see [LICENSE](LICENSE).