Files
chromatin-vgae-hic/README.md

29 KiB
Raw Permalink Blame History

chromatin-gnn

Variational Graph Autoencoder for learning latent representations of chromatin topology from Hi-C data


Overview

Chromosomes fold into compartments, topologically associating domains (TADs), and loop structures that correlate with gene regulation. Hi-C sequencing captures these contacts genome-wide, but the resulting data are high-dimensional, distance-dominated, and require principled dimensionality reduction to extract biologically interpretable structure.

This repository applies a Variational Graph Autoencoder (VGAE) to Hi-C contact data to learn compact, continuous latent representations of chromatin topology. Genomic bins are graph nodes;

ICE-balanced (normalized) contact frequencies are the weighted edges;

ChIP-seq tracks supply node features.

The model is trained end-to-end on a link-prediction objective and evaluated across 3 hypotheses as below


Scientific hypotheses

Hypothesis 1 — Representation learning

A VGAE trained on Hi-C contact graphs learns low-dimensional latent representations that capture biologically meaningful chromatin organisation in an unsupervised manner.

Hypothesis 2 — Dynamic rewiring

Changes in chromatin state between conditions manifest as measurable shifts in latent graph embeddings, with stable regulatory domains remaining conserved while loop-dependent architecture exhibits greater topological drift.

Hypothesis 3 — Long-range topology (in progress)

Weighted graph message passing over long-range chromatin interactions (> 1 Mb) enables the emergence of latent structure beyond what is explainable by local contact density alone.


Architecture

Three encoder architectures are available (--encoder gcn | gat | deep_gcn).

The best-performing model is DeepGCN (3-layer GCN with edge-weighted message passing):

β-VGAE architecture

ICE-balanced contact weights (log1p-normalised) are passed as edge_weight to every GCNConv layer.

The decoder is the standard dot-product decoder from Kipf & Welling (2016).

Training uses a link-prediction objective.


Repository structure

Show layout
chromatin_gnn/              # importable package — model and graph utilities
experiments/
  h1_representation/        # H1: GM12878 VGAE + IMR90 zero-shot transfer
  h2_rewiring/              # H2: RAD21 depletion perturbation analysis
  h3_longrange/             # H3: graph ablation
data/
  raw/                      # gitignored — mcool, bigWig files
  processed/
    gm12878/  imr90/  hct116/
results/
  h1_representation/        # model, embeddings, compartments, figures
  h2_rewiring/              # model, drift arrays, stats, figures
  h3_longrange/
pyproject.toml              # pip install -e . makes chromatin_gnn importable

Installation

Show install steps
conda create -n chromatin_gnn python=3.10 -y
conda activate chromatin_gnn
pip install torch==2.1.2 --index-url https://download.pytorch.org/whl/cpu
pip install torch-geometric==2.5.3 cooler==0.9.3 pyBigWig pandas \
    "numpy>=1.24,<2.0" scikit-learn matplotlib umap-learn scipy tqdm

# local install
pip install -e .

future to do - dockerize and add mlops setup


H1 — Representation learning

Dataset

GM12878 lymphoblastoid and IMR-90 fibroblast cell lines.

Chr1 at 25 kb resolution. 25kb seems reasonable to keep the TADs & loops.

File Cell line Type Source Accession
GM12878.mcool GM12878 Hi-C 4DN 4DNFIRUMEC32
IMR90.mcool IMR-90 Hi-C 4DN 4DNFIABB3FHQ
GM12878_CTCF.bw GM12878 CTCF ChIP-seq ENCODE ENCFF741BAQ
GM12878_H3K27me3.bw GM12878 H3K27me3 ChIP-seq ENCODE ENCFF736CNQ
GM12878_H3K4me3.bw GM12878 H3K4me3 ChIP-seq ENCODE ENCSR057BWO
IMR90_CTCF.bw IMR-90 CTCF ChIP-seq ENCODE ENCFF770DUD
IMR90_H3K27me3.bw IMR-90 H3K27me3 ChIP-seq ENCODE ENCFF158HZL
IMR90_H3K4me3.bw IMR-90 H3K4me3 ChIP-seq ENCODE ENCFF127NCC

Graph statistics (chr1, 25 kb):

Cell line Bins Edges Node features
GM12878 9,959 ~625k 3 (CTCF, H3K27me3, H3K4me3)
IMR90 9,959 ~625k 3

Modeling strategy

The VGAE is trained on GM12878 chr1 with a link-prediction objective.

No compartment labels are used during training.

After training, the GM12878 model is applied zero-shot to encode IMR90 — the same model weights, no retraining.

Results

Link prediction (training objective):

Encoder Features Edge weights Val AUC Test AUC Test AP
DeepGCN 3 ChIP 0.823 0.823 0.766

Compartment classification (biological validation):

A/B compartments are called by O/E Pearson PCA on the contact matrix (standard Hi-C method). VGAE embeddings are never shown compartment labels.

To test biological validity, a logistic regression is trained on embeddings to predict A/B labels (5-fold cross-validation).
Method What it uses GM12878 AUC GM12878 AP
Hi-C PC1 (classical) Full contact matrix 1.000* 1.000*
ChIP-seq features only (PCA) Node features, no graph 0.530 0.580
VGAE in-domain Graph + features 0.700 0.760
VGAE zero-shot IMR90 GM12878 model on IMR90 0.584 0.549

* The Hi-C PC1 baseline is 1.000 by construction: A/B labels are defined as sign(PC1), so PC1 predicts them easily. It represents the standard method

Key findings:

  • VGAE AUC 0.700 > feature-only 0.530 (+0.17): the graph topology contributes significantly beyond ChIP-seq features alone.
  • Spearman r(PC1, best latent dim) = 0.372 for GM12878 — the latent space has compartment-correlated structure without compartment supervision.
  • Zero-shot IMR90 AUC = 0.584: partial but significant cross-cell-type transfer.

Topology ablation (H1 extension):

To isolate whether graph structure or node features drive compartment recovery, a VGAE is trained with constant node features (ones vector, 1D — no ChIP-seq). Every node looks identical as input; the model can only learn from who contacts whom.

Model Node features Link pred AUC Compartment AUC
Full (VGAE + ChIP) 3 ChIP tracks 0.823 0.700
Topology-only no ChIP 0.531 0.487
Feature-only (PCA) 3 ChIP, no graph 0.530

The topology-only model converges to near-random performance (link pred AUC 0.531, compartment AUC 0.487 ≈ chance, early stopped at 58/300 epochs).

Neither the contact graph alone nor the ChIP-seq features alone achieve meaningful compartment recovery.

The VGAE's power comes from message passing that propagates epigenomic signals through the contact topology: what a bin encodes in the latent space is not just its own ChIP-seq state, but the combined epigenomic state of its contact neighbourhood.

I need models with efficient message passing

Genome-wide replication (chr1 chr22)

To test whether the chr1 result generalises, the full H1 pipeline was rerun independently on all 22 autosomes (per-chromosome models, identical architecture and hyperparameters; 5-fold CV LogReg on standardised embeddings).

Summary across 22 autosomes:

Metric Mean ± SD Range
Link prediction test AUC 0.813 ± 0.032 0.755 0.887
GM12878 in-domain compartment AUC 0.833 ± 0.082 0.700 0.993
IMR90 zero-shot compartment AUC 0.839 ± 0.091 0.690 0.985
GM12878 Spearman r (PC1, best dim) 0.503 ± 0.205 0.21 0.92
IMR90 Spearman r (PC1, best dim) 0.405 ± 0.209 0.12 0.84

Key findings:

  • Zero-shot transfer matches in-domain performance (0.839 vs 0.833). The GM12878-trained model recovers IMR90 compartments essentially as well as it recovers GM12878's own — substantially stronger evidence of biologically meaningful representations than chr1 alone suggested.
  • Link AUC is highly stable (CV ≈ 4 %) across all 22 chromosomes — training is robust, not chr1-specific.
  • Strong chromosome-size dependence in compartment AUC: small chromosomes recover compartments far better than large ones (chr21 AUC 0.993, chr19 0.972, chr22 0.888 vs chr1 0.700, chr8 0.734). Likely contributors: (a) smaller chromosomes have simpler / more dominant A/B partitions; (b) fewer bins lead to easier binary classification with less label imbalance.
  • Methodology note: the per-chromosome aggregate uses 5-fold CV LogReg on standardised embeddings and gives chr1 GM12878 AUC = 0.700 (identical to the chr1 table above) and chr1 IMR90 AUC = 0.690 (higher than the 0.584 reported in the chr1 table, which used a single train/test split). The genome-wide protocol is the more rigorous evaluation.

Per-chrom breakdown: results/h1_representation/genome_summary.csv. Run: bash experiments/h1_representation/run_genome.shpython experiments/h1_representation/aggregate.py.

Workflow

Show commands
export PYTHONPATH=$(pwd)
bash experiments/h1_representation/run.sh # full

# individual steps:
python experiments/h1_representation/train.py \
    --graph data/processed/gm12878/chr1.pt \
    --encoder deep_gcn --hidden 256 --latent 64 \
    --epochs 300 --patience 50 --lr 3e-4 \
    --outdir results/h1_representation

python experiments/h1_representation/evaluate.py \
    --gm12878_emb   results/h1_representation/gm12878_emb.npy \
    --imr90_emb     results/h1_representation/imr90_emb.npy \
    --gm12878_graph data/processed/gm12878/chr1.pt \
    --imr90_graph   data/processed/imr90/chr1.pt \
    --comp_gm12878  results/h1_representation/compartments/gm12878_chr1.csv \
    --comp_imr90    results/h1_representation/compartments/imr90_chr1.csv \
    --out           results/h1_representation/evaluation.json

Figures

Show figure list
Figure Description
results/h1_representation/figures/umap_GM12878_compartment.png GM12878 UMAP coloured by A/B compartment
results/h1_representation/figures/umap_IMR90_compartment.png IMR90 UMAP coloured by A/B compartment (zero-shot)
results/h1_representation/figures/umap_joint.png Joint UMAP of both cell lines
results/h1_representation/figures/chr21_delta.png Per-bin cosine distance (GM12878 vs IMR90)

Biological interpretation

  • The VGAE recovers A/B compartment structure as an emergent property of learning contact graph topology — without any compartment supervision. The graph topology contributes +0.17 AUC over ChIP-seq features alone (0.700 vs 0.530), demonstrating that it is contact structure, not just epigenomic marking, that encodes compartment identity in the learned representation. The partial zero-shot transfer to IMR90 (AUC 0.584) reflects the partial conservation of compartment architecture across cell types.

  • The VGAE does not outperform the classical Hi-C PC1 method at compartment calling. The VGAE is a general representation that captures compartment structure as a byproduct of link prediction, and which can be applied to new cell types, conditions, or downstream tasks without retraining.

  • The ablation makes the mechanism precise: the graph topology and ChIP-seq features are multiplicatively complementary, not independently sufficient. Contact topology without epigenomic marking is uninterpretable (AUC 0.487 ≈ chance); epigenomic marking without contact topology captures only marginal structure (AUC 0.530). Message passing through the contact graph which aggregates each bin's ChIP-seq neighbourhood is the operation that produces the compartment signal.


H2 — Dynamic rewiring

Dataset

HCT-116 colorectal carcinoma, RAD21-mAC auxin-inducible degron (Rao et al. 2017, Cell 171:305-320). G1/S-synchronised cells. Chr1 at 25 kb resolution, 4 ChIP-seq node features.

File Condition Source Accession
HCT116_control.mcool Untreated G1/S 4DN 4DNFINIQYFKT
HCT116_treated_6h.mcool 6 h auxin (RAD21-depleted) G1/S 4DN 4DNFIAVRY6RG
HCT116_CTCF.bw HCT-116 CTCF ChIP-seq ENCODE
HCT116_H3K27me3.bw HCT-116 H3K27me3 ChIP-seq ENCODE
HCT116_H3K27ac.bw HCT-116 H3K27ac ChIP-seq ENCODE
HCT116_H3K4me3.bw HCT-116 H3K4me3 ChIP-seq ENCODE

more time interval including datasets?

Graph statistics (chr1, 25 kb):

Condition Bins Edges Node features
Control (G1/S) 9,959 947,463 4 (CTCF, H3K27me3, H3K27ac, H3K4me3)
6 h auxin (G1/S) 9,959 946,994 4

Modeling strategy

The VGAE is trained on the control condition only (link-prediction objective). The trained model is then applied zero-shot to encode the treated (RAD21-depleted) graph.

Per-bin cosine distance between control and treated embeddings quantifies the embedding drift induced by cohesin loss.

Drift is decomposed into two distance bands reflecting different biological scales:

  • Short-range drift (edges < 1 Mb): loop scale. expected high at former loop anchors after RAD21 depletion
  • Long-range drift (edges 25 Mb): sub-compartment scale — expected LOW genome-wide (compartments are preserved)

Loop anchor definition

The Rao 2017 loop list was not publicly deposited. Therefore loops were called with cooltools dots (v0.7.1, FDR ≤ 0.1) on the control mcool at 10 kb resolution. The G1/S-synchronised sample (~1.35 GB) is ~10× shallower than the async sample (~12.8 GB), yielding only 1 statistically significant loop on chr1 — insufficient for enrichment testing.

The CTCF ChIP-seq peak bins was therefore used as a proxy (top 25th percentile CTCF signal at 25 kb). This is biologically motivated: CTCF anchors cohesin-dependent loops; RAD21 depletion dissolves loops while leaving CTCF binding intact. The proxy is fully independent of the Hi-C data (no circularity). It over-includes CTCF sites that are not loop anchors at this resolution, making the test conservative.

Results

Training (HCT-116 control, chr1, 25 kb):

Encoder Node features Edge weights Test AUC Test AP Epochs
DeepGCN 4 ChIP 0.864 0.843 300

Drift decomposition and anchor enrichment:

Metric Value
Mean full embedding drift (cosine) 0.128
Mean short-range drift (< 1 Mb; loop scale) 0.141
Mean long-range drift (25 Mb; sub-compartment) 0.174
CTCF anchor bins (chr1, top 25% signal) 2,301 / 9,959 (23%)
Mean short-range drift at CTCF anchors 0.152
Mean short-range drift at non-anchors 0.138
Permutation p-value (1,000 shuffles) 0.003

Short-range drift is significantly enriched at CTCF loop anchor bins (p = 0.003), consistent with loop-scale contact loss after RAD21/cohesin depletion (Rao et al. 2017).

Caveat: The graph max_dist filter of 5 Mb means the "long-range" band (25 Mb) is sub-compartment scale, not true compartment scale (> 10 Mb). We cannot directly test A/B compartment preservation with this graph. Rebuilding with max_dist ≥ 20 Mb is future work. # i need a gpu server for this!

Genome-wide replication (chr1 chr22)

To test whether the chr1 perturbation result is general, the H2 pipeline was rerun on all 22 autosomes — per-chromosome models, per-chromosome permutation tests (1,000 shuffles each), then combined.

Summary across 22 autosomes (independent per-chrom permutation tests):

Metric Value
Chromosomes with predicted positive contrast (anchor SR > non-anchor SR) 19 / 22
Chromosomes individually significant at p < 0.05 13 / 22
Chromosomes individually significant at p < 0.01 8 / 22
Fisher's combined p-value (22 independent tests) 1.0 × 10⁻¹⁶
Mean SR drift at CTCF anchors 0.153
Mean SR drift at non-anchor bins 0.138
Mean contrast (anchor non-anchor) +0.015 (≈ 11 % relative)

Key findings:

  • The chr1 result generalises. RAD21 depletion produces embedding drift preferentially concentrated at CTCF loop-anchor bins genome-wide.
  • The effect is small in absolute magnitude but consistent across 19 / 22 chromosomes — the signature of a biologically real, mechanistically specific signal rather than chr1-specific noise. Even a conservative binomial sign test on the 19/22 directional outcomes gives p ≈ 7 × 10⁻⁴.
  • 3 chromosomes (chr19 Δ = 0.003, chr20 Δ = 0.006, chr22 Δ = 0.012) show slight inverse contrast; chr7 is essentially flat (Δ ≈ 0). Notably chr21 — similar in size to chr19/20/22 — gives the strongest positive signal (Δ = +0.071, p = 0), so this is not a small-chromosome artefact. Possible explanations include chromosome-specific differences in CTCF anchor density or model convergence quality (chr22 stopped early at 58 epochs in H1, suggesting that chromosome's training may be undertrained for the perturbation test as well).
  • The genome-wide Fisher combined p of 10⁻¹⁶ accumulates partial-signal chromosomes into overwhelming significance, while honestly preserving the per-chrom distribution.

Per-chrom breakdown: results/h2_rewiring/genome_summary.csv, results/h2_rewiring/genome_summary.json. Run: bash experiments/h2_rewiring/run_genome.shpython experiments/h2_rewiring/aggregate.py.

Workflow

Show commands
export PYTHONPATH=$(pwd)
bash experiments/h2_rewiring/run.sh

# Or run the analysis step alone after training:
python experiments/h2_rewiring/perturbation_analysis.py \
    --control_graph data/processed/hct116/control_chr1.pt \
    --treated_graph data/processed/hct116/treated_chr1.pt \
    --model         results/h2_rewiring/model.pt \
    --loops         data/processed/hct116/loops_chr1_10kb.bedpe \
    --chrom chr1 --res 25000 \
    --short_cutoff 1000000 --long_cutoff 2000000 \
    --outdir results/h2_rewiring

python experiments/h2_rewiring/perturbation_viz.py \
    --control_emb results/h2_rewiring/control_emb.npy \
    --treated_emb results/h2_rewiring/treated_emb.npy \
    --drift_full  results/h2_rewiring/drift_full.npy \
    --drift_short results/h2_rewiring/drift_short.npy \
    --drift_long  results/h2_rewiring/drift_long.npy \
    --anchor_mask results/h2_rewiring/anchor_mask.npy \
    --stats       results/h2_rewiring/drift_stats.json \
    --outdir      results/h2_rewiring/figures --res 25000

Figures

Show figure list
Figure Description
results/h2_rewiring/figures/umap_perturbation.png UMAP of control vs RAD21-depleted embeddings
results/h2_rewiring/figures/drift_track_chr1.png Genome-browser drift track: short-range (signal) and long-range (background) along chr1
results/h2_rewiring/figures/scatter_lr_vs_sr_drift.png Long-range vs short-range drift scatter; CTCF anchors highlighted
results/h2_rewiring/figures/barplot_anchor_drift.png Mean short-range drift at loop anchors vs non-anchors, with permutation p-value

Biological interpretation

  • A VGAE trained only on the control condition detects loop-scale chromatin rewiring as an emergent shift in latent embeddings. The permutation test confirms that short-range drift is non-randomly concentrated at CTCF loop-anchor bins (p = 0.003) — the exact loci where RAD21/cohesin-dependent contacts are expected to be lost. This shows that latent embedding drift is a biologically structured signal, not noise.

  • The training objective (link prediction) is not designed to be sensitive to contact changes between conditions. The signal emerges from the model's learned representation of neighbourhood structure: after cohesin loss, a bin's contact neighbourhood changes, and the latent vector drifts accordingly. A perturbation-aware training objective (e.g., contrastive loss) would likely amplify this signal. # but then the model becomes less exciting, drift detection should come like a surprise


H3 — Long-range topology

Hypothesis (as posed): Chromatin interactions beyond the local genomic neighbourhood (> 1 Mb) encode non-trivial topological structure that cannot be explained by distance-decay alone, and should carry more compartment signal than local contacts.

Approach. Graph ablation. Train three VGAE variants on the same GM12878 chr1 graph (same architecture, same node features, same hyperparameters as H1):

Variant Edges kept # edges (chr1)
Full All ≤ 5 Mb ~625k
Local only < 250 kb (within-TAD) ~173k
Long-range only > 1 Mb (sub-compartment) ~339k

Result

Variant Link AUC Compartment AUC (5-fold CV) Spearman r (best dim, PC1)
Full 0.823 0.700 0.372
Local only 0.874 0.632 0.266
Long-range only 0.683 0.577 0.259

Hypothesis not supported as stated. Local edges alone outperform long-range edges alone on compartment recovery (Δ +0.055 AUC). The naive prediction — that megabase-scale interactions are the dominant compartment-carrying signal — is wrong in this setup.

What we did find. The two scales are complementary: the full graph beats both ablations (Δ +0.068 over local-only, Δ +0.123 over long-range-only). Three observations explain the pattern:

  1. Link prediction is dominated by local structure. Local-only achieves the highest link AUC (0.874) because near-diagonal contacts are essentially predictable from genomic distance alone. Long-range-only collapses to 0.683 — confirming long-range contacts are genuinely harder to predict (sparse, non-trivial), but the training objective derives most of its loss from local edges.
  2. Compartments are visible at every scale through node features. Both ablations carry the same CTCF/H3K27me3/H3K27ac/H3K4me3 features, which are themselves compartment-correlated. Local message-passing smooths these over neighbouring bins (which already share compartment identity) — extracting compartment signal "for free" without using long-range topology.
  3. Long-range edges add information the local graph cannot recover. The Δ between full (0.700) and local-only (0.632) is non-zero, so long-range contacts contribute. They are not redundant — but in this objective, they are not dominant either.

Reframing. The cleaner question is not "which scale carries compartment signal" (both do, via different routes) but "which scale carries signal that node features alone cannot supply" — answerable with the feature × edge cross-ablation below.

Feature × edge cross-ablation

To separate two confounded sources of signal — feature smoothing through the graph vs. genuine topological information — train a 2×3 grid (constant-features row trained on RTX 4090, ~6.5 min total for the three new cells; identical architecture and hyperparameters to the real-features cells):

Full edges Local (<250 kb) Long-range (>1 Mb)
Real features 0.700 0.632 0.577
Constant features (ones) 0.488 0.488 0.487
Δ (real const) +0.212 +0.144 +0.090

Result: H3 is firmly not supported. The bottom row is flat at 0.4870.488 — essentially indistinguishable from random (0.50). All three edge bands carry zero compartment-discriminative information from topology alone. The 0.001 spread between const_local and const_longrange is well inside the ±0.02 noise band; the apparent edge of long-range over local is a coin flip.

What this means.

  1. Compartments are feature-driven, not topology-driven. The full real-features cell scores 0.700 not because the graph encodes compartments topologically, but because (a) the node features (CTCF, H3K4me3, H3K27ac, H3K27me3) are themselves compartment-correlated, and (b) message-passing smooths those features over neighbouring bins — which already share compartment identity.
  2. The original H3 ordering (full > local > longrange under real features) is fully explained by smoothing reach. Local edges connect bins that share features and compartment identity, so smoothing is efficient. Long-range edges bridge bins of more variable compartment identity (different compartments are physically close in 3D space), so smoothing is less coherent.
  3. The graph's role is the smoothing structure, not the signal. The VGAE, in this setup, is a sophisticated feature-smoother where the contact graph determines which neighbours to average over but not what signal to propagate.

Wider implication for H2. This reframes the perturbation drift result. Epigenetic marks don't change in 6 h of cohesin loss; what changes is the topology — and therefore which neighbours each bin's features are smoothed against. The H2 drift signal is exactly this: a shift in the smoothing neighbourhood, not a change in the underlying biology. The VGAE captures topology change because topology is the smoothing operator, not because topology carries independent biological signal.

Scripts: experiments/h3_longrange/. Run end-to-end (idempotent, GPU-aware): bash experiments/h3_longrange/run.sh. Output: results/h3_longrange/cross_ablation.json, results/h3_longrange/ablation_comparison.json.


Limitations

  1. H3 is chr1-only. H1 and H2 have both been replicated genome-wide on chr1chr22 (see the Genome-wide replication subsections above). The H3 cross-ablation is still chr1-only; genome-wide H3 replication is future work.

  2. Random negative sampling. Negative edges are drawn uniformly at random — many are long-range pairs with trivially near-zero contact frequency. Distance-matched negative sampling would give a more stringent AUC.

  3. Link-prediction ≠ compartment recovery. The VGAE is optimised for contact prediction, not compartment calling. Compartment AUC is emergent. The classical Hi-C PC1 method is better at compartment calling by design (AUC 1.0 by construction).

  4. Zero-shot transfer with fixed BatchNorm. Encoding IMR90 or treated HCT-116 with GM12878/control BatchNorm statistics introduces a normalisation mismatch. BatchNorm adaptation (re-fitting statistics with frozen GCN weights) would give a fairer transfer evaluation.

  5. CTCF proxy ≠ loop calls (H2). CTCF bins over-represent the anchor set. A clean validation requires either deeper Hi-C or a published HCT-116 loop list.

  6. max_dist = 5 Mb (H2). Cannot test compartment-scale contact preservation (> 10 Mb).

  7. G1/S synchronisation compresses variance (H2). The synchronised control may have attenuated chromatin variance relative to asynchronous cells, potentially underestimating the perturbation effect.

  8. No TAD-level evaluation. TAD boundary detection requires a graph- or boundary-level metric. The current evaluation is node-level only.


Future work

  • H3: Genome-wide replication (chr222) of the cross-ablation to confirm the chr1 null is not chromosome-specific. Also: distance-matched negative sampling for link prediction (current AUCs are inflated for the local-only model because near-diagonal positives are trivially separable from random negatives).

  • All: Distance-matched negative sampling for link prediction AUC.

  • All: Replace inner-product decoder with a distance-aware decoder (corrects for polymer-scaling decay).


Scripts and package

Show module and script tables

Shared library (chromatin_gnn/ package):

Module Purpose
chromatin_gnn/model.py GCNEncoder, GATEncoder, DeepGCNEncoder, build_encoder()
chromatin_gnn/build_graph.py Convert .mcool + bigWigs → PyG Data object

H1 — Representation learning (experiments/h1_representation/):

Script Purpose
train.py Train VGAE with link-prediction objective + early stopping
encode.py Encode a new graph with a trained model
compute_compartments.py O/E matrix → Pearson PCA → A/B compartment calls
evaluate.py Compartment AUC, ablation comparison, Spearman r, zero-shot AUC
visualize.py UMAP + compartment visualisation + silhouette
compare.py Per-bin cosine/L2/L1 distance between two embeddings

H2 — Dynamic rewiring (experiments/h2_rewiring/):

Script Purpose
call_loops.py Call chromatin loops from mcool with cooltools dots
perturbation_analysis.py Encode both conditions, decompose drift by distance band, permutation test
perturbation_viz.py UMAP, genome-browser drift track, scatter, bar plot

H3 — Long-range topology (experiments/h3_longrange/, planned):

Script Purpose
build_ablation_graphs.py Filter edges from graph by distance band
train_ablation.py Train full / local / long-range VGAE variants
evaluate_ablation.py Compare AUC/AP across variants

Key reference:


License

MIT — see LICENSE.