Method — Applied Research · arxiviq.com — Applied Research

Case study: production training inside a 1 GB container

This site's serving model is trained in production on the platform's own lifecycle: corpus → pair synthesis → head-only training (frozen backbone, features extracted once) → evaluation gates on the served representation → charter → deploy. The trained artifact is a ~3 MB projection head stored in Postgres; the backbone loads from the baked HuggingFace cache at serve time, with no GPU, no shared filesystem, and no plan upgrade.

Model	nDCG@10	Effective rank	Provenance
`asn-head-1825460` (deployed)	0.9539	26.82	prod eval, gate slice
BAAI/bge-small-en-v1.5	0.9193	22.83	local, same pairs + metric code
all-MiniLM-L6-v2 (raw)	0.8847	25.97	local, same pairs + metric code

Method notes: no ASN weight surgery on this path (rejected 0/4 in fleet evidence); gates grade the representation that actually serves; baseline numbers were measured on the identical pair set and metric code (provenance differs by host and is labeled). Full row: EVIDENCE.md § production head-only split.

Org recipe (charter)config/recipes/research.json · updated 2026-07-02

Base modelsentence-transformers/all-MiniLM-L6-v2

Epochs / batch3 / 16

LossInfoNCE temp=0.07, zELO w=0

PEFTon

ASNkStrong=8, kTail=8, λ=0.5

Newton-Schulz5 steps

Matryoshka dims64 · 128 · 256 · 384

RollbackRevert if rotating-slice nDCG drops >0.02 vs charter baseline.

Evidence program

Based on 891 runs and 1,190 trainings. Sources: EVIDENCE.md §3.1–§3.8, SWEEP_FINDINGS.md, SWEEP_REPORT.md.

Evidence timeline

2026-06-27
EVIDENCE §3.1

Root cause fixed; trigger bug identified

Encoder effective-rank collapse traced to a projector-window trigger bug, not the contrastive loss. Re-run with the fixed trigger.

2026-06-27
EVIDENCE §3.2

Three-tier spectral surgery REJECTED

Original ASN weight-space SVD surgery makes collapse worse in a real collapse regime (served rank 3.4 → 1.0; kNN −12–21pts). Archived.

2026-06-27
EVIDENCE §3.4

Loss-space rank floor (VICReg) SUPPORTED

VICReg prevents collapse where it occurs (non-contrastive / low-negative regimes). Neutral on standard InfoNCE real-text. Kept as insurance.

2026-06-27
EVIDENCE §3.6

Real-text validation: VICReg neutral; domain fine-tune beats SOTA in-domain

AG News kNN 0.892 / nDCG 0.822 after fine-tune. Per-tenant domain fine-tuning is the defensible moat: it beats general commercial embeddings in-domain with no OOD forgetting. Reconfirmed 2026-06-28.

2026-06-28
EVIDENCE §3.7

Tenant corpus vs commercial baseline (Phase A fleet)

Barlow Twins top method in Bayesian TPE search (robust 1.42, #1 of 6). Matryoshka (MRL) truncation training promoted, the real lever for cheap truncated serving. int8 quantized serving essentially lossless (kNN_int8 ≈ kNN_full).

2026-06-28
EVIDENCE §3.8

Large-scale VICReg campaign (500 runs)

VICReg rank floor confirmed regime-specific: +0.32 kNN for SimSiam, ~0 for InfoNCE batch≥16. Not a default term; insurance only.

2026-07-02
EVIDENCE §2 / REV-905

Eval gate now fails closed

Removed the silent demo-pair fallback in core-api eval. Gates cannot pass on hard-coded demo pairs; MIN_REAL_PAIRS_EVAL=8 enforced. Deploy spine is honestly certifying.

What we ship, promote, and reject

In production

InfoNCE contrastive (default)
AG News kNN 0.892 / nDCG 0.822 · EVIDENCE §3.6, §3.7 (H-A)
int8 quantized serving
kNN_int8 ≈ kNN_full everywhere · EVIDENCE §3.7 (B)

Promoted to Business Development

Domain fine-tuning
+1.5% (300 pairs) → +3.0% (1,200 pairs) in-domain · EVIDENCE §3.6, §3.7 (H-C)
Barlow Twins
robust score 1.42 (#1 of 6 methods) · EVIDENCE §3.7 wave-2 TPE
Matryoshka (MRL) truncation
best truncated-dim kNN among methods · EVIDENCE §3.7 (B, wave-2)

Tested and rejected

Three-tier spectral surgery (original ASN)
rank 3.4 → 1.0; kNN −12–21pts · EVIDENCE §3.2
spectral_lift (rank-floor surgery)
rank 2.06 vs do-nothing 3.40 · EVIDENCE §3.3
Sleep / SHY consolidation (AwakenedSleepNet)
phasic rank 9.99 / 5.71 vs continuous 21.0 · EVIDENCE §3.5
DINO-style centering
robust score 1.08 (#6 of 6) · EVIDENCE §3.7 wave-2

Deploy gates (eval-harness)

A model reaches arxiviq.com only when every gate passes for real. Gates fail closed when a dimension is unmeasured, never stubbed true.

rankAboveBaselineeffective_rank > 8.0measured

ndcgNonRegressionnDCG ≥ 0.35 (k=2 proxy; k=10 benchmark pending exam port)measured

mrlWithinToleranceknn_t8 ≥ 0.30 AND (knn_full − knn_t8) ≤ 0.05; fail-closed if unmeasuredmeasured (fail-closed)

sufficientEvalPairs≥ 8 real collection pairs (no demo-pair fallback)measured (REV-905)

Corpus: 702 arXiv abstracts harvested across CS.CL / retrieval / embedding queries into data/corpora/research/corpus.jsonl. Re-kickoff with pnpm harvest:arxiv && pnpm kickoff:orgs after the stack is up.