Research Lab — Blue Hen RE — Applied Research

Live model versionslive

Measured model versions from this workspace's core-api /v1/models, ranked by effective rank.

asn-head-8080041erank 27.1nDCG@10 0.942

asn-head-1825460deployederank 26.8 · trunc 256nDCG@10 0.954

asn-head-8282654erank 26.0nDCG@10 0.908

Based on 891 runs and 1,190 trainings · evidence: EVIDENCE.md §3.1–§3.7, SWEEP_FINDINGS.md, SWEEP_REPORT.md · updated 2026-06-28

How the pipeline works

The research org tests every embedding method on measured benchmarks. Methods that prove their worth are promoted to Business Development for real-world tenant pilots, then the Execution team implements the winners in production. Dead ends are archived with the evidence that killed them.

In Research(2)→Promoted to Business Development(3)→In Execution(2)→Archived (rejected)(4)

In Research

2 methods

Under active investigation in the research org: measured, but not yet promoted.

VICReg (variance + covariance rank floor)

Measured (regime-specific)

Loss-space rank floor. Prevents collapse where it occurs (non-contrastive / low-negative regimes) but is neutral on standard InfoNCE real-text training. Kept as insurance, not a default.

Key metric · +0.32 kNN for SimSiam; ~0 for InfoNCE batch≥16; neutral on real text

Evidence: EVIDENCE §3.4, §3.6, §3.7 (H-A)

Differentiable rank-floor regularizer

Measured (mid-pack)

Adds a differentiable effective-rank maximization term to InfoNCE. Works but ranked below Barlow/InfoNCE in the search.

Key metric · robust score 1.31 (#3 of 6)

Evidence: EVIDENCE §3.7 wave-2

Promoted to Business Development

3 methods

Earned its keep in research. Handed to Business Development for real-world tenant pilots.

Domain fine-tuning

Measured

Cheap per-tenant fine-tuning on the tenant's domain. The defensible product moat: beats general commercial embeddings in-domain, with no out-of-domain forgetting at this scale.

Key metric · +1.5% (300 pairs) → +3.0% (1,200 pairs) in-domain; OOD improved

Evidence: EVIDENCE §3.6, §3.7 (H-C)

Barlow Twins

Validating on real text

Redundancy-reduction objective. Top method in the Bayesian search; keeps full-dim quality while resisting collapse. Real-text confirmation vs VICReg in progress.

Key metric · robust score 1.42 (#1 of 6 methods, synthetic)

Evidence: EVIDENCE §3.7 wave-2 TPE

Matryoshka (MRL) truncation training

Validating on real text

Trains nested prefixes of the served representation so it degrades gracefully when truncated. The real lever for cheap truncated serving (decorrelation is not).

Key metric · best truncated-dim kNN among methods (synthetic)

Evidence: EVIDENCE §3.7 (B, wave-2)

In Execution

2 methods

Validated and shipped. Part of the production serving / training path today.

InfoNCE contrastive (default)

Measured

Standard contrastive objective. Inherently resists collapse with adequate negatives (batch ≥16); the production default training path.

Key metric · AG News kNN 0.892 / nDCG 0.822 after fine-tune

Evidence: EVIDENCE §3.6, §3.7 (H-A)

int8 quantized serving

Measured

Per-vector int8 quantization for edge serving. Essentially lossless; ship it as the default cheap-serving tier.

Key metric · kNN_int8 ≈ kNN_full everywhere

Evidence: EVIDENCE §3.7 (B)

Archived (rejected)

4 methods

Tested honestly and rejected. Kept here so we don't re-litigate dead ends.

Three-tier spectral surgery (original ASN)

Rejected

Periodic weight-space SVD surgery shrinking the weak singular band. Fights anisotropy, not collapse. In a real collapse regime it made collapse worse (rank 3.4→1.0).

Key metric · served rank 3.4 → 1.0; kNN −12–21pts

Evidence: EVIDENCE §3.2

spectral_lift (rank-floor surgery)

Rejected

Redesigned surgery that lifts weak singular values. Fixes the harm of three-tier but still loses to doing nothing; weight-space surgery can't outrun the collapse gradient.

Key metric · rank 2.06 vs do-nothing 3.40

Evidence: EVIDENCE §3.3

Sleep / SHY consolidation (AwakenedSleepNet)

Rejected

Bio-inspired wake/sleep phasing with synaptic downscaling + dream pruning + replay. Phasic consolidation fails; only a continuous in-loss rank floor works. Downscaling survives as a collapse-neutral renormalizer.

Key metric · phasic rank 9.99 / 5.71 vs continuous 21.0

Evidence: EVIDENCE §3.5

DINO-style centering

Rejected

Self-distillation with centering + sharpening. Worst method in the Bayesian search on this task.

Key metric · robust score 1.08 (#6 of 6)

Evidence: EVIDENCE §3.7 wave-2