Skip to content

Benchmarks

Real-world search quality measurements on a production PHP codebase: Popup Maker core + Pro (611 files, 2,574 chunks across two repositories).

Test Suite

20 ground-truth queries spanning the full codebase — popup lifecycle, trigger system, conditions, cookies, forms, REST API, admin, DI architecture, and Pro features. Each query has 1-2 expected files that a developer would navigate to when investigating that topic.

Metrics:

  • MRR (Mean Reciprocal Rank): How high the correct file ranks on average. 1.0 = always rank 1.
  • Top-K Accuracy: Percentage of queries where the correct file appears in the top K results.

Zero-Config Default (~200MB)

pip install tessera-idx[embed]
tessera index /path/to/project

Uses BGE-small-en-v1.5 (67MB, 384d) + Jina-reranker-v1-tiny (130MB). Total ~200MB downloaded on first run. No GPU, no server, no config.

Expected quality: 0.739 MRR, 85% Top-3, 95% Top-10.

Quality Local (~340MB)

tessera index /path/to/project \
  --embedding-model BAAI/bge-base-en-v1.5

Uses BGE-base-en-v1.5 (210MB, 768d) + Jina-reranker-v1-tiny (130MB).

Expected quality: 0.766 MRR, 85% Top-3, 100% Top-10.

Maximum Local (~590MB)

tessera index /path/to/project \
  --embedding-model thenlper/gte-base \
  --reranking-model jinaai/jina-reranker-v1-turbo-en

Uses GTE-base (440MB, 768d) + Jina-reranker-v1-turbo (150MB). Matches gateway-quality scores with zero server setup.

Expected quality: 0.825 MRR, 90% Top-3, 100% Top-10.

Maximum Quality (model server)

Run an embedding + reranker server (e.g., LM Studio, vLLM, or a local gateway):

tessera index /path/to/project \
  --embedding-endpoint http://localhost:8800/v1/embeddings \
  --embedding-model nomic-embed

With Nomic-embed-text (768d) + Jina cross-encoder reranking via HTTP.

Expected quality: 0.854 MRR, 95% Top-3, 100% Top-10.


Full Embedding Model Comparison

All 12 fastembed-compatible models tested with VEC+code mode (semantic search, code files only), sorted by reranked MRR. Reranker: Jina-reranker-v1-tiny (130MB) unless noted.

Model Size Dim Index Time VEC MRR Top-1 Top-3 Top-10 +Rerank MRR Top-1 Top-3 Top-10
GTE-base 440MB 768d 183s 0.696 55% 80% 100% 0.743 60% 85% 100%
BGE-small 67MB 384d 64s 0.609 45% 80% 95% 0.739 60% 85% 95%
BGE-base 210MB 768d 187s 0.621 45% 85% 90% 0.766 65% 85% 100%
Jina-Code 640MB 768d 1109s 0.475 35% 50% 70% 0.716 55% 85% 95%
Arctic-XS 90MB 384d 32s 0.626 45% 75% 90% 0.704 60% 70% 95%
Arctic-S 130MB 384d 59s 0.623 40% 85% 100% 0.704 55% 80% 100%
Arctic-M 430MB 768d 182s 0.490 35% 55% 80% 0.645 55% 65% 85%
MxBAI-large 640MB 1024d 706s 0.592 40% 80% 90% 0.625 45% 75% 95%
Jina-small 120MB 512d 195s 0.440 30% 55% 75% 0.617 50% 70% 80%
MiniLM-L6 90MB 384d 21s 0.422 25% 50% 80% 0.609 45% 70% 90%
Nomic-full 520MB 768d 1011s 0.401 30% 40% 65% 0.468 35% 55% 70%
Nomic-Q 130MB 768d 790s 0.346 25% 40% 50% 0.399 30% 45% 60%

Self-hosted gateway models (HTTP endpoint required):

Model Dim VEC MRR Top-1 Top-3 Top-10 +Rerank MRR Top-1 Top-3 Top-10
Nomic-embed-text 768d 0.696 55% 75% 100% 0.854 75% 95% 100%
Qwen3-embed 1024d 0.615 45% 70% 100% 0.825 70% 100% 100%

Cloud API models (paid, per-token pricing):

Model Dim Cost VEC MRR Top-1 Top-3 Top-10 +Rerank MRR Top-1 Top-3 Top-10
OpenAI text-embedding-3-large 1024d $0.13/1M 0.566 35% 75% 100% 0.722 60% 80% 100%
OpenAI text-embedding-3-large 3072d $0.13/1M 0.571 35% 75% 100% 0.687 55% 80% 100%
OpenAI text-embedding-3-small 1536d $0.02/1M 0.558 40% 65% 80% 0.668 60% 65% 85%
OpenAI text-embedding-3-small 512d $0.02/1M 0.491 35% 60% 80% 0.627 55% 70% 85%

OpenAI embeddings underperform on code search

OpenAI's general-purpose embeddings score significantly below code-trained local models on this benchmark. Their best configuration (text-embedding-3-large at 1024d + local reranker, 0.722 MRR) loses to the free 67MB BGE-small (0.739 MRR). OpenAI's models are optimized for general text retrieval, not code search.

Cross-Test: Embedder x Reranker Matrix

The top 4 local embedders tested against all 4 rerankers. The best reranker depends on which embedder you use.

Embedder Reranker Total Size MRR Top-1 Top-3 Top-10
GTE-base 768d Jina-turbo (150MB) 590MB 0.825 70% 90% 100%
GTE-base 768d MiniLM-L12 (120MB) 560MB 0.806 70% 95% 100%
GTE-base 768d MiniLM-L6 (80MB) 520MB 0.795 70% 85% 100%
BGE-base 768d Jina-tiny (130MB) 340MB 0.766 65% 85% 100%
GTE-base 768d Jina-tiny (130MB) 570MB 0.743 60% 85% 100%
BGE-small 384d Jina-tiny (130MB) 197MB 0.739 60% 85% 95%
BGE-base 768d Jina-turbo (150MB) 360MB 0.731 55% 95% 100%
BGE-base 768d MiniLM-L12 (120MB) 330MB 0.726 60% 90% 100%
BGE-small 384d MiniLM-L12 (120MB) 187MB 0.721 60% 85% 95%
BGE-base 768d MiniLM-L6 (80MB) 290MB 0.718 60% 85% 100%
BGE-small 384d Jina-turbo (150MB) 217MB 0.708 55% 85% 95%
BGE-small 384d MiniLM-L6 (80MB) 147MB 0.703 60% 75% 95%

Reranker interaction matters

Jina-tiny is the best reranker for BGE models (0.739, 0.766) but the worst for GTE-base (0.743 vs 0.825 with Jina-turbo). Always cross-test your specific combination.

Reranker Comparison

All rerankers tested with GTE-base (768d) embeddings.

Reranker Size MRR Top-1 Top-3 Top-10
Jina-turbo 150MB 0.825 70% 90% 100%
MiniLM-L12 120MB 0.806 70% 95% 100%
MiniLM-L6 80MB 0.795 70% 85% 100%
Jina-tiny 130MB 0.743 60% 85% 100%

Key Findings

  • Cross-encoder reranking is the single biggest quality lever. +0.13-0.16 MRR over embedding-only search across all models.
  • Free local models beat paid cloud APIs for code search. BGE-small (67MB, free) scores 0.739 MRR vs OpenAI text-embedding-3-large ($0.13/1M) at 0.722. General-purpose cloud embeddings aren't trained for code retrieval.
  • Bigger is NOT better for local ONNX models. Nomic-full (520MB) and MxBAI-large (640MB) scored worse than BGE-small (67MB). ONNX quantization and model architecture matter more than parameter count.
  • The 200MB default stack (BGE-small + Jina-tiny) is the sweet spot for zero-config. 87% of the gateway's best score, zero server setup, fast indexing (64s).
  • 590MB gets you gateway-level quality locally. GTE-base + Jina-turbo hits 0.825 MRR — matching the Qwen3 gateway setup.
  • Higher dimensions don't guarantee better results. BGE-small (384d) outperforms MxBAI-large (1024d) and OpenAI's 3072d model. Model training quality dominates.
  • Nomic ONNX quantized variants perform poorly. The fastembed Nomic-Q (0.399 MRR) is dramatically worse than Nomic via HTTP gateway (0.854). The quantization destroys quality for this model.
  • Reranker-embedder interaction is real. Jina-tiny pairs best with BGE models; Jina-turbo pairs best with GTE. Always cross-test.
  • OpenAI dimension reduction helps with reranking. text-embedding-3-large at 1024d (0.722) outperforms full 3072d (0.687) when paired with a cross-encoder reranker — denser representations give the reranker more signal.

Search Mode Comparison

All modes tested with Nomic 768d via gateway.

Mode Description MRR Top-1 Top-3 Top-10 Avg Latency
VEC+rerank Semantic + cross-encoder reranking 0.854 75% 95% 100% 298ms
HYB+rerank Hybrid + cross-encoder reranking 0.817 75% 90% 90% 202ms
VEC+code Semantic only, code files 0.696 55% 75% 100% 13ms
VEC+PPR Semantic + PageRank graph 0.647 45% 80% 95% 16ms
HYBRID+code Keyword + semantic, code files 0.550 40% 60% 80% 9ms
LEX-only FTS5 keyword only 0.307 15% 25% 35% 5ms

Key findings:

  • Semantic search (VEC) dominates keyword search (LEX) for natural language queries against code.
  • PPR graph ranking helps structural queries (e.g., "singleton registry for popup trigger types" jumps from rank 2 to rank 1) but is gated to only fire when query terms match actual symbol names — preventing noise on conceptual queries.
  • HYBRID mode underperforms VEC-only because FTS5 tokenization doesn't align well with natural language queries against PHP code.
  • Reranking adds ~200-300ms latency but the quality gain is substantial.

PPR Graph Ranking

PageRank-based ranking uses the code's reference graph (who-calls-what) to boost structurally important files. Tessera gates PPR activation on symbol name matching — it only fires when query terms match actual symbol names in the index.

Query Without PPR With PPR Change
Frontend rendering (Popups.php) rank 5 rank 1 Structural hub
Trigger registry (Triggers.php) rank 2 rank 1 High fan-in symbol
Newsletter AJAX (Subscribe.php) rank 9 rank 7 Weak structural signal
Scheduling (scheduling.php) rank 4 rank 6 PPR noise (gated in current build)

Reproducing

All benchmarks are fully reproducible using the scripts in the scripts/ directory.

# Self-benchmark (indexes Tessera's own codebase)
uv run python scripts/benchmark_quick.py

# Full PM benchmark — single model (requires Popup Maker source)
uv run python scripts/benchmark_pm.py --all              # gateway models (HTTP endpoint)
uv run python scripts/benchmark_pm.py --provider fastembed --all  # local fastembed models

# Batch benchmark — all 12 local embedding models + 4 rerankers
uv run python scripts/benchmark_all_models.py              # embedding models only
uv run python scripts/benchmark_all_models.py --rerankers   # + reranker comparison

# Cross-test matrix — top embedders × all rerankers
uv run python scripts/benchmark_cross.py

# Cloud API benchmark — OpenAI (requires OPENAI_API_KEY)
uv run python scripts/benchmark_cloud.py                    # all OpenAI variants
uv run python scripts/benchmark_cloud.py --voyage           # + Voyage (needs VOYAGE_API_KEY)

Requirements

  • Local benchmarks: Requires pip install tessera-idx[embed] (installs fastembed). Models auto-download on first run.
  • Gateway benchmarks: Requires an OpenAI-compatible embedding endpoint (e.g., LM Studio, vLLM) at http://localhost:8800/v1/embeddings
  • Cloud benchmarks: Requires OPENAI_API_KEY env var. Optional VOYAGE_API_KEY for Voyage models. API costs are minimal (~$0.50 for all 4 OpenAI model variants).
  • PM benchmark: Requires Popup Maker core + Pro source code at ~/Projects/ProContent/ProductCode/popup-maker{,-pro}