How to Implement Semantic Search

Search that understands meaning - the bi-encoder hybrid and reranking architecture that beats keyword search.

Keyword search (BM25/Elasticsearch) is fast and reliable but literal. It cannot find relevant documents when the users query and the document use different words for the same concept. Semantic search solves this - but naive implementation trades precision for recall in ways that frustrate users. This guide covers the hybrid architecture that captures the best of both.

No fluff. Production-grade answers from engineers who ship AI into real products.

Bi-Encoder vs Cross-Encoder: The Architecture Tradeoff

Bi-encoder: encode query and documents independently compare embeddings with cosine similarity. Fast: documents can be pre-encoded and indexed. Suitable for retrieval over large corpora. Lower precision than cross-encoder. Cross-encoder: encode query and document together predict relevance score. Much higher precision much slower. Not suitable for first-stage retrieval over large corpora. Production pattern: bi-encoder for retrieval (top-100 candidates) cross-encoder for reranking (final top-10).

At Valletta Software, we focus on:

Bi-encoder retrieval: SBERT sentence-transformers for encoding - pre-encode all documents at index time

Hybrid search: combine BM25 keyword score and vector similarity score - neither alone is best

Reciprocal Rank Fusion: merge BM25 and vector retrieval result lists - simple effective no training needed

Cross-encoder reranking: Cohere Rerank or ms-marco-MiniLM - rerank top-100 to final top-10

Query expansion: LLM-generated query variants - retrieve with multiple queries combine results

Sparse and dense: SPLADE for sparse neural retrieval - better than BM25 for some domains

Evaluation: NDCG MRR Recall@K on a golden query set - measure before and after every change

The Indexing and Serving Architecture That Scales

Search must be fast. Slow search is abandoned search.

We give you more than just people. We give you top performers who drive results.

Elasticsearch / OpenSearch: BM25 baseline with dense_vector field for hybrid - single infrastructure
Dedicated vector DB: Qdrant or Weaviate for pure semantic search - better ANN performance
Index freshness: incremental upsert pipeline - new documents searchable within minutes not hours
Query latency SLA: p95 under 200ms for user-facing search - measure in production not just dev
Caching: cache popular query results with TTL - significant load reduction for head queries
Faceted filtering: filter before vector search - category date author reduce candidate set
Analytics: log query result clicks - train reranker and improve relevance with real user signals

Build RAG pipelines, agents, and LLM integrations from day one

Ship AI features 3x faster with AI-native tooling and methodology

Deploy to production - not just Jupyter notebooks and prototypes

Evaluate output quality - hallucination detection, cost optimization, monitoring

How to Implement Semantic Search - With Engineers Who Build It in Production

Forget the hype. We make AI work in the real world.

Our engineers are trained in the latest AI tooling - Copilot, Claude Code, Cursor, LangChain, and vector databases - and use them daily to ship production AI features, not just prototypes.

Choose from a solo dev, mini team, or full squad. All powered by AI and ready to build from day one.

Lets keep it simple.

Our AI engineers build semantic search with Elasticsearch hybrid (BM25 plus dense vector) bi-encoder retrieval Cohere reranking and NDCG evaluation on golden query sets.

Ready to Ship AI into Production? Lets Build It.

Our AI engineers have done this before - RAG pipelines, LLM integrations, agents, MLOps. On real products, under real deadlines.

Rates from EUR 45/h • Free consultation • No commitment required • Response within 24 hours