Search Internals

All search in Compass is Postgres-native. No Elasticsearch, no external vector database.

Full-Text Search (tsvector)

Entities have a search_vector generated column with weighted fields:

WeightFieldsPurpose
AURN, nameHighest relevance — exact identity matches
BDescriptionContent relevance
CSourceMetadata-level matching

A GIN index on search_vector enables fast full-text queries. PostgreSQL's ts_rank function scores results.

Fuzzy Matching (pg_trgm)

The pg_trgm extension provides trigram-based similarity matching. This handles:

  • Typos (ordrs matches orders)
  • Partial terms (ord matches orders)
  • Similar names across different naming conventions

Trigram indexes are GIN-based and work alongside tsvector.

Semantic Search (pgvector)

Vector embeddings are stored in an embeddings table with HNSW indexes for cosine similarity. The embedding pipeline:

  1. Entity/document content is serialized into text
  2. Text is split into token-aware chunks with overlap
  3. Chunks are embedded via the configured provider (OpenAI or Ollama)
  4. Embeddings are stored and indexed

Semantic search finds conceptually related entities even when exact terms don't overlap.

Chunking

Large documents can't be embedded as a single vector — detail is lost and token limits are exceeded. Compass chunks content before embedding:

Entities are serialized into a single text block (name, type, URN, source, description, properties) and embedded as one chunk. Most entities fit within the 512-token default.

Documents go through a multi-step split:

  1. Structural split — Split on markdown headings (h1-h4). Each section becomes a candidate chunk.
  2. Size check — Sections under the token limit (default 512) become one chunk. Oversized sections are split further on paragraph boundaries.
  3. Contextual prefix — Each chunk is prefixed with breadcrumb context (e.g., Document: ADCS Recovery > Section: Recovery Steps). This is the single biggest quality improvement — a chunk about "verify mode via HK telemetry" is meaningless alone but specific with context.
  4. Overlap — Adjacent chunks share ~50 tokens (configurable) to avoid losing information at boundaries.

Chunks are an indexing mechanism, not knowledge. They're stored separately from entities and documents, linked by URN, and are fully rebuildable.

Hybrid Ranking (RRF)

Reciprocal Rank Fusion combines keyword and semantic results:

RRF_score(d) = 1/(k + rank_keyword(d)) + 1/(k + rank_semantic(d))

Where k is a constant (typically 60). Documents that rank well in both lists get the highest combined score. This balances keyword precision with semantic recall.

Graph Traversal

Context assembly and impact analysis use PostgreSQL recursive CTEs:

  • Context (GetBidirectional) — Bidirectional traversal from a root entity, following edges in both directions up to the specified depth. Uses a path array for cycle detection and DISTINCT for deduplication. Results capped at 1000 edges.
  • Impact (GetDownstream) — Unidirectional downstream traversal. Same CTE pattern, single direction.

Both use the frontier pattern: each recursive step expands from the frontier URN discovered in the previous step, checking against the path to prevent cycles.