Skip to content
AGNT
Latency

Memory Retrieval Latency

<150ms

Under 150ms p95 for semantic memory recall.

Stable

Memory retrieval latency measures the time to fetch the top-10 semantically relevant facts from a user's memory store via pgvector cosine distance search. This operation runs on every non-trivial user message to inject personal context into the LLM system prompt — preferences, past interactions, dietary restrictions, favorite venues, and other learned facts. The sub-150ms p95 means that 95% of memory retrievals complete in under 150 milliseconds.

The retrieval pipeline has three stages: (1) embed the user's message using text-embedding-3-small via the OpenAI API (~60ms median), (2) run a pgvector cosine distance query against the user's memory partition (~30ms median with HNSW index), and (3) format and inject the top-10 results into the system prompt (~5ms). The embedding API call dominates latency, which is why we cache recent embeddings in Redis with a 1-hour TTL.

Fast memory retrieval is critical because it sits in the hot path of every conversation turn. If memory is slow, the entire response is slow. We've optimized aggressively: HNSW indexes with ef_search=40 (lower than venue search because memory stores are smaller per-user), connection pooling to avoid PostgreSQL connection setup overhead, and the embedding cache that hits ~35% of the time for users in active conversation.

Methodology

Measured via structured log spans wrapping the full retrieval pipeline (embed + query + format). Spans are tagged with user_id (hashed), memory_store_size, and cache_hit status. Percentiles are computed from a Redis HyperLogLog-backed histogram updated on every retrieval. The <150ms figure is the p95 computed over a 24-hour rolling window. We exclude the first retrieval after a cold start (when the PostgreSQL connection pool is being warmed) as it can spike to 400ms+ due to connection establishment overhead.

People also ask.

See it in action.

<150ms milliseconds — real numbers from production. Try the live scan demo or explore more benchmarks.