Memory Retrieval Latency
<150ms
Under 150ms p95 for semantic memory recall.
Memory retrieval latency measures the time to fetch the top-10 semantically relevant facts from a user's memory store via pgvector cosine distance search. This operation runs on every non-trivial user message to inject personal context into the LLM system prompt — preferences, past interactions, dietary restrictions, favorite venues, and other learned facts. The sub-150ms p95 means that 95% of memory retrievals complete in under 150 milliseconds.
The retrieval pipeline has three stages: (1) embed the user's message using text-embedding-3-small via the OpenAI API (~60ms median), (2) run a pgvector cosine distance query against the user's memory partition (~30ms median with HNSW index), and (3) format and inject the top-10 results into the system prompt (~5ms). The embedding API call dominates latency, which is why we cache recent embeddings in Redis with a 1-hour TTL.
Fast memory retrieval is critical because it sits in the hot path of every conversation turn. If memory is slow, the entire response is slow. We've optimized aggressively: HNSW indexes with ef_search=40 (lower than venue search because memory stores are smaller per-user), connection pooling to avoid PostgreSQL connection setup overhead, and the embedding cache that hits ~35% of the time for users in active conversation.
Methodology
People also ask.
See it in action.
<150ms milliseconds — real numbers from production. Try the live scan demo or explore more benchmarks.