大模型应用开发实战:语义缓存 — 降低 LLM 调用成本 70%

发布时间:2026/6/29 22:35:07
大模型应用开发实战:语义缓存 — 降低 LLM 调用成本 70% 一、问题同样的答案你付费了 1000 次用户 A: Python 怎么读取 CSV 文件 → LLM → $0.0003 用户 B: How to read CSV in Python? → LLM → $0.0003 用户 C: Python 读取csv文件的方法 → LLM → $0.0003 用户 D: python read csv file example → LLM → $0.0003四个用户问了本质相同的问题LLM 答了 4 次你付了 4 次钱。如果有 100 万用户30% 问题语义相似——每年多花 10 万美元。LLM 缓存的特殊挑战Redis 精确匹配只能命中完全相同的字符串同义改写“读CSV” vs “read csv”精确匹配会 Miss温度参数导致相同 prompt 可能不同输出解决方案精确缓存 语义缓存。二、两级缓存架构┌──────────────────────────┐ │ Cache Manager │ │ │ Request ──►│ ┌────────────────────┐ │ │ │ L1: Exact Cache │ │ 命中率 ~15% │ │ · MD5 hash │ │ 延迟 1ms │ │ · 本地 LRU │ │ │ └───────┬────────────┘ │ │ │ Miss │ │ ┌───────▼────────────┐ │ │ │ L2: Semantic Cache │ │ 命中率 ~25% │ │ · Embedding sim │ │ 延迟 ~10ms │ │ · Redis FAISS │ │ │ └───────┬────────────┘ │ │ │ Miss │ │ ┌───────▼────────────┐ │ │ │ L3: LLM API │ │ 回源 │ └────────────────────┘ │ └──────────────────────────┘三、完整实现# semantic_cache.py - 两级 LLM 语义缓存# pip install openai numpy sentence-transformers cachetoolsimporthashlibimportjsonimporttimeimportnumpyasnpfromtypingimportOptional,Dict,Any,List,Tuplefromdataclassesimportdataclass,fieldfromcachetoolsimportLRUCachefromopenaiimportOpenAI# # 1. 精确缓存 (L1)# dataclassclassCacheEntry:# 缓存条目query:strresponse:strusage:Dict[str,int]cost_saved:floatcreated_at:floatfield(default_factorytime.time)hit_count:int0classExactCache:# L1: 基于 MD5 的精确匹配缓存def__init__(self,max_size:int10000):self._cacheLRUCache(maxsizemax_size)def_key(self,messages:List[Dict],model:str,temperature:float)-str:rawjson.dumps({messages:messages,model:model,temperature:temperature,},sort_keysTrue,ensure_asciiFalse)returnhashlib.md5(raw.encode()).hexdigest()defget(self,messages:List[Dict],model:str,temperature:float0.0)-Optional[CacheEntry]:keyself._key(messages,model,temperature)entryself._cache.get(key)ifentry:entry.hit_count1returnentrydefset(self,messages:List[Dict],model:str,temperature:float,response:str,usage:Dict,cost:float):keyself._key(messages,model,temperature)entryCacheEntry(querymessages[-1][content]ifmessageselse,responseresponse,usageusage,cost_savedcost,)self._cache[key]entrydefstats(self)-dict:total_hitssum(e.hit_countforeinself._cache.values())return{size:len(self._cache),max_size:self._cache.maxsize,total_hits:total_hits,}# # 2. 语义缓存 (L2)# classSemanticCache:# L2: 基于 Embedding 余弦相似度的语义缓存# 可选: 用本地 sentence-transformers 代替 OpenAI Embedding APIdef__init__(self,embedding_model:strtext-embedding-3-small,similarity_threshold:float0.92,max_size:int50000,use_local:boolFalse):self.similarity_thresholdsimilarity_threshold self.embedding_modelembedding_model self.max_sizemax_size self.use_localuse_local# 本地向量存储self._embeddings:List[np.ndarray][]self._entries:List[CacheEntry][]# 本地模型 (更快更便宜)self._local_modelNoneifuse_local:try:fromsentence_transformersimportSentenceTransformer self._local_modelSentenceTransformer(all-MiniLM-L6-v2)exceptImportError:passself._openaiNonedef_get_client(self):ifself._openaiisNone:self._openaiOpenAI()returnself._openaidef_get_embedding(self,text:str)-np.ndarray:# 本地模型优先ifself._local_model:embself._local_model.encode(text,normalize_embeddingsTrue)returnnp.array(emb)# 否则用 OpenAI APIclientself._get_client()respclient.embeddings.create(modelself.embedding_model,inputtext,)embnp.array(resp.data[0].embedding)returnemb/np.linalg.norm(emb)defsearch(self,query:str)-Optional[CacheEntry]:ifnotself._embeddings:returnNonequery_embself._get_embedding(query)similaritiesnp.dot(np.array(self._embeddings),query_emb)best_idxint(np.argmax(similarities))best_scorefloat(similarities[best_idx])ifbest_scoreself.similarity_threshold:entryself._entries[best_idx]entry.hit_count1returnentryreturnNonedefadd(self,query:str,response:str,usage:Dict,cost:float):embself._get_embedding(query)iflen(self._embeddings)self.max_size:# FIFO 淘汰self._embeddings.pop(0)self._entries.pop(0)self._embeddings.append(emb)self._entries.append(CacheEntry(queryquery,responseresponse,usageusage,cost_savedcost,))defstats(self)-dict:total_hitssum(e.hit_countforeinself._entries)return{size:len(self._entries),total_hits:total_hits,}# # 3. 两级缓存管理器# dataclassclassCacheStats:l1_hits:int0l2_hits:int0misses:int0total_cost_saved:float0.0propertydeftotal_requests(self)-int:returnself.l1_hitsself.l2_hitsself.missespropertydefhit_rate(self)-float:ifself.total_requests0:return0.0return(self.l1_hitsself.l2_hits)/self.total_requestsdefsummary(self)-str:return(fRequests:{self.total_requests}| fHit Rate:{self.hit_rate:.1%}| fL1:{self.l1_hits}L2:{self.l2_hits}Miss:{self.misses}| fCost Saved: ${self.total_cost_saved:.4f})classCachedLLM:# 带两级缓存的 LLM 客户端def__init__(self,openai_client:OpenAI,exact_cache:ExactCacheNone,semantic_cache:SemanticCacheNone,auto_cache:boolTrue):self.clientopenai_client self.exactexact_cacheorExactCache()self.semanticsemantic_cacheorSemanticCache()self.auto_cacheauto_cache self.statsCacheStats()defchat(self,messages:List[Dict[str,str]],model:strgpt-4o-mini,temperature:float0.0,max_tokens:int4096,enable_semantic:boolTrue,**kwargs)-Tuple[str,Dict]:# 返回 (response_text, usage_dict)# 提取最后一条 user messageuser_queryforminreversed(messages):ifm[role]user:user_querym[content]break# L1: 精确缓存cachedself.exact.get(messages,model,temperature)ifcached:self.stats.l1_hits1self.stats.total_cost_savedcached.cost_savedreturncached.response,cached.usage# L2: 语义缓存ifenable_semanticanduser_query:cachedself.semantic.search(user_query)ifcached:self.stats.l2_hits1self.stats.total_cost_savedcached.cost_savedreturncached.response,cached.usage# L3: 调用 LLMself.stats.misses1respself.client.chat.completions.create(modelmodel,messagesmessages,temperaturetemperature,max_tokensmax_tokens,**kwargs,)response_textresp.choices[0].message.content usage{prompt_tokens:resp.usage.prompt_tokensifresp.usageelse0,completion_tokens:resp.usage.completion_tokensifresp.usageelse0,total_tokens:resp.usage.total_tokensifresp.usageelse0,}# 估算成本rates{gpt-4o-mini:(0.15,0.60)}# per 1M tokensinput_rate,output_raterates.get(model,(0.15,0.60))cost(usage[prompt_tokens]/1_000_000*input_rateusage[completion_tokens]/1_000_000*output_rate)# 写入缓存ifself.auto_cache:self.exact.set(messages,model,temperature,response_text,usage,cost)ifenable_semanticanduser_query:self.semantic.add(user_query,response_text,usage,cost)returnresponse_text,usage# # 4. 基准测试# if__name____main__:print(*60)print(语义缓存基准测试 (模拟))print(*60)cached_llmCachedLLM(openai_clientOpenAI(api_keysk-fake),exact_cacheExactCache(max_size1000),semantic_cacheSemanticCache(similarity_threshold0.85),auto_cacheFalse,)queries[Python 如何读取 CSV 文件?,How to read CSV file in Python?,Python 读取csv文件的方法,python read csv example,read csv using pandas python,Java 怎么读取 CSV?,Python 如何写入 JSON 文件?,python csv read tutorial,how to parse csv in python,What is the meaning of life?,]fake_response(import csv\nwith open(file.csv) as f:\n reader csv.reader(f))fake_usage{prompt_tokens:50,completion_tokens:30,total_tokens:80}fake_cost50/1_000_000*0.1530/1_000_000*0.60# 预热cached_llm.exact.set([{role:user,content:queries[0]}],gpt-4o-mini,0.0,fake_response,fake_usage,fake_cost,)print(f\n已预热精确缓存: {queries[0]})fori,qinenumerate(queries):print(f\n--- Query{i1}: {q} ---)l1cached_llm.exact.get([{role:user,content:q}],gpt-4o-mini,0.0)ifl1:print( L1 HIT! (exact))cached_llm.stats.l1_hits1cached_llm.stats.total_cost_savedl1.cost_savedcontinuecached_llm.stats.misses1ifqinqueries[:5]orqinqueries[7:9]:print( L2 WOULD HIT (similarity 0.85))else:print( L2 miss, would call LLM)print(f\n{*60})print(统计汇总:)print(f 10 条查询, 约 6 条可从缓存命中)print(f 理论节省: ~60% LLM 调用)四、缓存策略配置# 生产环境推荐配置classCacheConfig:# L1 精确缓存: temperature0 的确定性任务L1_MAX_SIZE10000# 本地 LRU, 内存 ~50MB# L2 语义缓存L2_SIMILARITY_THRESHOLD0.92# 推荐区间 0.90-0.95L2_MAX_SIZE100000# Redis FAISS# 跳过缓存的情况SKIP_IF_TEMPERATURE_GT0.3# temp 0.3 不缓存SKIP_FOR_TOOL_CALLSTrue# tool call 不缓存SKIP_FOR_STREAMINGTrue# 流式不缓存相似度阈值选择指南阈值命中率准确率适用场景0.98~5%极高金融/医疗零容忍0.92~20%高通用生产 (推荐)0.85~35%中客服/FAQ 容错场景0.75~50%低不推荐五、成本收益分析日均 100 万次请求, 每次 $0.0003:缓存层命中率日节省年节省仅 L1 (精确)~15%$45$16,425L1 L2 (语义)~40%$120$43,800L1 L2 优化~55%$165$60,225额外成本:Embedding API: ~$0.02/1M tokens → 约 $10-20/月本地模型 (all-MiniLM-L6-v2): 免费, CPU 即可Redis 内存: ~2GB (10 万条向量)ROI: 额外成本 $50/月, 年节省 $4-6 万。六、生产化注意事项缓存一致性— 模型升级后清空语义缓存 (embedding 可能变化)温度参数— temperature 0 不缓存 (输出随机)Tool calls— 函数调用不缓存 (结果可能过时)多租户—key tenant_id:hash隔离监控— 命中率 / 节省成本 / 相似度分布面板七、总结两级语义缓存L1 精确缓存— MD5 hash, 1ms, 命中率 ~15%L2 语义缓存— Embedding cosine sim, ~10ms, 命中率 ~25%组合命中率 ~40%— 年省 $4 万 (百万日活)