Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

Haozhen Zhang^{*, 1}, Haodong Yue^{*, 2}, Tao Feng³, Quanyu Long¹, Jianzhu Bao¹, Bowen Jin³, Weizhi Zhang⁴, Xiao Li⁵, Jiaxuan You³, Chengwei Qin⁶, Wenya Wang¹,

^*Equal Contribution
¹Nanyang Technological University
²Tsinghua University
³University of Illinois at Urbana-Champaign
⁴University of Illinois Chicago
⁵Sun Yat-sen University
⁶The Hong Kong University of Science and Technology (Guangzhou)

arXiv Code

Overview

BudgetMem rethinks runtime agent memory from the lens of explicit performance–cost control. Rather than relying on offline, query-agnostic memory construction, BudgetMem performs on-demand memory extraction at runtime and makes the computation spent on memory both budget-aware and controllable. It organizes memory extraction as a modular pipeline, where each module provides Low/Mid/High budget tiers and can be instantiated under different tiering strategies.

BudgetMem further learns a lightweight budget-tier router that selects tiers module-wise based on the query and intermediate states, trained with a cost-aware reinforcement learning objective. Using BudgetMem as a unified testbed, we systematically study three complementary ways to realize tiers—implementation, reasoning, and capacity—and characterize their performance–cost behaviors across budget regimes. Experiments on LoCoMo, LongMemEval, and HotpotQA show that BudgetMem achieves strong performance in performance-first settings and yields clear performance–cost frontiers under tighter budgets, offering practical insights for building controllable proactive memory systems.

BudgetMem Architecture

BudgetMem is a runtime agent memory framework that enables explicit performance--cost control for on-demand memory extraction. Given a user query, BudgetMem first retrieves a candidate set of raw chunks from the chunked history (without offline memory construction) and then processes them through a modular memory pipeline. Each module takes the query and intermediate states as input and progressively refines query-relevant information, producing an extracted memory that conditions the final answer generation.

A key design in BudgetMem is that every module exposes three budget tiers (Low/Mid/High), which correspond to different cost--quality behaviors under a chosen tiering strategy (implementation, reasoning, or capacity). A shared lightweight router performs budget-tier routing module-wise as the query flows through the pipeline, selecting which tier to apply at each module based on the current context. The router is trained with reinforcement learning under a cost-aware objective, enabling controllable performance--cost behavior and providing a unified testbed to study how different tiering strategies translate compute into downstream gains.

BudgetMem architecture overview.

Comparison Experiments

Under the performance-first setting, we evaluate BudgetMem on LoCoMo, LongMemEval, and HotpotQA against a diverse set of representative memory systems. BudgetMem consistently achieves strong gains in both F1 and LLM-as-a-judge, indicating more effective long-context evidence utilization. Despite prioritizing performance, BudgetMem remains cost-efficient in practice: its on-demand design retrieves query-relevant raw chunks and spends extraction compute only when needed, avoiding unnecessary processing over the full history (with smaller cost gaps on LoCoMo where histories are shorter). Finally, BudgetMem is strong on aggregate across datasets and backbones, and the learned router transfers from LLaMA to Qwen without retraining, suggesting the routing policy generalizes beyond a single base model.

Performance-first results on LoCoMo, LongMemEval, and HotpotQA.

Exploring Trade-offs Across Tiering Strategies

We systematically compare performance--cost behaviors on LoCoMo across the three tiering axes (implementation, reasoning, and capacity). By varying the cost weight λ, BudgetMem traces smooth and controllable trade-off frontiers that consistently envelop prior baselines in both low- and high-cost regimes, achieving higher Judge at similar cost or lower cost at similar performance.

The three axes exhibit distinct budget coverage. Implementation and capacity tiering span a broader cost range: implementation tiering yields rapid quality gains under moderate budgets, while capacity tiering continues to improve as budget increases and achieves the best high-budget quality. In contrast, reasoning tiering concentrates in a narrower cost band, acting as a finer-grained quality knob with less cost spread. Overall, these results highlight complementary strengths of different tiering strategies for shaping the Pareto frontier.

Performance--cost frontiers on LoCoMo across tiering strategies.

Ablation Study

We ablate the proposed reward-scale alignment on LoCoMo under the capacity tiering strategy. Because task reward and cost reward can differ substantially in scale and variability, removing alignment destabilizes optimization and can bias learning toward the cost term, leading to an overly conservative routing behavior.

Empirically, without reward-scale alignment (with λ=0.3), the router heavily favors the Low tier across modules, sharply reducing answer quality and yielding the lowest Judge scores. In contrast, enabling reward-scale alignment encourages a more graded use of tiers and produces a smoother, better-behaved performance--cost frontier. Overall, this ablation shows that reward-scale alignment is important for balancing learning signals and supporting meaningful performance--cost control.

Ablation of reward-scale alignment on LoCoMo (capacity tiering).

Discussion

Budget-tier selection ratio. We analyze module-level routing behavior on LongMemEval (capacity tiering) by reporting the selection ratios of Low/Mid/High tiers under different cost weights λ. The router exhibits a clear budget response: as cost pressure increases, it systematically shifts probability mass from higher-cost tiers to cheaper ones, providing interpretable, module-level evidence that BudgetMem allocates computation in a cost-aware manner.

Retrieval-size sensitivity. We also study how the number of retrieved raw chunks affects cost and quality on LoCoMo (evaluated under all three tiering strategies). Increasing retrieval size predictably raises cost and often improves Judge by providing more evidence, but the gain is not monotonic: retrieving too many chunks introduces redundant or weakly relevant content that increases noise and can hurt Judge, while retrieving too few chunks provides insufficient evidence. In our setting, retrieving 5 chunks offers the best balance between cost and quality.

Budget-tier selection ratios on LongMemEval.

Retrieval-size sensitivity on LoCoMo.

BibTeX

@article{BudgetMem,
  author    = {Haozhen Zhang and Haodong Yue and Tao Feng and Quanyu Long and Jianzhu Bao and Bowen Jin and Weizhi Zhang and Xiao Li and Jiaxuan You and Chengwei Qin and Wenya Wang},
  title     = {Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory},
  journal   = {arXiv preprint arXiv:xxxx.xxxxx},
  year      = {2026},
}