MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

Abstract

MemSkill shifts agent memory from turn-level, hand-designed operations to a new paradigm of span-level, skill-conditioned generation. Instead of applying a fixed procedure after every single turn, MemSkill groups interaction history into larger spans and conditions memory construction on a composed set of skills, making memory extraction more scalable, reusable, and easier to adapt across tasks and domains.

Crucially, MemSkill treats the skill bank as a living component: skills are not only invoked but also iteratively refined and expanded from hard cases, so the system can continually improve without relying on extensive manual redesign. Across long conversations, long-form text, and embodied interaction, this approach yields stronger memory and more robust downstream behavior, pointing toward self-evolving memory management for large language model agents.

Overview

Prior methods interleave handcrafted operations with LLM calls to incrementally extract and revise memory turn by turn, while MemSkill selects a small set of skills from a shared skill bank and applies them in one pass to produce skill-guided memories.

Comparison between (a) prior turn-level, handcrafted operations and (b) MemSkill's span-level, skill-conditioned generation.

MemSkill Architecture

MemSkill is built around a shared skill bank and three cooperating components. Given an interaction trace, we process it span by span: for each span, a controller learns to select a small set of relevant skills based on the current span and retrieved memories, and an LLM-based executor conditions on the selected skills to generate skill-guided memory for that span in a single step. In parallel, a designer periodically reviews representative hard cases collected during training and uses them to refine existing skills and propose new ones, expanding the skill bank over time. Together, this yields a closed loop where the agent improves both how it uses skills and what the skills are, enabling progressively stronger memory construction without relying on a fixed, hand-designed operation pipeline.

MemSkill architecture overview.

Comparison Experiments

Across LoCoMo, LongMemEval, and ALFWorld, MemSkill delivers the strongest overall results among all compared methods. On the conversational benchmarks, MemSkill achieves the best LLM-judge scores on both LoCoMo and LongMemEval within each base-model block, indicating higher-quality constructed memories than hand-designed systems such as MemoryBank, A-MEM, and MemoryOS. On ALFWorld, MemSkill also reaches the highest success rates on both seen and unseen splits, showing that skill-guided memory helps not only offline querying but also long-horizon embodied decision making. Notably, MemSkill is trained only with LLaMA, yet its learned skill bank transfers to Qwen without any additional training and remains highly competitive, demonstrating strong generalization across base models and suggesting that the evolved skills capture reusable memory behaviors rather than benchmark-specific heuristics.

Experimental results on LoCoMo, LongMemEval, and ALFWorld.

Skill Generation Under Distribution Shift

We further test whether the skill bank learned on LoCoMo remains effective under a clear shift in interaction format and evidence structure by transferring it directly to HotpotQA. We report results on three increasingly difficult settings with 50, 100, and 200 concatenated documents.

Across all three context lengths, MemSkill consistently outperforms both baselines, with the margin becoming most pronounced in the hardest 200-document setting. In addition, varying the number of selected skills provides a lightweight sensitivity check: performance generally improves as we increase the skill budget, and the best results are achieved with larger Top-$K$ (typically $K{=}7$). These findings indicate that the learned skills transfer beyond dialogue-style memory benchmarks and remain useful for document-centric QA, while the ability to compose more skills becomes increasingly valuable as context difficulty grows.

Extensive analysis of skill generation under distribution shift

Case Study

To make MemSkill more interpretable, we inspect the final evolved skill bank and report representative skills learned from LoCoMo and ALFWorld. A key takeaway is that the evolved skills reflect what each setting repeatedly requires memory to preserve, without being manually hard-coded.

For LoCoMo, the highlighted skills focus on capturing temporal context and activity details, suggesting that dialogue memory often needs lightweight structure, such as when something happened and the contextual details that make an event retrievable later. For ALFWorld, the skills prioritize action constraints and object locations, indicating that embodied success depends on maintaining a compact, actionable state summary that supports long-horizon execution. Overall, these examples show how skill evolution yields setting-relevant behaviors while keeping the skill bank reusable and transferable.

Case Study on LoCoMo and ALFWorld

BibTeX

@article{MemSkill,
  author    = {Haozhen Zhang and Quanyu Long and Jianzhu Bao and Tao Feng and Weizhi Zhang and Haodong Yue and Wenya Wang},
  title     = {MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents},
  journal   = {arXiv preprint arXiv:2602.02474},
  year      = {2026},
}