arXiv preprint · 2026 · survey paper

SoK: Agentic Skills - Beyond Tool Use in LLM Agents

A systematization of how reusable procedural modules are defined, acquired, executed, secured, and evaluated in modern LLM agent systems.

Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang, Guangsheng Yu

University of Technology Sydney · CSIRO Data61

Paper PDF arXiv BibTeX

Corpus 65 papers, 24 systems

Taxonomy 7 patterns, 5 representation classes

Anchor result Curated skills: +16.2pp, self-generated: -1.3pp

Seven design patterns for agentic skills — The paper's central systems view: seven design patterns for how skills are packaged, loaded, and executed.

Summary

What this paper contributes

A unified skill abstraction

Skills are formalized as reusable modules with applicability, execution policy, termination, and interface.

A lifecycle model

Discovery, refinement, distillation, storage, composition, execution, and update are treated as one pipeline.

A production-facing lens

The paper connects representation to trust tiers, supply-chain risk, deterministic evaluation, and governance.

Abstract

A survey of the skill layer in agent systems

Agentic systems increasingly rely on reusable procedural capabilities, or agentic skills, to execute long-horizon workflows reliably. This paper argues that the skill layer deserves its own systems view: not just how a model calls a tool once, but how reusable procedures are discovered, stored, governed, and evaluated over time.

The survey introduces two complementary organizing lenses. The first is a seven-pattern taxonomy for how skills are represented and deployed in real systems, from metadata-driven disclosure and executable code skills to self-evolving libraries and marketplace distribution. The second is an orthogonal representation-by-scope view that separates what a skill is from where it operates.

Beyond taxonomy, the paper treats skills as a security and operations problem. It analyzes trust tiers, prompt-injection-style skill payloads, supply-chain risks in skill marketplaces, and deterministic evaluation pipelines for deciding whether a skill helps in practice.

Framework

What counts as an agentic skill

The paper separates reusable skills from one-off plans, tool calls, and memory records by treating each skill as a bounded procedural module.

Internal anatomy of an agentic skill — The anatomy view used throughout the paper: applicability, policy, termination, and interface.

Formalization S = (C, pi, T, R)

Applicability condition, executable policy, termination condition, and reusable callable interface.

When to invoke

The applicability condition decides whether a skill belongs in the current task context.

How to execute

The policy may be natural language, code, a workflow, or a hybrid package.

When to stop

The termination rule makes the skill auditable instead of open-ended prompting.

How to reuse

The interface exposes the skill as a callable artifact that can be routed and composed.

Lifecycle

Skills are treated as evolving system components

The lifecycle view is the operational core of the survey: skills are acquired, stored, executed, evaluated, and revised as durable system assets.

Agentic skill lifecycle diagram — The lifecycle model connects discovery, refinement, distillation, storage, composition, execution, and update.

Discovery

Identify recurring task patterns, bottlenecks, or failure modes worth encapsulating.

Practice and refinement

Improve candidate procedures through execution feedback, reflection, or external supervision.

Distillation

Compress successful trajectories into a reusable procedural artifact with explicit boundaries.

Storage and retrieval

Index skills, manage versions, and route the right skill into the right runtime context.

Execution and update

Run within permission boundaries, then regress, refine, or retire based on evidence.

Patterns

Seven recurrent ways systems package and expose skills

The taxonomy is intentionally non-exclusive: strong systems often combine multiple patterns.

Disclosure and selection

Metadata-driven disclosure keeps context small while making skill selection explicit.

P2-P3

Executable control

Code-as-skill and workflow enforcement turn reusable procedures into testable runtime behavior.

P4-P6

Self-improving skill systems

Self-evolving libraries, hybrid packages, and meta-skills move skill management into the agent loop.

Distribution and governance

Marketplace distribution expands reuse, but also turns skills into a supply-chain security surface.

Skill composition and orchestration — Runtime composition is itself a systems problem: retrieve, route, decompose, recover, and retry.

Relationship between MCP and skills — The discussion section also situates skills relative to MCP and other emerging infrastructure layers.

Security

Skill ecosystems inherit the supply-chain problem

Threats the paper emphasizes

Poisoned skill retrieval through adversarial metadata
Malicious payloads in code or natural-language skill bodies
Cross-tenant leakage and confused-deputy behavior
Applicability-condition poisoning and skill drift

ClawHavoc anchor case

The paper grounds governance in a real marketplace incident: 1,184 malicious skills, 36.8% flawed listings in one audit, and credential theft spanning API keys, wallets, browsers, and SSH keys.

Trust-tiered threat model for skill governance — The proposed trust model moves from metadata-only exposure to supervised and then autonomous execution.

Evaluation

The strongest empirical result is about curation, not raw generation

+16.2pp

Curated skills

Average pass-rate lift in SkillsBench, from 24.3% to 40.6%.

-1.3pp

Self-generated skills

Average degradation relative to the no-skills baseline in open-ended settings.

7,308

Trajectories

Benchmark scale cited in the paper's SkillsBench case study.

Evaluation dimensions

Correctness, robustness, efficiency, generalization, and safety.

Industrial evaluation pipeline for skills — The industrial view maps skills to CI-style verification: regression suites, comparator agents, and versioned updates.

The paper's evaluation stance is pragmatic. What matters is not whether a skill looks elegant in isolation, but whether it reliably improves downstream outcomes under deterministic verification.

That is why the survey elevates benchmark harnesses, outcome-based verification, and industrial regression infrastructure. In this framing, a skill is closer to production logic than to a clever prompt.

Citation

Cite the paper directly

@article{jiang2026agenticskills,
  title   = {SoK: Agentic Skills - Beyond Tool Use in LLM Agents},
  author  = {Jiang, Yanna and Li, Delong and Deng, Haiyu and Ma, Baihe and
             Wang, Xu and Wang, Qin and Yu, Guangsheng},
  journal = {arXiv preprint arXiv:2602.20867},
  year    = {2026}
}