Behind The Scenes: Yarix Approach to LLM Security
Behind The Scenes: Yarix Approach to LLM SecurityReading Time: 11 minutesLar 2026-5-28 08:6:32 Author: labs.yarix.com(查看原文) 阅读量:16 收藏

Behind The Scenes: Yarix Approach to LLM Security

Reading Time: 11 minutes

Large Language Models are rapidly becoming load-bearing components of modern applications. This article takes you behind the scenes of how Yarix structures an LLM Security Assessment, from threat modeling to hands-on adversarial testing, combining international standards with the kind of methodical analysis that surfaces what automated tools and generic checklists consistently miss.

Introduction

Large Language Models enrich business workflows, assist decision-making, summarize sensitive documents, orchestrate automations, handle customer conversations, and integrate with systems through APIs, plugins, or entire RAG (Retrieval-Augmented Generation) pipelines. The adoption pace is remarkable, and so is the gap between how quickly companies deploy these systems and how carefully they think about the attack surface they are introducing in the process.

That attack surface is not limited to the model itself. It expands to everything around it: the application that wraps it, the data pipelines that feed it, the tools it can invoke, the APIs it can call, and the business logic that decides what to do with its output. Every one of these integration points is a potential entry path for an attacker who understands how the system is assembled.

In real-world environments we encounter, with striking regularity, the same set of structural problems:

  • Chatbot agents handling confidential data without proper filtering or data classification controls.
  • LLMs integrated directly into business logic without isolation boundaries or meaningful authorization controls on what the model is allowed to do.
  • Third-party models treated as "secure by default," with no clear understanding of where provider responsibility ends and client responsibility begins.
  • Error-handling paths that inadvertently leak sensitive context into model prompts, effectively giving the model information it should never see.
  • LLM-based features bolted onto existing web applications without any revisiting of the underlying security architecture that was designed before AI was part of the picture.

The attack surface is an architecture to understand, a threat model to build, and a set of controls to verify. Prompt injection, data leakage, unintended autonomous operations, resource exhaustion: these are concrete vectors exploited in production today, and the only way to assess them rigorously is to treat the work as methodology, not experimentation.

Part I: The Core Challenge - Testing a System That Does Not Behave Like Software

Determinism vs Probabilistic Reasoning

Traditional penetration testing is built on reproducibility: an input produces a predictable output; a vulnerability either exists or it does not. LLMs shatter this assumption entirely. The same prompt, on the same model, at the same temperature, can produce meaningfully different outputs on successive calls, a fundamental consequence of probabilistic token prediction, not a bug.

Even harder is defining what secure means in this context. The boundary between secure and insecure behavior is not a static property of the system but a semantic one, shaped by deployment context, intended audience, and what the model is permitted to do. For example: a model that freely discusses chemical synthesis might be perfectly appropriate in a research assistant and a critical liability in a children's platform.

Defining "vulnerable" requires understanding context at a depth that cannot be automated away, and that context changes with every new deployment.

The Velocity of the Threat Landscape

The threat landscape for LLM applications does not move on annual CVE cycles. It moves in weeks, sometimes days. Novel jailbreaking techniques, new prompt injection vectors, expanding multi-modal attack surfaces, increasingly sophisticated agentic chains where a single successful injection cascades into real-world consequences through tool calls: emails sent, files deleted, financial transactions initiated.

What is considered a robust guardrail today may be trivially bypassed by a technique published next week. This means LLM security cannot be treated as a point-in-time checkbox exercise. The threat model is a living document, not a deliverable produced at project kickoff and filed away.

The Semantic Gap

Traditional software has a bounded set of intended behaviors encoded in its logic. You can enumerate its functionalities, test and verify them. An LLM has learned patterns from vast corpora and can generalize in directions its designers never anticipated. The complete set of potentially harmful outputs cannot be enumerated in advance, because it is not defined by code paths but by the intersection of model capabilities, deployment context, and attacker creativity.

This is why LLM penetration testing requires a fundamentally different mindset. It is less about running known exploits against known vulnerability classes, and more about adversarial exploration of a behavioral state space whose boundaries are not fully defined even by the people who built the system.

Part II: The Yarix Approach - Methodology and Frameworks

LLM assessments require a combined mindset that spans offensive security, risk management, software engineering, and AI governance. Getting any one of these dimensions wrong tends to produce assessments that are either technically shallow, practically irrelevant, or missing the governance context that makes findings actionable at the organizational level.

At Yarix, our methodology for LLM security is primarily guided by the OWASP AI Exchange and the OWASP GenAI project. Both of these frameworks are explicitly designed to connect with complementary standards from MITRE, NIST, ISO, and others, which means they do not exist in isolation but serve as an organizing layer over a broader ecosystem of guidance. What we value about this foundation is not just the vulnerability lists it produces, but the structured threat modeling approach it enables: different deployment scenarios, different trust boundaries, different required controls, each mapped to the specific context of the system being assessed. The framework is also actively maintained by a community that tracks the threat landscape in real time, which matters enormously in a space that changes as quickly as this one does.

We complement this OWASP-centric foundation with:

  • MITRE ATLAS, for structured adversarial TTP mapping across the machine learning kill chain, from reconnaissance and resource development through to exfiltration and impact.
  • NIST AI RMF and AI 100-2, for governance language that translates technical findings into business risk, and for the adversarial input taxonomy that informs how we structure test cases.
  • ISO/IEC 42001, for AI management system requirements in engagements where organizational compliance is a defined objective.

We monitor new standards, community outputs, and research publications continuously.

Two Types of Engagement

One of the first structural decisions in any LLM security assessment is understanding what the client actually owns and controls. This single question determines everything that follows: the scope of the attack surface, the relevant threat model, the applicable controls, and the location of responsibility boundaries that define what the client can fix versus what they must negotiate with a vendor.

At Yarix, we proposed two different assessments based on the answer to the above question:

  1. Assessments of systems using third-party LLMs: the client has integrated an external model through an API and controls none of what sits underneath: not the weights, not the training data, not the safety alignment. The relevant attack surface lies entirely in the integration layer: how the system prompt is structured and protected, how the retrieval pipeline is designed, how tool integrations are scoped and authorized, and what the surrounding business logic does with the model's responses. This is where most of the exploitable vulnerabilities live, not in the model itself.
  2. Assessments of proprietary LLM development: the client owns the model, which means they own the entire lifecycle: training data quality and integrity, fine-tuning procedures, alignment validation, deployment environment hardening, inference infrastructure security. The threat surface expands accordingly to include attacks against the model and its lifecycle. Deeper access, broader scope, and a different set of technical competencies in the testing team are required, as well as the engagement’s requirements to start with the activity.

The distinction between these two engagement types matters in practice because it defines where security responsibility actually sits. A significant part of our methodology involves making this boundary explicit to the client, documenting which controls fall under the model provider's responsibility and cannot be remediated unilaterally on the client side, and which shared responsibilities require coordinated action rather than a clean assignment to one party.

Part III: Threat Modeling - The Desk Phase That Earns Its Keep

The two engagement types described above share the same first step: understanding what the system actually does, how it is built, and where the real attack surface lies. Whether the client owns the model or has integrated a third-party API, we invest heavily in mapping the architecture, tracing data flows, and identifying which threats are genuinely relevant in that specific environment before any adversarial testing begins. This phase is technical, structured, and non-negotiable. It is also the phase that establishes the baseline against which everything else is measured: what developers claim to enforce, what controls are assumed to be in place, what the system is believed to do. A second phase of active testing then exists to verify whether that picture holds under adversarial pressure, or whether the gap between what the team thinks the system does and what it can actually be made to do is wider than anyone realized.

Understanding the Client's Context

For third-party integrations, we map the full application stack: web application, APIs, plugins, internal tools, and any orchestration or middleware layer that sits between the user and the model. We trace data flows and trust boundaries, asking specifically where untrusted input can enter the model's context window and what retrieval sources the model queries.

For proprietary model development, the perimeter expands upstream: training data pipelines, fine-tuning infrastructure, and the processes that govern how the model was built become part of the architectural picture.

In both cases, we document third-party dependencies, review authentication layers, and assess what observability the client actually has over the model's behavior in production. We then determine how much control the client has over the LLM itself, because that single variable shapes the entire engagement: it determines which attack surfaces are in scope for remediation and which are architectural constraints the client must work around rather than fix.

This phase regularly surfaces things the client did not know were relevant: unexpected data passing through prompts that no one had tracked, transformations applied by intermediate services that silently bypass intended filters, implicit assumptions about what the model provider handles that turn out to be incorrect.

Selecting the Right Threat Model

With context in hand, we choose between two primary modeling approaches based on the type of engagement and the client's ownership profile:

  • Full Threat Model: applied when the client owns and controls the model and its training components, covering the complete ML lifecycle from data collection through inference.
  • Integration-Focused Threat Model: applied when the LLM is third-party and the relevant attack surface lies in the integration layer. This is the more common scenario and requires a precise understanding of where the integration begins and ends.

We explain this choice explicitly to the client and document the reasoning, because that reasoning is itself a deliverable. It establishes in writing where security responsibility sits, which creates the accountability structure needed to act on findings after the assessment concludes.

Illustrative examples only

Mapping Controls to Architecture

Here comes the part where each control defined in the OWASP AI Exchange is mapped to the client's specific architecture and assigned a status that reflects its current implementation state:

Status Meaning
Done Fully implemented and verified
Partial In place but incomplete or inconsistently applied
Missing Absent, remediation required
N/A Not applicable to this deployment, with documented justification
Supplier-driven Responsibility of the model provider, not the client

The Supplier-driven category deserves particular attention because it is consistently the most underestimated aspect of LLM security governance. Organizations frequently assume that security responsibilities are cleanly partitioned between themselves and their model provider, when in reality many controls involve shared or distributed responsibility.

As examples, click on the following cards to have an overview of what we did in practice for three different controls applied to specific business functionalities introduced on a web application, workflow or something else:

Illustrative examples only - real assessment outcomes depend on system-specific evidence

Part IV: Hands-On Testing - What We Actually Do to Break Things

With the threat model established and the control evaluation complete, we move to the practical stage. This is where the methodology meets the keyboard, and where the desk-phase analysis is validated, extended, or contradicted by what the system does under pressure.

Testing the Application Layer: the first layer of testing relies on the application around the model. The LLM is a feature of a larger application, and we test it the same way we would test a payment flow, a file upload endpoint, or any other sensitive functionality: authentication, session management, API endpoints, rate limiting, access controls, error handling, logging. The fact that the feature involves a language model is irrelevant at this level. What matters is whether the application logic around it is built correctly.

Model Interaction Layer: the second layer of testing is where LLM-specific attacks come into play. Here the model is no longer background infrastructure, it is the attack surface. Our approach is anchored in the threat mapping produced during the desk phase: we test against the threats identified as relevant for this specific deployment, not a generic checklist that treats every LLM integration as equivalent. The attack surface changes substantially depending on whether the model has tool-use capabilities, whether it feeds a RAG pipeline, whether it processes user-uploaded documents, and what the real-world consequences of successful manipulation are in this specific context.


The most analytically valuable step in the active testing phase is the comparison between what the design claimed to prevent, what the threat model predicted should be enforced, and what the system actually did when placed under adversarial conditions.

This gap between stated controls and demonstrated behavior is where the most actionable findings consistently live. It exposes design oversights that were never caught in code review, incorrect assumptions about what the model’s provider handles, misconfigurations that silently nullify intended controls, and the endemic misalignment between the security expectations of the development team and the actual behavior of the integrated system. It is the same comparative methodology we apply in any security assessment, applied to a component that is substantially harder to reason about deterministically than the systems security teams are accustomed to evaluating.

Part V: Deliverables

At the conclusion of the entire assessment, the client receives a structured report suite that reflects both phases of the work: the theoretical control evaluation produced during the desk phase and the empirical exploitation results produced during active testing.

The two phases are explicitly connected, so findings are traceable back to the threat model that predicted them and the controls that failed to prevent them. The most valuable insights emerge from the gap between design claims, threat model predictions, and actual system behavior under pressure. This comparison exposes design oversights, misconfigurations, and false assumptions about model provider responsibilities that code reviews often miss. Ultimately, this methodology highlights the systemic misalignment between a team's security expectations and the non-deterministic reality of the integrated LLM.

The delivery consists of four distinct documents:

  • The Threat Modeling Report: A high-level, discursive report detailing the project’s architecture, chosen methodology, and the strategic rationale behind the security approach. It provides an executive and technical narrative of the entire activity.
  • The Security Assessment Report: A typical penetration testing report following standard professional formats. It includes executive summaries and deep technical sections documenting every finding with full reproduction steps and proof-of-concepts.
  • The Control Mapping Matrix: A dedicated functional resource that maps the system against the OWASP LLM control framework. This provides a granular, easy-to-navigate view of the defensive posture that traditional document formats cannot offer.
  • The Remediation & Vulnerability Manager: A streamlined technical file designed specifically for tracking findings, managing the remediation plan, and integrating with the client’s internal vulnerability management workflows.

Part VI: Emerging Frontiers

The threat landscape for LLM applications is still consolidating. Several areas are developing rapidly enough to warrant explicit attention in any assessment conducted today.

Multi-modal attack surfaces. As models gain the ability to process images, audio, and uploaded files, entirely new injection channels open that were not part of the threat model for text-only systems. Adversarial text can be embedded in images, both as visible text and through pixel-level perturbations that are imperceptible to human reviewers but reliably read by vision models. Audio transcripts processed by speech-to-text pipelines can carry injected instructions. Document layouts and formatting can exploit multimodal parsing behavior in ways that differ from how a human reader would interpret the same content.

Memory and long-context manipulation. As LLMs are given persistent memory stores, typically external vector databases that accumulate context across sessions and across users, those stores become a persistent injection surface. An attacker who can write to a model's memory, by contributing content to a shared knowledge base or by influencing what the model retains from a session, can potentially influence the model's behavior across all future interactions with all users who query that memory. Persistence is what makes this category particularly concerning: unlike a successful prompt injection that affects a single session, a memory poisoning attack can persist indefinitely without leaving obvious traces.

Fine-tuning and RLHF attacks. For organizations fine-tuning foundation models on proprietary data, the fine-tuning pipeline itself becomes an attack surface that extends upstream of the inference endpoint. Data poisoning attacks, in which adversarial examples are injected into the fine-tuning dataset to create behavioral backdoors that activate under specific trigger conditions, are technically feasible and difficult to detect without rigorous dataset auditing. The subtlety of these attacks is a significant part of what makes them dangerous: a poisoned model may behave completely normally in all respects except when the trigger condition is met.

LLM-to-LLM injection in multi-agent systems. When LLMs orchestrate other LLMs, passing instructions and context between agents as part of a coordinated workflow, trust propagation across the agent graph is poorly defined and inconsistently implemented. An injection that succeeds against a low-privilege agent in the system can propagate upward as legitimate-appearing instructions to a high-privilege orchestrator, because the orchestrator has no native mechanism to verify that the instructions it receives originated from a trusted source rather than from injected content. This represents one of the most structurally open problems in agentic AI security, and one that becomes more relevant as multi-agent architectures move from research demonstrations into production deployments.

Conclusion

Securing Large Language Models is a technical challenge that cannot be solved by treating the model as a black box. Many vulnerabilities usually live in the integration layer: the data pipelines, the API permissions, and the specific ways the application handles model output.

Effective protection requires a methodology that bridges the gap between traditional software security and the non-deterministic nature of AI. Success depends on mapping precise trust boundaries and testing them against evolving adversarial techniques. A structured assessment reveals where developer assumptions fail and where shared responsibility with providers creates hidden gaps. As these systems move from isolated experiments to autonomous components, security must remain a rigorous architectural discipline rather than a one-time check.

See how this attack is simulated in real scenarios.

Authors

Andrea Varischio, of Yarix’s Red Team, graduated from UNIPD with a degree in telecommunications engineering. During his studies, he became passionate about computer security, with a focus on that of Android devices, which was also his master’s thesis topic. Now his focus is on Red Teaming, Red Team tool development, payments penetration tests and mobile assessments. Drummer in his spare time, he loves music, math, martial arts and board games.

Paolo Serra is a Red Team member at Yarix, specializing in application security. He brings a hands-on approach to ethical hacking, often working closely with web and mobile technologies.

Marco Negro, the pillar of the team. Do you have a question? Ask to him and you'll get the answer. He is a master in every area of security.

Claudio Codice, new entry.

See how this attack is simulated in real scenarios.


文章来源: https://labs.yarix.com/2026/05/behind-the-scenes-yarix-approach-to-llm-security/
如有侵权请联系:admin#unsafe.sh