Reading Time: 11 minutes
Large Language Models are rapidly becoming load-bearing components of modern applications. This article takes you behind the scenes of how Yarix structures an LLM Security Assessment, from threat modeling to hands-on adversarial testing, combining international standards with the kind of methodical analysis that surfaces what automated tools and generic checklists consistently miss.
Large Language Models enrich business workflows, assist decision-making, summarize sensitive documents, orchestrate automations, handle customer conversations, and integrate with systems through APIs, plugins, or entire RAG (Retrieval-Augmented Generation) pipelines. The adoption pace is remarkable, and so is the gap between how quickly companies deploy these systems and how carefully they think about the attack surface they are introducing in the process.
That attack surface is not limited to the model itself. It expands to everything around it: the application that wraps it, the data pipelines that feed it, the tools it can invoke, the APIs it can call, and the business logic that decides what to do with its output. Every one of these integration points is a potential entry path for an attacker who understands how the system is assembled.
In real-world environments we encounter, with striking regularity, the same set of structural problems:
The attack surface is an architecture to understand, a threat model to build, and a set of controls to verify. Prompt injection, data leakage, unintended autonomous operations, resource exhaustion: these are concrete vectors exploited in production today, and the only way to assess them rigorously is to treat the work as methodology, not experimentation.
Traditional penetration testing is built on reproducibility: an input produces a predictable output; a vulnerability either exists or it does not. LLMs shatter this assumption entirely. The same prompt, on the same model, at the same temperature, can produce meaningfully different outputs on successive calls, a fundamental consequence of probabilistic token prediction, not a bug.
Even harder is defining what secure means in this context. The boundary between secure and insecure behavior is not a static property of the system but a semantic one, shaped by deployment context, intended audience, and what the model is permitted to do. For example: a model that freely discusses chemical synthesis might be perfectly appropriate in a research assistant and a critical liability in a children's platform.
Defining "vulnerable" requires understanding context at a depth that cannot be automated away, and that context changes with every new deployment.
The threat landscape for LLM applications does not move on annual CVE cycles. It moves in weeks, sometimes days. Novel jailbreaking techniques, new prompt injection vectors, expanding multi-modal attack surfaces, increasingly sophisticated agentic chains where a single successful injection cascades into real-world consequences through tool calls: emails sent, files deleted, financial transactions initiated.
What is considered a robust guardrail today may be trivially bypassed by a technique published next week. This means LLM security cannot be treated as a point-in-time checkbox exercise. The threat model is a living document, not a deliverable produced at project kickoff and filed away.
Traditional software has a bounded set of intended behaviors encoded in its logic. You can enumerate its functionalities, test and verify them. An LLM has learned patterns from vast corpora and can generalize in directions its designers never anticipated. The complete set of potentially harmful outputs cannot be enumerated in advance, because it is not defined by code paths but by the intersection of model capabilities, deployment context, and attacker creativity.
This is why LLM penetration testing requires a fundamentally different mindset. It is less about running known exploits against known vulnerability classes, and more about adversarial exploration of a behavioral state space whose boundaries are not fully defined even by the people who built the system.
LLM assessments require a combined mindset that spans offensive security, risk management, software engineering, and AI governance. Getting any one of these dimensions wrong tends to produce assessments that are either technically shallow, practically irrelevant, or missing the governance context that makes findings actionable at the organizational level.
At Yarix, our methodology for LLM security is primarily guided by the OWASP AI Exchange and the OWASP GenAI project. Both of these frameworks are explicitly designed to connect with complementary standards from MITRE, NIST, ISO, and others, which means they do not exist in isolation but serve as an organizing layer over a broader ecosystem of guidance. What we value about this foundation is not just the vulnerability lists it produces, but the structured threat modeling approach it enables: different deployment scenarios, different trust boundaries, different required controls, each mapped to the specific context of the system being assessed. The framework is also actively maintained by a community that tracks the threat landscape in real time, which matters enormously in a space that changes as quickly as this one does.
We complement this OWASP-centric foundation with:
We monitor new standards, community outputs, and research publications continuously.
One of the first structural decisions in any LLM security assessment is understanding what the client actually owns and controls. This single question determines everything that follows: the scope of the attack surface, the relevant threat model, the applicable controls, and the location of responsibility boundaries that define what the client can fix versus what they must negotiate with a vendor.
At Yarix, we proposed two different assessments based on the answer to the above question:
The distinction between these two engagement types matters in practice because it defines where security responsibility actually sits. A significant part of our methodology involves making this boundary explicit to the client, documenting which controls fall under the model provider's responsibility and cannot be remediated unilaterally on the client side, and which shared responsibilities require coordinated action rather than a clean assignment to one party.
The two engagement types described above share the same first step: understanding what the system actually does, how it is built, and where the real attack surface lies. Whether the client owns the model or has integrated a third-party API, we invest heavily in mapping the architecture, tracing data flows, and identifying which threats are genuinely relevant in that specific environment before any adversarial testing begins. This phase is technical, structured, and non-negotiable. It is also the phase that establishes the baseline against which everything else is measured: what developers claim to enforce, what controls are assumed to be in place, what the system is believed to do. A second phase of active testing then exists to verify whether that picture holds under adversarial pressure, or whether the gap between what the team thinks the system does and what it can actually be made to do is wider than anyone realized.
For third-party integrations, we map the full application stack: web application, APIs, plugins, internal tools, and any orchestration or middleware layer that sits between the user and the model. We trace data flows and trust boundaries, asking specifically where untrusted input can enter the model's context window and what retrieval sources the model queries.
For proprietary model development, the perimeter expands upstream: training data pipelines, fine-tuning infrastructure, and the processes that govern how the model was built become part of the architectural picture.
In both cases, we document third-party dependencies, review authentication layers, and assess what observability the client actually has over the model's behavior in production. We then determine how much control the client has over the LLM itself, because that single variable shapes the entire engagement: it determines which attack surfaces are in scope for remediation and which are architectural constraints the client must work around rather than fix.
This phase regularly surfaces things the client did not know were relevant: unexpected data passing through prompts that no one had tracked, transformations applied by intermediate services that silently bypass intended filters, implicit assumptions about what the model provider handles that turn out to be incorrect.
With context in hand, we choose between two primary modeling approaches based on the type of engagement and the client's ownership profile:
We explain this choice explicitly to the client and document the reasoning, because that reasoning is itself a deliverable. It establishes in writing where security responsibility sits, which creates the accountability structure needed to act on findings after the assessment concludes.
Illustrative examples only
Here comes the part where each control defined in the OWASP AI Exchange is mapped to the client's specific architecture and assigned a status that reflects its current implementation state:
| Status | Meaning |
|---|---|
| Done | Fully implemented and verified |
| Partial | In place but incomplete or inconsistently applied |
| Missing | Absent, remediation required |
| N/A | Not applicable to this deployment, with documented justification |
| Supplier-driven | Responsibility of the model provider, not the client |
The Supplier-driven category deserves particular attention because it is consistently the most underestimated aspect of LLM security governance. Organizations frequently assume that security responsibilities are cleanly partitioned between themselves and their model provider, when in reality many controls involve shared or distributed responsibility.
As examples, click on the following cards to have an overview of what we did in practice for three different controls applied to specific business functionalities introduced on a web application, workflow or something else:
Illustrative examples only - real assessment outcomes depend on system-specific evidence
With the threat model established and the control evaluation complete, we move to the practical stage. This is where the methodology meets the keyboard, and where the desk-phase analysis is validated, extended, or contradicted by what the system does under pressure.
Testing the Application Layer: the first layer of testing relies on the application around the model. The LLM is a feature of a larger application, and we test it the same way we would test a payment flow, a file upload endpoint, or any other sensitive functionality: authentication, session management, API endpoints, rate limiting, access controls, error handling, logging. The fact that the feature involves a language model is irrelevant at this level. What matters is whether the application logic around it is built correctly.
Model Interaction Layer: the second layer of testing is where LLM-specific attacks come into play. Here the model is no longer background infrastructure, it is the attack surface. Our approach is anchored in the threat mapping produced during the desk phase: we test against the threats identified as relevant for this specific deployment, not a generic checklist that treats every LLM integration as equivalent. The attack surface changes substantially depending on whether the model has tool-use capabilities, whether it feeds a RAG pipeline, whether it processes user-uploaded documents, and what the real-world consequences of successful manipulation are in this specific context.
The most analytically valuable step in the active testing phase is the comparison between what the design claimed to prevent, what the threat model predicted should be enforced, and what the system actually did when placed under adversarial conditions.
This gap between stated controls and demonstrated behavior is where the most actionable findings consistently live. It exposes design oversights that were never caught in code review, incorrect assumptions about what the model’s provider handles, misconfigurations that silently nullify intended controls, and the endemic misalignment between the security expectations of the development team and the actual behavior of the integrated system. It is the same comparative methodology we apply in any security assessment, applied to a component that is substantially harder to reason about deterministically than the systems security teams are accustomed to evaluating.
At the conclusion of the entire assessment, the client receives a structured report suite that reflects both phases of the work: the theoretical control evaluation produced during the desk phase and the empirical exploitation results produced during active testing.
The two phases are explicitly connected, so findings are traceable back to the threat model that predicted them and the controls that failed to prevent them. The most valuable insights emerge from the gap between design claims, threat model predictions, and actual system behavior under pressure. This comparison exposes design oversights, misconfigurations, and false assumptions about model provider responsibilities that code reviews often miss. Ultimately, this methodology highlights the systemic misalignment between a team's security expectations and the non-deterministic reality of the integrated LLM.
The delivery consists of four distinct documents:
The threat landscape for LLM applications is still consolidating. Several areas are developing rapidly enough to warrant explicit attention in any assessment conducted today.
Multi-modal attack surfaces. As models gain the ability to process images, audio, and uploaded files, entirely new injection channels open that were not part of the threat model for text-only systems. Adversarial text can be embedded in images, both as visible text and through pixel-level perturbations that are imperceptible to human reviewers but reliably read by vision models. Audio transcripts processed by speech-to-text pipelines can carry injected instructions. Document layouts and formatting can exploit multimodal parsing behavior in ways that differ from how a human reader would interpret the same content.
Memory and long-context manipulation. As LLMs are given persistent memory stores, typically external vector databases that accumulate context across sessions and across users, those stores become a persistent injection surface. An attacker who can write to a model's memory, by contributing content to a shared knowledge base or by influencing what the model retains from a session, can potentially influence the model's behavior across all future interactions with all users who query that memory. Persistence is what makes this category particularly concerning: unlike a successful prompt injection that affects a single session, a memory poisoning attack can persist indefinitely without leaving obvious traces.
Fine-tuning and RLHF attacks. For organizations fine-tuning foundation models on proprietary data, the fine-tuning pipeline itself becomes an attack surface that extends upstream of the inference endpoint. Data poisoning attacks, in which adversarial examples are injected into the fine-tuning dataset to create behavioral backdoors that activate under specific trigger conditions, are technically feasible and difficult to detect without rigorous dataset auditing. The subtlety of these attacks is a significant part of what makes them dangerous: a poisoned model may behave completely normally in all respects except when the trigger condition is met.
LLM-to-LLM injection in multi-agent systems. When LLMs orchestrate other LLMs, passing instructions and context between agents as part of a coordinated workflow, trust propagation across the agent graph is poorly defined and inconsistently implemented. An injection that succeeds against a low-privilege agent in the system can propagate upward as legitimate-appearing instructions to a high-privilege orchestrator, because the orchestrator has no native mechanism to verify that the instructions it receives originated from a trusted source rather than from injected content. This represents one of the most structurally open problems in agentic AI security, and one that becomes more relevant as multi-agent architectures move from research demonstrations into production deployments.
Securing Large Language Models is a technical challenge that cannot be solved by treating the model as a black box. Many vulnerabilities usually live in the integration layer: the data pipelines, the API permissions, and the specific ways the application handles model output.
Effective protection requires a methodology that bridges the gap between traditional software security and the non-deterministic nature of AI. Success depends on mapping precise trust boundaries and testing them against evolving adversarial techniques. A structured assessment reveals where developer assumptions fail and where shared responsibility with providers creates hidden gaps. As these systems move from isolated experiments to autonomous components, security must remain a rigorous architectural discipline rather than a one-time check.
See how this attack is simulated in real scenarios.
Andrea Varischio, of Yarix’s Red Team, graduated from UNIPD with a degree in telecommunications engineering. During his studies, he became passionate about computer security, with a focus on that of Android devices, which was also his master’s thesis topic. Now his focus is on Red Teaming, Red Team tool development, payments penetration tests and mobile assessments. Drummer in his spare time, he loves music, math, martial arts and board games.
Paolo Serra is a Red Team member at Yarix, specializing in application security. He brings a hands-on approach to ethical hacking, often working closely with web and mobile technologies.
Marco Negro, the pillar of the team. Do you have a question? Ask to him and you'll get the answer. He is a master in every area of security.
Claudio Codice, new entry.