AI Systems Penetration Testing: How to Test AI Security

TL;DR

AI systems introduce attack surfaces that no standard penetration test covers. Prompt injection, model manipulation, indirect data extraction, and compromised retrieval pipelines are real attack vectors that organisations are deploying into production without testing. The EU AI Act sets a 2 August 2026 deadline for high-risk AI systems to demonstrate cybersecurity compliance. This guide explains what AI penetration testing involves, how it differs from conventional application security testing, and what your organisation needs to assess before deploying AI in a regulated or sensitive context.

Why AI systems need their own security testing

A web application has a defined attack surface: endpoints, authentication flows, session management, input handling, and server configuration. A penetration tester knows what to look for and has established methodologies for finding it.

An AI system has all of that and more. The model itself is an attack surface. The way it processes instructions is an attack surface. The external data it retrieves is an attack surface. The tools it can invoke are an attack surface. None of these are addressed by standard penetration testing methodology.

Organisations that run a standard web application pentest on an AI-powered application and consider it tested are creating a false sense of security. The most dangerous attack vectors against that application are the ones the standard test did not look for.

The attack surface of an AI system

The model and its instructions

The system prompt is the developer's instruction set. It tells the model what it is, what it can do, and what it must not do. Prompt injection attacks attempt to override these instructions. A tester probes whether the model can be made to reveal the system prompt, ignore its restrictions, impersonate a different system, or behave in ways the developer did not intend.

Retrieval-augmented generation pipelines

Many AI systems retrieve external content to ground their responses: documents, knowledge bases, database records, emails, web pages. This retrieval pipeline is a primary vector for indirect prompt injection. An attacker who can influence what the system retrieves can plant instructions that the model will execute when it processes that content.

RAG pipeline testing covers the security of the retrieval mechanism itself, including whether document access controls are correctly enforced, whether an attacker can insert documents into the retrieval index, and whether the model correctly attributes and limits the use of retrieved content.

Tool and API integrations

AI agents are increasingly connected to tools: email, calendar, code execution, database queries, external APIs, file systems. Each tool the agent can invoke is a potential damage amplifier if an injection attack succeeds. Testing covers whether tool invocations can be triggered through injection, whether the model applies appropriate constraints before invoking tools, and whether tool outputs are handled securely.

Data extraction through model outputs

Models that have been fine-tuned on proprietary data or that have access to sensitive information in context can be made to leak that information through carefully crafted queries. Testing covers whether the model can be induced to output training data, context contents, or information from other users' sessions.

Underlying application and infrastructure

The application hosting the AI system has its own attack surface: authentication, authorisation, API security, rate limiting, logging, and infrastructure configuration. A complete AI security assessment covers the model-specific attack vectors and the conventional application security layer.

How AI pentesting works in practice

Architecture review

Before any attack testing begins, the tester maps the AI system's architecture: what model or models are involved, how the system prompt is structured and stored, what data sources the system retrieves from, what tools it can invoke, how authentication and authorisation are implemented, and what logging and monitoring is in place. This determines which attack vectors are in scope and in which order to pursue them.

Prompt injection testing

Direct injection testing probes the model through its intended interface with instructions designed to override the system prompt, extract configuration details, bypass restrictions, or trigger unintended behaviour. Indirect injection testing introduces malicious instructions through external data sources the system retrieves, documents it processes, or content it summarises.

Safety guardrail assessment

Systems deployed with content restrictions, topic limitations, or prohibited output categories are tested for bypass techniques including jailbreaking, context manipulation, role-playing pretexts, and multi-turn manipulation strategies. The goal is to determine whether the guardrails hold under adversarial pressure or whether they can be circumvented by a motivated attacker.

Tool invocation and integration testing

Where the AI system has tool access, the tester attempts to trigger tool invocations through injection, tests whether tool parameters can be manipulated, and verifies that the system applies appropriate confirmation steps for high-impact actions. This is particularly critical for systems that can send communications, modify data, or make external API calls.

Conventional application security testing

The model API, authentication layer, rate limiting, and infrastructure are tested using standard application penetration testing methodology. This layer is often under-tested in AI-focused assessments because attention goes to the model-specific attack vectors.

AI pentesting and the EU AI Act

The EU AI Act requires high-risk AI systems to achieve appropriate levels of robustness, accuracy, and cybersecurity throughout their lifecycle. Organisations deploying AI systems classified as high-risk under Annex III must demonstrate these properties before placing their system on the market or putting it into service. The deadline for these requirements is 2 August 2026.

A standard penetration test report does not satisfy this requirement. A report documenting that the system was tested for code vulnerabilities and configuration weaknesses does not address the cybersecurity properties specific to AI systems. A structured AI security assessment that explicitly covers adversarial inputs, prompt injection, model manipulation, and integration security provides the evidence base that compliance requires.

High-risk AI use cases under Annex III include AI used in critical infrastructure, educational or vocational training, employment and worker management, access to essential services, law enforcement, migration and border control, administration of justice, and democratic processes. Organisations in regulated sectors should assess whether their AI deployments fall within scope before the August 2026 deadline.

What good AI security looks like

Least privilege for AI agents. An AI agent should only be able to invoke the tools and access the data it strictly needs. Every capability granted to the agent that is not required is an additional attack surface.

Human confirmation for consequential actions. Any action with real-world consequences, sending an email, modifying a record, calling an external service, should require explicit human approval before execution. This breaks the attack chain even when injection succeeds.

Separation of instructions and data. Where architecturally possible, instructions from developers and data retrieved from external sources should be handled through distinct channels with different trust levels. This reduces the attack surface for indirect injection.

Comprehensive logging. Log inputs, outputs, tool invocations, and retrieved content in sufficient detail to detect injection attempts and reconstruct what happened after a security event.

Recurring testing. AI systems evolve. Model updates, new tool integrations, changes to the retrieval pipeline, and new use cases change the attack surface. Security testing should be repeated when significant changes occur, not just at initial deployment.

FAQ

What is AI penetration testing?

AI penetration testing is a structured security assessment of an AI system, including its model, integrations, APIs, and data pipelines. It tests attack surfaces that do not exist in conventional software, such as prompt injection, model manipulation, data extraction through model outputs, and the security of retrieval-augmented generation pipelines. A standard application penetration test does not cover these attack surfaces.

How does AI pentesting differ from a standard application pentest?

A standard application pentest focuses on code vulnerabilities, authentication flaws, injection attacks in structured data, and configuration weaknesses. An AI pentest adds a layer of model-specific attack techniques: prompt injection through direct and indirect channels, context manipulation, jailbreaking, model inversion, membership inference, and the security of any tools or APIs the model can invoke. The tester needs to understand how the specific model processes instructions and what it is connected to.

Does the EU AI Act require penetration testing of AI systems?

The EU AI Act does not mandate penetration testing by name. It requires high-risk AI systems to achieve appropriate levels of robustness, accuracy, and cybersecurity throughout their lifecycle. In practice, demonstrating these properties requires structured testing that covers adversarial inputs and attack scenarios. A penetration test report documenting AI-specific attack coverage is the most credible evidence of cybersecurity compliance under the Act. The deadline for high-risk AI system requirements is 2 August 2026.

What is the most dangerous attack against an AI system in production?

Indirect prompt injection is consistently the most dangerous attack against AI systems in production. It does not require the attacker to interact with the system directly. Instead, the attacker plants malicious instructions in content the AI system retrieves, such as documents, emails, or web pages. When the system processes this content, it executes the attacker's instructions. AI agents with tool access are particularly vulnerable because the injected instruction can trigger real actions such as sending emails, exfiltrating data, or making API calls.

What does an AI security assessment typically cover?

A structured AI security assessment covers direct and indirect prompt injection, system prompt extraction, jailbreaking and safety guardrail bypass, context manipulation, RAG pipeline security including document poisoning, model API security, tool and integration abuse, data exfiltration through model outputs, and the underlying application and infrastructure security. The scope depends on the architecture of the specific system being tested.

Which organisations need to test their AI systems?

Any organisation deploying AI systems in a way that affects users, employees, customers, or regulated processes should test those systems. Under the EU AI Act, organisations deploying high-risk AI systems as defined in Annex III must meet cybersecurity requirements before the 2 August 2026 deadline. Beyond regulatory obligations, any AI system that can take actions, process sensitive data, or make decisions that affect people represents a security risk that should be assessed before deployment.

Related services and resources

Sectricity conducts AI systems penetration testing covering prompt injection, RAG pipeline security, tool abuse, and model manipulation across LLM-based and agentic deployments. For a deeper look at specific attack vectors, see our guides on prompt injection and MCP security. For the regulatory context, see our EU AI Act guide. Not sure where to start? Begin with a free security scan.

AI Systems Penetration Testing: How to Test the Security of an AI System