Research & Tradecraft

Jailbreaker: LLM Jailbreak Testing You Can Actually Repeat

Author

Read Time

12 mins

Published

Jun 29, 2026

You may have read about our new GhostWorks initiative here at SpecterOps. As part of this effort, we continually trial different ways of evaluating and improving model behavior to better understand which techniques can be applied to our research.

One of the things that has been winding me up about working with LLMs is how difficult it can be to tell whether a safety improvement actually worked.

You tweak a prompt, add a guardrail, adjust a refusal policy, and for a while it feels like the model is behaving better. Then a slightly different prompt lands and the same failure mode appears again. The issue is that most of us are still relying heavily on intuition when deciding whether a change made any measurable difference.

TL;DR Jailbreaker is a defensive LLM jailbreak testing tool that helps operators run jailbreak, prompt-injection, and agent-behavior tests without turning the work into a mess of copied prompts, spreadsheets, screenshots, and one-off scripts. It gives you a UI for configuring targets, running techniques, tracking experiments, reviewing evidence, and comparing results across models and prompts.

Access the GitHub repo: Jailbreaker

Introduction: What Jailbreaker is meant to solve

If you have tested an LLM application manually, the first few minutes usually feel simple enough.

You open the chatbot, paste in a prompt, tweak the wording, paste again, save a screenshot, and make a note that the model either refused or answered. Maybe you try a roleplay prompt next. Maybe you base64-encode the request. Maybe you run the same thing against another model to see if the behavior changes.

This works for a quick gut check.

It does not work well when you need to do the work repeatedly.

The painful part of jailbreak testing is not just coming up with clever prompts. It is managing the whole loop around those prompts: what target was tested, what system prompt was active, what model answered, what technique generated the final prompt, what the model returned, whether a baseline behaved differently, which runs need to be repeated, and what evidence should be shown to someone else later.

That is the problem Jailbreaker is meant to solve.

The manual jailbreak testing problem

Manual testing tends to start as a list of prompts and quickly turns into unmanageable states.

You might have:

a few target models
a local model and a cloud model
different system prompts
a set of known jailbreak techniques
an attacker model for iterative techniques
a judge model for scoring responses
datasets of unsafe intents
screenshots and copied responses
notes about which prompts “worked”

This is obviously a very operator-heavy workflow. The more serious the test becomes, the more time goes into bookkeeping instead of analysis.

There are a few common failure modes:

You forget which prompt produced a response.
You rerun a technique but accidentally change the target configuration.
You only test the jailbreak prompt and forget to compare it with a baseline.
You have results, but they are scattered across browser tabs and notes.
You want to test ten prompts across twenty techniques, but that means hundreds of manual runs.
You want to explain the result to someone else, but the evidence is not organized.

This is the same kind of workflow problem that shows up in many security tools: the individual action is easy, but scaling consistently requires a platform.

Jailbreaker: An LLM jailbreak testing tool

Jailbreaker treats jailbreak testing as an evaluation workflow instead of a collection of ad hoc prompts.

At a high level, the tool gives you a place to:

Configure the target system.
Pick the technique or technique family you want to run.
Run a baseline for comparison.
Execute one technique, a family, or the full set.
Inspect prompts, responses, traces, judge output, and scoring details.
Save and revisit results.
Run structured experiments across prompt datasets.

The goal is not to magically decide whether a model is “secure.” The goal is to make the testing loop repeatable enough that you can compare behavior, fix issues, and run the same checks again.

Learn more about the SpecterOps Ghostworks initiative

Dashboard: operator console for LLM testing

The Dashboard is the ad hoc operator console.

This is where you configure the target model, provide the intent you want to test, pick jailbreaking techniques, and run the evaluation. Instead of manually copying a jailbreak prompt into a chatbot and tracking the result somewhere else, the platform keeps the run tied to its configuration and output.

For example, say you want to test a customer-support assistant before it goes live. The manual process is familiar: paste an unsafe request into the bot, change the wording a few times, maybe try a roleplay prompt, and then copy the interesting response into notes.

In Jailbreaker, the same quick test becomes a managed run. You select the saved target profile for the support bot, provide the intent you want to test, and run a baseline first. Then you run one technique, a jailbreak family like roleplay or prompt injection, or you can select the full registry of techniques. The app keeps the unsafe intent, generated prompt, model configuration, response, scoring, and evidence together.

From the Dashboard you can:

run a baseline of the model
run individual jailbreak techniques
run a whole jailbreak technique family
run the full jailbreak technique registry
configure target, attacker, and judge models
inspect generated prompts and target responses
open conversation details and trace artifacts
save reusable custom techniques

Having a record of attempted jailbreak attempts is useful because many jailbreak results only make sense in context. A target response is more meaningful when you can see the original unsafe intent, the technique that transformed it, the final prompt that was sent, the model configuration, and the comparison baseline.

That last point is important. If a technique produces an alarming response, you can open the conversation details and see exactly what happened instead of trying to reconstruct it later from screenshots. If the baseline already produced the same behavior, that tells you something different than a technique creating a new failure mode.

Techniques: from prompt lists to a registry

One of the first things that breaks down in manual testing is the prompt list.

Someone has a file of prompts. Someone else has a different file. A third person has a few iterative scripts. Some techniques are direct. Some mutate the input. Some need an attacker model. Some need multiple rounds. Some are really families of related tests.

Jailbreaker provides a technique registry to help solve this problem.

The built-in registry includes direct and indirect prompt injection, roleplay and instruction override, encoding and multilingual obfuscation, system prompt extraction, iterative techniques such as PAIR, TAP, Crescendo, AutoDAN, and GPTFuzz, context poisoning, and customer-service boundary tests.

The important part is not just that these techniques exist, it is that they can be selected, run, compared, retried, and inspected through the same consistent workflow.

If a technique needs custom scoring behavior or an iterative loop, that behavior belongs to the technique instead of living in someone’s temporary script.

The current LLM jailbreak testing landscape

Any discussion about LLM red teaming tools eventually runs into PyRIT. That makes sense, PyRIT is a strong project, and it has done a lot to push AI red teaming toward reusable components instead of one-off scripts.

PyRIT is a flexible framework. It gives you building blocks: targets, converters, attacks, scorers, memory, scenarios, a scanner, and a GUI. If your goal is to build custom red teaming automation in Python, wire together your own components, or experiment with new attack strategies, that kind of framework is useful.

Jailbreaker is built around the operator workflow first. The question is not just, “Can I automate this attack?” The question is, “Can I manage the entire jailbreak testing process from target setup to results review without writing glue code for every assessment?” That changes the shape of the tool.

In Jailbreaker, the default experience is already a platform workflow:

configure the target, attacker, and judge roles in the UI
run a baseline next to jailbreak techniques
execute one technique, a family, or the full registry
launch dataset-by-technique experiment matrices
monitor progress and retry failures
inspect prompt construction, target responses, traces, and judge output
review heatmaps, metrics, exports, and SQL-backed analysis views
revisit previous experiments without rebuilding context

This is where Jailbreaker pulls ahead for many day-to-day users. You do not need to start by deciding which framework primitives to compose. You start by running an evaluation.

Jailbreaker is a better fit when the user wants a jailbreak testing platform: something an operator can clone, run, point at a model, and use to manage the assessment lifecycle.

That does not mean Jailbreaker and PyRIT cannot coexist. Some PyRIT-derived request transformation ideas already fit naturally into Jailbreaker’s technique model. The point is that Jailbreaker packages those ideas inside a workflow that is focused on repeatability, evidence, comparison, and review.

Model roles without reconfiguring everything

Jailbreak testing often involves more than one model.

Jailbreaker separates the common roles:

Target: the system being tested.
Attacker: the model used by iterative techniques to refine attacks.
Judge: the model used to evaluate responses when automated judging is useful.

This matters in practice. Sometimes you want the target model to judge itself. Sometimes you want the attacker model to also act as the judge. Sometimes you want a dedicated judge model, or sometimes you just want one local model to do everything while you test the workflow.

The saved LLM profile flow in Jailbreaker is there to reduce repeated configuration work. Save a model profile once, then select it for target, attacker, or judge. When one profile exists, each role gets a dropdown, and you can add or edit profiles from the same console.

The Vault supports this by storing reusable profile credentials behind an unlock step. The larger workflow benefit is that model configuration becomes a reusable part of the evaluation instead of a set of fields you keep filling out.

Experiments: when one prompt becomes a matrix

At some point, one-off testing is not enough.

You might want to know how a model behaves across a dataset of unsafe intents. Or you may want to measure how a technique family performs against different prompt categories, or if a system prompt change improved behavior across the same test set you ran yesterday.

This is where the Experiments page comes in.

Imagine a support-bot test with Jailbreaker found one suspicious response. The next question is usually not, “Can I make this happen once?” It is, “How often does this class of thing happen, and which techniques are getting us there?”

That is where a one-off run naturally turns into an experiment. You can define or upload a small dataset of unsafe intents, tag them by category, choose the techniques or technique families you care about, and launch the matrix. Instead of manually running every prompt-technique combination, the platform tracks the cells, records the results, and gives you a place to review what happened.

From the experiment workflow you can:

upload prompt datasets
select technique families
run matrix-style evaluations
monitor cell progress
retry failed cells
inspect experiment history
export CSV or JSON
review aggregate metrics by technique and prompt category

This is the difference between “I tried some jailbreaks” and “I ran the same evaluation set before and after a mitigation and can compare the results.”

For example, you might run the same dataset before and after tightening a system prompt, changing a retrieval policy, or adding a tool-use guardrail. The value is not just that Jailbreaker can send the prompts. The value is that the run is structured enough to compare.

Results and analysis

Jailbreak results are only useful if you can make sense of them later. Jailbreaker stores completed runs, detections, conversations, traces, experiment metadata, experiment cells, datasets, and custom techniques. The UI then gives you multiple ways to inspect that data.

The Results view helps you review a specific experiment: aggregate metrics, technique breakdowns, heatmaps, evidence, exports, and per-cell details.

In the support-bot example, this is where you stop looking at isolated model responses and start looking for patterns. Maybe encoding tricks are mostly blocked, but roleplay techniques still produce policy leaks. Maybe one prompt category behaves much worse than the others. Maybe the mitigation helped the average score but still left a small number of high-risk failures.

The Analysis page exposes SQL-backed reporting views for broader questions:

Which techniques are producing the most concerning responses?
Which models are more resistant across the same dataset?
Which prompt categories are most problematic?
What happened in a specific round of an iterative run?

It is meant to get the evidence into one place so the review is less painful.

Clone it and run it

Jailbreaker is designed to be easy to setup from a fresh git clone.

git clone <https://github.com/SpecterOps/Jailbreaker-CE>
cd Jailbreaker-CE
cp .env.example .env
docker compose up -d --build frontend backend database

The default stack includes the frontend, backend, and PostgreSQL database. Once it is running, open:

<http://localhost:3000>

From there, configure a target, pick an unsafe intent, and start running techniques.

The current focus is the clone-and-run workflow: a practical app that helps operators manage jailbreak testing without needing to wire up a custom evaluation harness first.

Closing thoughts

Manual jailbreak testing is useful, but it does not scale well as a workflow.

The interesting work is not just getting one model to answer one bad prompt. The interesting work is being able to run the same checks again, compare results, review evidence, and understand whether your changes actually improved the system.

That is what Jailbreaker tries to make easier.

It gives operators a place to manage targets, techniques, model roles, experiments, results, and analysis in one platform. The prompts still matter. The techniques still matter. But the workflow around them matters just as much.

You can find Jailbreaker on GitHub here: Jailbreaker

PS: We will occasionally share new techniques that we research through Jailbreaker with the community as they become available and are appropriately reviewed.

Post Views: 4,350

Neeraj Gupta

AI/ML Applied Scientist

Neeraj Gupta is a tinkerer, working as an AI/ML Applied Scientist at SpecterOps. He likes to make the machines learn new things and authored What’s Your Secret?: Secret Scanning by DeepPass2 while creating DeepPass2 tool.

Jailbreaker: LLM Jailbreak Testing You Can Actually Repeat

Share

Introduction: What Jailbreaker is meant to solve

The manual jailbreak testing problem

Jailbreaker: An LLM jailbreak testing tool

Dashboard: operator console for LLM testing

Techniques: from prompt lists to a registry

The current LLM jailbreak testing landscape

Model roles without reconfiguring everything

Experiments: when one prompt becomes a matrix

Results and analysis

Clone it and run it

Closing thoughts

Ready to get started?

You might also be interested in

There and Back Again: An Operators Guide on NTLM Relaying Egress

Finding SOCKS with Proxywatch

Building a Mental Model for Kubernetes Security Research