Jailbreak Attack and Defense Experiments

This repository contains an experimental framework for generating jailbreak attack prompts, running them against local large language models, applying defenses, and evaluating the resulting responses.

The maintained implementation lives in my_implementation/. For detailed usage instructions, see my_implementation/README.md.

Repository Layout

my_implementation/attacks/
Individual jailbreak attack implementations.
my_implementation/defense/
Baseline defenses, prompt-rewrite defenses, and rule-tree utilities.
my_implementation/evaluate/
Evaluation scripts and selected examples for defense training.
my_implementation/scripts/
Small runners used by the orchestrator inside PBS jobs.
my_implementation/run_orchestrator.py
Main orchestration CLI. It creates PBS job scripts and submits them with qsub unless dry_run is enabled.
my_implementation/config_orchestrator.yaml
Main configuration file for paths, model selection, backend selection, attacks, defenses, and evaluation.

Quick Start

From the cluster environment:

module add mambaforge
mamba activate /storage/brno2/home/xkaska01/.conda/envs/diplomka
cd /storage/brno2/home/xkaska01/master/my_implementation

List prepared attack JSON files:

python3 run_orchestrator.py --config config_orchestrator.yaml --list-attacks

Create or submit batch attack jobs:

python3 run_orchestrator.py --config config_orchestrator.yaml --attack-batch

Create or submit one selected attack for target_model:

python3 run_orchestrator.py --config config_orchestrator.yaml --attack-single

Run defenses:

python3 run_orchestrator.py --config config_orchestrator.yaml --defense ea
python3 run_orchestrator.py --config config_orchestrator.yaml --defense rallm
python3 run_orchestrator.py --config config_orchestrator.yaml --defense llamaguard
python3 run_orchestrator.py --config config_orchestrator.yaml --defense safeguard

Recommended Safe Test

Before submitting many PBS jobs, set this in my_implementation/config_orchestrator.yaml:

dry_run: true
target_model: "falcon3:3b"
single_attack: "_1_cypher"

Then run:

python3 run_orchestrator.py --config config_orchestrator.yaml --attack-single

Inspect the generated job script under results_dir/jobs/. If it is correct, set dry_run: false and run the command again.

Backends

The project supports two inference backends:

use_ollama: true
Use an Ollama model through the local Ollama HTTP API.
use_ollama: false
Use a local model directory through vLLM.

For the current environment, vLLM jobs request gpu_cap=cuda80 in my_implementation/scripts/job_templates.py. This avoids Blackwell sm_120 GPUs that are not supported by the installed PyTorch build.

Documentation

The detailed project documentation is maintained in:

my_implementation/README.md

The main Python entry points include Doxygen-style docstrings with @brief, @param, and @return tags so generated API documentation can be added later.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
my_implementation		my_implementation
.gitignore		.gitignore
README.md		README.md
job.sh		job.sh
job_eval.sh		job_eval.sh
ollama.log		ollama.log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Jailbreak Attack and Defense Experiments

Repository Layout

Quick Start

Recommended Safe Test

Backends

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Jailbreak Attack and Defense Experiments

Repository Layout

Quick Start

Recommended Safe Test

Backends

Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages