This repository contains an experimental framework for generating jailbreak attack prompts, running them against local large language models, applying defenses, and evaluating the resulting responses.
The maintained implementation lives in my_implementation/. For detailed usage
instructions, see my_implementation/README.md.
-
my_implementation/attacks/
Individual jailbreak attack implementations. -
my_implementation/defense/
Baseline defenses, prompt-rewrite defenses, and rule-tree utilities. -
my_implementation/evaluate/
Evaluation scripts and selected examples for defense training. -
my_implementation/scripts/
Small runners used by the orchestrator inside PBS jobs. -
my_implementation/run_orchestrator.py
Main orchestration CLI. It creates PBS job scripts and submits them withqsubunlessdry_runis enabled. -
my_implementation/config_orchestrator.yaml
Main configuration file for paths, model selection, backend selection, attacks, defenses, and evaluation.
From the cluster environment:
module add mambaforge
mamba activate /storage/brno2/home/xkaska01/.conda/envs/diplomka
cd /storage/brno2/home/xkaska01/master/my_implementationList prepared attack JSON files:
python3 run_orchestrator.py --config config_orchestrator.yaml --list-attacksCreate or submit batch attack jobs:
python3 run_orchestrator.py --config config_orchestrator.yaml --attack-batchCreate or submit one selected attack for target_model:
python3 run_orchestrator.py --config config_orchestrator.yaml --attack-singleRun defenses:
python3 run_orchestrator.py --config config_orchestrator.yaml --defense ea
python3 run_orchestrator.py --config config_orchestrator.yaml --defense rallm
python3 run_orchestrator.py --config config_orchestrator.yaml --defense llamaguard
python3 run_orchestrator.py --config config_orchestrator.yaml --defense safeguardBefore submitting many PBS jobs, set this in
my_implementation/config_orchestrator.yaml:
dry_run: true
target_model: "falcon3:3b"
single_attack: "_1_cypher"Then run:
python3 run_orchestrator.py --config config_orchestrator.yaml --attack-singleInspect the generated job script under results_dir/jobs/. If it is correct,
set dry_run: false and run the command again.
The project supports two inference backends:
-
use_ollama: true
Use an Ollama model through the local Ollama HTTP API. -
use_ollama: false
Use a local model directory through vLLM.
For the current environment, vLLM jobs request gpu_cap=cuda80 in
my_implementation/scripts/job_templates.py. This avoids Blackwell sm_120
GPUs that are not supported by the installed PyTorch build.
The detailed project documentation is maintained in:
my_implementation/README.md
The main Python entry points include Doxygen-style docstrings with @brief,
@param, and @return tags so generated API documentation can be added later.