This repository contains implementations of various deep reinforcement learning algorithms with experiments on different environments.
Implementations of reinforcement learning algorithms trained on multiple environments:
- A3C (Asynchronous Advantage Actor-Critic) — Kuka Pick & Place manipulation task
- A2C & REINFORCE — LunarLander-v2 continuous control task
MLR/RL/
├── code/
│ ├── A2C/ # Actor-Critic implementations for LunarLander
│ │ ├── actor.py # Actor network
│ │ ├── critic.py # Critic network
│ │ ├── train.py # Training script
│ │ ├── eval.py # Evaluation script
│ │ ├── compute_objectives.py # Loss computation
│ │ ├── utils.py # Utility functions
│ │ ├── config.json # Configuration
│ │ ├── checkpoints/ # Trained models
│ │ ├── plots/ # Training curves
│ │ └── videos/ # Evaluation videos
│ └── A3C/ # A3C implementation for Kuka
│ ├── main.py # Entry point
│ ├── eval.py # Evaluation script
│ ├── plot_training.py # Visualization
│ ├── config/ # Environment and model configs
│ ├── lib/ # A3C algorithm implementation
│ ├── helpers/ # Helper utilities
│ ├── models/ # Trained checkpoints
│ ├── logs/ # Training logs
│ ├── plots/ # Training curves
│ └── requirements.txt # Dependencies
└── Project3.pdf # Assignment specification
Asynchronous Advantage Actor-Critic with 4 parallel workers training on robotic manipulation.
Environment: KukaDiverseObjectEnv (PyBullet)
- Observation: RGB image (40×40, from 128×128)
- Action space: 3D continuous (end-effector control)
- Task: Pick and place diverse objects
Training Parameters:
- Episodes: 10,000
- Workers: 4 (asynchronous)
- Training time: ~4.5 hours (CPU)
- Device: CPU (CUDA unavailable)
Results:
- Final average reward: ~0.35–0.38
- Evaluation (100 episodes): 31% success rate
- Average reward: 0.310
Policy gradient method applied to continuous control.
Environment: LunarLander-v2
- Observation: 8D state vector (position, velocity, angles, contact)
- Action space: 2D continuous (thrust, rotation)
- Task: Land the lunar module safely
Training metrics:
- Episodes trained: 5,000
- Learning rate: Adaptive scheduling
- Convergence: ~500-1000 episodes
Actor-Critic method combining policy and value function learning.
Environment: LunarLander-v2 (same as REINFORCE)
Architecture:
- Shared hidden layers: [128, 64]
- Actor head: outputs action mean and log_std
- Critic head: outputs state value estimate
Training metrics:
- Episodes trained: 5,000
- Workers/parallel processes: 1
- Convergence: ~1000-2000 episodes
cd code/A3C
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtRun training:
python main.pyRun evaluation (with GUI):
python eval.py --checkpoint models/a3c_kuka_model_final.pth --episodes 10 --renderPlot training curves:
python plot_training.pycd code/A2C
pip install -r requirements.txtRun training:
python train.pyRun evaluation:
python eval.py --checkpoint checkpoints/lunar_lander_actor.pt --episodes 5| Algorithm | Environment | Episodes | Success Rate | Avg Reward |
|---|---|---|---|---|
| A3C | Kuka Pick & Place | 10,000 | 31% | 0.310 |
| REINFORCE | LunarLander-v2 | 5,000 | ~80%* | -50 to 0 |
| A2C | LunarLander-v2 | 5,000 | ~85%* | -20 to 0 |
*Vision-based reward threshold; higher scores indicate better performance.



