AI-powered legal document analysis for Indian citizens — understand contracts before you sign.
LawSahayak analyzes PDF contracts and tenancy agreements, detects risky clauses, cites relevant Indian laws, explains everything in plain English, and translates explanations into Hindi and other regional languages.
- Features
- Architecture
- Tech Stack
- Project Structure
- Getting Started
- Environment Variables
- API Reference
- Running Tests
- Known Limitations
- Roadmap
- PDF Analysis — Extracts and segments clauses from uploaded PDFs with OCR fallback for scanned documents
- Rule-Based Detection — Flags violations across 20 Indian legal rules (Indian Contract Act 1872, Transfer of Property Act 1882, Model Tenancy Act 2021, RERA 2016, and more)
- ML Risk Scoring — XGBoost classifier assigns LOW / MEDIUM / HIGH risk to each clause
- LLM Analysis — Google Gemini generates plain-English summaries, legal citations, and actionable advice per clause
- RAG Context — ChromaDB retrieval over the CUAD dataset enriches clause analysis with similar legal precedents
- Multilingual Translation — Sarvam AI translates explanations into Hindi and 9 other Indian languages
- Analytics — Supabase stores analysis results and user feedback
- Graceful Degradation — Friendly error messages for LLM timeouts, missing model files, translation failures, and backend offline states
PDF Upload
│
▼
Text Extraction (PyMuPDF + Tesseract OCR fallback)
│
▼
Clause Segmentation (up to 50 clauses per document)
│
▼
Rule Matching ──────────────────────────────────► 20 Indian legal rules (rules.json)
│
▼
ML Risk Classification (XGBoost)
│
▼
RAG Retrieval (ChromaDB + CUAD dataset)
│
▼
LLM Analysis — batched (Google Gemini)
│
▼
Translation (Sarvam AI)
│
▼
JSON Response → Next.js Frontend → Supabase (analytics)
| Layer | Technology |
|---|---|
| Backend API | FastAPI, Uvicorn, Pydantic |
| LLM | Google Gemini 2.0 Flash (with fallback chain) |
| Embeddings | sentence-transformers/all-MiniLM-L6-v2 |
| Vector DB | ChromaDB (persistent local client) |
| ML Risk Model | XGBoost (trained on CUAD dataset) |
| Translation | Sarvam AI |
| PDF Processing | PyMuPDF, OpenCV, Tesseract OCR |
| Database | Supabase (PostgreSQL) |
| Legal Dataset | CUAD (510 contracts for RAG context) |
| Frontend | Next.js 14, TypeScript, Tailwind CSS |
LawSahayak/
├── backend/
│ ├── main.py # FastAPI app — 6 endpoints
│ ├── pipeline/
│ │ ├── ingestion.py # PDF extraction + OCR + clause segmentation
│ │ ├── embedder.py # Sentence-transformer embeddings
│ │ ├── rag.py # ChromaDB ingestion and retrieval
│ │ ├── rules_engine.py # Keyword-based Indian law violation detection
│ │ ├── risk_classifier.py # XGBoost clause risk scoring
│ │ ├── gemini_analyzer.py # Batched Gemini LLM analysis with retry/fallback
│ │ └── sarvam_translator.py # Multilingual translation
│ ├── utils/
│ │ └── helpers.py
│ ├── test_backend.py # Endpoint smoke tests
│ ├── test_all.py # Integration tests
│ ├── download_cuad.py # CUAD dataset downloader
│ ├── .env.example # Environment variable template
│ └── SECRETS_ROTATION.md # Key rotation guidance
├── data/
│ ├── cuad/ # CUAD dataset (gitignored)
│ └── chroma_db/ # ChromaDB vector store (gitignored)
├── models/
│ └── risk_clf.pkl # Trained XGBoost model (gitignored)
├── frontend/
│ ├── app/
│ │ ├── upload/page.tsx
│ │ ├── processing/page.tsx
│ │ └── results/page.tsx
│ ├── context/AnalysisContext.tsx
│ ├── lib/
│ │ ├── api.ts # Backend API client
│ │ ├── types.ts # TypeScript interfaces
│ │ └── mock.ts # Demo mode data
│ └── .env.local # Frontend env (gitignored)
├── rules.json # 20 Indian legal rules
├── requirements.txt
├── README.md
└── .gitignore
- Python 3.10+
- Node.js 18+
- Tesseract OCR installed on your system
- API keys for Gemini and Sarvam (see Environment Variables)
# 1. Clone the repo
git clone https://github.com/your-username/LawSahayak.git
cd LawSahayak
# 2. Create and activate a virtual environment
cd backend
python -m venv venv
venv\Scripts\activate # Windows
# source venv/bin/activate # macOS/Linux
# 3. Install dependencies
pip install -r ../requirements.txt
# 4. Set up environment variables
cp .env.example .env
# Edit backend/.env and add your API keys
# 5. Start the backend
python -m uvicorn main:app --host 127.0.0.1 --port 8000 --reloadBackend runs at: http://localhost:8000
Swagger docs: http://localhost:8000/docs
cd frontend
# Install dependencies
npm install
# Set environment variables
# Create frontend/.env.local with:
# NEXT_PUBLIC_API_URL=http://localhost:8000
# NEXT_PUBLIC_USE_MOCK_API=false
# Start the dev server
npm run devFrontend runs at: http://localhost:3000
# Required
GEMINI_API_KEY=your_gemini_api_key
SARVAM_API_KEY=your_sarvam_api_key
# Optional — Supabase analytics
SUPABASE_URL=your_supabase_url
SUPABASE_KEY=your_supabase_anon_key
# Optional — configurable paths (defaults shown)
CHROMA_PATH=../data/chroma_db
CUAD_PATH=../data/cuad/CUAD_v1.json
RISK_MODEL_PATH=../models/risk_clf.pklNEXT_PUBLIC_API_URL=http://localhost:8000
NEXT_PUBLIC_USE_MOCK_API=falseSecurity note: Never commit
.envor.env.localfiles. Both are in.gitignore. If keys were previously exposed, rotate them immediately — seebackend/SECRETS_ROTATION.md.
Analyzes a PDF document and returns risk-classified clauses.
Request: multipart/form-data
| Field | Type | Description |
|---|---|---|
file |
File | PDF document (max 10 MB) |
language |
Query param | Translation language (default: hindi) |
Response:
{
"success": true,
"document_name": "contract.pdf",
"total_clauses": 7,
"summary": { "high": 2, "medium": 1, "low": 4 },
"processing_time_ms": 33589,
"clauses": [
{
"risk_level": "HIGH",
"risk_score": 90,
"plain_english": "...",
"why_risky": "...",
"legal_citation": "Indian Contract Act, 1872, Section 23",
"red_flag": true,
"advice": "...",
"clause_text": "...",
"rule_violations": [...],
"translated_explanation": "..."
}
]
}Translates text to a specified Indian language.
Request body:
{
"text": "The tenant must pay all damages.",
"language": "hindi"
}Submits user feedback on a clause analysis.
Request body:
{
"clause_text": "...",
"risk_level": "HIGH",
"user_verdict": "correct",
"correct_risk": null,
"document_type": null
}Returns aggregate analytics.
{
"total_documents_analyzed": 42,
"total_feedback_received": 8
}Returns service status.
{ "status": "ok", "timestamp": 1780672690.26 }# Endpoint smoke tests (requires backend running)
cd backend
python test_backend.py
# Integration tests (embeddings, Gemini, Sarvam, CUAD)
python test_all.py
# Rules engine standalone check
python pipeline/rules_engine.py- Gemini rate limits — Free tier has per-minute and per-day quotas. The analyzer uses a fallback model chain (
gemini-2.0-flash-lite→gemini-2.0-flash→gemini-2.5-flash) and retries transient errors. - Risk model —
models/risk_clf.pklmust be trained or provided. Without it, risk scores default to 50 with a warning. - CUAD dataset — RAG context requires the CUAD dataset at
CUAD_PATH. Without it, RAG is silently skipped. - Sarvam translation —
verified: falsein translation responses indicates a fallback translator was used; check yourSARVAM_API_KEY. - Frontend — UI is functional but not yet deployed. Run locally for full integration.
- Deploy backend to Render / Railway
- Deploy frontend to Vercel
- Train and ship
risk_clf.pklartifact - Add support for more document types (DOCX, images)
- Expand rules.json to cover more Indian statutes
- Add user authentication and document history
- Multilingual UI (not just translated clause text)
Justice Belongs to Everyone.