LawSahayak ⚖️

AI-powered legal document analysis for Indian citizens — understand contracts before you sign.

LawSahayak analyzes PDF contracts and tenancy agreements, detects risky clauses, cites relevant Indian laws, explains everything in plain English, and translates explanations into Hindi and other regional languages.

Features

PDF Analysis — Extracts and segments clauses from uploaded PDFs with OCR fallback for scanned documents
Rule-Based Detection — Flags violations across 20 Indian legal rules (Indian Contract Act 1872, Transfer of Property Act 1882, Model Tenancy Act 2021, RERA 2016, and more)
ML Risk Scoring — XGBoost classifier assigns LOW / MEDIUM / HIGH risk to each clause
LLM Analysis — Google Gemini generates plain-English summaries, legal citations, and actionable advice per clause
RAG Context — ChromaDB retrieval over the CUAD dataset enriches clause analysis with similar legal precedents
Multilingual Translation — Sarvam AI translates explanations into Hindi and 9 other Indian languages
Analytics — Supabase stores analysis results and user feedback
Graceful Degradation — Friendly error messages for LLM timeouts, missing model files, translation failures, and backend offline states

Architecture

PDF Upload
    │
    ▼
Text Extraction (PyMuPDF + Tesseract OCR fallback)
    │
    ▼
Clause Segmentation (up to 50 clauses per document)
    │
    ▼
Rule Matching ──────────────────────────────────► 20 Indian legal rules (rules.json)
    │
    ▼
ML Risk Classification (XGBoost)
    │
    ▼
RAG Retrieval (ChromaDB + CUAD dataset)
    │
    ▼
LLM Analysis — batched (Google Gemini)
    │
    ▼
Translation (Sarvam AI)
    │
    ▼
JSON Response → Next.js Frontend → Supabase (analytics)

Tech Stack

Layer	Technology
Backend API	FastAPI, Uvicorn, Pydantic
LLM	Google Gemini 2.0 Flash (with fallback chain)
Embeddings	sentence-transformers/all-MiniLM-L6-v2
Vector DB	ChromaDB (persistent local client)
ML Risk Model	XGBoost (trained on CUAD dataset)
Translation	Sarvam AI
PDF Processing	PyMuPDF, OpenCV, Tesseract OCR
Database	Supabase (PostgreSQL)
Legal Dataset	CUAD (510 contracts for RAG context)
Frontend	Next.js 14, TypeScript, Tailwind CSS

Project Structure

LawSahayak/
├── backend/
│   ├── main.py                    # FastAPI app — 6 endpoints
│   ├── pipeline/
│   │   ├── ingestion.py           # PDF extraction + OCR + clause segmentation
│   │   ├── embedder.py            # Sentence-transformer embeddings
│   │   ├── rag.py                 # ChromaDB ingestion and retrieval
│   │   ├── rules_engine.py        # Keyword-based Indian law violation detection
│   │   ├── risk_classifier.py     # XGBoost clause risk scoring
│   │   ├── gemini_analyzer.py     # Batched Gemini LLM analysis with retry/fallback
│   │   └── sarvam_translator.py   # Multilingual translation
│   ├── utils/
│   │   └── helpers.py
│   ├── test_backend.py            # Endpoint smoke tests
│   ├── test_all.py                # Integration tests
│   ├── download_cuad.py           # CUAD dataset downloader
│   ├── .env.example               # Environment variable template
│   └── SECRETS_ROTATION.md        # Key rotation guidance
├── data/
│   ├── cuad/                      # CUAD dataset (gitignored)
│   └── chroma_db/                 # ChromaDB vector store (gitignored)
├── models/
│   └── risk_clf.pkl               # Trained XGBoost model (gitignored)
├── frontend/
│   ├── app/
│   │   ├── upload/page.tsx
│   │   ├── processing/page.tsx
│   │   └── results/page.tsx
│   ├── context/AnalysisContext.tsx
│   ├── lib/
│   │   ├── api.ts                 # Backend API client
│   │   ├── types.ts               # TypeScript interfaces
│   │   └── mock.ts                # Demo mode data
│   └── .env.local                 # Frontend env (gitignored)
├── rules.json                     # 20 Indian legal rules
├── requirements.txt
├── README.md
└── .gitignore

Getting Started

Prerequisites

Python 3.10+
Node.js 18+
Tesseract OCR installed on your system
API keys for Gemini and Sarvam (see Environment Variables)

Backend Setup

# 1. Clone the repo
git clone https://github.com/your-username/LawSahayak.git
cd LawSahayak

# 2. Create and activate a virtual environment
cd backend
python -m venv venv
venv\Scripts\activate        # Windows
# source venv/bin/activate   # macOS/Linux

# 3. Install dependencies
pip install -r ../requirements.txt

# 4. Set up environment variables
cp .env.example .env
# Edit backend/.env and add your API keys

# 5. Start the backend
python -m uvicorn main:app --host 127.0.0.1 --port 8000 --reload

Backend runs at: http://localhost:8000 Swagger docs: http://localhost:8000/docs

Frontend Setup

cd frontend

# Install dependencies
npm install

# Set environment variables
# Create frontend/.env.local with:
# NEXT_PUBLIC_API_URL=http://localhost:8000
# NEXT_PUBLIC_USE_MOCK_API=false

# Start the dev server
npm run dev

Frontend runs at: http://localhost:3000

Environment Variables

Backend (`backend/.env`)

# Required
GEMINI_API_KEY=your_gemini_api_key
SARVAM_API_KEY=your_sarvam_api_key

# Optional — Supabase analytics
SUPABASE_URL=your_supabase_url
SUPABASE_KEY=your_supabase_anon_key

# Optional — configurable paths (defaults shown)
CHROMA_PATH=../data/chroma_db
CUAD_PATH=../data/cuad/CUAD_v1.json
RISK_MODEL_PATH=../models/risk_clf.pkl

Frontend (`frontend/.env.local`)

NEXT_PUBLIC_API_URL=http://localhost:8000
NEXT_PUBLIC_USE_MOCK_API=false

Security note: Never commit .env or .env.local files. Both are in .gitignore. If keys were previously exposed, rotate them immediately — see backend/SECRETS_ROTATION.md.

API Reference

`POST /analyze`

Analyzes a PDF document and returns risk-classified clauses.

Request: multipart/form-data

Field	Type	Description
`file`	File	PDF document (max 10 MB)
`language`	Query param	Translation language (default: `hindi`)

Response:

{
  "success": true,
  "document_name": "contract.pdf",
  "total_clauses": 7,
  "summary": { "high": 2, "medium": 1, "low": 4 },
  "processing_time_ms": 33589,
  "clauses": [
    {
      "risk_level": "HIGH",
      "risk_score": 90,
      "plain_english": "...",
      "why_risky": "...",
      "legal_citation": "Indian Contract Act, 1872, Section 23",
      "red_flag": true,
      "advice": "...",
      "clause_text": "...",
      "rule_violations": [...],
      "translated_explanation": "..."
    }
  ]
}

`POST /translate`

Translates text to a specified Indian language.

Request body:

{
  "text": "The tenant must pay all damages.",
  "language": "hindi"
}

`POST /feedback`

Submits user feedback on a clause analysis.

Request body:

{
  "clause_text": "...",
  "risk_level": "HIGH",
  "user_verdict": "correct",
  "correct_risk": null,
  "document_type": null
}

`GET /stats`

Returns aggregate analytics.

{
  "total_documents_analyzed": 42,
  "total_feedback_received": 8
}

`GET /health`

Returns service status.

{ "status": "ok", "timestamp": 1780672690.26 }

Running Tests

# Endpoint smoke tests (requires backend running)
cd backend
python test_backend.py

# Integration tests (embeddings, Gemini, Sarvam, CUAD)
python test_all.py

# Rules engine standalone check
python pipeline/rules_engine.py

Known Limitations

Gemini rate limits — Free tier has per-minute and per-day quotas. The analyzer uses a fallback model chain (gemini-2.0-flash-lite → gemini-2.0-flash → gemini-2.5-flash) and retries transient errors.
Risk model — models/risk_clf.pkl must be trained or provided. Without it, risk scores default to 50 with a warning.
CUAD dataset — RAG context requires the CUAD dataset at CUAD_PATH. Without it, RAG is silently skipped.
Sarvam translation — verified: false in translation responses indicates a fallback translator was used; check your SARVAM_API_KEY.
Frontend — UI is functional but not yet deployed. Run locally for full integration.

Roadmap

Deploy backend to Render / Railway
Deploy frontend to Vercel
Train and ship risk_clf.pkl artifact
Add support for more document types (DOCX, images)
Expand rules.json to cover more Indian statutes
Add user authentication and document history
Multilingual UI (not just translated clause text)

Justice Belongs to Everyone.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
LawSahayak		LawSahayak
Rules and Regulations		Rules and Regulations
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LawSahayak ⚖️

Table of Contents

Features

Architecture

Tech Stack

Project Structure

Getting Started

Prerequisites

Backend Setup

Frontend Setup

Environment Variables

Backend (`backend/.env`)

Frontend (`frontend/.env.local`)

API Reference

`POST /analyze`

`POST /translate`

`POST /feedback`

`GET /stats`

`GET /health`

Running Tests

Known Limitations

Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LawSahayak ⚖️

Table of Contents

Features

Architecture

Tech Stack

Project Structure

Getting Started

Prerequisites

Backend Setup

Frontend Setup

Environment Variables

Backend (backend/.env)

Frontend (frontend/.env.local)

API Reference

POST /analyze

POST /translate

POST /feedback

GET /stats

GET /health

Running Tests

Known Limitations

Roadmap

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Backend (`backend/.env`)

Frontend (`frontend/.env.local`)

`POST /analyze`

`POST /translate`

`POST /feedback`

`GET /stats`

`GET /health`

Packages