Skip to content

AGTechathon-2-0/Probably-Coding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 

Repository files navigation

LawSahayak ⚖️

AI-powered legal document analysis for Indian citizens — understand contracts before you sign.

LawSahayak analyzes PDF contracts and tenancy agreements, detects risky clauses, cites relevant Indian laws, explains everything in plain English, and translates explanations into Hindi and other regional languages.


Table of Contents


Features

  • PDF Analysis — Extracts and segments clauses from uploaded PDFs with OCR fallback for scanned documents
  • Rule-Based Detection — Flags violations across 20 Indian legal rules (Indian Contract Act 1872, Transfer of Property Act 1882, Model Tenancy Act 2021, RERA 2016, and more)
  • ML Risk Scoring — XGBoost classifier assigns LOW / MEDIUM / HIGH risk to each clause
  • LLM Analysis — Google Gemini generates plain-English summaries, legal citations, and actionable advice per clause
  • RAG Context — ChromaDB retrieval over the CUAD dataset enriches clause analysis with similar legal precedents
  • Multilingual Translation — Sarvam AI translates explanations into Hindi and 9 other Indian languages
  • Analytics — Supabase stores analysis results and user feedback
  • Graceful Degradation — Friendly error messages for LLM timeouts, missing model files, translation failures, and backend offline states

Architecture

PDF Upload
    │
    ▼
Text Extraction (PyMuPDF + Tesseract OCR fallback)
    │
    ▼
Clause Segmentation (up to 50 clauses per document)
    │
    ▼
Rule Matching ──────────────────────────────────► 20 Indian legal rules (rules.json)
    │
    ▼
ML Risk Classification (XGBoost)
    │
    ▼
RAG Retrieval (ChromaDB + CUAD dataset)
    │
    ▼
LLM Analysis — batched (Google Gemini)
    │
    ▼
Translation (Sarvam AI)
    │
    ▼
JSON Response → Next.js Frontend → Supabase (analytics)

Tech Stack

Layer Technology
Backend API FastAPI, Uvicorn, Pydantic
LLM Google Gemini 2.0 Flash (with fallback chain)
Embeddings sentence-transformers/all-MiniLM-L6-v2
Vector DB ChromaDB (persistent local client)
ML Risk Model XGBoost (trained on CUAD dataset)
Translation Sarvam AI
PDF Processing PyMuPDF, OpenCV, Tesseract OCR
Database Supabase (PostgreSQL)
Legal Dataset CUAD (510 contracts for RAG context)
Frontend Next.js 14, TypeScript, Tailwind CSS

Project Structure

LawSahayak/
├── backend/
│   ├── main.py                    # FastAPI app — 6 endpoints
│   ├── pipeline/
│   │   ├── ingestion.py           # PDF extraction + OCR + clause segmentation
│   │   ├── embedder.py            # Sentence-transformer embeddings
│   │   ├── rag.py                 # ChromaDB ingestion and retrieval
│   │   ├── rules_engine.py        # Keyword-based Indian law violation detection
│   │   ├── risk_classifier.py     # XGBoost clause risk scoring
│   │   ├── gemini_analyzer.py     # Batched Gemini LLM analysis with retry/fallback
│   │   └── sarvam_translator.py   # Multilingual translation
│   ├── utils/
│   │   └── helpers.py
│   ├── test_backend.py            # Endpoint smoke tests
│   ├── test_all.py                # Integration tests
│   ├── download_cuad.py           # CUAD dataset downloader
│   ├── .env.example               # Environment variable template
│   └── SECRETS_ROTATION.md        # Key rotation guidance
├── data/
│   ├── cuad/                      # CUAD dataset (gitignored)
│   └── chroma_db/                 # ChromaDB vector store (gitignored)
├── models/
│   └── risk_clf.pkl               # Trained XGBoost model (gitignored)
├── frontend/
│   ├── app/
│   │   ├── upload/page.tsx
│   │   ├── processing/page.tsx
│   │   └── results/page.tsx
│   ├── context/AnalysisContext.tsx
│   ├── lib/
│   │   ├── api.ts                 # Backend API client
│   │   ├── types.ts               # TypeScript interfaces
│   │   └── mock.ts                # Demo mode data
│   └── .env.local                 # Frontend env (gitignored)
├── rules.json                     # 20 Indian legal rules
├── requirements.txt
├── README.md
└── .gitignore

Getting Started

Prerequisites

  • Python 3.10+
  • Node.js 18+
  • Tesseract OCR installed on your system
  • API keys for Gemini and Sarvam (see Environment Variables)

Backend Setup

# 1. Clone the repo
git clone https://github.com/your-username/LawSahayak.git
cd LawSahayak

# 2. Create and activate a virtual environment
cd backend
python -m venv venv
venv\Scripts\activate        # Windows
# source venv/bin/activate   # macOS/Linux

# 3. Install dependencies
pip install -r ../requirements.txt

# 4. Set up environment variables
cp .env.example .env
# Edit backend/.env and add your API keys

# 5. Start the backend
python -m uvicorn main:app --host 127.0.0.1 --port 8000 --reload

Backend runs at: http://localhost:8000 Swagger docs: http://localhost:8000/docs

Frontend Setup

cd frontend

# Install dependencies
npm install

# Set environment variables
# Create frontend/.env.local with:
# NEXT_PUBLIC_API_URL=http://localhost:8000
# NEXT_PUBLIC_USE_MOCK_API=false

# Start the dev server
npm run dev

Frontend runs at: http://localhost:3000


Environment Variables

Backend (backend/.env)

# Required
GEMINI_API_KEY=your_gemini_api_key
SARVAM_API_KEY=your_sarvam_api_key

# Optional — Supabase analytics
SUPABASE_URL=your_supabase_url
SUPABASE_KEY=your_supabase_anon_key

# Optional — configurable paths (defaults shown)
CHROMA_PATH=../data/chroma_db
CUAD_PATH=../data/cuad/CUAD_v1.json
RISK_MODEL_PATH=../models/risk_clf.pkl

Frontend (frontend/.env.local)

NEXT_PUBLIC_API_URL=http://localhost:8000
NEXT_PUBLIC_USE_MOCK_API=false

Security note: Never commit .env or .env.local files. Both are in .gitignore. If keys were previously exposed, rotate them immediately — see backend/SECRETS_ROTATION.md.


API Reference

POST /analyze

Analyzes a PDF document and returns risk-classified clauses.

Request: multipart/form-data

Field Type Description
file File PDF document (max 10 MB)
language Query param Translation language (default: hindi)

Response:

{
  "success": true,
  "document_name": "contract.pdf",
  "total_clauses": 7,
  "summary": { "high": 2, "medium": 1, "low": 4 },
  "processing_time_ms": 33589,
  "clauses": [
    {
      "risk_level": "HIGH",
      "risk_score": 90,
      "plain_english": "...",
      "why_risky": "...",
      "legal_citation": "Indian Contract Act, 1872, Section 23",
      "red_flag": true,
      "advice": "...",
      "clause_text": "...",
      "rule_violations": [...],
      "translated_explanation": "..."
    }
  ]
}

POST /translate

Translates text to a specified Indian language.

Request body:

{
  "text": "The tenant must pay all damages.",
  "language": "hindi"
}

POST /feedback

Submits user feedback on a clause analysis.

Request body:

{
  "clause_text": "...",
  "risk_level": "HIGH",
  "user_verdict": "correct",
  "correct_risk": null,
  "document_type": null
}

GET /stats

Returns aggregate analytics.

{
  "total_documents_analyzed": 42,
  "total_feedback_received": 8
}

GET /health

Returns service status.

{ "status": "ok", "timestamp": 1780672690.26 }

Running Tests

# Endpoint smoke tests (requires backend running)
cd backend
python test_backend.py

# Integration tests (embeddings, Gemini, Sarvam, CUAD)
python test_all.py

# Rules engine standalone check
python pipeline/rules_engine.py

Known Limitations

  • Gemini rate limits — Free tier has per-minute and per-day quotas. The analyzer uses a fallback model chain (gemini-2.0-flash-litegemini-2.0-flashgemini-2.5-flash) and retries transient errors.
  • Risk modelmodels/risk_clf.pkl must be trained or provided. Without it, risk scores default to 50 with a warning.
  • CUAD dataset — RAG context requires the CUAD dataset at CUAD_PATH. Without it, RAG is silently skipped.
  • Sarvam translationverified: false in translation responses indicates a fallback translator was used; check your SARVAM_API_KEY.
  • Frontend — UI is functional but not yet deployed. Run locally for full integration.

Roadmap

  • Deploy backend to Render / Railway
  • Deploy frontend to Vercel
  • Train and ship risk_clf.pkl artifact
  • Add support for more document types (DOCX, images)
  • Expand rules.json to cover more Indian statutes
  • Add user authentication and document history
  • Multilingual UI (not just translated clause text)

Justice Belongs to Everyone.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors