Skip to content

🕷️ Official Scrapegraph API SDK: Effortlessly extract content from any website. AI-powered. 🤖 Hassle-free web scraping made simple.

License

Notifications You must be signed in to change notification settings

ScrapeGraphAI/scrapegraph-py

Repository files navigation

🌐 ScrapeGraph AI SDK

License Python SDK Documentation

Official Python SDK for the ScrapeGraph AI API - Intelligent web scraping and search powered by AI. Extract structured data from any webpage or perform AI-powered web searches with natural language prompts.

Get your API key! API Banner

Features

  • 🤖 SmartScraper: Extract structured data from webpages using natural language prompts
  • 🔍 SearchScraper: AI-powered web search with structured results and reference URLs
  • 📝 Markdownify: Convert any webpage into clean, formatted markdown
  • 🕷️ SmartCrawler: Intelligently crawl and extract data from multiple pages
  • 🤖 AgenticScraper: Perform automated browser actions with AI-powered session management
  • 📄 Scrape: Convert webpages to HTML with JavaScript rendering and custom headers
  • Scheduled Jobs: Create and manage automated scraping workflows with cron scheduling
  • 💳 Credits Management: Monitor API usage and credit balance
  • 💬 Feedback System: Provide ratings and feedback to improve service quality

🚀 Quick Links

ScrapeGraphAI offers seamless integration with popular frameworks and tools to enhance your scraping capabilities. Whether you're building with Python, using LLM frameworks, or working with no-code platforms, we've got you covered with our comprehensive integration options..

You can find more informations at the following link

Integrations:

📦 Installation

pip install scrapegraph-py

🎯 Core Features

  • 🤖 AI-Powered Extraction & Search: Use natural language to extract data or search the web
  • 📊 Structured Output: Get clean, structured data with optional schema validation
  • 🔄 Multiple Formats: Extract data as JSON, Markdown, or custom schemas
  • High Performance: Concurrent processing and automatic retries
  • 🔒 Enterprise Ready: Production-grade security and rate limiting

🛠️ Available Endpoints

🤖 SmartScraper

Using AI to extract structured data from any webpage or HTML content with natural language prompts.

Example Usage:

from scrapegraph_py import Client
import os
from dotenv import load_dotenv

load_dotenv()

# Initialize the client
client = Client(api_key=os.getenv("SGAI_API_KEY"))

# Extract data from a webpage
response = client.smartscraper(
    website_url="https://example.com",
    user_prompt="Extract the main heading, description, and summary of the webpage",
)

print(f"Request ID: {response['request_id']}")
print(f"Result: {response['result']}")

client.close()

🔍 SearchScraper

Perform AI-powered web searches with structured results and reference URLs.

Example Usage:

from scrapegraph_py import Client
import os
from dotenv import load_dotenv

load_dotenv()

# Initialize the client
client = Client(api_key=os.getenv("SGAI_API_KEY"))

# Perform AI-powered web search
response = client.searchscraper(
    user_prompt="What is the latest version of Python and what are its main features?",
    num_results=3,  # Number of websites to search (default: 3)
)

print(f"Result: {response['result']}")
print("\nReference URLs:")
for url in response["reference_urls"]:
    print(f"- {url}")

client.close()

📝 Markdownify

Convert any webpage into clean, formatted markdown.

Example Usage:

from scrapegraph_py import Client
import os
from dotenv import load_dotenv

load_dotenv()

# Initialize the client
client = Client(api_key=os.getenv("SGAI_API_KEY"))

# Convert webpage to markdown
response = client.markdownify(
    website_url="https://example.com",
)

print(f"Request ID: {response['request_id']}")
print(f"Markdown: {response['result']}")

client.close()

🕷️ SmartCrawler

Intelligently crawl and extract data from multiple pages with configurable depth and batch processing.

Example Usage:

from scrapegraph_py import Client
import os
import time
from dotenv import load_dotenv

load_dotenv()

# Initialize the client
client = Client(api_key=os.getenv("SGAI_API_KEY"))

# Start crawl job
crawl_response = client.crawl(
    url="https://example.com",
    prompt="Extract page titles and main headings",
    data_schema={
        "type": "object",
        "properties": {
            "title": {"type": "string"},
            "headings": {"type": "array", "items": {"type": "string"}}
        }
    },
    depth=2,
    max_pages=5,
    same_domain_only=True,
)

crawl_id = crawl_response.get("id") or crawl_response.get("task_id")

# Poll for results
if crawl_id:
    for _ in range(10):
        time.sleep(5)
        result = client.get_crawl(crawl_id)
        if result.get("status") == "success":
            print("Crawl completed:", result["result"]["llm_result"])
            break

client.close()

🤖 AgenticScraper

Perform automated browser actions on webpages using AI-powered agentic scraping with session management.

Example Usage:

from scrapegraph_py import Client
import os
from dotenv import load_dotenv

load_dotenv()

# Initialize the client
client = Client(api_key=os.getenv("SGAI_API_KEY"))

# Perform automated browser actions
response = client.agenticscraper(
    url="https://example.com",
    use_session=True,
    steps=[
        "Type [email protected] in email input box",
        "Type password123 in password inputbox",
        "click on login"
    ],
    ai_extraction=False  # Set to True for AI extraction
)

print(f"Request ID: {response['request_id']}")
print(f"Status: {response.get('status')}")

# Get results
result = client.get_agenticscraper(response['request_id'])
print(f"Result: {result.get('result')}")

client.close()

📄 Scrape

Convert webpages into HTML format with optional JavaScript rendering and custom headers.

Example Usage:

from scrapegraph_py import Client
import os
from dotenv import load_dotenv

load_dotenv()

# Initialize the client
client = Client(api_key=os.getenv("SGAI_API_KEY"))

# Get HTML content from webpage
response = client.scrape(
    website_url="https://example.com",
    render_heavy_js=False,  # Set to True for JavaScript-heavy sites
)

print(f"Request ID: {response['request_id']}")
print(f"HTML length: {len(response.get('html', ''))} characters")

client.close()

⏰ Scheduled Jobs

Create, manage, and monitor scheduled scraping jobs with cron expressions and execution history.

💳 Credits

Check your API credit balance and usage.

💬 Feedback

Send feedback and ratings for scraping requests to help improve the service.

🌟 Key Benefits

  • 📝 Natural Language Queries: No complex selectors or XPath needed
  • 🎯 Precise Extraction: AI understands context and structure
  • 🔄 Adaptive Processing: Works with both web content and direct HTML
  • 📊 Schema Validation: Ensure data consistency with Pydantic
  • Async Support: Handle multiple requests efficiently
  • 🔍 Source Attribution: Get reference URLs for search results

💡 Use Cases

  • 🏢 Business Intelligence: Extract company information and contacts
  • 📊 Market Research: Gather product data and pricing
  • 📰 Content Aggregation: Convert articles to structured formats
  • 🔍 Data Mining: Extract specific information from multiple sources
  • 📱 App Integration: Feed clean data into your applications
  • 🌐 Web Research: Perform AI-powered searches with structured results

📖 Documentation

For detailed documentation and examples, visit:

💬 Support & Feedback

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


Made with ❤️ by ScrapeGraph AI

About

🕷️ Official Scrapegraph API SDK: Effortlessly extract content from any website. AI-powered. 🤖 Hassle-free web scraping made simple.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 10