Skip to content

MAGIK-935: Website URL Provider — Meta/OG/JSON-LD Extraction #116

@MAGIKBIT

Description

@MAGIKBIT

MAGIK-935: Website URL Provider — Meta/OG/JSON-LD Extraction

Epic: EPIC-025 — #113
Priority: P0
Estimate: 5 SP
Depends on: MAGIK-934

Description

Create WebsiteProvider that crawls a given website URL and extracts profile-relevant data from HTML meta tags, Open Graph tags, JSON-LD structured data, schema.org markup, and visible page content.

Implementation

Class: Libraries/Enrichment/WebsiteProvider.php

URL matching: Any valid HTTP/HTTPS URL not matched by social-specific providers.

Extraction layers (in priority order):

  1. JSON-LD / schema.org<script type="application/ld+json"> for Organization, LocalBusiness, Person
  2. Open Graph tagsog:title, og:description, og:image, og:url, og:site_name
  3. HTML meta tags<meta name="description">, <meta name="author">, <link rel="icon">
  4. Visible content heuristics — regex for emails (mailto:), phone patterns, address blocks
  5. Social link discovery<a href> matching known social platform URL patterns

Extracted fields:

Field Source Confidence
Company/Site name JSON-LD > OG > <title> 0.9 / 0.8 / 0.6
Description JSON-LD > OG > meta description 0.9 / 0.8 / 0.7
Logo/Favicon JSON-LD logo > OG image > <link rel="icon"> 0.9 / 0.7 / 0.5
Emails JSON-LD > mailto: links > regex 0.9 / 0.8 / 0.5
Phones JSON-LD > tel: links > regex 0.9 / 0.8 / 0.4
Address JSON-LD PostalAddress > address block regex 0.9 / 0.3
Social links <a> href matching fb/ig/yt/tw/li patterns 0.8

HTTP client: CodeIgniter's CURLRequest with 10s timeout, User-Agent: MagikTap-Enrichment/1.0, robots.txt check.

Files

File Action
Libraries/Enrichment/WebsiteProvider.php Create
Libraries/Enrichment/HtmlExtractor.php Create (shared HTML parsing utility)
Libraries/EnrichmentService.php Modify (register provider)
Config/Enrichment.php Modify (add website config)

Acceptance Criteria

  • canHandleUrl() matches any valid HTTP/HTTPS URL (fallback provider)
  • Extracts name, description, logo from OG tags
  • Extracts structured data from JSON-LD when available
  • Discovers email/phone from visible content
  • Discovers social media links from page anchors
  • Each field includes confidence score and source evidence
  • Respects robots.txt and 10s timeout
  • Returns graceful error on unreachable/blocked sites
  • Unit tests with fixture HTML files

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions