piedomains package

Subpackages

Submodules

piedomains.api module

Modern, intuitive API for piedomains domain classification.

This module provides a clean, class-based interface for domain content classification with support for text analysis, image analysis, and historical archive.org snapshots.

class piedomains.api.DomainClassifier(cache_dir=None)[source]

Bases: object

Main interface for domain content classification.

Supports multiple classification approaches: - Traditional ML: Text-based, image-based, and combined classification - Modern AI: LLM-based classification with multimodal support - Historical analysis via archive.org snapshots

Example (Traditional ML):
>>> classifier = DomainClassifier()
>>> results = classifier.classify(["google.com", "facebook.com"])
>>> for result in results:
...     print(f"{result['domain']}: {result['category']} ({result['confidence']:.3f})")
google.com: search (0.892)
facebook.com: socialnet (0.967)

# Historical analysis >>> results = classifier.classify([“google.com”], archive_date=”20200101”) >>> print(f”Archive: {results[0][‘category’]} from {results[0][‘date_time_collected’]}”)

Example (LLM-based):
>>> classifier = DomainClassifier()
>>> classifier.configure_llm(
...     provider="openai",
...     model="gpt-4o",
...     api_key="sk-...",
...     categories=["news", "shopping", "social", "tech"]
... )
>>> results = classifier.classify_by_llm(["cnn.com"])
>>> print(f"LLM: {results[0]['category']} - {results[0]['reason']}")
Example (Separated workflow):
>>> collector = DataCollector()
>>> collection = collector.collect(["example.com"])
>>> text_results = classifier.classify_from_collection(collection, method="text")
>>> image_results = classifier.classify_from_collection(collection, method="images")
>>> # Same collected content, different classification approaches
JSON Output Schema:

All classification methods return List[Dict] with consistent structure:

Collection Data Schema (from collect_content): {

“collection_id”: str, # Unique identifier for collection “timestamp”: str, # ISO 8601 collection timestamp “config”: {

“cache_dir”: str, # Cache directory path “archive_date”: str, # Archive.org date (YYYYMMDD) or null “fetcher_type”: str, # “live” or “archive” “max_parallel”: int # Parallel fetch limit

}, “domains”: [ # List of domain results

{

“url”: str, # Original input URL/domain “domain”: str, # Parsed domain name “text_path”: str, # Path to HTML file (relative to cache_dir) “image_path”: str, # Path to screenshot (relative to cache_dir) “date_time_collected”: str, # ISO 8601 timestamp “fetch_success”: bool, # Whether data collection succeeded “cached”: bool, # Whether data was retrieved from cache “error”: str, # Error message if fetch_success is false “title”: str, # Page title (optional) “meta_description”: str # Meta description (optional)

}

], “summary”: {

“total_domains”: int, # Total domains requested “successful”: int, # Successfully collected “failed”: int # Failed collections

}

}

Classification Result Schema (from classify methods): [

{

“url”: str, # Original input URL/domain “domain”: str, # Parsed domain name “text_path”: str, # Path to HTML file “image_path”: str, # Path to screenshot “date_time_collected”: str, # ISO 8601 timestamp “model_used”: str, # Model identifier (e.g. “text/shallalist_ml”) “category”: str, # Predicted category “confidence”: float, # Confidence score (0.0-1.0) “reason”: str, # LLM reasoning (null for ML models) “error”: str, # Error message if classification failed “raw_predictions”: dict, # Full probability distribution

# Combined classification specific fields: “text_category”: str, # Text-only prediction “text_confidence”: float, # Text confidence “image_category”: str, # Image-only prediction “image_confidence”: float # Image confidence

}

]

Supported Categories: adv, aggressive, alcohol, anonvpn, automobile, chatphisher, cooking, dating, downloads, drugs, education, finance, forum, gamble, government, hacking, health, hobby, homehealth, imagehosting, jobsearch, lingerie, music, news, occult, onlinemarketing, politics, porn, publicite, radiotv, recreation, religion, remotecontrol, shopping, socialnet, spyware, updatesites, urlshortener, violence, warez, weapons, webmail, webphone, webradio, webtv

__init__(cache_dir=None)[source]

Initialize domain classifier.

Parameters:

cache_dir (str, optional) – Directory for caching downloaded content. Defaults to “cache” in current directory.

classify(domains, archive_date=None, use_cache=True, latest=False)[source]

Classify domains using combined text and image analysis.

This is the most comprehensive classification method, using both textual content and homepage screenshots for maximum accuracy.

Parameters:
  • domains (list[str]) – List of domain names or URLs to classify e.g., [“google.com”, “https://facebook.com/page”]

  • archive_date (str or datetime, optional) – For historical analysis. Format: “YYYYMMDD” or datetime object

  • use_cache (bool) – Whether to reuse cached content (default: True)

  • latest (bool) – Whether to download latest model versions (default: False)

Returns:

Classification results in JSON format with fields:
  • url: Original URL/domain input

  • domain: Parsed domain name

  • text_path: Path to collected HTML file

  • image_path: Path to collected screenshot

  • date_time_collected: When data was collected (ISO format)

  • model_used: “combined/text_image_ml”

  • category: Best prediction (ensemble of text + image)

  • confidence: Confidence score (0-1)

  • reason: None (reasoning field for LLM models)

  • error: Error message if classification failed

  • text_category: Text-only prediction

  • text_confidence: Text confidence

  • image_category: Image-only prediction

  • image_confidence: Image confidence

  • raw_predictions: Full probability distributions

Return type:

list[dict]

Raises:

ValueError – If domains list is empty

Example

>>> classifier = DomainClassifier()
>>> results = classifier.classify(["cnn.com", "bbc.com"])
>>> print(f"{results[0]['domain']}: {results[0]['category']} ({results[0]['confidence']:.3f})")
cnn.com: news (0.876)
classify_by_text(domains, archive_date=None, use_cache=True, latest=False)[source]

Classify domains using only text content analysis.

Faster than combined analysis, good for batch processing or when screenshots are not needed.

Parameters:
  • domains (list[str]) – List of domain names or URLs to classify

  • archive_date (str or datetime, optional) – For historical analysis

  • use_cache (bool) – Whether to reuse cached content (default: True)

  • latest (bool) – Whether to download latest model versions (default: False)

Returns:

Text classification results in JSON format with fields:
  • url: Original URL/domain input

  • domain: Parsed domain name

  • text_path: Path to collected HTML file

  • image_path: Path to collected screenshot (may be None)

  • date_time_collected: When data was collected (ISO format)

  • model_used: “text/shallalist_ml”

  • category: Text classification prediction

  • confidence: Text confidence score (0-1)

  • reason: None (reasoning field for LLM models)

  • error: Error message if classification failed

  • raw_predictions: Full text probability distribution

Return type:

list[dict]

Example

>>> classifier = DomainClassifier()
>>> results = classifier.classify_by_text(["wikipedia.org"])
>>> print(f"{results[0]['domain']}: {results[0]['category']} ({results[0]['confidence']:.3f})")
wikipedia.org: education (0.823)
classify_by_images(domains, archive_date=None, use_cache=True, latest=False)[source]

Classify domains using only homepage screenshot analysis.

Good for visual content classification, especially when text content is minimal or misleading.

Parameters:
  • domains (list[str]) – List of domain names or URLs to classify

  • archive_date (str or datetime, optional) – For historical analysis

  • use_cache (bool) – Whether to reuse cached content (default: True)

  • latest (bool) – Whether to download latest model versions (default: False)

Returns:

Image classification results in JSON format with fields:
  • url: Original URL/domain input

  • domain: Parsed domain name

  • text_path: Path to collected HTML file (may be None)

  • image_path: Path to collected screenshot

  • date_time_collected: When data was collected (ISO format)

  • model_used: “image/shallalist_ml”

  • category: Image classification prediction

  • confidence: Image confidence score (0-1)

  • reason: None (reasoning field for LLM models)

  • error: Error message if classification failed

  • raw_predictions: Full image probability distribution

Return type:

list[dict]

Example

>>> classifier = DomainClassifier()
>>> results = classifier.classify_by_images(["instagram.com"])
>>> print(f"{results[0]['domain']}: {results[0]['category']} ({results[0]['confidence']:.3f})")
instagram.com: socialnet (0.912)
configure_llm(provider, model, api_key=None, categories=None, **kwargs)[source]

Configure LLM for AI-powered domain classification.

Parameters:
  • provider (str) – LLM provider (‘openai’, ‘anthropic’, ‘google’, etc.)

  • model (str) – Model name (‘gpt-4o’, ‘claude-3-5-sonnet-20241022’, ‘gemini-1.5-pro’)

  • api_key (str | None) – API key for the provider (or set via environment variable)

  • categories (list[str] | None) – Custom classification categories

  • **kwargs – Additional LLMConfig parameters (temperature, max_tokens, etc.)

Return type:

None

Example

>>> classifier = DomainClassifier()
>>> classifier.configure_llm(
...     provider="openai",
...     model="gpt-4o",
...     api_key="sk-...",
...     categories=["news", "shopping", "social", "tech"]
... )
classify_by_llm(domains, custom_instructions=None, use_cache=True, mode='text')[source]

Classify domains using LLM analysis.

Parameters:
  • domains (list[str]) – List of domain names to classify

  • custom_instructions (str | None) – Optional custom classification instructions

  • use_cache (bool) – Whether to use cached content (default: True)

  • mode (str) – LLM mode - “text”, “image”, or “multimodal” (default: “text”)

Returns:

LLM classification results in JSON format with fields:
  • url: Original URL/domain input

  • domain: Parsed domain name

  • text_path: Path to collected HTML file

  • image_path: Path to collected screenshot (if applicable)

  • date_time_collected: When data was collected (ISO format)

  • model_used: “text/llm_{provider}_{model}” or similar

  • category: LLM classification prediction

  • confidence: LLM confidence score (0-1)

  • reason: LLM reasoning explanation

  • error: Error message if classification failed

Return type:

list[dict]

Raises:

RuntimeError – If LLM not configured

Example

>>> classifier = DomainClassifier()
>>> classifier.configure_llm("openai", "gpt-4o", api_key="sk-...")
>>> results = classifier.classify_by_llm(["cnn.com", "amazon.com"])
>>> print(f"{results[0]['domain']}: {results[0]['category']} - {results[0]['reason']}")
cnn.com: news - This domain contains current events and journalism content
classify_by_llm_multimodal(domains, custom_instructions=None, use_cache=True)[source]

Classify domains using LLM multimodal analysis (text + screenshots).

Parameters:
  • domains (list[str]) – List of domain names to classify

  • custom_instructions (str | None) – Optional custom classification instructions

  • use_cache (bool) – Whether to use cached content (default: True)

Returns:

Multimodal LLM classification results in JSON format

Return type:

list[dict]

Raises:

RuntimeError – If LLM not configured

Example

>>> classifier = DomainClassifier()
>>> classifier.configure_llm("openai", "gpt-4o", api_key="sk-...")
>>> results = classifier.classify_by_llm_multimodal(["cnn.com"])
>>> print(f"{results[0]['domain']}: {results[0]['category']} - {results[0]['reason']}")
cnn.com: news - Based on text content and visual layout typical of news websites
get_llm_usage_stats()[source]

Get LLM usage statistics and cost tracking.

Return type:

dict | None

Returns:

Dictionary with usage stats or None if LLM not configured

Example

>>> classifier = DomainClassifier()
>>> classifier.configure_llm("openai", "gpt-4o")
>>> classifier.classify_by_llm(["example.com"])
>>> stats = classifier.get_llm_usage_stats()
>>> print(f"Cost: ${stats['estimated_cost_usd']:.4f}")
collect_content(domains, archive_date=None, collection_id=None, use_cache=True, batch_size=10)[source]

Collect website content for domains without performing inference.

Separates content collection from classification, enabling: - Content reuse across multiple models - Clear data lineage and inspection - Reproducible analysis workflows

Parameters:
  • domains (list[str]) – List of domain names or URLs to collect content for

  • archive_date (str or datetime, optional) – For historical analysis

  • collection_id (str, optional) – Identifier for this collection

  • use_cache (bool) – Whether to use cached content when available

  • batch_size (int) – Number of domains to process in parallel

Returns:

Collection metadata with file paths for downstream inference

Return type:

dict

Example

>>> classifier = DomainClassifier()
>>> collection = classifier.collect_content(["cnn.com", "bbc.com"])
>>> print(collection["domains"][0]["text_path"])
html/cnn.com.html
classify_from_collection(collection_data, method='combined', output_file=None, latest=False)[source]

Perform inference on previously collected content.

Parameters:
  • collection_data (dict) – Collection metadata from collect_content()

  • method (str) – Classification method - “text”, “images”, “combined”, or “llm”

  • output_file (str, optional) – Path to save JSON results

  • latest (bool) – Whether to use latest model versions (default: False)

Returns:

Classification results in JSON format

Return type:

list[dict]

Example

>>> classifier = DomainClassifier()
>>> collection = classifier.collect_content(["cnn.com"])
>>> results = classifier.classify_from_collection(collection, method="text")
>>> print(results[0]["category"])
news
piedomains.api.classify_domains(domains, method='combined', archive_date=None, cache_dir=None)[source]

Quick domain classification function.

Parameters:
  • domains (list[str]) – List of domain names or URLs to classify

  • method (str) – Classification method - “combined”, “text”, or “images”

  • archive_date (str | datetime | None) – Optional historical date for archive.org analysis

  • cache_dir (str | None) – Optional cache directory override

Returns:

Classification results in JSON format

Return type:

list[dict]

Example

>>> results = classify_domains(["cnn.com", "github.com"])
>>> for result in results:
...     print(f"{result['domain']}: {result['category']} ({result['confidence']:.3f})")
cnn.com: news (0.876)
github.com: computers (0.892)

piedomains.archive_org_downloader module

Archive.org content retrieval and historical data access utilities.

This module provides functionality for accessing historical snapshots of web content through the Internet Archive’s Wayback Machine. It includes utilities for querying available snapshots within date ranges and downloading content from archived pages.

The module supports the piedomains package’s historical domain analysis capabilities by providing structured access to archived web content for training and classification on historical data.

Example

Basic usage for getting archived content:
>>> from piedomains.archive_org_downloader import get_urls_year, download_from_archive_org
>>> urls = get_urls_year("example.com", year=2020)
>>> if urls:
...     content = download_from_archive_org(urls[0])
...     print(f"Retrieved {len(content)} characters of historical content")
piedomains.archive_org_downloader.get_urls_year(domain, year=2014, status_code=200, limit=None)[source]

Retrieve all archived URLs for a domain within a specific year.

This function queries the Internet Archive’s CDX API to find all available snapshots of a domain within the specified year that returned the given HTTP status code.

Parameters:
  • domain (str) – Domain name to search for (e.g., “example.com” or “https://example.com”). The function will handle both domain names and full URLs.

  • year (int) – Year to search within (e.g., 2020). Defaults to 2014. Must be between 1996 (first archive) and current year.

  • status_code (int) – HTTP status code to filter by. Only snapshots that returned this status code will be included. Defaults to 200.

  • limit (Optional[int]) – Maximum number of URLs to return. If None, returns all available URLs. Useful for large domains with many snapshots.

Returns:

List of complete Wayback Machine URLs for accessing archived content.

Each URL is formatted as ‘https://web.archive.org/web/{timestamp}/{original_url}’. Returns empty list if no snapshots are found or if an error occurs.

Return type:

List[str]

Raises:
  • ValueError – If year is outside valid range or domain is invalid.

  • requests.RequestException – If API request fails (logged but not raised).

Example

>>> # Get all snapshots for 2020
>>> urls = get_urls_year("cnn.com", year=2020)
>>> print(f"Found {len(urls)} snapshots")
>>> # Get limited snapshots with error handling
>>> urls = get_urls_year("example.com", year=2019, limit=10)
>>> if urls:
...     print(f"First snapshot: {urls[0]}")
... else:
...     print("No snapshots found")

Note

  • The function searches for snapshots with successful HTTP responses (200 by default)

  • Results are ordered chronologically by snapshot timestamp

  • Large popular domains may have thousands of snapshots per year

  • Use the limit parameter to avoid excessive API calls

piedomains.archive_org_downloader.download_from_archive_org(url, timeout=30, clean_content=True)[source]

Download and extract text content from an archived webpage.

This function retrieves content from a Wayback Machine URL and extracts the visible text content, optionally cleaning archive-specific elements.

Parameters:
  • url (str) – Complete Wayback Machine URL (from get_urls_year or similar). Should be in format ‘https://web.archive.org/web/{timestamp}/{original_url}’.

  • timeout (int) – HTTP request timeout in seconds. Defaults to 30 seconds as archived pages can be slow to load.

  • clean_content (bool) – If True, removes archive.org specific navigation and metadata elements. Defaults to True for cleaner content.

Returns:

Extracted text content from the archived page. Returns empty string

if download fails or no content is found.

Return type:

str

Raises:
  • ValueError – If URL is not a valid Wayback Machine URL.

  • requests.RequestException – If HTTP request fails (logged but not raised).

Example

>>> # Download content from archived page
>>> wayback_url = "https://web.archive.org/web/20200101120000/https://example.com"
>>> content = download_from_archive_org(wayback_url)
>>> print(f"Retrieved {len(content)} characters")
>>> # Download with custom timeout and no cleaning
>>> raw_content = download_from_archive_org(
...     wayback_url,
...     timeout=60,
...     clean_content=False
... )

Note

  • Only works with archive.org URLs, not live web pages

  • Extracted text includes all visible page content

  • Archive pages may load slowly due to Internet Archive infrastructure

  • Some archived pages may be incomplete or corrupted

piedomains.archive_org_downloader.get_closest_snapshot(domain, target_date, status_code=200)[source]

Find the archived snapshot closest to a specific target date.

This function uses the Wayback Machine availability API to find the snapshot that was captured closest in time to the specified target date.

Parameters:
  • domain (str) – Domain name to search for (e.g., “example.com”).

  • target_date (Union[str, datetime]) – Target date as ‘YYYYMMDD’ string or datetime object.

  • status_code (int) – HTTP status code to filter by. Defaults to 200.

Returns:

Wayback Machine URL of the closest snapshot if found,

None if no snapshots are available near the target date.

Return type:

Optional[str]

Raises:

ValueError – If target_date format is invalid or domain is invalid.

Example

>>> from datetime import datetime
>>>
>>> # Using string date
>>> url = get_closest_snapshot("cnn.com", "20200315")
>>> if url:
...     content = download_from_archive_org(url)
>>> # Using datetime object
>>> target = datetime(2019, 6, 15)
>>> url = get_closest_snapshot("example.com", target)

Note

  • Returns the snapshot with timestamp closest to target_date

  • Preference is given to snapshots after the target date if available

  • Uses the Wayback Machine availability API for efficient lookup

piedomains.base module

Base class infrastructure for model management and data loading.

This module provides the foundational base class for all machine learning models in the piedomains package. It handles model file management, automatic downloading from remote repositories, and local caching for improved performance.

The Base class serves as the foundation for all classifier implementations, providing standardized model loading, caching, and resource management capabilities.

class piedomains.base.Base[source]

Bases: object

Base class for all machine learning model implementations in piedomains.

This class provides standardized functionality for model data management, including automatic downloading, caching, and loading of model files from remote repositories. All classifier classes should inherit from this base class to ensure consistent behavior.

Class Attributes:
MODELFN (str | None): Relative path to the model directory within the package.

Must be set by subclasses to specify their model location.

Example

Creating a custom classifier that inherits from Base:

>>> class MyClassifier(Base):
...     MODELFN = "model/my_classifier"
...
...     def __init__(self):
...         self.model_path = self.load_model_data("my_model.zip")

Note

Subclasses must define the MODELFN class attribute to specify the model directory path. The load_model_data method will create this directory and download model files as needed.

MODELFN: str | None = None
classmethod load_model_data(file_name, latest=False)[source]

Load model data from local cache or download from remote repository.

This method handles the complete lifecycle of model data: 1. Checks if local model directory exists, creates if needed 2. Verifies if model files exist locally or if update is requested 3. Downloads model data from remote repository if necessary 4. Returns the local path to model data for loading

Parameters:
  • file_name (str) – Name of the model data file to download (e.g., “model.zip”). This should be the filename as it exists in the remote repository.

  • latest (bool) – If True, forces download of latest model data even if local files exist. Useful for model updates. Defaults to False.

Returns:

Absolute path to the local model directory containing the downloaded

and extracted model files. Returns empty string if MODELFN is not set or if download fails.

Return type:

str

Raises:
  • OSError – If model directory cannot be created due to permission issues.

  • ConnectionError – If model download fails due to network issues.

Example

>>> class TextClassifier(Base):
...     MODELFN = "model/text_classifier"
...
>>> classifier = TextClassifier()
>>> model_path = classifier.load_model_data("text_model.zip")
>>> print(f"Model loaded from: {model_path}")
Model loaded from: /path/to/piedomains/model/text_classifier
>>> # Force download of latest model
>>> latest_path = classifier.load_model_data("text_model.zip", latest=True)

Note

  • This method only downloads if the saved_model directory doesn’t exist or if latest=True is specified

  • Model files are cached locally to avoid repeated downloads

  • Downloads happen only on first use or when explicitly requested

classmethod get_model_info()[source]

Get information about the model configuration and paths.

Returns:

Dictionary containing model information:
  • ’class_name’: Name of the classifier class

  • ’model_fn’: Model directory path (MODELFN)

  • ’model_path’: Absolute path to model directory if it exists

  • ’has_saved_model’: Whether saved_model directory exists

Return type:

dict[str, str]

Example

>>> info = TextClassifier.get_model_info()
>>> print(info)
{
    'class_name': 'TextClassifier',
    'model_fn': 'model/text_classifier',
    'model_path': '/path/to/piedomains/model/text_classifier',
    'has_saved_model': True
}
classmethod __init_subclass__(**kwargs)[source]

Validate subclass configuration when class is defined.

Raises:

ValueError – If MODELFN is not properly set by subclass.

piedomains.config module

Configuration management for piedomains.

class piedomains.config.Config(config_dict=None)[source]

Bases: object

Configuration class for piedomains settings.

DEFAULT_CONFIG = {'allowed_content_types': ['text/html', 'application/xhtml+xml', 'application/xml', 'text/xml', 'text/plain'], 'archive_429_wait_time': 60, 'archive_cdx_rate_limit': 1.0, 'archive_max_parallel': 2, 'archive_page_delay': 0.5, 'archive_retry_on_429': True, 'batch_size': 50, 'block_media': True, 'block_resources': ['media', 'video', 'font', 'websocket', 'manifest'], 'blocked_extensions': ['.exe', '.msi', '.scr', '.bat', '.cmd', '.com', '.pif', '.vbs', '.jar', '.app', '.dmg', '.pkg', '.deb', '.rpm', '.run', '.bin', '.elf', '.so', '.dll', '.dylib'], 'content_length_limits': {'application/pdf': 52428800, 'default': 10485760, 'text/html': 5242880}, 'content_safety_mode': 'moderate', 'enable_content_validation': True, 'html_extension': '.html', 'http_timeout': 10, 'image_extension': '.png', 'image_size': (254, 254), 'log_format': '%(asctime)s - %(name)s - %(levelname)s - %(message)s', 'log_level': 'INFO', 'max_content_length': 10485760, 'max_parallel': 4, 'max_retries': 3, 'model_cache_dir': None, 'parallel_workers': 4, 'playwright_headless': True, 'playwright_timeout': 30000, 'playwright_viewport': {'height': 1024, 'width': 1280}, 'retry_delay': 1, 'sandbox_mode_required': False, 'suspicious_url_patterns': ['.*\\/[^\\/]*\\.(exe|msi|scr|bat|cmd|pif|vbs|jar)(\\?.*)?$', '.*\\.com\\/.*\\.(exe|msi|scr|bat|cmd|pif|vbs|jar)(\\?.*)?$', '.*\\/download\\/.*\\.(zip|rar|7z|tar\\.gz|tgz)(\\?.*)?$', '.*\\/attachment\\/.*', '.*[?&](download|attachment)=.*'], 'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', 'validate_domain_extensions': False, 'webdriver_timeout': 30, 'webdriver_window_size': '1280,1024'}
__init__(config_dict=None)[source]

Initialize configuration.

Parameters:

config_dict (Dict[str, Any]) – Optional configuration overrides

get(key, default=None)[source]

Get configuration value.

Parameters:
  • key (str) – Configuration key

  • default – Default value if key not found

Returns:

Configuration value

set(key, value)[source]

Set configuration value.

Parameters:
  • key (str) – Configuration key

  • value (any) – Configuration value

update(config_dict)[source]

Update multiple configuration values.

Parameters:

config_dict (dict[str, any]) – Configuration updates

to_dict()[source]

Get configuration as dictionary.

Returns:

Configuration dictionary

Return type:

dict[str, any]

property http_timeout: int

HTTP request timeout in seconds.

property webdriver_timeout: int

WebDriver timeout in seconds.

property page_load_timeout: int

Page load timeout in seconds.

property max_retries: int

Maximum number of retries for failed operations.

property retry_delay: float

Delay between retries in seconds.

property screenshot_wait_time: int

Wait time after loading page before screenshot.

property webdriver_window_size: str

WebDriver window size.

property batch_size: int

Batch size for processing domains.

property parallel_workers: int

Number of parallel workers.

property user_agent: str

User agent string for HTTP requests.

property image_size: tuple

Image size for model input.

property enable_content_validation: bool

Whether content validation is enabled.

property content_safety_mode: str

strict, moderate, or permissive.

Type:

Content safety mode

property max_content_length: int

Maximum content length to download.

property sandbox_mode_required: bool

Whether sandbox mode is required for risky content.

property allowed_content_types: list

List of allowed MIME types.

property blocked_extensions: list

List of blocked file extensions.

property suspicious_url_patterns: list

List of regex patterns for suspicious URLs.

property content_length_limits: dict

Content length limits by content type.

piedomains.config.get_config()[source]

Get global configuration instance.

Returns:

Global configuration instance

Return type:

Config

piedomains.config.set_config(config)[source]

Set global configuration instance.

Parameters:

config (Config) – Configuration instance to set as global

piedomains.config.configure(**kwargs)[source]

Configure global settings.

Parameters:

**kwargs – Configuration key-value pairs

piedomains.constants module

Constants and classification categories for piedomains package.

This module defines the core classification categories used by piedomains for domain content classification, as well as filtering constants for text processing.

The categories are based on the Shallalist categorization system, a comprehensive classification scheme originally developed for web filtering and content analysis. These categories cover the major types of web content found across the internet.

Example

Accessing classification categories:
>>> from piedomains.constants import classes, most_common_words
>>> print(f"Available categories: {len(classes)}")
>>> print(f"Example categories: {classes[:5]}")
>>> print(f"Common words to filter: {most_common_words[:5]}")
piedomains.constants.classes: list[str] = ['adv', 'alcohol', 'automobile', 'dating', 'downloads', 'drugs', 'education', 'finance', 'fortunetelling', 'forum', 'gamble', 'government', 'hobby', 'hospitals', 'imagehosting', 'isp', 'jobsearch', 'models', 'movies', 'music', 'news', 'politics', 'porn', 'radiotv', 'recreation', 'redirector', 'religion', 'science', 'searchengines', 'sex', 'shopping', 'socialnet', 'spyware', 'tracker', 'urlshortener', 'warez', 'weapons', 'webmail', 'webradio']

Complete list of website classification categories.

This list contains 41 categories used for domain content classification. Categories are based on the Shallalist system and cover major website types including commerce, media, government, adult content, and technology services.

The categories are used by both traditional ML models and LLM-based classification to provide consistent categorization across different classification methods.

Type:

List[str]

piedomains.constants.most_common_words: list[str] = ['home', 'contact', 'us', 'new', 'news', 'site', 'privacy', 'search', 'help', 'copyright', 'free', 'service', 'en', 'get', 'one', 'find', 'menu', 'account', 'next']

Common words to filter out during text preprocessing.

These words are extremely common across all website types and provide little discriminative value for classification. They are filtered out during text processing to focus on more meaningful content words.

This list includes: - Navigation elements (home, menu, next) - Generic marketing terms (free, new, get) - Common website sections (contact, help, privacy) - Linguistic articles and connectors (us, one, en)

Used by text preprocessing functions to clean content before model input.

Type:

List[str]

piedomains.constants.get_valid_categories()[source]

Get a copy of all valid classification categories.

Returns:

Complete list of valid category names for classification.

Return type:

List[str]

Example

>>> categories = get_valid_categories()
>>> if "news" in categories:
...     print("News category is available")
piedomains.constants.is_valid_category(category)[source]

Check if a category name is valid for classification.

Parameters:

category (str) – Category name to validate.

Returns:

True if category is valid, False otherwise.

Return type:

bool

Example

>>> is_valid_category("news")
True
>>> is_valid_category("invalid_category")
False
piedomains.constants.get_category_count()[source]

Get the total number of available classification categories.

Returns:

Total number of classification categories.

Return type:

int

Example

>>> count = get_category_count()
>>> print(f"Total categories available: {count}")
Total categories available: 41

piedomains.context_managers module

Context managers for resource cleanup and management.

piedomains.context_managers.webdriver_context()[source]

DEPRECATED: Use PlaywrightFetcher context manager instead.

This function is maintained for backward compatibility.

piedomains.context_managers.playwright_context()[source]

Context manager for PlaywrightFetcher instances.

Yields:

PlaywrightFetcher – Playwright fetcher instance

Ensures proper cleanup of Playwright resources.

Return type:

Generator[PlaywrightFetcher, None, None]

piedomains.context_managers.temporary_directory(suffix='', prefix='piedomains_')[source]

Context manager for temporary directories.

Parameters:
  • suffix (str) – Directory name suffix

  • prefix (str) – Directory name prefix

Yields:

str – Path to temporary directory

Return type:

Generator[str, None, None]

Ensures cleanup of temporary directories.

piedomains.context_managers.file_cleanup(*file_paths)[source]

Context manager for file cleanup.

Parameters:

*file_paths (str) – Paths to files that should be cleaned up

Return type:

Generator[None, None, None]

Ensures cleanup of specified files after context exits.

piedomains.context_managers.error_recovery(operation_name, fallback_value=None, reraise=False)[source]

Context manager for error recovery with logging.

Parameters:
  • operation_name (str) – Name of the operation for logging

  • fallback_value – Value to return on error (if not reraising)

  • reraise (bool) – Whether to reraise exceptions

Yields:

dict – Dictionary with ‘success’, ‘error’, ‘result’ keys

Return type:

Generator

piedomains.context_managers.batch_progress_tracking(total_items, operation_name='Processing')[source]

Context manager for tracking batch processing progress.

Parameters:
  • total_items (int) – Total number of items to process

  • operation_name (str) – Name of the operation

Yields:

callable – Function to update progress

Return type:

Generator

class piedomains.context_managers.ResourceManager[source]

Bases: object

Resource manager for tracking and cleaning up resources.

__init__()[source]
add_driver(driver)[source]

Add a WebDriver/fetcher instance for cleanup (deprecated).

add_fetcher(fetcher)[source]

Add a PlaywrightFetcher instance for cleanup.

add_temp_directory(path)[source]

Add a temporary directory for cleanup.

add_temp_file(path)[source]

Add a temporary file for cleanup.

cleanup_all()[source]

Clean up all tracked resources.

piedomains.fetchers module

Playwright-based page fetcher for content extraction. Supports live content fetching and archive.org historical snapshots. Unified pipeline for HTML, text extraction, and screenshots.

class piedomains.fetchers.FetchResult(url, success, html='', text='', screenshot_path='', title='', meta_description='', error='')[source]

Bases: object

Result from a single fetch operation.

url: str
success: bool
html: str = ''
text: str = ''
screenshot_path: str = ''
title: str = ''
meta_description: str = ''
error: str = ''
__init__(url, success, html='', text='', screenshot_path='', title='', meta_description='', error='')
class piedomains.fetchers.BaseFetcher[source]

Bases: object

Base class for content fetchers with security validation.

__init__()[source]

Initialize fetcher with content validator.

class piedomains.fetchers.PlaywrightFetcher(max_parallel=4)[source]

Bases: BaseFetcher

Unified Playwright fetcher for all content extraction.

__init__(max_parallel=4)[source]

Initialize Playwright fetcher.

Parameters:

max_parallel (int) – Maximum number of parallel browser contexts

async fetch_single(url, screenshot_path=None)[source]

Fetch content from a single URL.

Return type:

FetchResult

async fetch_batch(urls, cache_dir='cache')[source]

Fetch multiple URLs in parallel.

Return type:

list[FetchResult]

fetch_html(url, **kwargs)[source]

Sync wrapper for HTML fetching.

Return type:

tuple[bool, str, str]

fetch_content(url, **kwargs)[source]

Sync wrapper for content fetching (alias for fetch_single).

Return type:

FetchResult

fetch_screenshot(url, output_path, **kwargs)[source]

Sync wrapper for screenshot.

Return type:

tuple[bool, str]

fetch_both(url, output_path, **kwargs)[source]

Sync wrapper for both HTML and screenshot.

Return type:

FetchResult

class piedomains.fetchers.ArchiveFetcher(target_date, max_parallel=None)[source]

Bases: BaseFetcher

Fetcher for archive.org historical snapshots using Playwright.

__init__(target_date, max_parallel=None)[source]

Initialize archive fetcher.

Parameters:
  • target_date (str | datetime) – Target date as ‘YYYYMMDD’ string or datetime object

  • max_parallel (int) – Maximum number of parallel browser contexts (default: 2 for archive.org)

async fetch_single(url, screenshot_path=None)[source]

Fetch content from archive.org snapshot.

Return type:

FetchResult

async fetch_batch(urls, cache_dir='cache')[source]

Fetch multiple URLs from archive.org in parallel with rate limiting.

Return type:

list[FetchResult]

fetch_html(url, **kwargs)[source]

Sync wrapper for HTML fetching from archive.

Return type:

tuple[bool, str, str]

fetch_screenshot(url, output_path, **kwargs)[source]

Sync wrapper for screenshot from archive.

Return type:

tuple[bool, str]

piedomains.fetchers.get_fetcher(archive_date=None, max_parallel=4)[source]

Factory function to get appropriate fetcher.

Parameters:
  • archive_date (str | datetime | None) – If provided, returns ArchiveFetcher for this date. If None, returns PlaywrightFetcher for current content.

  • max_parallel (int) – Maximum number of parallel browser contexts

Returns:

Appropriate fetcher instance

Return type:

BaseFetcher

piedomains.http_client module

HTTP client with connection pooling and session management for improved performance.

class piedomains.http_client.PooledHTTPClient[source]

Bases: object

HTTP client with connection pooling and session reuse.

__init__()[source]
property session: Session

Get or create HTTP session with connection pooling.

get(url, timeout=None, **kwargs)[source]

Perform HTTP GET with retry logic and connection pooling.

Parameters:
  • url (str) – URL to fetch

  • timeout (float) – Request timeout (uses config default if None)

  • **kwargs – Additional arguments passed to requests.get

Returns:

HTTP response

Return type:

requests.Response

Raises:

requests.exceptions.RequestException – On final failure after retries

close()[source]

Close the HTTP session.

piedomains.http_client.http_client()[source]

Context manager for getting a pooled HTTP client.

Yields:

PooledHTTPClient – HTTP client with connection pooling

piedomains.http_client.get_http_client()[source]

Get the global HTTP client instance.

Returns:

Global HTTP client with connection pooling

Return type:

PooledHTTPClient

piedomains.http_client.close_global_client()[source]

Close the global HTTP client.

piedomains.logging module

piedomains.piedomain module

class piedomains.piedomain.Piedomain[source]

Bases: Base

MODELFN: str | None = 'model/shallalist'
model_file_name = 'shallalist_v5_model.tar.gz'
weights_loaded = False
img_width = 254
img_height = 254
static parse_url_to_domain(url)[source]

Extract domain name from a URL.

Parameters:

url (str) – Full URL or domain name

Returns:

Domain name extracted from URL

Return type:

str

static validate_url_or_domain(url_or_domain)[source]

Validate if input is a valid URL or domain name.

Parameters:

url_or_domain (str) – URL or domain name to validate

Returns:

True if valid URL or domain, False otherwise

Return type:

bool

static validate_domain_name(domain)[source]

Validate if a domain name is properly formatted.

Parameters:

domain (str) – Domain name to validate

Returns:

True if domain is valid, False otherwise

Return type:

bool

classmethod validate_domains(domains)[source]

Validate a list of domain names and separate valid from invalid.

Parameters:

domains (list) – List of domain names to validate

Returns:

(valid_domains, invalid_domains)

Return type:

tuple

classmethod validate_urls_or_domains(urls_or_domains)[source]

Validate a list of URLs or domains and separate valid from invalid.

Parameters:

urls_or_domains (list) – List of URLs or domain names to validate

Returns:

(valid_inputs, invalid_inputs, url_to_domain_map)

Return type:

tuple

classmethod text_from_html(text)[source]

Extract clean text content from HTML.

Parameters:

text (str) – Raw HTML content

Returns:

Cleaned text with unique lowercase words

Return type:

str

classmethod data_cleanup(s)[source]

Clean and normalize text data for model input.

Parameters:

s (str) – Raw text string

Returns:

Cleaned text with English words only, no stopwords or common terms

Return type:

str

classmethod get_driver()[source]

DEPRECATED: Use PlaywrightFetcher instead. Get configured Chrome WebDriver instance for screenshots.

Returns:

Headless Chrome driver with optimized settings

Return type:

webdriver.Chrome

classmethod save_image(url_or_domain, image_dir)[source]

DEPRECATED: Use PlaywrightFetcher.fetch_screenshot instead. Save screenshot of URL or domain homepage.

Parameters:
  • url_or_domain (str) – URL or domain name to screenshot

  • image_dir (str) – Directory to save screenshot

Returns:

(success, error_message)

Return type:

tuple[bool, str]

classmethod extract_images(input, use_cache, image_dir)[source]

DEPRECATED: Use ContentProcessor.extract_image_content instead. Extract screenshots for domains.

Parameters:
  • input (list) – List of domains

  • use_cache (bool) – Whether to use cached screenshots

  • image_dir (str) – Directory to save screenshots

Returns:

(used_domain_screenshot, screenshot_errors)

Return type:

tuple[list, dict]

classmethod extract_image_tensor(offline, domains, image_dir)[source]

Convert PNG images to TensorFlow tensors for model input.

Parameters:
  • offline (bool) – Whether to process all images in directory

  • domains (list) – List of domain names to process

  • image_dir (str) – Directory containing PNG files

Returns:

Dictionary mapping domain names to image tensors

Return type:

dict

classmethod extract_htmls(urls_or_domains, use_cache, html_path)[source]

Extract HTML content from URLs or domain homepages.

Parameters:
  • urls_or_domains (list) – List of URLs or domain names

  • use_cache (bool) – Whether to use cached HTML files

  • html_path (str) – Directory to save HTML files

Returns:

Dictionary of errors encountered {domain: error_message}

Return type:

dict

classmethod extract_html_text(offline, input, html_path)[source]

Extract and clean text content from HTML files.

Parameters:
  • offline (bool) – Whether to process all HTML files in directory

  • input (str) – List of domain names to process

  • html_path (str) – Directory containing HTML files

Returns:

(domains, content) - lists of domain names and cleaned text

Return type:

tuple

classmethod load_model(model_file_name, latest=False)[source]

Load TensorFlow models and calibrators from local cache or download from server.

Parameters:
  • model_file_name (str) – Name of the model file to load

  • latest (bool) – Whether to download the latest model version

Note

Loads both text and image models plus isotonic regression calibrators. Models are cached locally after first download.

classmethod validate_input(input, path, type)[source]

Validate input parameters for prediction functions.

Parameters:
  • input (list) – List of URLs or domain names

  • path (str) – Path to HTML or image files

  • type (str) – Input type - ‘html’ or ‘image’

Returns:

True if operating in offline mode (using local files only)

Return type:

bool

Raises:

Exception – If neither URLs/domains nor valid path provided

piedomains.utils module

Utility functions for file operations, downloads, and security.

This module provides essential utility functions for the piedomains package, including secure file downloads, tar archive extraction with path traversal protection, and configuration management.

The utilities focus on security-first implementation, particularly for handling downloaded archives and preventing common security vulnerabilities like path traversal attacks.

piedomains.utils.REPO_BASE_URL = 'https://dataverse.harvard.edu/api/access/datafile/7081895'

Base URL for model data repository.

Can be overridden via PIEDOMAINS_MODEL_URL environment variable. Defaults to Harvard Dataverse hosting the piedomains model files.

Type:

str

piedomains.utils.download_file(url, target, file_name, timeout=30)[source]

Download and extract a compressed model file from a remote repository.

This function downloads a tar.gz file from the specified URL, saves it to the target directory, extracts it using secure extraction methods, and cleans up the downloaded archive file.

Parameters:
  • url (str) – URL of the remote file to download. Should point to a valid tar.gz archive containing model data.

  • target (str) – Local directory path where the file should be downloaded and extracted. Directory will be created if it doesn’t exist.

  • file_name (str) – Name to use for the downloaded file. Should include appropriate extension (e.g., “model.tar.gz”).

  • timeout (int) – HTTP request timeout in seconds. Defaults to 30 seconds for large model files.

Returns:

True if download and extraction completed successfully,

False if any error occurred during the process.

Return type:

bool

Raises:
  • requests.RequestException – If HTTP download fails (not caught, logged only).

  • tarfile.TarError – If tar extraction fails (not caught, logged only).

  • OSError – If file operations fail (not caught, logged only).

Example

>>> success = download_file(
...     url="https://example.com/model.tar.gz",
...     target="/path/to/models",
...     file_name="text_model.tar.gz"
... )
>>> if success:
...     print("Model downloaded and extracted successfully")
Security:
  • Uses safe_extract() to prevent path traversal attacks

  • Validates archive contents before extraction

  • Automatically removes downloaded archive after extraction

  • Logs all errors for security monitoring

Note

The downloaded tar.gz file is automatically deleted after extraction to save disk space. Only the extracted contents remain in the target directory.

piedomains.utils.is_within_directory(directory, target)[source]

Check if a target path is within a specified directory (security check).

This function validates that a file path is contained within a directory to prevent path traversal attacks when extracting archives. It resolves all symbolic links and relative path components before comparison.

Parameters:
  • directory (str) – The base directory path that should contain the target.

  • target (str) – The target file/directory path to validate.

Returns:

True if target is within directory, False if it would escape

the directory boundary (indicating a potential path traversal attack).

Return type:

bool

Example

>>> # Safe path
>>> is_within_directory("/safe/dir", "/safe/dir/file.txt")
True
>>> # Path traversal attempt
>>> is_within_directory("/safe/dir", "/safe/dir/../../../etc/passwd")
False
>>> # Another traversal attempt
>>> is_within_directory("/safe/dir", "/safe/dir/subdir/../../../etc/passwd")
False
Security:

This function is critical for preventing path traversal attacks (also known as directory traversal or dot-dot-slash attacks) where malicious archives attempt to extract files outside the intended directory.

Note

This function uses os.path.abspath() to resolve all relative path components and symbolic links before performing the security check.

piedomains.utils.safe_extract(tar, path='.', members=None, *, numeric_owner=False)[source]

Securely extract a tar archive with path traversal protection.

This function provides a secure wrapper around tarfile.extractall() that validates all archive members to prevent path traversal attacks. It checks each member’s path before extraction to ensure it stays within the target directory.

Parameters:
  • tar (tarfile.TarFile) – Open tar file object to extract from.

  • path (str) – Directory path where archive should be extracted. Defaults to current directory (“.”).

  • members (list, optional) – Specific members to extract. If None, extracts all members. Defaults to None.

  • numeric_owner (bool) – If True, preserve numeric user/group IDs. If False, use current user. Defaults to False.

Raises:
  • SecurityError – If any archive member attempts path traversal (would extract outside the target directory).

  • tarfile.TarError – If tar extraction fails for other reasons.

  • OSError – If file system operations fail.

Return type:

None

Example

>>> import tarfile
>>> with tarfile.open("model.tar.gz", "r:gz") as tar:
...     safe_extract(tar, "/safe/extraction/dir")
Security:
  • Validates every archive member before extraction

  • Prevents path traversal attacks (e.g., “../../../etc/passwd”)

  • Logs security violations for monitoring

  • Raises exceptions rather than silently failing

Note

This function should always be used instead of tarfile.extractall() when handling archives from untrusted sources, which includes downloaded model files.

exception piedomains.utils.SecurityError[source]

Bases: Exception

Exception raised for security violations during file operations.

This exception is raised when security checks fail, particularly during archive extraction when path traversal attempts are detected.

Example

>>> try:
...     safe_extract(malicious_tar, "/safe/dir")
... except SecurityError as e:
...     logger.error(f"Security violation: {e}")
piedomains.utils.get_file_hash(file_path, algorithm='sha256')[source]

Calculate cryptographic hash of a file for integrity verification.

Parameters:
  • file_path (str) – Path to the file to hash.

  • algorithm (str) – Hash algorithm to use (‘md5’, ‘sha1’, ‘sha256’, ‘sha512’). Defaults to ‘sha256’ for security.

Returns:

Hexadecimal hash digest of the file.

Return type:

str

Raises:

Example

>>> hash_value = get_file_hash("model.tar.gz", "sha256")
>>> print(f"File hash: {hash_value}")

Module contents

Piedomains: Domain content classification library.

This module provides lazy imports to avoid dependency issues when optional dependencies (like playwright) are not installed.

piedomains.__getattr__(name)[source]

Lazy import handler for piedomains modules.