piedomains package¶
Subpackages¶
- piedomains classifiers
- Domain Classification Modules
- piedomains.text module
- piedomains.image module
- piedomains.llm module
LLMConfigLLMConfig.providerLLMConfig.modelLLMConfig.api_keyLLMConfig.base_urlLLMConfig.max_tokensLLMConfig.temperatureLLMConfig.categoriesLLMConfig.cost_limit_usdLLMConfig.usage_trackingLLMConfig.__init__()LLMConfig.__post_init__()LLMConfig.api_keyLLMConfig.base_urlLLMConfig.categoriesLLMConfig.cost_limit_usdLLMConfig.from_dict()LLMConfig.max_tokensLLMConfig.temperatureLLMConfig.to_litellm_params()LLMConfig.usage_trackingLLMConfig.providerLLMConfig.model
get_classification_prompt()get_multimodal_prompt()parse_llm_response()
- piedomains.llm package
- Submodules
- piedomains.llm.config module
LLMConfigLLMConfig.providerLLMConfig.modelLLMConfig.api_keyLLMConfig.base_urlLLMConfig.max_tokensLLMConfig.temperatureLLMConfig.categoriesLLMConfig.cost_limit_usdLLMConfig.usage_trackingLLMConfig.providerLLMConfig.modelLLMConfig.api_keyLLMConfig.base_urlLLMConfig.max_tokensLLMConfig.temperatureLLMConfig.categoriesLLMConfig.cost_limit_usdLLMConfig.usage_trackingLLMConfig.__post_init__()LLMConfig.to_litellm_params()LLMConfig.from_dict()LLMConfig.__init__()
- piedomains.llm.prompts module
- piedomains.llm.response_parser module
- Module contents
LLMConfigLLMConfig.providerLLMConfig.modelLLMConfig.api_keyLLMConfig.base_urlLLMConfig.max_tokensLLMConfig.temperatureLLMConfig.categoriesLLMConfig.cost_limit_usdLLMConfig.usage_trackingLLMConfig.__init__()LLMConfig.__post_init__()LLMConfig.api_keyLLMConfig.base_urlLLMConfig.categoriesLLMConfig.cost_limit_usdLLMConfig.from_dict()LLMConfig.max_tokensLLMConfig.temperatureLLMConfig.to_litellm_params()LLMConfig.usage_trackingLLMConfig.providerLLMConfig.model
get_classification_prompt()get_multimodal_prompt()parse_llm_response()
- piedomains.processors package
Submodules¶
piedomains.api module¶
Modern, intuitive API for piedomains domain classification.
This module provides a clean, class-based interface for domain content classification with support for text analysis, image analysis, and historical archive.org snapshots.
- class piedomains.api.DomainClassifier(cache_dir=None)[source]¶
Bases:
objectMain interface for domain content classification.
Supports multiple classification approaches: - Traditional ML: Text-based, image-based, and combined classification - Modern AI: LLM-based classification with multimodal support - Historical analysis via archive.org snapshots
- Example (Traditional ML):
>>> classifier = DomainClassifier() >>> results = classifier.classify(["google.com", "facebook.com"]) >>> for result in results: ... print(f"{result['domain']}: {result['category']} ({result['confidence']:.3f})") google.com: search (0.892) facebook.com: socialnet (0.967)
# Historical analysis >>> results = classifier.classify([“google.com”], archive_date=”20200101”) >>> print(f”Archive: {results[0][‘category’]} from {results[0][‘date_time_collected’]}”)
- Example (LLM-based):
>>> classifier = DomainClassifier() >>> classifier.configure_llm( ... provider="openai", ... model="gpt-4o", ... api_key="sk-...", ... categories=["news", "shopping", "social", "tech"] ... ) >>> results = classifier.classify_by_llm(["cnn.com"]) >>> print(f"LLM: {results[0]['category']} - {results[0]['reason']}")
- Example (Separated workflow):
>>> collector = DataCollector() >>> collection = collector.collect(["example.com"]) >>> text_results = classifier.classify_from_collection(collection, method="text") >>> image_results = classifier.classify_from_collection(collection, method="images") >>> # Same collected content, different classification approaches
- JSON Output Schema:
All classification methods return List[Dict] with consistent structure:
Collection Data Schema (from collect_content): {
“collection_id”: str, # Unique identifier for collection “timestamp”: str, # ISO 8601 collection timestamp “config”: {
“cache_dir”: str, # Cache directory path “archive_date”: str, # Archive.org date (YYYYMMDD) or null “fetcher_type”: str, # “live” or “archive” “max_parallel”: int # Parallel fetch limit
}, “domains”: [ # List of domain results
- {
“url”: str, # Original input URL/domain “domain”: str, # Parsed domain name “text_path”: str, # Path to HTML file (relative to cache_dir) “image_path”: str, # Path to screenshot (relative to cache_dir) “date_time_collected”: str, # ISO 8601 timestamp “fetch_success”: bool, # Whether data collection succeeded “cached”: bool, # Whether data was retrieved from cache “error”: str, # Error message if fetch_success is false “title”: str, # Page title (optional) “meta_description”: str # Meta description (optional)
}
], “summary”: {
“total_domains”: int, # Total domains requested “successful”: int, # Successfully collected “failed”: int # Failed collections
}
}
Classification Result Schema (from classify methods): [
- {
“url”: str, # Original input URL/domain “domain”: str, # Parsed domain name “text_path”: str, # Path to HTML file “image_path”: str, # Path to screenshot “date_time_collected”: str, # ISO 8601 timestamp “model_used”: str, # Model identifier (e.g. “text/shallalist_ml”) “category”: str, # Predicted category “confidence”: float, # Confidence score (0.0-1.0) “reason”: str, # LLM reasoning (null for ML models) “error”: str, # Error message if classification failed “raw_predictions”: dict, # Full probability distribution
# Combined classification specific fields: “text_category”: str, # Text-only prediction “text_confidence”: float, # Text confidence “image_category”: str, # Image-only prediction “image_confidence”: float # Image confidence
}
]
Supported Categories: adv, aggressive, alcohol, anonvpn, automobile, chatphisher, cooking, dating, downloads, drugs, education, finance, forum, gamble, government, hacking, health, hobby, homehealth, imagehosting, jobsearch, lingerie, music, news, occult, onlinemarketing, politics, porn, publicite, radiotv, recreation, religion, remotecontrol, shopping, socialnet, spyware, updatesites, urlshortener, violence, warez, weapons, webmail, webphone, webradio, webtv
- __init__(cache_dir=None)[source]¶
Initialize domain classifier.
- Parameters:
cache_dir (str, optional) – Directory for caching downloaded content. Defaults to “cache” in current directory.
- classify(domains, archive_date=None, use_cache=True, latest=False)[source]¶
Classify domains using combined text and image analysis.
This is the most comprehensive classification method, using both textual content and homepage screenshots for maximum accuracy.
- Parameters:
domains (list[str]) – List of domain names or URLs to classify e.g., [“google.com”, “https://facebook.com/page”]
archive_date (str or datetime, optional) – For historical analysis. Format: “YYYYMMDD” or datetime object
use_cache (bool) – Whether to reuse cached content (default: True)
latest (bool) – Whether to download latest model versions (default: False)
- Returns:
- Classification results in JSON format with fields:
url: Original URL/domain input
domain: Parsed domain name
text_path: Path to collected HTML file
image_path: Path to collected screenshot
date_time_collected: When data was collected (ISO format)
model_used: “combined/text_image_ml”
category: Best prediction (ensemble of text + image)
confidence: Confidence score (0-1)
reason: None (reasoning field for LLM models)
error: Error message if classification failed
text_category: Text-only prediction
text_confidence: Text confidence
image_category: Image-only prediction
image_confidence: Image confidence
raw_predictions: Full probability distributions
- Return type:
- Raises:
ValueError – If domains list is empty
Example
>>> classifier = DomainClassifier() >>> results = classifier.classify(["cnn.com", "bbc.com"]) >>> print(f"{results[0]['domain']}: {results[0]['category']} ({results[0]['confidence']:.3f})") cnn.com: news (0.876)
- classify_by_text(domains, archive_date=None, use_cache=True, latest=False)[source]¶
Classify domains using only text content analysis.
Faster than combined analysis, good for batch processing or when screenshots are not needed.
- Parameters:
- Returns:
- Text classification results in JSON format with fields:
url: Original URL/domain input
domain: Parsed domain name
text_path: Path to collected HTML file
image_path: Path to collected screenshot (may be None)
date_time_collected: When data was collected (ISO format)
model_used: “text/shallalist_ml”
category: Text classification prediction
confidence: Text confidence score (0-1)
reason: None (reasoning field for LLM models)
error: Error message if classification failed
raw_predictions: Full text probability distribution
- Return type:
Example
>>> classifier = DomainClassifier() >>> results = classifier.classify_by_text(["wikipedia.org"]) >>> print(f"{results[0]['domain']}: {results[0]['category']} ({results[0]['confidence']:.3f})") wikipedia.org: education (0.823)
- classify_by_images(domains, archive_date=None, use_cache=True, latest=False)[source]¶
Classify domains using only homepage screenshot analysis.
Good for visual content classification, especially when text content is minimal or misleading.
- Parameters:
- Returns:
- Image classification results in JSON format with fields:
url: Original URL/domain input
domain: Parsed domain name
text_path: Path to collected HTML file (may be None)
image_path: Path to collected screenshot
date_time_collected: When data was collected (ISO format)
model_used: “image/shallalist_ml”
category: Image classification prediction
confidence: Image confidence score (0-1)
reason: None (reasoning field for LLM models)
error: Error message if classification failed
raw_predictions: Full image probability distribution
- Return type:
Example
>>> classifier = DomainClassifier() >>> results = classifier.classify_by_images(["instagram.com"]) >>> print(f"{results[0]['domain']}: {results[0]['category']} ({results[0]['confidence']:.3f})") instagram.com: socialnet (0.912)
- configure_llm(provider, model, api_key=None, categories=None, **kwargs)[source]¶
Configure LLM for AI-powered domain classification.
- Parameters:
provider (
str) – LLM provider (‘openai’, ‘anthropic’, ‘google’, etc.)model (
str) – Model name (‘gpt-4o’, ‘claude-3-5-sonnet-20241022’, ‘gemini-1.5-pro’)api_key (
str|None) – API key for the provider (or set via environment variable)categories (
list[str] |None) – Custom classification categories**kwargs – Additional LLMConfig parameters (temperature, max_tokens, etc.)
- Return type:
Example
>>> classifier = DomainClassifier() >>> classifier.configure_llm( ... provider="openai", ... model="gpt-4o", ... api_key="sk-...", ... categories=["news", "shopping", "social", "tech"] ... )
- classify_by_llm(domains, custom_instructions=None, use_cache=True, mode='text')[source]¶
Classify domains using LLM analysis.
- Parameters:
- Returns:
- LLM classification results in JSON format with fields:
url: Original URL/domain input
domain: Parsed domain name
text_path: Path to collected HTML file
image_path: Path to collected screenshot (if applicable)
date_time_collected: When data was collected (ISO format)
model_used: “text/llm_{provider}_{model}” or similar
category: LLM classification prediction
confidence: LLM confidence score (0-1)
reason: LLM reasoning explanation
error: Error message if classification failed
- Return type:
- Raises:
RuntimeError – If LLM not configured
Example
>>> classifier = DomainClassifier() >>> classifier.configure_llm("openai", "gpt-4o", api_key="sk-...") >>> results = classifier.classify_by_llm(["cnn.com", "amazon.com"]) >>> print(f"{results[0]['domain']}: {results[0]['category']} - {results[0]['reason']}") cnn.com: news - This domain contains current events and journalism content
- classify_by_llm_multimodal(domains, custom_instructions=None, use_cache=True)[source]¶
Classify domains using LLM multimodal analysis (text + screenshots).
- Parameters:
- Returns:
Multimodal LLM classification results in JSON format
- Return type:
- Raises:
RuntimeError – If LLM not configured
Example
>>> classifier = DomainClassifier() >>> classifier.configure_llm("openai", "gpt-4o", api_key="sk-...") >>> results = classifier.classify_by_llm_multimodal(["cnn.com"]) >>> print(f"{results[0]['domain']}: {results[0]['category']} - {results[0]['reason']}") cnn.com: news - Based on text content and visual layout typical of news websites
- get_llm_usage_stats()[source]¶
Get LLM usage statistics and cost tracking.
Example
>>> classifier = DomainClassifier() >>> classifier.configure_llm("openai", "gpt-4o") >>> classifier.classify_by_llm(["example.com"]) >>> stats = classifier.get_llm_usage_stats() >>> print(f"Cost: ${stats['estimated_cost_usd']:.4f}")
- collect_content(domains, archive_date=None, collection_id=None, use_cache=True, batch_size=10)[source]¶
Collect website content for domains without performing inference.
Separates content collection from classification, enabling: - Content reuse across multiple models - Clear data lineage and inspection - Reproducible analysis workflows
- Parameters:
domains (list[str]) – List of domain names or URLs to collect content for
archive_date (str or datetime, optional) – For historical analysis
collection_id (str, optional) – Identifier for this collection
use_cache (bool) – Whether to use cached content when available
batch_size (int) – Number of domains to process in parallel
- Returns:
Collection metadata with file paths for downstream inference
- Return type:
Example
>>> classifier = DomainClassifier() >>> collection = classifier.collect_content(["cnn.com", "bbc.com"]) >>> print(collection["domains"][0]["text_path"]) html/cnn.com.html
- classify_from_collection(collection_data, method='combined', output_file=None, latest=False)[source]¶
Perform inference on previously collected content.
- Parameters:
- Returns:
Classification results in JSON format
- Return type:
Example
>>> classifier = DomainClassifier() >>> collection = classifier.collect_content(["cnn.com"]) >>> results = classifier.classify_from_collection(collection, method="text") >>> print(results[0]["category"]) news
- piedomains.api.classify_domains(domains, method='combined', archive_date=None, cache_dir=None)[source]¶
Quick domain classification function.
- Parameters:
- Returns:
Classification results in JSON format
- Return type:
Example
>>> results = classify_domains(["cnn.com", "github.com"]) >>> for result in results: ... print(f"{result['domain']}: {result['category']} ({result['confidence']:.3f})") cnn.com: news (0.876) github.com: computers (0.892)
piedomains.archive_org_downloader module¶
Archive.org content retrieval and historical data access utilities.
This module provides functionality for accessing historical snapshots of web content through the Internet Archive’s Wayback Machine. It includes utilities for querying available snapshots within date ranges and downloading content from archived pages.
The module supports the piedomains package’s historical domain analysis capabilities by providing structured access to archived web content for training and classification on historical data.
Example
- Basic usage for getting archived content:
>>> from piedomains.archive_org_downloader import get_urls_year, download_from_archive_org >>> urls = get_urls_year("example.com", year=2020) >>> if urls: ... content = download_from_archive_org(urls[0]) ... print(f"Retrieved {len(content)} characters of historical content")
- piedomains.archive_org_downloader.get_urls_year(domain, year=2014, status_code=200, limit=None)[source]¶
Retrieve all archived URLs for a domain within a specific year.
This function queries the Internet Archive’s CDX API to find all available snapshots of a domain within the specified year that returned the given HTTP status code.
- Parameters:
domain (str) – Domain name to search for (e.g., “example.com” or “https://example.com”). The function will handle both domain names and full URLs.
year (int) – Year to search within (e.g., 2020). Defaults to 2014. Must be between 1996 (first archive) and current year.
status_code (int) – HTTP status code to filter by. Only snapshots that returned this status code will be included. Defaults to 200.
limit (Optional[int]) – Maximum number of URLs to return. If None, returns all available URLs. Useful for large domains with many snapshots.
- Returns:
- List of complete Wayback Machine URLs for accessing archived content.
Each URL is formatted as ‘https://web.archive.org/web/{timestamp}/{original_url}’. Returns empty list if no snapshots are found or if an error occurs.
- Return type:
List[str]
- Raises:
ValueError – If year is outside valid range or domain is invalid.
requests.RequestException – If API request fails (logged but not raised).
Example
>>> # Get all snapshots for 2020 >>> urls = get_urls_year("cnn.com", year=2020) >>> print(f"Found {len(urls)} snapshots")
>>> # Get limited snapshots with error handling >>> urls = get_urls_year("example.com", year=2019, limit=10) >>> if urls: ... print(f"First snapshot: {urls[0]}") ... else: ... print("No snapshots found")
Note
The function searches for snapshots with successful HTTP responses (200 by default)
Results are ordered chronologically by snapshot timestamp
Large popular domains may have thousands of snapshots per year
Use the limit parameter to avoid excessive API calls
- piedomains.archive_org_downloader.download_from_archive_org(url, timeout=30, clean_content=True)[source]¶
Download and extract text content from an archived webpage.
This function retrieves content from a Wayback Machine URL and extracts the visible text content, optionally cleaning archive-specific elements.
- Parameters:
url (str) – Complete Wayback Machine URL (from get_urls_year or similar). Should be in format ‘https://web.archive.org/web/{timestamp}/{original_url}’.
timeout (int) – HTTP request timeout in seconds. Defaults to 30 seconds as archived pages can be slow to load.
clean_content (bool) – If True, removes archive.org specific navigation and metadata elements. Defaults to True for cleaner content.
- Returns:
- Extracted text content from the archived page. Returns empty string
if download fails or no content is found.
- Return type:
- Raises:
ValueError – If URL is not a valid Wayback Machine URL.
requests.RequestException – If HTTP request fails (logged but not raised).
Example
>>> # Download content from archived page >>> wayback_url = "https://web.archive.org/web/20200101120000/https://example.com" >>> content = download_from_archive_org(wayback_url) >>> print(f"Retrieved {len(content)} characters")
>>> # Download with custom timeout and no cleaning >>> raw_content = download_from_archive_org( ... wayback_url, ... timeout=60, ... clean_content=False ... )
Note
Only works with archive.org URLs, not live web pages
Extracted text includes all visible page content
Archive pages may load slowly due to Internet Archive infrastructure
Some archived pages may be incomplete or corrupted
- piedomains.archive_org_downloader.get_closest_snapshot(domain, target_date, status_code=200)[source]¶
Find the archived snapshot closest to a specific target date.
This function uses the Wayback Machine availability API to find the snapshot that was captured closest in time to the specified target date.
- Parameters:
- Returns:
- Wayback Machine URL of the closest snapshot if found,
None if no snapshots are available near the target date.
- Return type:
Optional[str]
- Raises:
ValueError – If target_date format is invalid or domain is invalid.
Example
>>> from datetime import datetime >>> >>> # Using string date >>> url = get_closest_snapshot("cnn.com", "20200315") >>> if url: ... content = download_from_archive_org(url)
>>> # Using datetime object >>> target = datetime(2019, 6, 15) >>> url = get_closest_snapshot("example.com", target)
Note
Returns the snapshot with timestamp closest to target_date
Preference is given to snapshots after the target date if available
Uses the Wayback Machine availability API for efficient lookup
piedomains.base module¶
Base class infrastructure for model management and data loading.
This module provides the foundational base class for all machine learning models in the piedomains package. It handles model file management, automatic downloading from remote repositories, and local caching for improved performance.
The Base class serves as the foundation for all classifier implementations, providing standardized model loading, caching, and resource management capabilities.
- class piedomains.base.Base[source]¶
Bases:
objectBase class for all machine learning model implementations in piedomains.
This class provides standardized functionality for model data management, including automatic downloading, caching, and loading of model files from remote repositories. All classifier classes should inherit from this base class to ensure consistent behavior.
- Class Attributes:
- MODELFN (str | None): Relative path to the model directory within the package.
Must be set by subclasses to specify their model location.
Example
Creating a custom classifier that inherits from Base:
>>> class MyClassifier(Base): ... MODELFN = "model/my_classifier" ... ... def __init__(self): ... self.model_path = self.load_model_data("my_model.zip")
Note
Subclasses must define the MODELFN class attribute to specify the model directory path. The load_model_data method will create this directory and download model files as needed.
- classmethod load_model_data(file_name, latest=False)[source]¶
Load model data from local cache or download from remote repository.
This method handles the complete lifecycle of model data: 1. Checks if local model directory exists, creates if needed 2. Verifies if model files exist locally or if update is requested 3. Downloads model data from remote repository if necessary 4. Returns the local path to model data for loading
- Parameters:
- Returns:
- Absolute path to the local model directory containing the downloaded
and extracted model files. Returns empty string if MODELFN is not set or if download fails.
- Return type:
- Raises:
OSError – If model directory cannot be created due to permission issues.
ConnectionError – If model download fails due to network issues.
Example
>>> class TextClassifier(Base): ... MODELFN = "model/text_classifier" ... >>> classifier = TextClassifier() >>> model_path = classifier.load_model_data("text_model.zip") >>> print(f"Model loaded from: {model_path}") Model loaded from: /path/to/piedomains/model/text_classifier
>>> # Force download of latest model >>> latest_path = classifier.load_model_data("text_model.zip", latest=True)
Note
This method only downloads if the saved_model directory doesn’t exist or if latest=True is specified
Model files are cached locally to avoid repeated downloads
Downloads happen only on first use or when explicitly requested
- classmethod get_model_info()[source]¶
Get information about the model configuration and paths.
- Returns:
- Dictionary containing model information:
’class_name’: Name of the classifier class
’model_fn’: Model directory path (MODELFN)
’model_path’: Absolute path to model directory if it exists
’has_saved_model’: Whether saved_model directory exists
- Return type:
Example
>>> info = TextClassifier.get_model_info() >>> print(info) { 'class_name': 'TextClassifier', 'model_fn': 'model/text_classifier', 'model_path': '/path/to/piedomains/model/text_classifier', 'has_saved_model': True }
- classmethod __init_subclass__(**kwargs)[source]¶
Validate subclass configuration when class is defined.
- Raises:
ValueError – If MODELFN is not properly set by subclass.
piedomains.config module¶
Configuration management for piedomains.
- class piedomains.config.Config(config_dict=None)[source]¶
Bases:
objectConfiguration class for piedomains settings.
- DEFAULT_CONFIG = {'allowed_content_types': ['text/html', 'application/xhtml+xml', 'application/xml', 'text/xml', 'text/plain'], 'archive_429_wait_time': 60, 'archive_cdx_rate_limit': 1.0, 'archive_max_parallel': 2, 'archive_page_delay': 0.5, 'archive_retry_on_429': True, 'batch_size': 50, 'block_media': True, 'block_resources': ['media', 'video', 'font', 'websocket', 'manifest'], 'blocked_extensions': ['.exe', '.msi', '.scr', '.bat', '.cmd', '.com', '.pif', '.vbs', '.jar', '.app', '.dmg', '.pkg', '.deb', '.rpm', '.run', '.bin', '.elf', '.so', '.dll', '.dylib'], 'content_length_limits': {'application/pdf': 52428800, 'default': 10485760, 'text/html': 5242880}, 'content_safety_mode': 'moderate', 'enable_content_validation': True, 'html_extension': '.html', 'http_timeout': 10, 'image_extension': '.png', 'image_size': (254, 254), 'log_format': '%(asctime)s - %(name)s - %(levelname)s - %(message)s', 'log_level': 'INFO', 'max_content_length': 10485760, 'max_parallel': 4, 'max_retries': 3, 'model_cache_dir': None, 'parallel_workers': 4, 'playwright_headless': True, 'playwright_timeout': 30000, 'playwright_viewport': {'height': 1024, 'width': 1280}, 'retry_delay': 1, 'sandbox_mode_required': False, 'suspicious_url_patterns': ['.*\\/[^\\/]*\\.(exe|msi|scr|bat|cmd|pif|vbs|jar)(\\?.*)?$', '.*\\.com\\/.*\\.(exe|msi|scr|bat|cmd|pif|vbs|jar)(\\?.*)?$', '.*\\/download\\/.*\\.(zip|rar|7z|tar\\.gz|tgz)(\\?.*)?$', '.*\\/attachment\\/.*', '.*[?&](download|attachment)=.*'], 'user_agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', 'validate_domain_extensions': False, 'webdriver_timeout': 30, 'webdriver_window_size': '1280,1024'}¶
- __init__(config_dict=None)[source]¶
Initialize configuration.
- Parameters:
config_dict (Dict[str, Any]) – Optional configuration overrides
- get(key, default=None)[source]¶
Get configuration value.
- Parameters:
key (str) – Configuration key
default – Default value if key not found
- Returns:
Configuration value
- set(key, value)[source]¶
Set configuration value.
- Parameters:
key (str) – Configuration key
value (any) – Configuration value
- piedomains.config.get_config()[source]¶
Get global configuration instance.
- Returns:
Global configuration instance
- Return type:
piedomains.constants module¶
Constants and classification categories for piedomains package.
This module defines the core classification categories used by piedomains for domain content classification, as well as filtering constants for text processing.
The categories are based on the Shallalist categorization system, a comprehensive classification scheme originally developed for web filtering and content analysis. These categories cover the major types of web content found across the internet.
Example
- Accessing classification categories:
>>> from piedomains.constants import classes, most_common_words >>> print(f"Available categories: {len(classes)}") >>> print(f"Example categories: {classes[:5]}") >>> print(f"Common words to filter: {most_common_words[:5]}")
- piedomains.constants.classes: list[str] = ['adv', 'alcohol', 'automobile', 'dating', 'downloads', 'drugs', 'education', 'finance', 'fortunetelling', 'forum', 'gamble', 'government', 'hobby', 'hospitals', 'imagehosting', 'isp', 'jobsearch', 'models', 'movies', 'music', 'news', 'politics', 'porn', 'radiotv', 'recreation', 'redirector', 'religion', 'science', 'searchengines', 'sex', 'shopping', 'socialnet', 'spyware', 'tracker', 'urlshortener', 'warez', 'weapons', 'webmail', 'webradio']¶
Complete list of website classification categories.
This list contains 41 categories used for domain content classification. Categories are based on the Shallalist system and cover major website types including commerce, media, government, adult content, and technology services.
The categories are used by both traditional ML models and LLM-based classification to provide consistent categorization across different classification methods.
- Type:
List[str]
- piedomains.constants.most_common_words: list[str] = ['home', 'contact', 'us', 'new', 'news', 'site', 'privacy', 'search', 'help', 'copyright', 'free', 'service', 'en', 'get', 'one', 'find', 'menu', 'account', 'next']¶
Common words to filter out during text preprocessing.
These words are extremely common across all website types and provide little discriminative value for classification. They are filtered out during text processing to focus on more meaningful content words.
This list includes: - Navigation elements (home, menu, next) - Generic marketing terms (free, new, get) - Common website sections (contact, help, privacy) - Linguistic articles and connectors (us, one, en)
Used by text preprocessing functions to clean content before model input.
- Type:
List[str]
- piedomains.constants.get_valid_categories()[source]¶
Get a copy of all valid classification categories.
- Returns:
Complete list of valid category names for classification.
- Return type:
List[str]
Example
>>> categories = get_valid_categories() >>> if "news" in categories: ... print("News category is available")
- piedomains.constants.is_valid_category(category)[source]¶
Check if a category name is valid for classification.
- Parameters:
category (str) – Category name to validate.
- Returns:
True if category is valid, False otherwise.
- Return type:
Example
>>> is_valid_category("news") True >>> is_valid_category("invalid_category") False
piedomains.context_managers module¶
Context managers for resource cleanup and management.
- piedomains.context_managers.webdriver_context()[source]¶
DEPRECATED: Use PlaywrightFetcher context manager instead.
This function is maintained for backward compatibility.
- piedomains.context_managers.playwright_context()[source]¶
Context manager for PlaywrightFetcher instances.
- Yields:
PlaywrightFetcher – Playwright fetcher instance
Ensures proper cleanup of Playwright resources.
- Return type:
- piedomains.context_managers.temporary_directory(suffix='', prefix='piedomains_')[source]¶
Context manager for temporary directories.
- Parameters:
- Yields:
str – Path to temporary directory
- Return type:
Ensures cleanup of temporary directories.
- piedomains.context_managers.file_cleanup(*file_paths)[source]¶
Context manager for file cleanup.
- Parameters:
*file_paths (
str) – Paths to files that should be cleaned up- Return type:
Ensures cleanup of specified files after context exits.
- piedomains.context_managers.error_recovery(operation_name, fallback_value=None, reraise=False)[source]¶
Context manager for error recovery with logging.
- piedomains.context_managers.batch_progress_tracking(total_items, operation_name='Processing')[source]¶
Context manager for tracking batch processing progress.
piedomains.fetchers module¶
Playwright-based page fetcher for content extraction. Supports live content fetching and archive.org historical snapshots. Unified pipeline for HTML, text extraction, and screenshots.
- class piedomains.fetchers.FetchResult(url, success, html='', text='', screenshot_path='', title='', meta_description='', error='')[source]¶
Bases:
objectResult from a single fetch operation.
- __init__(url, success, html='', text='', screenshot_path='', title='', meta_description='', error='')¶
- class piedomains.fetchers.BaseFetcher[source]¶
Bases:
objectBase class for content fetchers with security validation.
- class piedomains.fetchers.PlaywrightFetcher(max_parallel=4)[source]¶
Bases:
BaseFetcherUnified Playwright fetcher for all content extraction.
- __init__(max_parallel=4)[source]¶
Initialize Playwright fetcher.
- Parameters:
max_parallel (
int) – Maximum number of parallel browser contexts
- async fetch_single(url, screenshot_path=None)[source]¶
Fetch content from a single URL.
- Return type:
- fetch_content(url, **kwargs)[source]¶
Sync wrapper for content fetching (alias for fetch_single).
- Return type:
- class piedomains.fetchers.ArchiveFetcher(target_date, max_parallel=None)[source]¶
Bases:
BaseFetcherFetcher for archive.org historical snapshots using Playwright.
- async fetch_single(url, screenshot_path=None)[source]¶
Fetch content from archive.org snapshot.
- Return type:
- async fetch_batch(urls, cache_dir='cache')[source]¶
Fetch multiple URLs from archive.org in parallel with rate limiting.
- Return type:
piedomains.http_client module¶
HTTP client with connection pooling and session management for improved performance.
- class piedomains.http_client.PooledHTTPClient[source]¶
Bases:
objectHTTP client with connection pooling and session reuse.
- property session: Session¶
Get or create HTTP session with connection pooling.
- piedomains.http_client.http_client()[source]¶
Context manager for getting a pooled HTTP client.
- Yields:
PooledHTTPClient – HTTP client with connection pooling
piedomains.logging module¶
piedomains.piedomain module¶
- class piedomains.piedomain.Piedomain[source]¶
Bases:
Base- model_file_name = 'shallalist_v5_model.tar.gz'¶
- weights_loaded = False¶
- img_width = 254¶
- img_height = 254¶
- static validate_url_or_domain(url_or_domain)[source]¶
Validate if input is a valid URL or domain name.
- classmethod validate_domains(domains)[source]¶
Validate a list of domain names and separate valid from invalid.
- classmethod validate_urls_or_domains(urls_or_domains)[source]¶
Validate a list of URLs or domains and separate valid from invalid.
- classmethod get_driver()[source]¶
DEPRECATED: Use PlaywrightFetcher instead. Get configured Chrome WebDriver instance for screenshots.
- Returns:
Headless Chrome driver with optimized settings
- Return type:
webdriver.Chrome
- classmethod save_image(url_or_domain, image_dir)[source]¶
DEPRECATED: Use PlaywrightFetcher.fetch_screenshot instead. Save screenshot of URL or domain homepage.
- classmethod extract_images(input, use_cache, image_dir)[source]¶
DEPRECATED: Use ContentProcessor.extract_image_content instead. Extract screenshots for domains.
- classmethod extract_image_tensor(offline, domains, image_dir)[source]¶
Convert PNG images to TensorFlow tensors for model input.
- classmethod extract_htmls(urls_or_domains, use_cache, html_path)[source]¶
Extract HTML content from URLs or domain homepages.
- classmethod extract_html_text(offline, input, html_path)[source]¶
Extract and clean text content from HTML files.
- classmethod load_model(model_file_name, latest=False)[source]¶
Load TensorFlow models and calibrators from local cache or download from server.
- Parameters:
Note
Loads both text and image models plus isotonic regression calibrators. Models are cached locally after first download.
piedomains.utils module¶
Utility functions for file operations, downloads, and security.
This module provides essential utility functions for the piedomains package, including secure file downloads, tar archive extraction with path traversal protection, and configuration management.
The utilities focus on security-first implementation, particularly for handling downloaded archives and preventing common security vulnerabilities like path traversal attacks.
- piedomains.utils.REPO_BASE_URL = 'https://dataverse.harvard.edu/api/access/datafile/7081895'¶
Base URL for model data repository.
Can be overridden via PIEDOMAINS_MODEL_URL environment variable. Defaults to Harvard Dataverse hosting the piedomains model files.
- Type:
- piedomains.utils.download_file(url, target, file_name, timeout=30)[source]¶
Download and extract a compressed model file from a remote repository.
This function downloads a tar.gz file from the specified URL, saves it to the target directory, extracts it using secure extraction methods, and cleans up the downloaded archive file.
- Parameters:
url (str) – URL of the remote file to download. Should point to a valid tar.gz archive containing model data.
target (str) – Local directory path where the file should be downloaded and extracted. Directory will be created if it doesn’t exist.
file_name (str) – Name to use for the downloaded file. Should include appropriate extension (e.g., “model.tar.gz”).
timeout (int) – HTTP request timeout in seconds. Defaults to 30 seconds for large model files.
- Returns:
- True if download and extraction completed successfully,
False if any error occurred during the process.
- Return type:
- Raises:
requests.RequestException – If HTTP download fails (not caught, logged only).
tarfile.TarError – If tar extraction fails (not caught, logged only).
OSError – If file operations fail (not caught, logged only).
Example
>>> success = download_file( ... url="https://example.com/model.tar.gz", ... target="/path/to/models", ... file_name="text_model.tar.gz" ... ) >>> if success: ... print("Model downloaded and extracted successfully")
- Security:
Uses safe_extract() to prevent path traversal attacks
Validates archive contents before extraction
Automatically removes downloaded archive after extraction
Logs all errors for security monitoring
Note
The downloaded tar.gz file is automatically deleted after extraction to save disk space. Only the extracted contents remain in the target directory.
- piedomains.utils.is_within_directory(directory, target)[source]¶
Check if a target path is within a specified directory (security check).
This function validates that a file path is contained within a directory to prevent path traversal attacks when extracting archives. It resolves all symbolic links and relative path components before comparison.
- Parameters:
- Returns:
- True if target is within directory, False if it would escape
the directory boundary (indicating a potential path traversal attack).
- Return type:
Example
>>> # Safe path >>> is_within_directory("/safe/dir", "/safe/dir/file.txt") True
>>> # Path traversal attempt >>> is_within_directory("/safe/dir", "/safe/dir/../../../etc/passwd") False
>>> # Another traversal attempt >>> is_within_directory("/safe/dir", "/safe/dir/subdir/../../../etc/passwd") False
- Security:
This function is critical for preventing path traversal attacks (also known as directory traversal or dot-dot-slash attacks) where malicious archives attempt to extract files outside the intended directory.
Note
This function uses os.path.abspath() to resolve all relative path components and symbolic links before performing the security check.
- piedomains.utils.safe_extract(tar, path='.', members=None, *, numeric_owner=False)[source]¶
Securely extract a tar archive with path traversal protection.
This function provides a secure wrapper around tarfile.extractall() that validates all archive members to prevent path traversal attacks. It checks each member’s path before extraction to ensure it stays within the target directory.
- Parameters:
tar (tarfile.TarFile) – Open tar file object to extract from.
path (str) – Directory path where archive should be extracted. Defaults to current directory (“.”).
members (list, optional) – Specific members to extract. If None, extracts all members. Defaults to None.
numeric_owner (bool) – If True, preserve numeric user/group IDs. If False, use current user. Defaults to False.
- Raises:
SecurityError – If any archive member attempts path traversal (would extract outside the target directory).
tarfile.TarError – If tar extraction fails for other reasons.
OSError – If file system operations fail.
- Return type:
Example
>>> import tarfile >>> with tarfile.open("model.tar.gz", "r:gz") as tar: ... safe_extract(tar, "/safe/extraction/dir")
- Security:
Validates every archive member before extraction
Prevents path traversal attacks (e.g., “../../../etc/passwd”)
Logs security violations for monitoring
Raises exceptions rather than silently failing
Note
This function should always be used instead of tarfile.extractall() when handling archives from untrusted sources, which includes downloaded model files.
- exception piedomains.utils.SecurityError[source]¶
Bases:
ExceptionException raised for security violations during file operations.
This exception is raised when security checks fail, particularly during archive extraction when path traversal attempts are detected.
Example
>>> try: ... safe_extract(malicious_tar, "/safe/dir") ... except SecurityError as e: ... logger.error(f"Security violation: {e}")
- piedomains.utils.get_file_hash(file_path, algorithm='sha256')[source]¶
Calculate cryptographic hash of a file for integrity verification.
- Parameters:
- Returns:
Hexadecimal hash digest of the file.
- Return type:
- Raises:
FileNotFoundError – If the specified file doesn’t exist.
ValueError – If an unsupported hash algorithm is specified.
Example
>>> hash_value = get_file_hash("model.tar.gz", "sha256") >>> print(f"File hash: {hash_value}")
Module contents¶
Piedomains: Domain content classification library.
This module provides lazy imports to avoid dependency issues when optional dependencies (like playwright) are not installed.