piedomains classifiers¶

Domain Classification Modules¶

piedomains.text module¶

Text-based domain classification using HTML content analysis.

class piedomains.text.TextClassifier(cache_dir=None, archive_date=None)[source]¶

Bases: Base

Text-based domain content classifier.

MODELFN: str | None = 'model/shallalist'¶

model_file_name = 'shallalist_v5_model.tar.gz'¶

__init__(cache_dir=None, archive_date=None)[source]¶

Initialize text classifier.

Parameters:

cache_dir (str, optional) – Directory for caching content
archive_date (str, optional) – Date for archive.org snapshots

load_models(latest=False)[source]¶: Load text classification model and calibrators.

classify(domains, latest=False)[source]¶

Classify domains using their cached HTML content.

Parameters:

domains (list[str]) – List of domain names to classify
latest (bool) – Whether to use latest model version

Return type:

list[dict]

Returns:

List of classification result dictionaries

Example

>>> classifier = TextClassifier()
>>> results = classifier.classify(["cnn.com", "bbc.com"])
>>> print(results[0]["category"])
news

classify_from_paths(data_paths, output_file=None, latest=False)[source]¶

Classify domains using HTML files from collected data paths.

Parameters:

data_paths (list[dict]) – List of dicts with domain data containing text_path, domain, etc.
output_file (str) – Optional path to save JSON results
latest (bool) – Whether to use latest model version

Return type:

list[dict]

Returns:

List of classification result dictionaries (JSON format)

Example

>>> classifier = TextClassifier()
>>> data = [{"domain": "cnn.com", "text_path": "html/cnn.com.html", ...}]
>>> results = classifier.classify_from_paths(data)
>>> print(results[0]["category"])
news

classify_from_data(collection_data, output_file=None, latest=False)[source]¶

Classify domains using collection metadata from DataCollector.

Parameters:

collection_data (dict) – Collection metadata dict from DataCollector.collect()
output_file (str) – Optional path to save JSON results
latest (bool) – Whether to use latest model version

Return type:

list[dict]

Returns:

List of classification result dictionaries (JSON format)

Example

>>> from piedomains import DataCollector
>>> collector = DataCollector()
>>> data = collector.collect(["cnn.com"])
>>> classifier = TextClassifier()
>>> results = classifier.classify_from_data(data)

piedomains.image module¶

Image-based domain classification using homepage screenshots.

class piedomains.image.ImageClassifier(cache_dir=None, archive_date=None)[source]¶

Bases: Base

Image-based domain content classifier using homepage screenshots.

MODELFN: str | None = 'model/shallalist'¶

model_file_name = 'shallalist_v5_model.tar.gz'¶

__init__(cache_dir=None, archive_date=None)[source]¶

Initialize image classifier.

Parameters:

cache_dir (str, optional) – Directory for caching content
archive_date (str, optional) – Date for archive.org snapshots

load_models(latest=False)[source]¶: Load image classification model.

classify(domains, latest=False)[source]¶

Classify domains using their cached screenshot images.

Parameters:

domains (list[str]) – List of domain names to classify
latest (bool) – Whether to use latest model version

Return type:

list[dict]

Returns:

List of classification result dictionaries

Example

>>> classifier = ImageClassifier()
>>> results = classifier.classify(["cnn.com", "bbc.com"])
>>> print(results[0]["category"])
news

classify_from_paths(data_paths, output_file=None, latest=False)[source]¶

Classify domains using screenshot files from collected data paths.

Parameters:

data_paths (list[dict]) – List of dicts with domain data containing image_path, domain, etc.
output_file (str) – Optional path to save JSON results
latest (bool) – Whether to use latest model version

Return type:

list[dict]

Returns:

List of classification result dictionaries (JSON format)

Example

>>> classifier = ImageClassifier()
>>> data = [{"domain": "cnn.com", "image_path": "images/cnn.com.png", ...}]
>>> results = classifier.classify_from_paths(data)
>>> print(results[0]["category"])
news

classify_from_data(collection_data, output_file=None, latest=False)[source]¶

Classify domains using collection metadata from DataCollector.

Parameters:

collection_data (dict) – Collection metadata dict from DataCollector.collect()
output_file (str) – Optional path to save JSON results
latest (bool) – Whether to use latest model version

Return type:

list[dict]

Returns:

List of classification result dictionaries (JSON format)

Example

>>> from piedomains import DataCollector
>>> collector = DataCollector()
>>> data = collector.collect(["cnn.com"])
>>> classifier = ImageClassifier()
>>> results = classifier.classify_from_data(data)

piedomains.llm module¶

LLM-based classification utilities for piedomains.

class piedomains.llm.LLMConfig(provider, model, api_key=None, base_url=None, max_tokens=500, temperature=0.1, categories=None, cost_limit_usd=10.0, usage_tracking=True)[source]¶

Bases: object

Configuration for LLM-based classification.

provider¶: LLM provider (e.g., ‘openai’, ‘anthropic’, ‘google’)

model¶: Model name (e.g., ‘gpt-4o’, ‘claude-3-5-sonnet-20241022’, ‘gemini-1.5-pro’)

api_key¶: API key for the provider

base_url¶: Optional base URL for custom endpoints

max_tokens¶: Maximum tokens for response

temperature¶: Temperature for response generation

categories¶: List of classification categories

cost_limit_usd¶: Maximum cost limit in USD

usage_tracking¶: Whether to track API usage

__init__(provider, model, api_key=None, base_url=None, max_tokens=500, temperature=0.1, categories=None, cost_limit_usd=10.0, usage_tracking=True)¶

__post_init__()[source]¶

Validate and set defaults after initialization.

Return type:: None

api_key: str | None = None¶

base_url: str | None = None¶

categories: list[str] | None = None¶

cost_limit_usd: float = 10.0¶

classmethod from_dict(config_dict)[source]¶

Create LLMConfig from dictionary.

Return type:: LLMConfig

max_tokens: int = 500¶

temperature: float = 0.1¶

to_litellm_params()[source]¶

Convert to litellm parameters.

Return type:: dict[str, Any]

usage_tracking: bool = True¶

provider: str¶

model: str¶

piedomains.llm.get_classification_prompt(domain, content, categories, max_content_length=8000)[source]¶

Generate classification prompt for text-only analysis.

Parameters:

domain (str) – Domain name to classify
content (str) – Extracted text content from the domain
categories (list[str]) – List of available categories
max_content_length (int) – Maximum length of content to include

Return type:

str

Returns:

Formatted prompt string

piedomains.llm.get_multimodal_prompt(domain, content=None, categories=None, has_screenshot=False, max_content_length=6000)[source]¶

Generate classification prompt for multimodal analysis (text + image).

Parameters:

domain (str) – Domain name to classify
content (str | None) – Extracted text content (optional)
categories (list[str] | None) – List of available categories
has_screenshot (bool) – Whether a screenshot image is provided
max_content_length (int) – Maximum length of content to include

Return type:

str

Returns:

Formatted prompt string

piedomains.llm.parse_llm_response(response_text)[source]¶

Parse LLM response into structured classification result.

Parameters:: response_text (str) – Raw response text from LLM
Return type:: dict[str, Any]
Returns:: Dictionary with parsed classification data
Raises:: ValueError – If response cannot be parsed