piedomains classifiers

Domain Classification Modules

piedomains.text module

Text-based domain classification using HTML content analysis.

class piedomains.text.TextClassifier(cache_dir=None, archive_date=None)[source]

Bases: Base

Text-based domain content classifier.

MODELFN: str | None = 'model/shallalist'
model_file_name = 'shallalist_v5_model.tar.gz'
__init__(cache_dir=None, archive_date=None)[source]

Initialize text classifier.

Parameters:
  • cache_dir (str, optional) – Directory for caching content

  • archive_date (str, optional) – Date for archive.org snapshots

load_models(latest=False)[source]

Load text classification model and calibrators.

classify(domains, latest=False)[source]

Classify domains using their cached HTML content.

Parameters:
  • domains (list[str]) – List of domain names to classify

  • latest (bool) – Whether to use latest model version

Return type:

list[dict]

Returns:

List of classification result dictionaries

Example

>>> classifier = TextClassifier()
>>> results = classifier.classify(["cnn.com", "bbc.com"])
>>> print(results[0]["category"])
news
classify_from_paths(data_paths, output_file=None, latest=False)[source]

Classify domains using HTML files from collected data paths.

Parameters:
  • data_paths (list[dict]) – List of dicts with domain data containing text_path, domain, etc.

  • output_file (str) – Optional path to save JSON results

  • latest (bool) – Whether to use latest model version

Return type:

list[dict]

Returns:

List of classification result dictionaries (JSON format)

Example

>>> classifier = TextClassifier()
>>> data = [{"domain": "cnn.com", "text_path": "html/cnn.com.html", ...}]
>>> results = classifier.classify_from_paths(data)
>>> print(results[0]["category"])
news
classify_from_data(collection_data, output_file=None, latest=False)[source]

Classify domains using collection metadata from DataCollector.

Parameters:
  • collection_data (dict) – Collection metadata dict from DataCollector.collect()

  • output_file (str) – Optional path to save JSON results

  • latest (bool) – Whether to use latest model version

Return type:

list[dict]

Returns:

List of classification result dictionaries (JSON format)

Example

>>> from piedomains import DataCollector
>>> collector = DataCollector()
>>> data = collector.collect(["cnn.com"])
>>> classifier = TextClassifier()
>>> results = classifier.classify_from_data(data)

piedomains.image module

Image-based domain classification using homepage screenshots.

class piedomains.image.ImageClassifier(cache_dir=None, archive_date=None)[source]

Bases: Base

Image-based domain content classifier using homepage screenshots.

MODELFN: str | None = 'model/shallalist'
model_file_name = 'shallalist_v5_model.tar.gz'
__init__(cache_dir=None, archive_date=None)[source]

Initialize image classifier.

Parameters:
  • cache_dir (str, optional) – Directory for caching content

  • archive_date (str, optional) – Date for archive.org snapshots

load_models(latest=False)[source]

Load image classification model.

classify(domains, latest=False)[source]

Classify domains using their cached screenshot images.

Parameters:
  • domains (list[str]) – List of domain names to classify

  • latest (bool) – Whether to use latest model version

Return type:

list[dict]

Returns:

List of classification result dictionaries

Example

>>> classifier = ImageClassifier()
>>> results = classifier.classify(["cnn.com", "bbc.com"])
>>> print(results[0]["category"])
news
classify_from_paths(data_paths, output_file=None, latest=False)[source]

Classify domains using screenshot files from collected data paths.

Parameters:
  • data_paths (list[dict]) – List of dicts with domain data containing image_path, domain, etc.

  • output_file (str) – Optional path to save JSON results

  • latest (bool) – Whether to use latest model version

Return type:

list[dict]

Returns:

List of classification result dictionaries (JSON format)

Example

>>> classifier = ImageClassifier()
>>> data = [{"domain": "cnn.com", "image_path": "images/cnn.com.png", ...}]
>>> results = classifier.classify_from_paths(data)
>>> print(results[0]["category"])
news
classify_from_data(collection_data, output_file=None, latest=False)[source]

Classify domains using collection metadata from DataCollector.

Parameters:
  • collection_data (dict) – Collection metadata dict from DataCollector.collect()

  • output_file (str) – Optional path to save JSON results

  • latest (bool) – Whether to use latest model version

Return type:

list[dict]

Returns:

List of classification result dictionaries (JSON format)

Example

>>> from piedomains import DataCollector
>>> collector = DataCollector()
>>> data = collector.collect(["cnn.com"])
>>> classifier = ImageClassifier()
>>> results = classifier.classify_from_data(data)

piedomains.llm module

LLM-based classification utilities for piedomains.

class piedomains.llm.LLMConfig(provider, model, api_key=None, base_url=None, max_tokens=500, temperature=0.1, categories=None, cost_limit_usd=10.0, usage_tracking=True)[source]

Bases: object

Configuration for LLM-based classification.

provider

LLM provider (e.g., ‘openai’, ‘anthropic’, ‘google’)

model

Model name (e.g., ‘gpt-4o’, ‘claude-3-5-sonnet-20241022’, ‘gemini-1.5-pro’)

api_key

API key for the provider

base_url

Optional base URL for custom endpoints

max_tokens

Maximum tokens for response

temperature

Temperature for response generation

categories

List of classification categories

cost_limit_usd

Maximum cost limit in USD

usage_tracking

Whether to track API usage

__init__(provider, model, api_key=None, base_url=None, max_tokens=500, temperature=0.1, categories=None, cost_limit_usd=10.0, usage_tracking=True)
__post_init__()[source]

Validate and set defaults after initialization.

Return type:

None

api_key: str | None = None
base_url: str | None = None
categories: list[str] | None = None
cost_limit_usd: float = 10.0
classmethod from_dict(config_dict)[source]

Create LLMConfig from dictionary.

Return type:

LLMConfig

max_tokens: int = 500
temperature: float = 0.1
to_litellm_params()[source]

Convert to litellm parameters.

Return type:

dict[str, Any]

usage_tracking: bool = True
provider: str
model: str
piedomains.llm.get_classification_prompt(domain, content, categories, max_content_length=8000)[source]

Generate classification prompt for text-only analysis.

Parameters:
  • domain (str) – Domain name to classify

  • content (str) – Extracted text content from the domain

  • categories (list[str]) – List of available categories

  • max_content_length (int) – Maximum length of content to include

Return type:

str

Returns:

Formatted prompt string

piedomains.llm.get_multimodal_prompt(domain, content=None, categories=None, has_screenshot=False, max_content_length=6000)[source]

Generate classification prompt for multimodal analysis (text + image).

Parameters:
  • domain (str) – Domain name to classify

  • content (str | None) – Extracted text content (optional)

  • categories (list[str] | None) – List of available categories

  • has_screenshot (bool) – Whether a screenshot image is provided

  • max_content_length (int) – Maximum length of content to include

Return type:

str

Returns:

Formatted prompt string

piedomains.llm.parse_llm_response(response_text)[source]

Parse LLM response into structured classification result.

Parameters:

response_text (str) – Raw response text from LLM

Return type:

dict[str, Any]

Returns:

Dictionary with parsed classification data

Raises:

ValueError – If response cannot be parsed