piedomains classifiers¶
Domain Classification Modules¶
piedomains.text module¶
Text-based domain classification using HTML content analysis.
- class piedomains.text.TextClassifier(cache_dir=None, archive_date=None)[source]¶
Bases:
BaseText-based domain content classifier.
- model_file_name = 'shallalist_v5_model.tar.gz'¶
- classify(domains, latest=False)[source]¶
Classify domains using their cached HTML content.
- Parameters:
- Return type:
- Returns:
List of classification result dictionaries
Example
>>> classifier = TextClassifier() >>> results = classifier.classify(["cnn.com", "bbc.com"]) >>> print(results[0]["category"]) news
- classify_from_paths(data_paths, output_file=None, latest=False)[source]¶
Classify domains using HTML files from collected data paths.
- Parameters:
- Return type:
- Returns:
List of classification result dictionaries (JSON format)
Example
>>> classifier = TextClassifier() >>> data = [{"domain": "cnn.com", "text_path": "html/cnn.com.html", ...}] >>> results = classifier.classify_from_paths(data) >>> print(results[0]["category"]) news
- classify_from_data(collection_data, output_file=None, latest=False)[source]¶
Classify domains using collection metadata from DataCollector.
- Parameters:
- Return type:
- Returns:
List of classification result dictionaries (JSON format)
Example
>>> from piedomains import DataCollector >>> collector = DataCollector() >>> data = collector.collect(["cnn.com"]) >>> classifier = TextClassifier() >>> results = classifier.classify_from_data(data)
piedomains.image module¶
Image-based domain classification using homepage screenshots.
- class piedomains.image.ImageClassifier(cache_dir=None, archive_date=None)[source]¶
Bases:
BaseImage-based domain content classifier using homepage screenshots.
- model_file_name = 'shallalist_v5_model.tar.gz'¶
- classify(domains, latest=False)[source]¶
Classify domains using their cached screenshot images.
- Parameters:
- Return type:
- Returns:
List of classification result dictionaries
Example
>>> classifier = ImageClassifier() >>> results = classifier.classify(["cnn.com", "bbc.com"]) >>> print(results[0]["category"]) news
- classify_from_paths(data_paths, output_file=None, latest=False)[source]¶
Classify domains using screenshot files from collected data paths.
- Parameters:
- Return type:
- Returns:
List of classification result dictionaries (JSON format)
Example
>>> classifier = ImageClassifier() >>> data = [{"domain": "cnn.com", "image_path": "images/cnn.com.png", ...}] >>> results = classifier.classify_from_paths(data) >>> print(results[0]["category"]) news
- classify_from_data(collection_data, output_file=None, latest=False)[source]¶
Classify domains using collection metadata from DataCollector.
- Parameters:
- Return type:
- Returns:
List of classification result dictionaries (JSON format)
Example
>>> from piedomains import DataCollector >>> collector = DataCollector() >>> data = collector.collect(["cnn.com"]) >>> classifier = ImageClassifier() >>> results = classifier.classify_from_data(data)
piedomains.llm module¶
LLM-based classification utilities for piedomains.
- class piedomains.llm.LLMConfig(provider, model, api_key=None, base_url=None, max_tokens=500, temperature=0.1, categories=None, cost_limit_usd=10.0, usage_tracking=True)[source]¶
Bases:
objectConfiguration for LLM-based classification.
- provider¶
LLM provider (e.g., ‘openai’, ‘anthropic’, ‘google’)
- model¶
Model name (e.g., ‘gpt-4o’, ‘claude-3-5-sonnet-20241022’, ‘gemini-1.5-pro’)
- api_key¶
API key for the provider
- base_url¶
Optional base URL for custom endpoints
- max_tokens¶
Maximum tokens for response
- temperature¶
Temperature for response generation
- categories¶
List of classification categories
- cost_limit_usd¶
Maximum cost limit in USD
- usage_tracking¶
Whether to track API usage
- __init__(provider, model, api_key=None, base_url=None, max_tokens=500, temperature=0.1, categories=None, cost_limit_usd=10.0, usage_tracking=True)¶
- piedomains.llm.get_classification_prompt(domain, content, categories, max_content_length=8000)[source]¶
Generate classification prompt for text-only analysis.
- piedomains.llm.get_multimodal_prompt(domain, content=None, categories=None, has_screenshot=False, max_content_length=6000)[source]¶
Generate classification prompt for multimodal analysis (text + image).
- Parameters:
- Return type:
- Returns:
Formatted prompt string
- piedomains.llm.parse_llm_response(response_text)[source]¶
Parse LLM response into structured classification result.
- Parameters:
response_text (
str) – Raw response text from LLM- Return type:
- Returns:
Dictionary with parsed classification data
- Raises:
ValueError – If response cannot be parsed