Using Archae as a Python Library¶
While Archae is primarily designed as a command-line tool, it can also be used as a Python library to programmatically extract and analyze archives in your own code.
Installation¶
To use Archae as a library, install it with pip or your preferred package manager:
uv pip install archae
Basic Usage¶
The main entry point for library usage is the ArchiveExtractor class from archae.extractor.
Simple Example¶
from pathlib import Path
from archae.extractor import ArchiveExtractor
# Create an extractor instance
extractor = ArchiveExtractor(extract_dir=Path("."))
# Process an archive
archive_path = Path("my_archive.zip")
extractor.handle_file(archive_path)
# Retrieve extracted files information
tracked_files = extractor.get_tracked_files()
for file_hash, file_info in tracked_files.items():
print(f"Hash: {file_hash}")
print(f"Info: {file_info}")
Core Classes and Methods¶
ArchiveExtractor¶
The main class for handling archive extraction and file tracking.
Constructor¶
ArchiveExtractor(base_dir: Path | None = None)
Parameters:
base_dir(optional): The base directory for extraction operations. Defaults to the current working directory. Anextracted/subdirectory will be created here to store extracted files.
Key Methods¶
handle_file(file_path: Path)¶
Process a file or archive. Recursively extracts archives and tracks all files.
Parameters:
file_path: Path to the file or archive to process
Example:
extractor.handle_file(Path("archive.tar.gz"))
get_tracked_files() -> dict[str, dict]¶
Retrieve information about all tracked files.
Returns: Dictionary mapping file hashes to their metadata including:
size: File size in bytestype: File type (from libmagic)type_mime: MIME typeextension: File extensionis_archive: Whether the file is an archiveextracted_size: Uncompressed size (for archives)overall_compression_ratio: Compression ratio (for archives)
Example:
files = extractor.get_tracked_files()
for file_hash, metadata in files.items():
print(f"{file_hash}: {metadata['size']} bytes")
get_warnings() -> list[str]¶
Retrieve accumulated warnings from the extraction process.
Returns: List of warning messages
Example:
warnings = extractor.get_warnings()
for warning in warnings:
print(f"Warning: {warning}")
get_default_settings() -> dict¶
Get the default settings from the config module.
Returns: Dictionary of default settings
Example:
defaults = extractor.get_default_settings()
print(f"Default max archive size: {defaults['MAX_ARCHIVE_SIZE_BYTES']}")
apply_options(option_list: dict[str, str | int | float | bool]])¶
Apply a dict of settings options.
Parameters:
option_list: Dictionary of options to apply
Example:
extractor.apply_options([
("MAX_ARCHIVE_SIZE_BYTES", "5000000000"),
("MIN_ARCHIVE_RATIO", "0.01")
])
Configuration¶
Archae’s behavior can be configured via the settings system. The same configuration options available in the CLI are accessible when using the library:
from archae.config import settings
# View current settings
print(settings["MAX_ARCHIVE_SIZE_BYTES"])
print(settings["MAX_TOTAL_SIZE_BYTES"])
print(settings["MIN_ARCHIVE_RATIO"])
print(settings["MIN_DISK_FREE_SPACE"])
For details on these settings and how to configure them, see the CLI Reference.
Advanced Examples¶
Processing Multiple Archives¶
from pathlib import Path
from archae.extractor import ArchiveExtractor
extractor = ArchiveExtractor(base_dir=Path("./output"))
archives = Path(".").glob("*.zip")
for archive in archives:
print(f"Processing {archive}")
try:
extractor.handle_file(archive)
except Exception as e:
print(f"Error processing {archive}: {e}")
# Get results
tracked = extractor.get_tracked_files()
print(f"Total files tracked: {len(tracked)}")
Analyzing Extraction Results¶
from archae.extractor import ArchiveExtractor
from pathlib import Path
extractor = ArchiveExtractor()
extractor.handle_file(Path("complex_archive.tar.gz"))
tracked = extractor.get_tracked_files()
# Find all archives
archives = [
(hash_, meta)
for hash_, meta in tracked.items()
if meta.get("is_archive")
]
print(f"Found {len(archives)} nested archives")
# Find files with high compression ratio (potential bombs)
for file_hash, metadata in tracked.items():
if metadata.get("is_archive"):
ratio = metadata.get("overall_compression_ratio", 0)
if ratio > 100: # 100x compression
print(f"Highly compressed archive: {file_hash}")
Getting File Statistics¶
from archae.extractor import ArchiveExtractor
from pathlib import Path
extractor = ArchiveExtractor()
extractor.handle_file(Path("archive.zip"))
tracked = extractor.get_tracked_files()
# Calculate total extracted size
total_size = sum(f.get("size", 0) for f in tracked.values())
print(f"Total extracted size: {total_size} bytes")
# Count files by type
types = {}
for metadata in tracked.values():
file_type = metadata.get("type", "unknown")
types[file_type] = types.get(file_type, 0) + 1
for file_type, count in sorted(types.items(), key=lambda x: x[1], reverse=True):
print(f"{file_type}: {count} files")
Error Handling¶
The extractor includes built-in protections against archive bombs and other issues. Check warnings after processing:
from archae.extractor import ArchiveExtractor
from pathlib import Path
extractor = ArchiveExtractor()
extractor.handle_file(Path("suspicious.zip"))
# Get any warnings about skipped archives
warnings = extractor.get_warnings()
if warnings:
print("Processing completed with warnings:")
for warning in warnings:
print(f" {warning.warning_type} - {warning.message}")
Notes¶
Archives are automatically detected based on file MIME type and extension
The extractor uses libmagic for accurate file type detection
By default, an
extracted/directory is created in the base directoryThe extractor is not thread-safe and should be used with one file at a time
Warnings are accumulated during processing and can be retrieved after completion