Using Archae as a Python Library

While Archae is primarily designed as a command-line tool, it can also be used as a Python library to programmatically extract and analyze archives in your own code.

Installation

To use Archae as a library, install it with pip or your preferred package manager:

uv pip install archae

Basic Usage

The main entry point for library usage is the ArchiveExtractor class from archae.extractor.

Simple Example

from pathlib import Path
from archae.extractor import ArchiveExtractor

# Create an extractor instance
extractor = ArchiveExtractor(extract_dir=Path("."))

# Process an archive
archive_path = Path("my_archive.zip")
extractor.handle_file(archive_path)

# Retrieve extracted files information
tracked_files = extractor.get_tracked_files()
for file_hash, file_info in tracked_files.items():
    print(f"Hash: {file_hash}")
    print(f"Info: {file_info}")

Core Classes and Methods

ArchiveExtractor

The main class for handling archive extraction and file tracking.

Constructor

ArchiveExtractor(base_dir: Path | None = None)

Parameters:

  • base_dir (optional): The base directory for extraction operations. Defaults to the current working directory. An extracted/ subdirectory will be created here to store extracted files.

Key Methods

handle_file(file_path: Path)

Process a file or archive. Recursively extracts archives and tracks all files.

Parameters:

  • file_path: Path to the file or archive to process

Example:

extractor.handle_file(Path("archive.tar.gz"))
get_tracked_files() -> dict[str, dict]

Retrieve information about all tracked files.

Returns: Dictionary mapping file hashes to their metadata including:

  • size: File size in bytes

  • type: File type (from libmagic)

  • type_mime: MIME type

  • extension: File extension

  • is_archive: Whether the file is an archive

  • extracted_size: Uncompressed size (for archives)

  • overall_compression_ratio: Compression ratio (for archives)

Example:

files = extractor.get_tracked_files()
for file_hash, metadata in files.items():
    print(f"{file_hash}: {metadata['size']} bytes")
get_warnings() -> list[str]

Retrieve accumulated warnings from the extraction process.

Returns: List of warning messages

Example:

warnings = extractor.get_warnings()
for warning in warnings:
    print(f"Warning: {warning}")
get_default_settings() -> dict

Get the default settings from the config module.

Returns: Dictionary of default settings

Example:

defaults = extractor.get_default_settings()
print(f"Default max archive size: {defaults['MAX_ARCHIVE_SIZE_BYTES']}")
apply_options(option_list: dict[str, str | int | float | bool]])

Apply a dict of settings options.

Parameters:

  • option_list: Dictionary of options to apply

Example:

extractor.apply_options([
    ("MAX_ARCHIVE_SIZE_BYTES", "5000000000"),
    ("MIN_ARCHIVE_RATIO", "0.01")
])

Configuration

Archae’s behavior can be configured via the settings system. The same configuration options available in the CLI are accessible when using the library:

from archae.config import settings

# View current settings
print(settings["MAX_ARCHIVE_SIZE_BYTES"])
print(settings["MAX_TOTAL_SIZE_BYTES"])
print(settings["MIN_ARCHIVE_RATIO"])
print(settings["MIN_DISK_FREE_SPACE"])

For details on these settings and how to configure them, see the CLI Reference.

Advanced Examples

Processing Multiple Archives

from pathlib import Path
from archae.extractor import ArchiveExtractor

extractor = ArchiveExtractor(base_dir=Path("./output"))

archives = Path(".").glob("*.zip")
for archive in archives:
    print(f"Processing {archive}")
    try:
        extractor.handle_file(archive)
    except Exception as e:
        print(f"Error processing {archive}: {e}")

# Get results
tracked = extractor.get_tracked_files()
print(f"Total files tracked: {len(tracked)}")

Analyzing Extraction Results

from archae.extractor import ArchiveExtractor
from pathlib import Path

extractor = ArchiveExtractor()
extractor.handle_file(Path("complex_archive.tar.gz"))

tracked = extractor.get_tracked_files()

# Find all archives
archives = [
    (hash_, meta)
    for hash_, meta in tracked.items()
    if meta.get("is_archive")
]

print(f"Found {len(archives)} nested archives")

# Find files with high compression ratio (potential bombs)
for file_hash, metadata in tracked.items():
    if metadata.get("is_archive"):
        ratio = metadata.get("overall_compression_ratio", 0)
        if ratio > 100:  # 100x compression
            print(f"Highly compressed archive: {file_hash}")

Getting File Statistics

from archae.extractor import ArchiveExtractor
from pathlib import Path

extractor = ArchiveExtractor()
extractor.handle_file(Path("archive.zip"))

tracked = extractor.get_tracked_files()

# Calculate total extracted size
total_size = sum(f.get("size", 0) for f in tracked.values())
print(f"Total extracted size: {total_size} bytes")

# Count files by type
types = {}
for metadata in tracked.values():
    file_type = metadata.get("type", "unknown")
    types[file_type] = types.get(file_type, 0) + 1

for file_type, count in sorted(types.items(), key=lambda x: x[1], reverse=True):
    print(f"{file_type}: {count} files")

Error Handling

The extractor includes built-in protections against archive bombs and other issues. Check warnings after processing:

from archae.extractor import ArchiveExtractor
from pathlib import Path

extractor = ArchiveExtractor()
extractor.handle_file(Path("suspicious.zip"))

# Get any warnings about skipped archives
warnings = extractor.get_warnings()
if warnings:
    print("Processing completed with warnings:")
    for warning in warnings:
        print(f" {warning.warning_type} - {warning.message}")

Notes

  • Archives are automatically detected based on file MIME type and extension

  • The extractor uses libmagic for accurate file type detection

  • By default, an extracted/ directory is created in the base directory

  • The extractor is not thread-safe and should be used with one file at a time

  • Warnings are accumulated during processing and can be retrieved after completion