Implementation Details

Overview

This document describes the technical implementation of the comprehensive feed fetching and analytics system added to AI Web Feeds in version 1.0.

This is the first version of these capabilities - designed from scratch for optimal performance and extensibility.

Architecture

The enhanced system consists of three main components:

Feed URL → AdvancedFeedFetcher → FeedMetadata + Items
                                        ↓
                                  DatabaseManager
                                        ↓
                                  FeedAnalytics
                                        ↓
                                  CLI Commands

Core Components

1. Advanced Feed Fetcher

Location: packages/ai_web_feeds/src/ai_web_feeds/fetcher.py (820 lines)

A sophisticated feed fetching system that extracts exhaustive metadata from RSS/Atom/JSON feeds.

Key Features

100+ Metadata Fields

The fetcher extracts comprehensive metadata organized in categories:

Basic Feed Information:

Title, subtitle, description
Homepage link
Language and copyright
Generator information

Author/Publisher Data:

Author name and email
Publisher information
Managing editor
Webmaster contact

Visual Assets:

Feed images (URL, title, link)
Logo and icon URLs
Dimensions and alt text

Technical Metadata:

TTL (Time To Live)
Skip hours and skip days
Cloud configuration
PubSubHubbub hub URLs

Content Statistics:

Total item count
Items with full content
Items with authors
Items with enclosures/media
Average title/description/content lengths

Three-Dimensional Quality Scoring

Each feed receives scores (0-1) across three dimensions:

1. Completeness Score

Measures how complete the feed metadata is:

✅ Has title
✅ Has description
✅ Has link
✅ Has language
✅ Has timestamps
✅ Has author/publisher
✅ Has categories
✅ Has image/logo

# Example calculation
completeness = sum([
    bool(feed.title),      # 1/8
    bool(feed.description), # 1/8
    bool(feed.link),       # 1/8
    bool(feed.language),   # 1/8
    # ... etc
]) / 8.0

2. Richness Score

Measures content quality and depth:

Items have content
Content coverage percentage
Author attribution
Average content length
Full content availability
Media/images present

3. Structure Score

Measures feed structure quality:

No parsing errors
Has items
Items have GUIDs
Has timestamps
Has links

Publishing Frequency Detection

Automatically analyzes item publication patterns to estimate update frequency:

Frequency	Pattern
Hourly	New items every hour or less
Daily	New items published daily
Weekly	Weekly publication schedule
Monthly	Monthly updates
Infrequent	Longer intervals between posts

# Algorithm outline
def estimate_update_frequency(items):
    if not items or len(items) < 2:
        return "unknown"

    # Calculate time between publications
    intervals = calculate_intervals(items)
    avg_interval = median(intervals)

    # Classify based on average interval
    if avg_interval < 3600:      # < 1 hour
        return "hourly"
    elif avg_interval < 86400:   # < 1 day
        return "daily"
    # ... etc

Extension Support

Full support for popular RSS extensions:

iTunes Podcast Metadata:

Author, owner, categories
Explicit flag
Episode information
Artwork URLs

Dublin Core Metadata:

Contributor, coverage
Creator, date
Format, identifier
Rights, source

Media RSS:

Thumbnails with dimensions
Media content
Keywords and descriptions
Credit information

GeoRSS:

Location coordinates
Geographic regions
Place names

Usage Example

from ai_web_feeds.fetcher import AdvancedFeedFetcher
from ai_web_feeds.storage import DatabaseManager

# Initialize
db = DatabaseManager("sqlite:///data/aiwebfeeds.db")
fetcher = AdvancedFeedFetcher()

# Fetch feed
fetch_log, metadata, items = await fetcher.fetch_feed(
    "https://example.com/feed.xml"
)

# Access quality scores
print(f"Completeness: {metadata.completeness_score:.2f}")
print(f"Richness: {metadata.richness_score:.2f}")
print(f"Structure: {metadata.structure_score:.2f}")

# Access metadata
print(f"Update frequency: {metadata.estimated_update_frequency}")
print(f"Total items: {metadata.total_items}")
print(f"Found {len(items)} items")

# Save to database
session = db.get_session()
session.add(fetch_log)
session.commit()

Conditional Requests

The fetcher supports conditional HTTP requests to reduce bandwidth:

# Use ETag and Last-Modified from previous fetch
fetch_log, metadata, items = await fetcher.fetch_feed(
    url="https://example.com/feed.xml",
    etag="33a64df551425fcc55e4d42a148795d9f25f89d4",
    last_modified="Wed, 15 Nov 2023 12:00:00 GMT"
)

# Returns 304 Not Modified if feed hasn't changed
if fetch_log.status_code == 304:
    print("Feed unchanged")

Retry Logic

Built-in exponential backoff for transient failures:

# Automatic retries (configured via tenacity)
@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10)
)
async def fetch_with_retry(url):
    # Will retry up to 3 times
    # Waits 2s, 4s, 8s between attempts
    pass

2. Analytics Engine

Location: packages/ai_web_feeds/src/ai_web_feeds/analytics.py (600 lines)

Comprehensive analytics engine providing 8 different analytical views of feed data.

Generate Full Report

# Export everything to JSON
report = analytics.generate_full_report()

# Save to file
import json
with open("analytics.json", "w") as f:
    json.dump(report, f, indent=2)

# Report includes all 8 analytics views

3. CLI Commands

Fetch Commands

Location: apps/cli/ai_web_feeds/cli/commands/fetch.py (200 lines)

Fetch Single Feed

ai-web-feeds fetch one <feed-id> [--metadata]

Fetches a single feed with optional metadata display:

# Basic fetch
ai-web-feeds fetch one openai-blog

# With detailed metadata
ai-web-feeds fetch one openai-blog --metadata

Features:

Progress indicator
Error reporting
Quality scores display
Metadata summary table

Fetch All Feeds

ai-web-feeds fetch all [--limit N] [--verified-only]

Batch fetch with progress tracking:

# Fetch all feeds
ai-web-feeds fetch all

# Fetch first 10 feeds
ai-web-feeds fetch all --limit 10

# Fetch only verified feeds
ai-web-feeds fetch all --verified-only

Features:

Rich progress bar
Real-time stats
Error summary table
Success/failure counts

Analytics Commands

Location: apps/cli/ai_web_feeds/cli/commands/analytics.py (400 lines)

Overview Dashboard

ai-web-feeds analytics overview

Displays comprehensive dashboard with:

Total counts (feeds, items, topics)
Status distribution
Recent activity (24h)

Distributions

ai-web-feeds analytics distributions [--limit N]

Shows distributions across:

Source types
Content mediums
Topics
Languages

Quality Metrics

ai-web-feeds analytics quality

Quality assessment with:

Average scores
Quality distribution
High/low quality counts

Performance Tracking

ai-web-feeds analytics performance [--days N]

Fetch performance metrics:

Success/failure rates
Average durations
Error distribution
HTTP status codes

Content Statistics

ai-web-feeds analytics content

Content analysis:

Total items
Coverage metrics
Top categories

Publishing Trends

ai-web-feeds analytics trends [--days N]

Publishing patterns:

Items per day
Hourly distribution
Weekday patterns
Peak times

Feed Health

ai-web-feeds analytics health <feed-id>

Per-feed health report with diagnostics and recommendations.

Top Contributors

ai-web-feeds analytics contributors [--limit N]

Contributor leaderboard with verification rates.

Generate Report

ai-web-feeds analytics report [--output FILE]

Export comprehensive JSON report.

Database Schema

The enhanced system uses the existing database schema with full utilization of flexible JSON columns:

FeedFetchLog Enhancements

class FeedFetchLog(SQLModel, table=True):
    # ... existing fields ...

    # Enhanced usage of extra_data
    extra_data: Optional[Dict[str, Any]] = Field(
        default=None,
        sa_column=Column(JSON)
    )
    # Now stores:
    # - Complete HTTP headers
    # - Detailed error information
    # - Item statistics
    # - Quality scores
    # - Extension metadata

FeedItem Enhancements

class FeedItem(SQLModel, table=True):
    # ... existing fields ...

    # Enhanced usage of extra_data
    extra_data: Optional[Dict[str, Any]] = Field(
        default=None,
        sa_column=Column(JSON)
    )
    # Now stores:
    # - Extension metadata (iTunes, Media RSS, etc.)
    # - Multiple categories
    # - Enclosure metadata
    # - Author details

No migration required - The system leverages existing flexible JSON columns for maximum compatibility.

dependencies = [
    # ... existing ...
    "beautifulsoup4>=4.12.0",  # NEW: HTML parsing
]

Purpose:

HTML parsing for feed discovery
Extracting feed URLs from web pages
Parsing HTML content in feed items

CLI Tool Dependencies

File: apps/cli/pyproject.toml

dependencies = [
    # ... existing ...
    "rich>=13.7.0",  # NEW: Rich terminal output
]

Purpose:

Beautiful terminal tables
Progress bars and spinners
Colored output and styling
Markdown rendering in terminal

Performance Considerations

Conditional Requests

Reduce bandwidth and processing for unchanged feeds:

# Store from previous fetch
etag = fetch_log.etag
last_modified = fetch_log.last_modified

# Use in next fetch
new_log, metadata, items = await fetcher.fetch_feed(
    url=feed_url,
    etag=etag,
    last_modified=last_modified
)

# Server returns 304 Not Modified if unchanged
if new_log.status_code == 304:
    # No processing needed
    return

Retry Logic

Exponential backoff for reliability:

from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential
)

@retry(
    stop=stop_after_attempt(3),  # Max 3 attempts
    wait=wait_exponential(
        multiplier=1,
        min=2,    # Wait 2s after first failure
        max=10    # Wait max 10s
    )
)
async def fetch_with_retry(url):
    # Automatic retry on failure
    pass

Timeouts

Prevent hanging on slow feeds:

# Configurable timeout (default 30s)
fetcher = AdvancedFeedFetcher(timeout=30.0)

# Per-request timeout
fetch_log, metadata, items = await fetcher.fetch_feed(
    url=feed_url,
    timeout=60.0  # Override for slow feed
)

Best Practices

Use Conditional Requests

Always pass etag and last_modified from previous fetches to reduce bandwidth:

# Save from previous fetch
session.add(fetch_log)

# Use in next fetch
new_log = await fetcher.fetch_feed(
    url=url,
    etag=fetch_log.etag,
    last_modified=fetch_log.last_modified
)

Respect TTL Values

Honor feed TTL (Time To Live) for update frequency:

if metadata.ttl:
    # Wait TTL minutes before next fetch
    next_fetch = datetime.now() + timedelta(minutes=metadata.ttl)

Monitor Health Regularly

Check feed health scores to identify issues:

# Daily health check
ai-web-feeds analytics health openai-blog

# Weekly full report
ai-web-feeds analytics report --output weekly-report.json

Track Trends

Use analytics to identify patterns:

# Monthly trend analysis
ai-web-feeds analytics trends --days 30

# Quality monitoring
ai-web-feeds analytics quality

Generate Periodic Reports

Export analytics for monitoring:

# Weekly reports
ai-web-feeds analytics report --output reports/week-$(date +%U).json

# Archive for historical analysis

Installation

Quick Setup Script

Use the automated setup script:

# Make executable
chmod +x setup-enhanced-features.sh

# Run setup
./setup-enhanced-features.sh

The script will:

Install core library with dependencies
Install CLI tool with dependencies
Verify installation
Display next steps

Manual Installation

Install each component separately:

# 1. Install core library
cd packages/ai_web_feeds
pip install -e .

# 2. Install CLI tool
cd ../../apps/cli
pip install -e .

# 3. Verify installation
ai-web-feeds --version
ai-web-feeds fetch --help
ai-web-feeds analytics --help

Code Organization

packages/ai_web_feeds/src/ai_web_feeds/
├── fetcher.py          # AdvancedFeedFetcher class
│   ├── FeedMetadata    # Metadata container (100+ fields)
│   ├── fetch_feed()    # Main fetch method
│   ├── _extract_*()    # Extraction helpers
│   └── _calculate_*()  # Quality scoring
│
├── analytics.py        # FeedAnalytics class
│   ├── get_overview_stats()
│   ├── get_*_distribution()
│   ├── get_quality_metrics()
│   ├── get_fetch_performance_stats()
│   ├── get_content_statistics()
│   ├── get_publishing_trends()
│   ├── get_feed_health_report()
│   ├── get_top_contributors()
│   └── generate_full_report()
│
apps/cli/ai_web_feeds/cli/commands/
├── fetch.py            # Fetch CLI commands
│   ├── fetch_one()     # Single feed fetch
│   └── fetch_all()     # Batch fetch
│
└── analytics.py        # Analytics CLI commands
    ├── show_overview()
    ├── show_distributions()
    ├── show_quality()
    ├── show_performance()
    ├── show_content()
    ├── show_trends()
    ├── show_health()
    ├── show_contributors()
    └── generate_report()

Future Enhancements

Potential additions for future versions:

Web UI dashboard with real-time metrics
Machine learning for content classification
Real-time monitoring with webhooks
GraphQL API for analytics
Advanced deduplication algorithms
Content similarity analysis
Multi-language NLP support
Anomaly detection in publishing patterns
Automated quality recommendations

Support

For technical questions or issues:

Review this documentation
Check inline code documentation
Explore CLI help: ai-web-feeds --help
Open an issue on GitHub

Feature Overview - High-level feature list
Getting Started - Setup and quickstart
Analytics Guide - Analytics usage guide

Implementation Details

1. Overview Statistics

2. Distribution Analysis

3. Quality Metrics

4. Performance Tracking

5. Content Statistics

6. Publishing Trends

7. Feed Health Reports

8. Contributor Analytics

On this page