AI Web FeedsAIWebFeeds

Implementation Details

Technical implementation details for advanced feed fetching and analytics

Overview

This document describes the technical implementation of the comprehensive feed fetching and analytics system added to AI Web Feeds in version 1.0.

This is the first version of these capabilities - designed from scratch for optimal performance and extensibility.

Architecture

The enhanced system consists of three main components:

Feed URL → AdvancedFeedFetcher → FeedMetadata + Items

                                  DatabaseManager

                                  FeedAnalytics

                                  CLI Commands

Core Components

1. Advanced Feed Fetcher

Location: packages/ai_web_feeds/src/ai_web_feeds/fetcher.py (820 lines)

A sophisticated feed fetching system that extracts exhaustive metadata from RSS/Atom/JSON feeds.

Key Features

100+ Metadata Fields

The fetcher extracts comprehensive metadata organized in categories:

Basic Feed Information:

  • Title, subtitle, description
  • Homepage link
  • Language and copyright
  • Generator information

Author/Publisher Data:

  • Author name and email
  • Publisher information
  • Managing editor
  • Webmaster contact

Visual Assets:

  • Feed images (URL, title, link)
  • Logo and icon URLs
  • Dimensions and alt text

Technical Metadata:

  • TTL (Time To Live)
  • Skip hours and skip days
  • Cloud configuration
  • PubSubHubbub hub URLs

Content Statistics:

  • Total item count
  • Items with full content
  • Items with authors
  • Items with enclosures/media
  • Average title/description/content lengths

Three-Dimensional Quality Scoring

Each feed receives scores (0-1) across three dimensions:

1. Completeness Score

Measures how complete the feed metadata is:

  • ✅ Has title
  • ✅ Has description
  • ✅ Has link
  • ✅ Has language
  • ✅ Has timestamps
  • ✅ Has author/publisher
  • ✅ Has categories
  • ✅ Has image/logo
# Example calculation
completeness = sum([
    bool(feed.title),      # 1/8
    bool(feed.description), # 1/8
    bool(feed.link),       # 1/8
    bool(feed.language),   # 1/8
    # ... etc
]) / 8.0

2. Richness Score

Measures content quality and depth:

  • Items have content
  • Content coverage percentage
  • Author attribution
  • Average content length
  • Full content availability
  • Media/images present

3. Structure Score

Measures feed structure quality:

  • No parsing errors
  • Has items
  • Items have GUIDs
  • Has timestamps
  • Has links

Publishing Frequency Detection

Automatically analyzes item publication patterns to estimate update frequency:

FrequencyPattern
HourlyNew items every hour or less
DailyNew items published daily
WeeklyWeekly publication schedule
MonthlyMonthly updates
InfrequentLonger intervals between posts
# Algorithm outline
def estimate_update_frequency(items):
    if not items or len(items) < 2:
        return "unknown"

    # Calculate time between publications
    intervals = calculate_intervals(items)
    avg_interval = median(intervals)

    # Classify based on average interval
    if avg_interval < 3600:      # < 1 hour
        return "hourly"
    elif avg_interval < 86400:   # < 1 day
        return "daily"
    # ... etc

Extension Support

Full support for popular RSS extensions:

iTunes Podcast Metadata:

  • Author, owner, categories
  • Explicit flag
  • Episode information
  • Artwork URLs

Dublin Core Metadata:

  • Contributor, coverage
  • Creator, date
  • Format, identifier
  • Rights, source

Media RSS:

  • Thumbnails with dimensions
  • Media content
  • Keywords and descriptions
  • Credit information

GeoRSS:

  • Location coordinates
  • Geographic regions
  • Place names

Usage Example

from ai_web_feeds.fetcher import AdvancedFeedFetcher
from ai_web_feeds.storage import DatabaseManager

# Initialize
db = DatabaseManager("sqlite:///data/aiwebfeeds.db")
fetcher = AdvancedFeedFetcher()

# Fetch feed
fetch_log, metadata, items = await fetcher.fetch_feed(
    "https://example.com/feed.xml"
)

# Access quality scores
print(f"Completeness: {metadata.completeness_score:.2f}")
print(f"Richness: {metadata.richness_score:.2f}")
print(f"Structure: {metadata.structure_score:.2f}")

# Access metadata
print(f"Update frequency: {metadata.estimated_update_frequency}")
print(f"Total items: {metadata.total_items}")
print(f"Found {len(items)} items")

# Save to database
session = db.get_session()
session.add(fetch_log)
session.commit()

Conditional Requests

The fetcher supports conditional HTTP requests to reduce bandwidth:

# Use ETag and Last-Modified from previous fetch
fetch_log, metadata, items = await fetcher.fetch_feed(
    url="https://example.com/feed.xml",
    etag="33a64df551425fcc55e4d42a148795d9f25f89d4",
    last_modified="Wed, 15 Nov 2023 12:00:00 GMT"
)

# Returns 304 Not Modified if feed hasn't changed
if fetch_log.status_code == 304:
    print("Feed unchanged")

Retry Logic

Built-in exponential backoff for transient failures:

# Automatic retries (configured via tenacity)
@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10)
)
async def fetch_with_retry(url):
    # Will retry up to 3 times
    # Waits 2s, 4s, 8s between attempts
    pass

2. Analytics Engine

Location: packages/ai_web_feeds/src/ai_web_feeds/analytics.py (600 lines)

Comprehensive analytics engine providing 8 different analytical views of feed data.

Generate Full Report

# Export everything to JSON
report = analytics.generate_full_report()

# Save to file
import json
with open("analytics.json", "w") as f:
    json.dump(report, f, indent=2)

# Report includes all 8 analytics views

3. CLI Commands

Fetch Commands

Location: apps/cli/ai_web_feeds/cli/commands/fetch.py (200 lines)

Fetch Single Feed

ai-web-feeds fetch one <feed-id> [--metadata]

Fetches a single feed with optional metadata display:

# Basic fetch
ai-web-feeds fetch one openai-blog

# With detailed metadata
ai-web-feeds fetch one openai-blog --metadata

Features:

  • Progress indicator
  • Error reporting
  • Quality scores display
  • Metadata summary table

Fetch All Feeds

ai-web-feeds fetch all [--limit N] [--verified-only]

Batch fetch with progress tracking:

# Fetch all feeds
ai-web-feeds fetch all

# Fetch first 10 feeds
ai-web-feeds fetch all --limit 10

# Fetch only verified feeds
ai-web-feeds fetch all --verified-only

Features:

  • Rich progress bar
  • Real-time stats
  • Error summary table
  • Success/failure counts

Analytics Commands

Location: apps/cli/ai_web_feeds/cli/commands/analytics.py (400 lines)

Overview Dashboard

ai-web-feeds analytics overview

Displays comprehensive dashboard with:

  • Total counts (feeds, items, topics)
  • Status distribution
  • Recent activity (24h)

Distributions

ai-web-feeds analytics distributions [--limit N]

Shows distributions across:

  • Source types
  • Content mediums
  • Topics
  • Languages

Quality Metrics

ai-web-feeds analytics quality

Quality assessment with:

  • Average scores
  • Quality distribution
  • High/low quality counts

Performance Tracking

ai-web-feeds analytics performance [--days N]

Fetch performance metrics:

  • Success/failure rates
  • Average durations
  • Error distribution
  • HTTP status codes

Content Statistics

ai-web-feeds analytics content

Content analysis:

  • Total items
  • Coverage metrics
  • Top categories
ai-web-feeds analytics trends [--days N]

Publishing patterns:

  • Items per day
  • Hourly distribution
  • Weekday patterns
  • Peak times

Feed Health

ai-web-feeds analytics health <feed-id>

Per-feed health report with diagnostics and recommendations.

Top Contributors

ai-web-feeds analytics contributors [--limit N]

Contributor leaderboard with verification rates.

Generate Report

ai-web-feeds analytics report [--output FILE]

Export comprehensive JSON report.

Database Schema

The enhanced system uses the existing database schema with full utilization of flexible JSON columns:

FeedFetchLog Enhancements

class FeedFetchLog(SQLModel, table=True):
    # ... existing fields ...

    # Enhanced usage of extra_data
    extra_data: Optional[Dict[str, Any]] = Field(
        default=None,
        sa_column=Column(JSON)
    )
    # Now stores:
    # - Complete HTTP headers
    # - Detailed error information
    # - Item statistics
    # - Quality scores
    # - Extension metadata

FeedItem Enhancements

class FeedItem(SQLModel, table=True):
    # ... existing fields ...

    # Enhanced usage of extra_data
    extra_data: Optional[Dict[str, Any]] = Field(
        default=None,
        sa_column=Column(JSON)
    )
    # Now stores:
    # - Extension metadata (iTunes, Media RSS, etc.)
    # - Multiple categories
    # - Enclosure metadata
    # - Author details
No migration required - The system leverages existing flexible JSON columns for maximum compatibility.

Dependencies

New Dependencies Added

Core Library Dependencies

File: packages/ai_web_feeds/pyproject.toml

dependencies = [
    # ... existing ...
    "beautifulsoup4>=4.12.0",  # NEW: HTML parsing
]

Purpose:

  • HTML parsing for feed discovery
  • Extracting feed URLs from web pages
  • Parsing HTML content in feed items

CLI Tool Dependencies

File: apps/cli/pyproject.toml

dependencies = [
    # ... existing ...
    "rich>=13.7.0",  # NEW: Rich terminal output
]

Purpose:

  • Beautiful terminal tables
  • Progress bars and spinners
  • Colored output and styling
  • Markdown rendering in terminal

Performance Considerations

Conditional Requests

Reduce bandwidth and processing for unchanged feeds:

# Store from previous fetch
etag = fetch_log.etag
last_modified = fetch_log.last_modified

# Use in next fetch
new_log, metadata, items = await fetcher.fetch_feed(
    url=feed_url,
    etag=etag,
    last_modified=last_modified
)

# Server returns 304 Not Modified if unchanged
if new_log.status_code == 304:
    # No processing needed
    return

Retry Logic

Exponential backoff for reliability:

from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential
)

@retry(
    stop=stop_after_attempt(3),  # Max 3 attempts
    wait=wait_exponential(
        multiplier=1,
        min=2,    # Wait 2s after first failure
        max=10    # Wait max 10s
    )
)
async def fetch_with_retry(url):
    # Automatic retry on failure
    pass

Timeouts

Prevent hanging on slow feeds:

# Configurable timeout (default 30s)
fetcher = AdvancedFeedFetcher(timeout=30.0)

# Per-request timeout
fetch_log, metadata, items = await fetcher.fetch_feed(
    url=feed_url,
    timeout=60.0  # Override for slow feed
)

Best Practices

Use Conditional Requests

Always pass etag and last_modified from previous fetches to reduce bandwidth:

# Save from previous fetch
session.add(fetch_log)

# Use in next fetch
new_log = await fetcher.fetch_feed(
    url=url,
    etag=fetch_log.etag,
    last_modified=fetch_log.last_modified
)

Respect TTL Values

Honor feed TTL (Time To Live) for update frequency:

if metadata.ttl:
    # Wait TTL minutes before next fetch
    next_fetch = datetime.now() + timedelta(minutes=metadata.ttl)

Monitor Health Regularly

Check feed health scores to identify issues:

# Daily health check
ai-web-feeds analytics health openai-blog

# Weekly full report
ai-web-feeds analytics report --output weekly-report.json

Use analytics to identify patterns:

# Monthly trend analysis
ai-web-feeds analytics trends --days 30

# Quality monitoring
ai-web-feeds analytics quality

Generate Periodic Reports

Export analytics for monitoring:

# Weekly reports
ai-web-feeds analytics report --output reports/week-$(date +%U).json

# Archive for historical analysis

Installation

Quick Setup Script

Use the automated setup script:

# Make executable
chmod +x setup-enhanced-features.sh

# Run setup
./setup-enhanced-features.sh

The script will:

  1. Install core library with dependencies
  2. Install CLI tool with dependencies
  3. Verify installation
  4. Display next steps

Manual Installation

Install each component separately:

# 1. Install core library
cd packages/ai_web_feeds
pip install -e .

# 2. Install CLI tool
cd ../../apps/cli
pip install -e .

# 3. Verify installation
ai-web-feeds --version
ai-web-feeds fetch --help
ai-web-feeds analytics --help

Code Organization

packages/ai_web_feeds/src/ai_web_feeds/
├── fetcher.py          # AdvancedFeedFetcher class
│   ├── FeedMetadata    # Metadata container (100+ fields)
│   ├── fetch_feed()    # Main fetch method
│   ├── _extract_*()    # Extraction helpers
│   └── _calculate_*()  # Quality scoring

├── analytics.py        # FeedAnalytics class
│   ├── get_overview_stats()
│   ├── get_*_distribution()
│   ├── get_quality_metrics()
│   ├── get_fetch_performance_stats()
│   ├── get_content_statistics()
│   ├── get_publishing_trends()
│   ├── get_feed_health_report()
│   ├── get_top_contributors()
│   └── generate_full_report()

apps/cli/ai_web_feeds/cli/commands/
├── fetch.py            # Fetch CLI commands
│   ├── fetch_one()     # Single feed fetch
│   └── fetch_all()     # Batch fetch

└── analytics.py        # Analytics CLI commands
    ├── show_overview()
    ├── show_distributions()
    ├── show_quality()
    ├── show_performance()
    ├── show_content()
    ├── show_trends()
    ├── show_health()
    ├── show_contributors()
    └── generate_report()

Future Enhancements

Potential additions for future versions:

  • Web UI dashboard with real-time metrics
  • Machine learning for content classification
  • Real-time monitoring with webhooks
  • GraphQL API for analytics
  • Advanced deduplication algorithms
  • Content similarity analysis
  • Multi-language NLP support
  • Anomaly detection in publishing patterns
  • Automated quality recommendations

Support

For technical questions or issues:

  1. Review this documentation
  2. Check inline code documentation
  3. Explore CLI help: ai-web-feeds --help
  4. Open an issue on GitHub