Implementation Details
Technical implementation details for advanced feed fetching and analytics
Overview
This document describes the technical implementation of the comprehensive feed fetching and analytics system added to AI Web Feeds in version 1.0.
Architecture
The enhanced system consists of three main components:
Feed URL → AdvancedFeedFetcher → FeedMetadata + Items
↓
DatabaseManager
↓
FeedAnalytics
↓
CLI CommandsCore Components
1. Advanced Feed Fetcher
Location: packages/ai_web_feeds/src/ai_web_feeds/fetcher.py (820 lines)
A sophisticated feed fetching system that extracts exhaustive metadata from RSS/Atom/JSON feeds.
Key Features
100+ Metadata Fields
The fetcher extracts comprehensive metadata organized in categories:
Basic Feed Information:
- Title, subtitle, description
- Homepage link
- Language and copyright
- Generator information
Author/Publisher Data:
- Author name and email
- Publisher information
- Managing editor
- Webmaster contact
Visual Assets:
- Feed images (URL, title, link)
- Logo and icon URLs
- Dimensions and alt text
Technical Metadata:
- TTL (Time To Live)
- Skip hours and skip days
- Cloud configuration
- PubSubHubbub hub URLs
Content Statistics:
- Total item count
- Items with full content
- Items with authors
- Items with enclosures/media
- Average title/description/content lengths
Three-Dimensional Quality Scoring
Each feed receives scores (0-1) across three dimensions:
1. Completeness Score
Measures how complete the feed metadata is:
- ✅ Has title
- ✅ Has description
- ✅ Has link
- ✅ Has language
- ✅ Has timestamps
- ✅ Has author/publisher
- ✅ Has categories
- ✅ Has image/logo
# Example calculation
completeness = sum([
bool(feed.title), # 1/8
bool(feed.description), # 1/8
bool(feed.link), # 1/8
bool(feed.language), # 1/8
# ... etc
]) / 8.02. Richness Score
Measures content quality and depth:
- Items have content
- Content coverage percentage
- Author attribution
- Average content length
- Full content availability
- Media/images present
3. Structure Score
Measures feed structure quality:
- No parsing errors
- Has items
- Items have GUIDs
- Has timestamps
- Has links
Publishing Frequency Detection
Automatically analyzes item publication patterns to estimate update frequency:
| Frequency | Pattern |
|---|---|
| Hourly | New items every hour or less |
| Daily | New items published daily |
| Weekly | Weekly publication schedule |
| Monthly | Monthly updates |
| Infrequent | Longer intervals between posts |
# Algorithm outline
def estimate_update_frequency(items):
if not items or len(items) < 2:
return "unknown"
# Calculate time between publications
intervals = calculate_intervals(items)
avg_interval = median(intervals)
# Classify based on average interval
if avg_interval < 3600: # < 1 hour
return "hourly"
elif avg_interval < 86400: # < 1 day
return "daily"
# ... etcExtension Support
Full support for popular RSS extensions:
iTunes Podcast Metadata:
- Author, owner, categories
- Explicit flag
- Episode information
- Artwork URLs
Dublin Core Metadata:
- Contributor, coverage
- Creator, date
- Format, identifier
- Rights, source
Media RSS:
- Thumbnails with dimensions
- Media content
- Keywords and descriptions
- Credit information
GeoRSS:
- Location coordinates
- Geographic regions
- Place names
Usage Example
from ai_web_feeds.fetcher import AdvancedFeedFetcher
from ai_web_feeds.storage import DatabaseManager
# Initialize
db = DatabaseManager("sqlite:///data/aiwebfeeds.db")
fetcher = AdvancedFeedFetcher()
# Fetch feed
fetch_log, metadata, items = await fetcher.fetch_feed(
"https://example.com/feed.xml"
)
# Access quality scores
print(f"Completeness: {metadata.completeness_score:.2f}")
print(f"Richness: {metadata.richness_score:.2f}")
print(f"Structure: {metadata.structure_score:.2f}")
# Access metadata
print(f"Update frequency: {metadata.estimated_update_frequency}")
print(f"Total items: {metadata.total_items}")
print(f"Found {len(items)} items")
# Save to database
session = db.get_session()
session.add(fetch_log)
session.commit()Conditional Requests
The fetcher supports conditional HTTP requests to reduce bandwidth:
# Use ETag and Last-Modified from previous fetch
fetch_log, metadata, items = await fetcher.fetch_feed(
url="https://example.com/feed.xml",
etag="33a64df551425fcc55e4d42a148795d9f25f89d4",
last_modified="Wed, 15 Nov 2023 12:00:00 GMT"
)
# Returns 304 Not Modified if feed hasn't changed
if fetch_log.status_code == 304:
print("Feed unchanged")Retry Logic
Built-in exponential backoff for transient failures:
# Automatic retries (configured via tenacity)
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10)
)
async def fetch_with_retry(url):
# Will retry up to 3 times
# Waits 2s, 4s, 8s between attempts
pass2. Analytics Engine
Location: packages/ai_web_feeds/src/ai_web_feeds/analytics.py (600 lines)
Comprehensive analytics engine providing 8 different analytical views of feed data.
Generate Full Report
# Export everything to JSON
report = analytics.generate_full_report()
# Save to file
import json
with open("analytics.json", "w") as f:
json.dump(report, f, indent=2)
# Report includes all 8 analytics views3. CLI Commands
Fetch Commands
Location: apps/cli/ai_web_feeds/cli/commands/fetch.py (200 lines)
Fetch Single Feed
ai-web-feeds fetch one <feed-id> [--metadata]Fetches a single feed with optional metadata display:
# Basic fetch
ai-web-feeds fetch one openai-blog
# With detailed metadata
ai-web-feeds fetch one openai-blog --metadataFeatures:
- Progress indicator
- Error reporting
- Quality scores display
- Metadata summary table
Fetch All Feeds
ai-web-feeds fetch all [--limit N] [--verified-only]Batch fetch with progress tracking:
# Fetch all feeds
ai-web-feeds fetch all
# Fetch first 10 feeds
ai-web-feeds fetch all --limit 10
# Fetch only verified feeds
ai-web-feeds fetch all --verified-onlyFeatures:
- Rich progress bar
- Real-time stats
- Error summary table
- Success/failure counts
Analytics Commands
Location: apps/cli/ai_web_feeds/cli/commands/analytics.py (400 lines)
Overview Dashboard
ai-web-feeds analytics overviewDisplays comprehensive dashboard with:
- Total counts (feeds, items, topics)
- Status distribution
- Recent activity (24h)
Distributions
ai-web-feeds analytics distributions [--limit N]Shows distributions across:
- Source types
- Content mediums
- Topics
- Languages
Quality Metrics
ai-web-feeds analytics qualityQuality assessment with:
- Average scores
- Quality distribution
- High/low quality counts
Performance Tracking
ai-web-feeds analytics performance [--days N]Fetch performance metrics:
- Success/failure rates
- Average durations
- Error distribution
- HTTP status codes
Content Statistics
ai-web-feeds analytics contentContent analysis:
- Total items
- Coverage metrics
- Top categories
Publishing Trends
ai-web-feeds analytics trends [--days N]Publishing patterns:
- Items per day
- Hourly distribution
- Weekday patterns
- Peak times
Feed Health
ai-web-feeds analytics health <feed-id>Per-feed health report with diagnostics and recommendations.
Top Contributors
ai-web-feeds analytics contributors [--limit N]Contributor leaderboard with verification rates.
Generate Report
ai-web-feeds analytics report [--output FILE]Export comprehensive JSON report.
Database Schema
The enhanced system uses the existing database schema with full utilization of flexible JSON columns:
FeedFetchLog Enhancements
class FeedFetchLog(SQLModel, table=True):
# ... existing fields ...
# Enhanced usage of extra_data
extra_data: Optional[Dict[str, Any]] = Field(
default=None,
sa_column=Column(JSON)
)
# Now stores:
# - Complete HTTP headers
# - Detailed error information
# - Item statistics
# - Quality scores
# - Extension metadataFeedItem Enhancements
class FeedItem(SQLModel, table=True):
# ... existing fields ...
# Enhanced usage of extra_data
extra_data: Optional[Dict[str, Any]] = Field(
default=None,
sa_column=Column(JSON)
)
# Now stores:
# - Extension metadata (iTunes, Media RSS, etc.)
# - Multiple categories
# - Enclosure metadata
# - Author detailsDependencies
New Dependencies Added
Core Library Dependencies
File: packages/ai_web_feeds/pyproject.toml
dependencies = [
# ... existing ...
"beautifulsoup4>=4.12.0", # NEW: HTML parsing
]Purpose:
- HTML parsing for feed discovery
- Extracting feed URLs from web pages
- Parsing HTML content in feed items
CLI Tool Dependencies
File: apps/cli/pyproject.toml
dependencies = [
# ... existing ...
"rich>=13.7.0", # NEW: Rich terminal output
]Purpose:
- Beautiful terminal tables
- Progress bars and spinners
- Colored output and styling
- Markdown rendering in terminal
Performance Considerations
Conditional Requests
Reduce bandwidth and processing for unchanged feeds:
# Store from previous fetch
etag = fetch_log.etag
last_modified = fetch_log.last_modified
# Use in next fetch
new_log, metadata, items = await fetcher.fetch_feed(
url=feed_url,
etag=etag,
last_modified=last_modified
)
# Server returns 304 Not Modified if unchanged
if new_log.status_code == 304:
# No processing needed
returnRetry Logic
Exponential backoff for reliability:
from tenacity import (
retry,
stop_after_attempt,
wait_exponential
)
@retry(
stop=stop_after_attempt(3), # Max 3 attempts
wait=wait_exponential(
multiplier=1,
min=2, # Wait 2s after first failure
max=10 # Wait max 10s
)
)
async def fetch_with_retry(url):
# Automatic retry on failure
passTimeouts
Prevent hanging on slow feeds:
# Configurable timeout (default 30s)
fetcher = AdvancedFeedFetcher(timeout=30.0)
# Per-request timeout
fetch_log, metadata, items = await fetcher.fetch_feed(
url=feed_url,
timeout=60.0 # Override for slow feed
)Best Practices
Use Conditional Requests
Always pass etag and last_modified from previous fetches to reduce bandwidth:
# Save from previous fetch
session.add(fetch_log)
# Use in next fetch
new_log = await fetcher.fetch_feed(
url=url,
etag=fetch_log.etag,
last_modified=fetch_log.last_modified
)Respect TTL Values
Honor feed TTL (Time To Live) for update frequency:
if metadata.ttl:
# Wait TTL minutes before next fetch
next_fetch = datetime.now() + timedelta(minutes=metadata.ttl)Monitor Health Regularly
Check feed health scores to identify issues:
# Daily health check
ai-web-feeds analytics health openai-blog
# Weekly full report
ai-web-feeds analytics report --output weekly-report.jsonTrack Trends
Use analytics to identify patterns:
# Monthly trend analysis
ai-web-feeds analytics trends --days 30
# Quality monitoring
ai-web-feeds analytics qualityGenerate Periodic Reports
Export analytics for monitoring:
# Weekly reports
ai-web-feeds analytics report --output reports/week-$(date +%U).json
# Archive for historical analysisInstallation
Quick Setup Script
Use the automated setup script:
# Make executable
chmod +x setup-enhanced-features.sh
# Run setup
./setup-enhanced-features.shThe script will:
- Install core library with dependencies
- Install CLI tool with dependencies
- Verify installation
- Display next steps
Manual Installation
Install each component separately:
# 1. Install core library
cd packages/ai_web_feeds
pip install -e .
# 2. Install CLI tool
cd ../../apps/cli
pip install -e .
# 3. Verify installation
ai-web-feeds --version
ai-web-feeds fetch --help
ai-web-feeds analytics --helpCode Organization
packages/ai_web_feeds/src/ai_web_feeds/
├── fetcher.py # AdvancedFeedFetcher class
│ ├── FeedMetadata # Metadata container (100+ fields)
│ ├── fetch_feed() # Main fetch method
│ ├── _extract_*() # Extraction helpers
│ └── _calculate_*() # Quality scoring
│
├── analytics.py # FeedAnalytics class
│ ├── get_overview_stats()
│ ├── get_*_distribution()
│ ├── get_quality_metrics()
│ ├── get_fetch_performance_stats()
│ ├── get_content_statistics()
│ ├── get_publishing_trends()
│ ├── get_feed_health_report()
│ ├── get_top_contributors()
│ └── generate_full_report()
│
apps/cli/ai_web_feeds/cli/commands/
├── fetch.py # Fetch CLI commands
│ ├── fetch_one() # Single feed fetch
│ └── fetch_all() # Batch fetch
│
└── analytics.py # Analytics CLI commands
├── show_overview()
├── show_distributions()
├── show_quality()
├── show_performance()
├── show_content()
├── show_trends()
├── show_health()
├── show_contributors()
└── generate_report()Future Enhancements
Potential additions for future versions:
- Web UI dashboard with real-time metrics
- Machine learning for content classification
- Real-time monitoring with webhooks
- GraphQL API for analytics
- Advanced deduplication algorithms
- Content similarity analysis
- Multi-language NLP support
- Anomaly detection in publishing patterns
- Automated quality recommendations
Support
For technical questions or issues:
- Review this documentation
- Check inline code documentation
- Explore CLI help:
ai-web-feeds --help - Open an issue on GitHub
Related Documentation
- Feature Overview - High-level feature list
- Getting Started - Setup and quickstart
- Analytics Guide - Analytics usage guide