Data Enrichment & Analytics

AI Web Feeds includes comprehensive data enrichment and advanced analytics capabilities that automatically enhance feed metadata, analyze content, track quality, and provide ML-powered insights.

Key Features

1. Metadata Enrichment

Module: enrichment.metadata

Automatically discovers and enriches feed metadata:

Auto-discovery: Extracts titles, descriptions, authors from feeds and websites
Language Detection: Identifies feed language with confidence scores
Platform Detection: Recognizes Reddit, Medium, Substack, GitHub, arXiv, YouTube, etc.
Icon/Logo Discovery: Finds favicons and Open Graph images
Feed Format Detection: Identifies RSS, Atom, JSON feeds
Publishing Frequency: Analyzes update patterns

Example Usage:

from ai_web_feeds.enrichment import MetadataEnricher

enricher = MetadataEnricher()

# Enrich single feed
feed_data = {"url": "https://example.com/feed"}
enriched = enricher.enrich_feed_source(feed_data)

print(enriched["title"])  # Auto-discovered title
print(enriched["language"])  # Detected language
print(enriched["platform"])  # Detected platform

# Batch enrichment (parallel)
feeds = [{"url": url1}, {"url": url2}, {"url": url3}]
enriched_feeds = enricher.batch_enrich(feeds, max_workers=5)

2. Content Analysis

Module: enrichment.content

NLP-powered content analysis:

Text Statistics: Word count, sentence count, paragraph count
Readability Scoring: Flesch reading ease, reading level classification
Keyword Extraction: Top keywords, domain-specific keywords (AI/ML)
Named Entity Recognition: Simple capitalization-based extraction
Sentiment Analysis: Positive/negative/neutral classification with confidence
Topic Detection: Auto-classification into research, industry, ML, NLP, etc.
Content Detection: Identifies code snippets and mathematical notation

Example Usage:

from ai_web_feeds.enrichment import ContentAnalyzer

analyzer = ContentAnalyzer()

# Analyze text content
text = """
Machine learning models are becoming increasingly powerful.
Recent advances in transformer architectures have led to
breakthrough performance on many NLP tasks.
"""

analysis = analyzer.analyze_text(text)

print(f"Readability: {analysis.readability_score:.1f}")
print(f"Reading Level: {analysis.reading_level}")
print(f"Sentiment: {analysis.sentiment_label} ({analysis.sentiment_score:.2f})")
print(f"Top Keywords: {analysis.top_keywords[:5]}")
print(f"Detected Topics: {analysis.detected_topics}")
print(f"Has Code: {analysis.has_code}")

3. Quality Analysis

Module: enrichment.quality

Multi-dimensional quality scoring:

Completeness: Required vs. optional fields
Accuracy: URL format, title length, description quality
Consistency: Domain matching, language code format
Timeliness: Update freshness, staleness detection
Validity: Data type checking, schema compliance
Uniqueness: Duplicate detection (with context)

Quality Dimensions (with weights):

Completeness (25%): Are required fields present?
Accuracy (20%): Is data properly formatted?
Consistency (15%): Do related fields match?
Timeliness (15%): Is data up-to-date?
Validity (15%): Does data meet type requirements?
Uniqueness (10%): Is feed unique?

Example Usage:

from ai_web_feeds.enrichment import QualityAnalyzer

analyzer = QualityAnalyzer()

# Assess feed quality
feed_data = {
    "url": "example.com/feed",  # Missing protocol
    "title": "AI News",
    # Missing recommended fields: description, language, topics
}

score = analyzer.assess_feed_source(feed_data)

print(f"Overall Score: {score.overall_score}/100")
print(f"Completeness: {score.completeness_score}/100")
print(f"Issues Found: {len(score.issues)}")

for issue in score.issues:
    print(f"  [{issue.severity}] {issue.field}: {issue.issue}")
    if issue.auto_fixable:
        print(f"    → Can auto-fix: {issue.suggestion}")

# Auto-fix issues
fixed = analyzer.auto_fix_issues(feed_data)
print(f"Fixed URL: {fixed['url']}")  # Now has https://

4. Time-Series Analysis

Module: analytics.timeseries

Forecasting and temporal pattern analysis:

Health Forecasting: Predict feed health 7+ days ahead
Seasonality Detection: Weekly/daily posting patterns
Trend Analysis: Increasing/decreasing/stable trends with R²
Frequency Analysis: Publishing rates and regularity
Peak Time Detection: Most active hours/days

Example Usage:

from ai_web_feeds.analytics.timeseries import TimeSeriesAnalyzer
from ai_web_feeds import DatabaseManager

db = DatabaseManager()

with db.get_session() as session:
    analyzer = TimeSeriesAnalyzer(session)

    # Forecast health
    forecast = analyzer.forecast_health_metric("feed_123", days_ahead=14)
    print(f"Forecast (next 14 days): {forecast.forecast_values}")
    print(f"Confidence Intervals: {forecast.confidence_intervals}")
    print(f"Model RMSE: {forecast.rmse:.3f}")

    # Detect seasonality
    seasonality = analyzer.detect_seasonality("feed_123", lookback_days=90)
    if seasonality.has_seasonality:
        print(f"Seasonal Period: {seasonality.seasonal_period} hours/days")
        print(f"Seasonal Strength: {seasonality.seasonal_strength:.2f}")

    # Analyze trend
    trend = analyzer.analyze_trend("feed_123", lookback_days=90)
    print(f"Trend Direction: {trend.trend_direction}")
    print(f"Slope: {trend.slope:.4f}")
    print(f"R²: {trend.r_squared:.3f}")

5. Network Analysis

Module: analytics.network

Graph-based topic and feed relationship analysis:

Topic Networks: Graph of topic relationships
Feed Similarity Networks: Feeds connected by shared topics
Centrality Metrics: PageRank, degree, closeness, betweenness
Community Detection: Identify topic clusters
Influential Topics: Rank topics by network importance

Example Usage:

from ai_web_feeds.analytics.network import NetworkAnalyzer
from ai_web_feeds import DatabaseManager

db = DatabaseManager()

with db.get_session() as session:
    analyzer = NetworkAnalyzer(session)

    # Build topic network
    topic_graph = analyzer.build_topic_network()
    print(f"Topics: {topic_graph.stats['num_nodes']}")
    print(f"Relationships: {topic_graph.stats['num_edges']}")
    print(f"Density: {topic_graph.stats['density']:.3f}")

    # Find influential topics
    influential = analyzer.find_influential_topics(topic_graph, top_n=10)
    for topic in influential:
        print(f"{topic['label']}: PageRank={topic['pagerank']:.4f}")

6. Advanced Analytics

Module: analytics.advanced

ML-powered insights:

Predictive Health Modeling: Linear regression forecasts
Pattern Detection: Temporal, content, category patterns
Similarity Computation: Jaccard similarity between feeds
Feed Clustering: BFS-based clustering by similarity
ML Insights Reports: Comprehensive ML analysis

Integration with Data Sync

The enrichment system integrates seamlessly with data synchronization:

from ai_web_feeds.data_sync import DataSyncOrchestrator
from ai_web_feeds.enrichment import MetadataEnricher, QualityAnalyzer
from ai_web_feeds import DatabaseManager

db = DatabaseManager()

# Load and enrich feeds
with MetadataEnricher() as enricher:
    import yaml
    with open("data/feeds.yaml") as f:
        data = yaml.safe_load(f)

    # Enrich all feeds
    enriched_sources = enricher.batch_enrich(data["sources"])

    # Assess quality
    quality_analyzer = QualityAnalyzer()
    for feed in enriched_sources:
        score = quality_analyzer.assess_feed_source(feed)
        feed["quality_score"] = score.overall_score

# Sync to database
sync = DataSyncOrchestrator(db)
sync.full_sync()

Workflow Examples

Complete Feed Enrichment Pipeline

from ai_web_feeds.enrichment import (
    MetadataEnricher,
    ContentAnalyzer,
    QualityAnalyzer
)

# 1. Extract metadata
enricher = MetadataEnricher()
feed_data = {"url": "https://openai.com/blog/rss/"}
enriched = enricher.enrich_feed_source(feed_data)

# 2. Analyze content
content_analyzer = ContentAnalyzer()
content_text = "Latest advances in GPT-4 and DALL-E 3..."
content_analysis = content_analyzer.analyze_text(content_text)

# 3. Assess quality
quality_analyzer = QualityAnalyzer()
quality = quality_analyzer.assess_feed_source(enriched)

# 4. Combine results
final_feed = {
    **enriched,
    "content_analysis": {
        "readability": content_analysis.readability_score,
        "sentiment": content_analysis.sentiment_label,
        "topics": content_analysis.detected_topics,
    },
    "quality": {
        "overall_score": quality.overall_score,
        "issues_count": len(quality.issues),
    }
}

Health Monitoring Dashboard

from ai_web_feeds.analytics.timeseries import TimeSeriesAnalyzer
from ai_web_feeds.analytics.advanced import AdvancedFeedAnalytics

with db.get_session() as session:
    ts_analyzer = TimeSeriesAnalyzer(session)
    adv_analytics = AdvancedFeedAnalytics(session)

    feed_id = "feed_123"

    # Current health
    current_health = adv_analytics.get_current_health(feed_id)

    # Future forecast
    forecast = ts_analyzer.forecast_health_metric(feed_id, days_ahead=7)

    # Trend analysis
    trend = ts_analyzer.analyze_trend(feed_id, lookback_days=30)

    dashboard = {
        "feed_id": feed_id,
        "current_health": current_health,
        "forecast_7d": forecast.forecast_values[-1],
        "trend": trend.trend_direction,
        "status": "healthy" if current_health > 0.7 else "degraded"
    }

Performance Considerations

Batch Processing: Use batch_enrich() for multiple feeds (parallel workers)
Caching: Metadata enrichment results cached in enriched YAML
Incremental Updates: Only re-enrich feeds older than X days
Database Indexes: Ensure indexes on feed_source_id, published_date, calculated_at
Memory: Time-series analysis memory-efficient with streaming for large datasets