Data Enrichment & Analytics
Comprehensive data enrichment and advanced analytics capabilities
Data Enrichment & Analytics
AI Web Feeds includes comprehensive data enrichment and advanced analytics capabilities that automatically enhance feed metadata, analyze content, track quality, and provide ML-powered insights.
Key Features
1. Metadata Enrichment
Module: enrichment.metadata
Automatically discovers and enriches feed metadata:
- Auto-discovery: Extracts titles, descriptions, authors from feeds and websites
- Language Detection: Identifies feed language with confidence scores
- Platform Detection: Recognizes Reddit, Medium, Substack, GitHub, arXiv, YouTube, etc.
- Icon/Logo Discovery: Finds favicons and Open Graph images
- Feed Format Detection: Identifies RSS, Atom, JSON feeds
- Publishing Frequency: Analyzes update patterns
Example Usage:
from ai_web_feeds.enrichment import MetadataEnricher
enricher = MetadataEnricher()
# Enrich single feed
feed_data = {"url": "https://example.com/feed"}
enriched = enricher.enrich_feed_source(feed_data)
print(enriched["title"]) # Auto-discovered title
print(enriched["language"]) # Detected language
print(enriched["platform"]) # Detected platform
# Batch enrichment (parallel)
feeds = [{"url": url1}, {"url": url2}, {"url": url3}]
enriched_feeds = enricher.batch_enrich(feeds, max_workers=5)2. Content Analysis
Module: enrichment.content
NLP-powered content analysis:
- Text Statistics: Word count, sentence count, paragraph count
- Readability Scoring: Flesch reading ease, reading level classification
- Keyword Extraction: Top keywords, domain-specific keywords (AI/ML)
- Named Entity Recognition: Simple capitalization-based extraction
- Sentiment Analysis: Positive/negative/neutral classification with confidence
- Topic Detection: Auto-classification into research, industry, ML, NLP, etc.
- Content Detection: Identifies code snippets and mathematical notation
Example Usage:
from ai_web_feeds.enrichment import ContentAnalyzer
analyzer = ContentAnalyzer()
# Analyze text content
text = """
Machine learning models are becoming increasingly powerful.
Recent advances in transformer architectures have led to
breakthrough performance on many NLP tasks.
"""
analysis = analyzer.analyze_text(text)
print(f"Readability: {analysis.readability_score:.1f}")
print(f"Reading Level: {analysis.reading_level}")
print(f"Sentiment: {analysis.sentiment_label} ({analysis.sentiment_score:.2f})")
print(f"Top Keywords: {analysis.top_keywords[:5]}")
print(f"Detected Topics: {analysis.detected_topics}")
print(f"Has Code: {analysis.has_code}")3. Quality Analysis
Module: enrichment.quality
Multi-dimensional quality scoring:
- Completeness: Required vs. optional fields
- Accuracy: URL format, title length, description quality
- Consistency: Domain matching, language code format
- Timeliness: Update freshness, staleness detection
- Validity: Data type checking, schema compliance
- Uniqueness: Duplicate detection (with context)
Quality Dimensions (with weights):
- Completeness (25%): Are required fields present?
- Accuracy (20%): Is data properly formatted?
- Consistency (15%): Do related fields match?
- Timeliness (15%): Is data up-to-date?
- Validity (15%): Does data meet type requirements?
- Uniqueness (10%): Is feed unique?
Example Usage:
from ai_web_feeds.enrichment import QualityAnalyzer
analyzer = QualityAnalyzer()
# Assess feed quality
feed_data = {
"url": "example.com/feed", # Missing protocol
"title": "AI News",
# Missing recommended fields: description, language, topics
}
score = analyzer.assess_feed_source(feed_data)
print(f"Overall Score: {score.overall_score}/100")
print(f"Completeness: {score.completeness_score}/100")
print(f"Issues Found: {len(score.issues)}")
for issue in score.issues:
print(f" [{issue.severity}] {issue.field}: {issue.issue}")
if issue.auto_fixable:
print(f" → Can auto-fix: {issue.suggestion}")
# Auto-fix issues
fixed = analyzer.auto_fix_issues(feed_data)
print(f"Fixed URL: {fixed['url']}") # Now has https://4. Time-Series Analysis
Module: analytics.timeseries
Forecasting and temporal pattern analysis:
- Health Forecasting: Predict feed health 7+ days ahead
- Seasonality Detection: Weekly/daily posting patterns
- Trend Analysis: Increasing/decreasing/stable trends with R²
- Frequency Analysis: Publishing rates and regularity
- Peak Time Detection: Most active hours/days
Example Usage:
from ai_web_feeds.analytics.timeseries import TimeSeriesAnalyzer
from ai_web_feeds import DatabaseManager
db = DatabaseManager()
with db.get_session() as session:
analyzer = TimeSeriesAnalyzer(session)
# Forecast health
forecast = analyzer.forecast_health_metric("feed_123", days_ahead=14)
print(f"Forecast (next 14 days): {forecast.forecast_values}")
print(f"Confidence Intervals: {forecast.confidence_intervals}")
print(f"Model RMSE: {forecast.rmse:.3f}")
# Detect seasonality
seasonality = analyzer.detect_seasonality("feed_123", lookback_days=90)
if seasonality.has_seasonality:
print(f"Seasonal Period: {seasonality.seasonal_period} hours/days")
print(f"Seasonal Strength: {seasonality.seasonal_strength:.2f}")
# Analyze trend
trend = analyzer.analyze_trend("feed_123", lookback_days=90)
print(f"Trend Direction: {trend.trend_direction}")
print(f"Slope: {trend.slope:.4f}")
print(f"R²: {trend.r_squared:.3f}")5. Network Analysis
Module: analytics.network
Graph-based topic and feed relationship analysis:
- Topic Networks: Graph of topic relationships
- Feed Similarity Networks: Feeds connected by shared topics
- Centrality Metrics: PageRank, degree, closeness, betweenness
- Community Detection: Identify topic clusters
- Influential Topics: Rank topics by network importance
Example Usage:
from ai_web_feeds.analytics.network import NetworkAnalyzer
from ai_web_feeds import DatabaseManager
db = DatabaseManager()
with db.get_session() as session:
analyzer = NetworkAnalyzer(session)
# Build topic network
topic_graph = analyzer.build_topic_network()
print(f"Topics: {topic_graph.stats['num_nodes']}")
print(f"Relationships: {topic_graph.stats['num_edges']}")
print(f"Density: {topic_graph.stats['density']:.3f}")
# Find influential topics
influential = analyzer.find_influential_topics(topic_graph, top_n=10)
for topic in influential:
print(f"{topic['label']}: PageRank={topic['pagerank']:.4f}")6. Advanced Analytics
Module: analytics.advanced
ML-powered insights:
- Predictive Health Modeling: Linear regression forecasts
- Pattern Detection: Temporal, content, category patterns
- Similarity Computation: Jaccard similarity between feeds
- Feed Clustering: BFS-based clustering by similarity
- ML Insights Reports: Comprehensive ML analysis
Integration with Data Sync
The enrichment system integrates seamlessly with data synchronization:
from ai_web_feeds.data_sync import DataSyncOrchestrator
from ai_web_feeds.enrichment import MetadataEnricher, QualityAnalyzer
from ai_web_feeds import DatabaseManager
db = DatabaseManager()
# Load and enrich feeds
with MetadataEnricher() as enricher:
import yaml
with open("data/feeds.yaml") as f:
data = yaml.safe_load(f)
# Enrich all feeds
enriched_sources = enricher.batch_enrich(data["sources"])
# Assess quality
quality_analyzer = QualityAnalyzer()
for feed in enriched_sources:
score = quality_analyzer.assess_feed_source(feed)
feed["quality_score"] = score.overall_score
# Sync to database
sync = DataSyncOrchestrator(db)
sync.full_sync()Workflow Examples
Complete Feed Enrichment Pipeline
from ai_web_feeds.enrichment import (
MetadataEnricher,
ContentAnalyzer,
QualityAnalyzer
)
# 1. Extract metadata
enricher = MetadataEnricher()
feed_data = {"url": "https://openai.com/blog/rss/"}
enriched = enricher.enrich_feed_source(feed_data)
# 2. Analyze content
content_analyzer = ContentAnalyzer()
content_text = "Latest advances in GPT-4 and DALL-E 3..."
content_analysis = content_analyzer.analyze_text(content_text)
# 3. Assess quality
quality_analyzer = QualityAnalyzer()
quality = quality_analyzer.assess_feed_source(enriched)
# 4. Combine results
final_feed = {
**enriched,
"content_analysis": {
"readability": content_analysis.readability_score,
"sentiment": content_analysis.sentiment_label,
"topics": content_analysis.detected_topics,
},
"quality": {
"overall_score": quality.overall_score,
"issues_count": len(quality.issues),
}
}Health Monitoring Dashboard
from ai_web_feeds.analytics.timeseries import TimeSeriesAnalyzer
from ai_web_feeds.analytics.advanced import AdvancedFeedAnalytics
with db.get_session() as session:
ts_analyzer = TimeSeriesAnalyzer(session)
adv_analytics = AdvancedFeedAnalytics(session)
feed_id = "feed_123"
# Current health
current_health = adv_analytics.get_current_health(feed_id)
# Future forecast
forecast = ts_analyzer.forecast_health_metric(feed_id, days_ahead=7)
# Trend analysis
trend = ts_analyzer.analyze_trend(feed_id, lookback_days=30)
dashboard = {
"feed_id": feed_id,
"current_health": current_health,
"forecast_7d": forecast.forecast_values[-1],
"trend": trend.trend_direction,
"status": "healthy" if current_health > 0.7 else "degraded"
}Performance Considerations
- Batch Processing: Use
batch_enrich()for multiple feeds (parallel workers) - Caching: Metadata enrichment results cached in enriched YAML
- Incremental Updates: Only re-enrich feeds older than X days
- Database Indexes: Ensure indexes on
feed_source_id,published_date,calculated_at - Memory: Time-series analysis memory-efficient with streaming for large datasets
Troubleshooting
Common Issues
Language detection fails
- Ensure text is at least 10 characters; langdetect requires minimum text
Metadata extraction returns empty
- Check URL accessibility; some sites block scrapers (use crawlee-python)
Quality score too low
- Use
auto_fix_issues()to automatically fix common problems
Forecasting insufficient data
- Need minimum 7 data points; ensure health metrics collected regularly
Best Practices
- Enrich on Import: Run enrichment when adding new feeds
- Quality Gates: Set minimum quality score threshold (e.g., 70/100)
- Regular Updates: Re-enrich metadata monthly
- Content Analysis: Run on new feed items, not all historical
- Health Monitoring: Schedule daily health metric calculations
- Network Updates: Rebuild topic network when taxonomy changes
Future Enhancements
Planned features:
- Deep Learning Models: Use transformer models for better NLP
- Real-time Anomaly Detection: Alert on unusual patterns
- Automated Categorization: ML-based topic assignment
- Sentiment Trends: Track sentiment changes over time
- Duplicate Detection: Find near-duplicate feeds
- Performance Optimization: GPU acceleration for large-scale analysis
Related Documentation
- Database Architecture - Database implementation
- Database Quick Start - Get started with the database
- Python API - Full API reference
Version: 1.0 Last Updated: October 15, 2025