Entity Extraction

Entity Extraction identifies and tracks people, organizations, techniques, datasets, and concepts mentioned in articles using spaCy's Named Entity Recognition (NER) models.

Overview

The entity extractor:

Extracts entities from article text using spaCy NER
Normalizes entity names to canonical forms (e.g., "G. Hinton" → "Geoffrey Hinton")
Tracks entity mentions across articles with confidence scores
Enables full-text search across entities and aliases

Architecture

Entity Types

Supported entity types:

person: Geoffrey Hinton, Yann LeCun, Ilya Sutskever
organization: OpenAI, Google Brain, Anthropic
technique: Transformers, RLHF, LoRA, BERT
dataset: ImageNet, COCO, WikiText-103
concept: Attention mechanism, Backpropagation

Features

Named Entity Recognition

Uses spaCy's en_core_web_sm model to detect entities:

from ai_web_feeds.nlp import EntityExtractor

extractor = EntityExtractor()

article = {
    "id": 1,
    "title": "GPT-4 by OpenAI",
    "content": "OpenAI released GPT-4, led by Sam Altman..."
}

entities = extractor.extract_entities(article)
# Returns: [
#     {"text": "OpenAI", "type": "organization", "confidence": 0.91},
#     {"text": "GPT-4", "type": "technique", "confidence": 0.96},
#     {"text": "Sam Altman", "type": "person", "confidence": 0.89}
# ]

Entity Normalization

Automatically merges similar entities using Levenshtein distance:

# "Geoffrey Hinton" vs "G. Hinton" → Merged (distance ≤ 2)
# "OpenAI" vs "Open AI" → Merged (distance = 1)

Algorithm:

Title-case normalization
Compare to existing entities of same type
If Levenshtein distance ≤ 2, use existing canonical name
Otherwise, create new entity

Full-Text Search

SQLite FTS5 virtual table enables fast entity search:

# Search entities by name, aliases, or description
aiwebfeeds nlp search-entities "hinton"
# Returns: Geoffrey Hinton, Geoff Hinton (alias)

Usage

CLI Commands

Extract Entities

aiwebfeeds nlp entities

Options:

--batch-size: Number of articles (default: 50)
--force: Reprocess all articles

# Process 25 articles
aiwebfeeds nlp entities --batch-size 25

List Entities

# List top 10 entities by frequency
aiwebfeeds nlp list-entities --limit 10

Show Entity Details

aiwebfeeds nlp show-entity "Geoffrey Hinton"

Shows:

Entity metadata (type, aliases, frequency)
Recent article mentions
Related entities

Manage Entities

Add Alias:

aiwebfeeds nlp add-alias "Geoffrey Hinton" "G. Hinton"

Merge Duplicate Entities:

aiwebfeeds nlp merge-entities "Geoff Hinton" "Geoffrey Hinton"

Search Entities (FTS5):

aiwebfeeds nlp search-entities "transformer attention"

Python API

from ai_web_feeds.nlp import EntityExtractor
from ai_web_feeds.storage import Storage

extractor = EntityExtractor()
storage = Storage()

# Extract entities
article = storage.get_article_by_id(123)
entities = extractor.extract_entities(article)

# Store entities
for entity_data in entities:
    # Normalize name
    canonical_name = extractor.normalize_entity(
        entity_data["text"],
        entity_data["type"],
        existing_entities=storage.list_all_entity_names()
    )

    # Get or create entity
    entity = storage.get_entity_by_name(canonical_name)
    if not entity:
        entity = storage.create_entity(
            canonical_name=canonical_name,
            entity_type=entity_data["type"]
        )

    # Record mention
    storage.create_entity_mention(
        entity_id=entity.id,
        article_id=article["id"],
        confidence=entity_data["confidence"],
        extraction_method="ner_model",
        context=entity_data["context"]
    )

Batch Processing

Entity extraction runs hourly via APScheduler:

from ai_web_feeds.nlp.scheduler import NLPScheduler

nlp_scheduler = NLPScheduler(scheduler)
nlp_scheduler.register_jobs()
# Registers: Entity extraction job (every hour)

Database Schema

entities Table

CREATE TABLE entities (
    id TEXT PRIMARY KEY,  -- UUID
    canonical_name TEXT NOT NULL UNIQUE,
    entity_type TEXT NOT NULL CHECK(entity_type IN ('person', 'organization', 'technique', 'dataset', 'concept')),
    aliases TEXT,  -- JSON array
    description TEXT,
    metadata TEXT,  -- JSON object
    frequency_count INTEGER DEFAULT 0,
    first_seen DATETIME DEFAULT CURRENT_TIMESTAMP,
    last_seen DATETIME,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

entity_mentions Table

CREATE TABLE entity_mentions (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    entity_id TEXT NOT NULL REFERENCES entities(id),
    article_id INTEGER NOT NULL,
    confidence REAL NOT NULL CHECK(confidence BETWEEN 0 AND 1),
    extraction_method TEXT NOT NULL CHECK(extraction_method IN ('ner_model', 'rule_based', 'manual')),
    context TEXT,  -- Surrounding text snippet
    mentioned_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (entity_id) REFERENCES entities(id),
    FOREIGN KEY (article_id) REFERENCES feed_entries(id)
);

FTS5 Virtual Table

CREATE VIRTUAL TABLE entities_fts USING fts5(
    entity_id UNINDEXED,
    canonical_name,
    aliases,
    description
);

Model Installation

The first run will download the spaCy model (~13MB):

# Manual download (optional)
uv run python -m spacy download en_core_web_sm

Model Info:

Name: en_core_web_sm
Size: 13MB
Language: English
Accuracy: ~85% F1 score on OntoNotes 5.0

Configuration

class Phase5Settings(BaseSettings):
    entity_batch_size: int = 50
    entity_cron: str = "0 * * * *"  # Every hour
    entity_confidence_threshold: float = 0.7
    spacy_model: str = "en_core_web_sm"

Environment Variables:

PHASE5_ENTITY_BATCH_SIZE=50
PHASE5_ENTITY_CONFIDENCE_THRESHOLD=0.7
PHASE5_SPACY_MODEL=en_core_web_sm

Performance

Throughput: ~50 articles/hour
Memory: ~200MB (spaCy model loaded)
Storage: ~50 bytes per entity mention

Use Cases

Track Influential Researchers

# Find top AI researchers by mention frequency
aiwebfeeds nlp list-entities --type person --limit 20

Discover Emerging Techniques

# Find recently mentioned techniques
aiwebfeeds nlp list-entities --type technique --sort recent

Build Knowledge Graphs

Connect entities by co-occurrence in articles:

# Articles mentioning both "GPT-4" and "RLHF"
storage.get_articles_mentioning_entities(["GPT-4", "RLHF"])

Troubleshooting

Low Extraction Accuracy

Symptom: Many entities missed or incorrectly classified.

Solutions:

Use larger spaCy model: en_core_web_lg (40MB, better accuracy)
Add domain-specific rules for AI terminology
Manual curation: Add aliases for common variations

Duplicate Entities

Symptom: "Geoffrey Hinton" and "Geoff Hinton" as separate entities.

Solution:

# Merge duplicates
aiwebfeeds nlp merge-entities "Geoff Hinton" "Geoffrey Hinton"

# Add alias
aiwebfeeds nlp add-alias "Geoffrey Hinton" "Geoff Hinton"

spaCy Model Not Found

Symptom: OSError: Can't find model 'en_core_web_sm'

Solution:

uv run python -m spacy download en_core_web_sm

Entity Extraction

Entity Extraction

Overview

Architecture

Entity Types

Features

Named Entity Recognition

Entity Normalization

Full-Text Search

Usage

CLI Commands

Extract Entities

List Entities

Show Entity Details

Manage Entities

Python API

Batch Processing

Database Schema

entities Table

entity_mentions Table

FTS5 Virtual Table

Model Installation

Configuration

Performance

Use Cases

Track Influential Researchers

Discover Emerging Techniques

Build Knowledge Graphs

Troubleshooting

Low Extraction Accuracy

Duplicate Entities

spaCy Model Not Found

See Also

On this page