February 9, 20264 min read

How AI Content Filtering Actually Works (And Why Keyword Filters Aren't Enough)

Most parents assume content filtering is straightforward: maintain a list of bad words, block any message that contains one. This approach has been the backbone of internet safety tools for two decades. And it's fundamentally broken.

Why Keyword Filters Fail

Keyword filtering has three fatal flaws:

1. False Positives Everywhere

Keyword filters can't distinguish context. The word "kill" appears in:

"I killed my presentation today" (positive)
"How do you kill a process in Linux?" (technical)
"I want to kill myself" (crisis)

A keyword filter treats all three the same. The result? Either constant false alarms that parents learn to ignore, or filters so loose they miss actual concerns.

2. Trivially Easy to Bypass

Children are creative. A keyword filter blocking "drugs" won't catch:

"What are some substances that make you feel different?"
"How do people at parties get stuff to relax?"
Misspellings, slang, or coded language

Any child who wants to bypass a keyword filter can do so in under a minute. The filter creates an illusion of safety while providing almost none.

3. Missing the Real Concerns

The most concerning messages often don't contain any flagged keywords. "I don't want to be here anymore" contains no trigger words. "Nobody would care if I was gone" is invisible to keyword matching. These are exactly the messages that need to be caught.

How LLM-Based Classification Works

Modern content safety uses the same technology that powers the AI itself — large language models — to classify content with human-like understanding.

Here's how it works in Ori:

Step 1: Pre-classification. Before the AI generates a response to a minor's message, a separate classification model evaluates the message. This model is specifically trained to detect concerning content across multiple categories.

Step 2: Context evaluation. The classifier doesn't just look at individual words. It evaluates the full message in context:

What is the user actually asking?
Is this educational curiosity or intent to act?
What's the emotional tone?
Does this require adult intervention?

Step 3: Category and severity assignment. Each flagged message is categorized (explicit content, violence, drugs, self-harm, dangerous activities, hate speech, personal information sharing) and assigned a severity level.

Step 4: Appropriate response. Based on severity:

Low: AI responds normally. Flag stored for parent review.
Medium: AI responds with age-appropriate guidance. Parent notified.
High: Message blocked. AI provides caring redirection and crisis resources. Parent receives instant alert.

The Cost of Getting It Right

LLM-based classification isn't free. Running every message through a classifier adds latency and compute cost. But the cost of getting it wrong — missing a genuine cry for help, or providing harmful information to a child — is infinitely higher.

Ori uses efficient, fast models specifically for classification — separate from the main AI that generates responses. This keeps the safety check fast (under half a second) and cheap (fractions of a cent per classification) while maintaining high accuracy.

What Parents Should Look For

When evaluating AI safety tools for your family, ask these questions:

Does it use context-aware classification or just keyword matching? If the documentation mentions "keyword filtering" or "blocked word lists," it's using outdated technology.

Does it distinguish severity levels? Not every flag is equally urgent. A system that treats all flags the same will either overwhelm you with alerts or miss critical situations.

Can it detect implicit meaning? Test it. A good classifier catches "I don't want to be here anymore" without the word "suicide" ever appearing.

Is it transparent? Can your child see that monitoring is active? Systems that operate covertly teach children to hide rather than communicate.

Does it adapt? As AI evolves, the safety system needs to evolve too. Look for systems backed by active development, not static rule sets.

The Standard Is Rising

As more families adopt AI assistants, the expectation for safety will rise with it. Keyword filters will be recognized as the digital equivalent of a "Do Not Enter" sign — technically present, functionally useless.

The future of AI safety is context-aware, severity-graded, transparent classification that respects both the child's privacy and the parent's need for oversight. That future is already here.

One AI for the whole family

Private conversations, shared knowledge, group chat, and safety controls for children. Try Ori free.

Get Started Free