The Problem with the F1 Score

For years, the F1 score has been the golden standard for evaluating AI models. It’s familiar, intuitive, and strikes a neat balance between precision and recall. But here’s the uncomfortable truth: in today’s rapidly evolving AI landscape—especially with the rise of generative AI—F1 score is dangerously incomplete.

AI systems are no longer static classifiers working in sanitized environments. They are agents acting in the wild—writing code, making API calls, or helping customers through complex decision trees. In this world, the F1 score becomes a rearview mirror: great for measuring performance on test data, but blind to the vast unknowns lurking just outside the frame.

Understanding the F1 Score

The F1 score measures how accurate a model is at detecting things it’s supposed to find—balancing two critical questions:

• Precision: When the model flags something, is it usually right?
• Recall: Does the model catch most of the things it should?
The formula: F1 = 2 × (Precision × Recall) / (Precision + Recall). A high F1 score means the model catches most harmful items AND rarely flags innocent ones—the balance matters.

F1 Score Interpretation

F1 Score Rating Interpretation
90–100% Excellent Model performs exceptionally well
80–89% Good Solid performance for most applications
70–79% Fair May need improvement for critical tasks
<70% Poor Significant improvement needed

Analogy: Think of a security guard with high precision but low recall—they only stop people when 100% certain, rarely making mistakes but missing many prohibited items. The F1 score balances this trade-off.

Coverage: The Metric We’ve Been Missing

While recall tells us how many known threats or patterns a model can detect, coverage asks a more existential question: how much of the real world can the model even recognize?
In simple terms, coverage measures the size of the model’s understanding. It’s about knowledge limits—not just how many toxic comments or jailbreak attempts the model got right, but how many variations of those patterns it could even comprehend.

Coverage vs. Recall: Key Differences

RECALL COVERAGE
"Of the harmful items in our test set, how many did we catch?" "Of all possible harmful items that could exist, how many can we recognize?"
Scope: Known, labeled examples only Scope: The entire theoretical knowledge limit

Put simply: Recall asks “How thorough were we on the exam?” Coverage asks “How much of reality did the exam cover?”

Bounded vs. Unbounded Problems

Not all AI safety and security detection problems are created equal. Understanding whether you’re dealing with a bounded or unbounded problem fundamentally changes how you should think about coverage.

BOUNDED PROBLEMS UNBOUNDED PROBLEMS
Toxicity Detection – Finite semantic patterns; variations are mostly surface-level Adversarial Prompts & Jailbreaks – Attackers actively invent new techniques; space constantly expands
Sentiment Analysis – Ways humans express like/dislike follow recognizable patterns Intent & Context – Same sentence means different things depending on situation
✓ High F1 + High Coverage achievable ⚠ High F1 can mask low coverage

The Hidden Gap: When High F1 Hides Low Coverage

Here’s the trap: you can have a model with a 94% F1 score for jailbreak detection and still miss 40% or more of actual threats. That’s not a performance win—it’s a security risk.

The data reveals a stark contrast between bounded and unbounded problems:

Problem Type F1 Score Est. Coverage
Toxicity Detection (bounded) 96% 94%
Jailbreak Detection (non-semantic) 94%+ 40–60%
Jailbreak Detection (semantic-based) 94% 70–85%

Notice that even with identical F1 scores, the coverage gap between approaches can be massive—a difference of 25-45 percentage points. This is the hidden risk that F1 alone cannot reveal.

Why Semantic Classifiers Change the Game

Enter FAISS-based semantic classifiers. Instead of just matching patterns, they measure meaning. That subtle shift allows for broader generalization and better detection of paraphrased, obfuscated, or novel attacks.
Key advantages of semantic similarity approaches:

  • Intent Detection: Detects meaning and intent rather than exact pattern matching—catches attacks that say the same thing differently
  • Generalization: Recognizes novel phrasings, paraphrases, and synonyms of known attack types without explicit training
  • High Precision Enables Breadth: Low false positive rate allows broader detection thresholds without overwhelming users with noise

FAISS vector similarity enables efficient nearest-neighbor search across high-dimensional embedding space, making it practical to deploy at scale.

The Remaining 15-30% Coverage Gap

Even with semantic approaches, a coverage gap remains. Understanding why helps us build better mitigations:

Emerging LLM Attack Characteristics Recommended Mitigations
Novel attack categories: Entirely new techniques that don’t resemble any known patterns Proxy models such as language classifiers to exclude non-semantic based attacks (Base64, morse code, etc.)
Multi-turn attacks: Sophisticated manipulations spread across conversation context Session monitoring with compound probability classification across all inputs
Adversarial evolution: Attackers continuously inventing new approaches Effective threat intelligence feeds with high frequency of model re-training

The Sample Multiplier Effect

How do we measure or estimate coverage? For bounded problems, estimates can be made—”Toxicity” for example is estimated to have a knowledge limit of ~100k samples (when using semantic similarity). For unbounded problems, repeated testing as datasets grow will eventually allow for an estimate.
Here’s where semantic approaches deliver a major efficiency advantage: dataset sample sizes yield significantly different coverage depending on the model architecture.

100K samples in FAISS can equal 500K–2M in effective coverage

FAISS Semantic Retrieval Transformer Training
5–20× effective coverage multiplier ~1× (no multiplier)
Performs lookup in continuous semantic space. A query finds its nearest neighbor — proximity equals relevance. Learns generalizable functions from training data. Gradient descent doesn't respect Euclidean neighborhoods.
100k clustered samples → 500k–2M effective coverage 100k samples → ~100k effective, possibly less

Why Does Clustering Amplify FAISS But Not Transformers?

For FAISS (clustering helps):

  • Each sample “covers” its Voronoi cell in embedding space
  • Minimizing max distance eliminates coverage gaps
  • Similar queries get similar results
  • No wasted samples on near-duplicates

For Transformers (clustering doesn’t help):

  • Models need variation, not coverage—learning requires diverse examples
  • Redundancy actually reinforces learning
  • Nearby embeddings ≠ useful gradients
  • Requires task-relevant density, not geometric coverage

Key Insight: FAISS optimizes for coverage (minimizing max distance). Transformers need density along task-relevant dimensions, which may not align with embedding similarity.

Sample Multiplier by Model Recall

FAISS semantic similarity enables higher sample utility as model recall improves. The relationship is roughly exponential:

Recall Multiplier Interpretation
0.75 1.5–2× Limited — needs verification
0.80 2–3× Moderate with spot-checks
0.85 3–5× Good reliability
0.90 5–8× High confidence zone
0.95 8–12× Excellent transfer
0.98 12–15× Near-optimal
0.99+ 15–20× Maximum multiplier

Practical example at 0.95 recall: 1,000 FAISS samples → 8,000–12,000 effective coverage, compared to just 1,000 for transformer-based approaches.
Below 0.90 recall, human verification is recommended to validate results.

Coverage as a Strategic Metric

When evaluating AI guardrails, especially those integrated into mission-critical workflows, coverage should be treated as a first-class metric. It’s not just about performance anymore—it’s about resilience. Can your model handle what it hasn’t seen before? Can it detect novel attacks, not just the ones it’s trained on?
This shift from performance metrics to capability metrics mirrors what happened in cybersecurity years ago. Detection rates weren’t enough—we needed threat intelligence, behavioral heuristics, zero-day detection. AI is now entering that same maturity curve.

The Road Ahead: Trust Through Breadth

As AI agents become more autonomous and more deeply embedded in our systems, we need more than just smart models—we need prepared ones. Models that don’t just ace the test, but understand the full syllabus of human behavior, adversarial strategy, and linguistic nuance.
F1 is a start, but it’s not the finish line. In a world where the threats are evolving faster than your training set, coverage is the only metric that can keep up.

Key Takeaways

  • F1 score alone is insufficient for unbounded problems like jailbreak detection
  • Coverage measures knowledge limits—how much of reality your model can recognize
  • Semantic approaches (FAISS) achieve 70-85% coverage vs. 40-60% for pattern matching for unbounded problems
  • The sample multiplier effect: 100K FAISS samples ≈ 500K–2M transformer samples
  • Higher model recall unlocks exponentially higher multipliers (up to 20×) when using semantic similarity
  • Treat coverage as a first-class metric for mission-critical AI deployments