Name: Netallion AI Assurance
Author: Netallion

What is Byte-Pair Encoding?

Byte-Pair Encoding (BPE) is a subword tokenization algorithm originally developed for data compression and later adopted by natural language processing systems including GPT, Claude, and other large language models. The algorithm works by iteratively finding the most frequent pair of adjacent bytes (or characters) in a corpus and merging them into a single token. After many iterations, the vocabulary contains a mix of individual characters, common character sequences, and full words.

For example, the string "ghp_a3Bf9xKmZ2" (a GitHub personal access token) would be tokenised differently from the English word "application". The token for "application" likely exists as a single entry in the vocabulary because it appears frequently in training data. The GitHub token, however, breaks into many small, uncommon subword fragments because its characters are essentially random. This difference in tokenization behaviour is exactly what makes BPE useful for secret detection.

How BPE Applies to Security

Traditional secret detection relies on two primary techniques: regex pattern matching and Shannon entropy analysis. Regex works well when the secret format is known in advance. AWS access keys always start with "AKIA", GitHub tokens start with "ghp_" or "gho_", and Azure connection strings contain recognisable keywords. But when organisations use custom token formats, internal credential schemes, or base64-encoded secrets without a predictable prefix, regex has no pattern to match.

Shannon entropy measures the randomness of a string. High-entropy strings are more likely to be secrets than low-entropy strings. But entropy analysis alone produces excessive false positives because many legitimate strings, such as UUIDs, base64-encoded images, hashes in URLs, and minified JavaScript, also have high entropy. It also misses secrets that are relatively short or that use a restricted character set, resulting in lower entropy than the detection threshold.

BPE tokenization solves both problems by operating at the subword level. When a string is tokenised with a BPE vocabulary trained on natural language, secrets fragment into many small, rare tokens while natural language text maps to fewer, common tokens. By measuring the average token frequency or the number of tokens per character, the system can distinguish secrets from non-secrets with much higher accuracy.

Why BPE Outperforms Regex and Entropy-Only Detection

The advantage is structural, not a matter of tuning. Entropy-only detection is trapped on a single dial: raise the threshold and it misses generic secrets with restricted character sets or short lengths; lower it and it drowns in UUIDs, hashes, and base64 that are just as random-looking but legitimate. BPE operates one level deeper. A true credential fragments into many rare subword tokens, while a UUID or hash carries the recurring byte pairs BPE has learned to merge — so the two separate into different token profiles even when their raw entropy is identical. That is why BPE tokenization delivers substantially higher recall than entropy-only analysis while producing fewer false positives: it reads the structural difference between random-looking credentials and random-looking but legitimate content that entropy scoring is blind to.

Regex detection, while precise for known formats, has zero recall for unknown formats. If an organisation creates a custom API key format like "myco_live_" followed by 32 random characters, no scanner will detect it until someone writes a regex for that specific format. BPE detects it immediately because the random portion fragments into rare subword tokens regardless of the prefix.

The combination of all three approaches, regex, entropy, and BPE tokenization, creates a layered detection engine. Regex catches known formats with high precision and near-zero latency. Entropy provides a fast pre-filter to identify candidate strings. BPE tokenization then analyses the candidates to confirm whether they are truly secrets or benign high-entropy content. This layered approach achieves the best of all three worlds: broad coverage, high precision, and low latency.

Real-World Examples

Consider a developer who pastes a database connection string into a Slack message: Server=prod-db.internal;Database=customers;User=svc_api;Password=kX9mP2vL8nQ4. A regex scanner would catch this because the format includes recognisable keywords like "Password=". But if the same developer pastes just the password value kX9mP2vL8nQ4 in a follow-up message, regex has nothing to match. Entropy analysis might flag it, but it might also miss it because the string is only 12 characters long and uses a limited character set. BPE tokenization fragments this into rare subword tokens and flags it as a likely credential.

Another common scenario involves base64-encoded secrets embedded in configuration files or log entries. A string like eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiIxMjM0NTY3ODkw is a JWT token. Regex can match the JWT structure, but a partial JWT or a custom base64-encoded API key without the JWT header would be invisible to regex. BPE tokenization detects the high fragmentation of the base64 payload and flags it for review.

How Netallion AI Assurance Uses BPE Tokenization

Netallion AI Assurance's detection engine combines 467 regex patterns with BPE tokenization and 20 live verifiers. Every text surface, including Azure Monitor logs, pull requests, Slack messages, Jira tickets, and outbound AI prompts, is processed through the same engine. The BPE tokenizer runs as a second-pass analyser on content that does not match any known regex pattern, ensuring that novel and custom credential formats are caught without requiring manual pattern updates.

When a finding is generated by BPE tokenization, the system assigns a confidence score based on token fragmentation density, contextual signals (such as proximity to keywords like "key", "token", or "password"), and string length. High-confidence findings are then passed to live verifiers that test whether the credential is actually active, eliminating false positives from revoked or test credentials.

BPE Tokenization for Secret Detection: Higher Recall Than Entropy-Only

Key Takeaways

What is Byte-Pair Encoding?

How BPE Applies to Security

Why BPE Outperforms Regex and Entropy-Only Detection

Real-World Examples

How Netallion AI Assurance Uses BPE Tokenization

Catch the secrets that regex misses

Related Glossary Terms

Prompt DLP

Secret Sprawl

Non-Human Identity