A blazing-fast Rust CLI for measuring string distance with a focus on typosquatting and homoglyph attack detection.
sqdist detects when one string is trying to impersonate another — the two
main families of name-based impersonation attack:
- Typosquatting — names one keyboard slip away from a real one (
gogle,gooogle,googelforgoogle), used for malicious lookalike domains and package names. - Homoglyph spoofing — characters that look identical but are different
Unicode code points:
pаypalwhere theаis Cyrillic. Indistinguishable to a human, a completely different string to a byte comparison.
The hard part is that a homoglyph spoof and an innocent typo can have the exact same edit distance, so plain Levenshtein can't separate them. The skeleton-distance axes (below) make genuine spoofs collapse toward zero distance while real typos stay near 1.0, so you can alert on spoofs without false-alarming on honest fat-finger typos.
Typical uses:
- Brand / domain monitoring — scan newly-registered domains against a brand
watchlist (
microsоft.com?). - Supply-chain defense — check new npm / PyPI / crates package names against popular names to catch malicious lookalikes before they're installed.
- Phishing / fraud filtering — flag deceptive sender names or URLs.
# Homebrew (macOS) — prebuilt, signed, notarized universal binary
brew install ericfitz/tap/sqdist
# Cargo (any platform with the Rust toolchain) — builds from source
cargo install sqdistYou can also grab a signed .pkg installer or the universal tarball directly
from the latest release.
To check your installed version:
sqdist --version
# sqdist 0.3.0 (d430de5)sqdist computes a panel of 10 independent axes for each pair. The --help
AXES block and JSON key order:
AXES:
equal the strings are byte-identical (bool)
levenshtein min single-char insert/delete/substitute edits
damerau like levenshtein, but an adjacent swap counts as one edit
skeleton_levenshtein levenshtein after reducing both to UTS#39 skeletons
skeleton_damerau damerau after reducing both to UTS#39 skeletons
(~0 when visually identical, incl. multi-char confusables)
uts39_confusable_count # of aligned substitutions that are UTS#39-confusable (experimental)
uts39_skeleton_delta calculated as (damerau - skeleton_damerau); edits that vanish under
skeletonization (experimental, may change)
confusable_only true when the strings differ but share an identical skeleton
script_restriction UTS#39 restriction level 0-5 (higher = more mixed-script/suspicious)
keyboard_distance mean QWERTY key distance over substitutions, 0-1 (n/a if non-ASCII)
The homoglyph model uses the official Unicode UTS #39 confusables data
(confusables.txt, v17.0.0), the same authoritative source attacker tooling
targets. Two characters are treated as confusable when they share the same
skeleton (prototype) under UTS #39 — this transitively handles confusable
chains (e.g. Greek omicron → Latin o, Cyrillic о → Latin o, so all three are
mutually confusable).
A homoglyph spoof and a benign typo can have identical Levenshtein/Damerau
distance. The skeleton columns collapse homoglyphs — and also handle multi-char
confusables (rn→m) that per-character approaches miss:
lev damerau skeleton_lev skeleton_dam uts39_delta confusable_only
paypal vs pаypal 1 1 0 0 1 true <- homoglyph spoof
rnicrosoft vs microsoft 2 2 0 0 2 true <- multi-char spoof
google vs gogle 1 1 1 1 0 false <- benign typo
Set a threshold (e.g. -t 1 -m skeleton_damerau) to alert only on the spoofs.
sqdist [OPTIONS] <STRING_A> <STRING_B> # single pair
sqdist [OPTIONS] --stdin # batch: pre-paired lines
sqdist [OPTIONS] --string <S> --list <FILE> # score <S> vs each line
OPTIONS:
-t, --threshold <F> Alert when the --metric distance <= F. Single-pair:
sets exit code. Batch: filters output; exit 1 if none match.
-m, --metric <AXIS> Numeric axis for -t and --sort (default skeleton_damerau)
--fields <LIST> Comma-separated axes to show (default: all). See AXES.
--confusables <LIST> Confusable sources for skeletons: uts39,flowcrypt,digraph (default uts39)
--len-tolerance <F> Max length-difference ratio for a spoof verdict (default 0.25)
-s, --stdin Batch: read TAB/comma pairs from stdin, emit JSONL
--string <S> (with --list) the single string to compare
--list <FILE> (with --string) score <S> against each non-blank line
--sort List mode: emit most-suspicious-first (buffers)
--top <N> List mode: keep only the N closest (implies --sort)
-j, --json Emit JSON (single-pair mode)
-v, --version Print version and commit, then exit
-h, --help This help
sqdist paypal pаypal # human-readable
sqdist -j microsoft micrоsoft # JSON
sqdist -t 1 apple аpple && echo "ALERT: likely spoof"Feed candidate↔brand pairs (one per line, tab- or comma-separated). Emits a
JSON line per pair; with -t it emits only alerts at/under the threshold:
# scan npm/pypi candidate names against your brand watchlist
generate_pairs | sqdist --stdin -t 1 > alerts.jsonlWith -t, batch modes exit with code 0 if at least one alert was emitted and 1 if none matched.
Without -t, batch modes always exit 0. This makes it easy to use in shell conditionals:
if generate_pairs | sqdist --stdin -t 1 > alerts.jsonl; then
echo "Found suspicious matches"
else
echo "All clear"
fiProcess-spawn overhead is ~1 ms; the distance computation itself is sub-microsecond for typical identifier-length strings, so batch mode keeps everything in one process for high throughput.
Score a single name against every line of a candidates file — closer to how you'd screen registry/Artifactory package names against a known-good name:
# emit JSONL (input/match keys), most-suspicious first, top 10
sqdist --string paypal --list candidates.txt --sort --top 10
# alert-only: skeleton_damerau distance at/under the threshold
sqdist --string paypal --list candidates.txt -t 1By default list mode streams: each matching row is written to stdout as it is scored, so memory stays flat (~2 MB) no matter how large the file is — a million-line scan costs the same as a hundred.
--sort and --top are the exception. Ranking needs every row in hand before
it can order them, so those flags buffer all kept rows in memory first. Each
buffered row costs roughly ~100 bytes of fixed overhead + ~2× the candidate's
length in bytes — about 190 bytes per row for 16-character names. Peak
resident memory then scales with the number of rows kept:
Rows kept (≈16-char names), --sort |
Peak RSS |
|---|---|
| 10,000 | ~5 MB |
| 100,000 | ~23 MB |
| 1,000,000 | ~190 MB |
Rough rule of thumb when sorting: peak_MB ≈ 2 (baseline) + rows × (100 + 2 × avg_name_len) / 1e6.
-t cuts this down further even when sorting: a threshold drops non-matching
rows before they're buffered, so --sort -t 1 over a million lines that
alerts on only a handful stays near the ~2 MB baseline. So: stream by default;
add -t (and optionally --top) when you want a ranked view of a very large
list without holding it all in memory. (Chunking a huge list and scanning each
piece is also fine — results are independent per line.)
| Key | Type | Meaning |
|---|---|---|
equal |
bool | the strings are byte-identical |
levenshtein |
int | min single-char insert/delete/substitute edits |
damerau |
int | like levenshtein, but an adjacent swap counts as one edit |
skeleton_levenshtein |
int | levenshtein after reducing both to UTS#39 skeletons |
skeleton_damerau |
int | damerau after reducing both to UTS#39 skeletons; ~0 when visually identical including multi-char confusables |
uts39_confusable_count |
int | # of aligned substitutions that are UTS#39-confusable (experimental) |
uts39_skeleton_delta |
int | calculated as (damerau − skeleton_damerau); edits that vanish under skeletonization (experimental) |
confusable_only |
bool | true when strings differ but share an identical skeleton — highest-confidence spoof signal |
script_restriction |
int | UTS#39 restriction level 0–5 of the pair (max of the two strings' levels); higher = more mixed-script and more suspicious. Direction: higher = more different/suspicious. |
keyboard_distance |
float|null | mean physical US-QWERTY key distance over the substituted positions, normalized to [0,1]; near 0 = adjacent-key fat-finger typo, near 1 = far-apart/deliberate. Direction: higher = more different/suspicious. NA (JSON null, human n/a) when either string contains a non-ASCII character (keyboard distance is undefined there); zero substitutions = 0.0 (accurate). Self-contained QWERTY table (src/keyboard.rs); no dependency. |
AxisValue has an NA variant that renders as JSON null and human n/a; keyboard_distance uses it when either string is non-ASCII.
Single-pair and batch (stdin) modes use keys a and b; list mode uses input and match. Batch and list modes emit JSONL.
sqdist paypal pаypal # second 'a' is Cyrillicequal false
levenshtein 1
damerau 1
skeleton_levenshtein 0
skeleton_damerau 0
uts39_confusable_count 1
uts39_skeleton_delta 1
confusable_only true
script_restriction 4
keyboard_distance n/a
[LIKELY SPOOF] The strings differ by 1 edit, but every differing character is a homoglyph (the strings are visually identical). High likelihood of an attempt to confuse.
sqdist -j paypal pаypal # second 'a' is Cyrillic{"a":"paypal","b":"pаypal","equal":false,"levenshtein":1,"damerau":1,"skeleton_levenshtein":0,"skeleton_damerau":0,"uts39_confusable_count":1,"uts39_skeleton_delta":1,"confusable_only":true,"script_restriction":4,"keyboard_distance":null}The --fields flag restricts output to a comma-separated list of axis keys. The identifier keys (a/b in single-pair and stdin, input/match in list mode) are always included regardless. Axes are printed in canonical order regardless of the input order. Invalid axis names produce an error listing the valid keys.
Example: show only damerau and confusable_only:
sqdist --fields damerau,confusable_only GOOGLE GO0GLEOutput:
damerau 1
confusable_only true
[LIKELY SPOOF] The strings differ by 1 edit, but every differing character is a homoglyph (the strings are visually identical). High likelihood of an attempt to confuse.
The same axes in JSON output:
sqdist -j --fields damerau,confusable_only GOOGLE GO0GLEOutput:
{"a":"GOOGLE","b":"GO0GLE","damerau":1,"confusable_only":true}-m/--metric <AXIS> selects which numeric axis drives -t (threshold filtering and exit code) and --sort. Default: skeleton_damerau. The two bool axes (equal, confusable_only) are not valid metric axes and produce an error listing the valid numeric keys.
sqdist -m skeleton_damerau -t 1 paypal pаypal
sqdist --string microsoft --list candidates.txt --sort -m levenshteinIn single-pair human output (not -j JSON, not batch modes), a verdict line appears below the axis table:
The verdict is one of:
[IDENTICAL]— the two strings are identical[LIKELY SPOOF]— strong signal of a homoglyph attack[LIKELY BENIGN]— a real typo or legitimate edit
A likely spoof is signaled when:
- All differing characters are homoglyphs (
confusable_only = true), OR - Homoglyphs account for more than half the Damerau distance within a length tolerance (default
--len-tolerance 0.25)
This verdict helps distinguish attacks from honest typos and appears only in single-pair human output — it's never emitted in JSON mode or batch modes.
Some multi-character visual confusables are detected via full UTS#39
skeletonization — each string is reduced to its skeleton (every code point
mapped through the confusables table and concatenated) before measuring
distance. The classic example is the letter m, whose UTS#39 skeleton is
rn, so a spoof that swaps one for the other collapses to a zero-distance match:
rn↔m(rnicrosoftvsmicrosoft) — detected
This surfaces in the skeleton_damerau field (~0 for a pure multi-char spoof)
and sets confusable_only to true, even when the strings differ in length.
UTS#39 keys its multi-character skeletons on a small set of single code points
(among ASCII letters, essentially just m → rn); it does not provide reverse
vv → w style mappings. By default sqdist uses only pure UTS#39, so these three
spoofs are reported as ordinary edits (confusable_only false):
vv≈wcl≈dnn≈m(note:rn≈mis caught — see above)
These gaps are closable via --confusables=digraph (see Supplemental
confusables below). The digraph source adds exactly those three curated mappings.
Leetspeak substitutions (3→e, 4→a) are deliberately excluded — UTS#39
does not consider them visually confusable and sqdist follows that policy.
--confusables <LIST> selects which confusable sources are active during
skeleton computation. The value is a comma-separated list; the default is
uts39.
| Source | Description | Default |
|---|---|---|
uts39 |
Unicode UTS#39 confusables data (authoritative, ~6565 entries). Always included. | yes |
flowcrypt |
FlowCrypt idn-homographs-database single-char supplement — 477 ASCII-base look-alikes anchored to UTS#39 skeletons. MIT-licensed. |
no (opt-in) |
digraph |
Three curated multi-char mappings: vv→w, cl→d, nn→rn. Closes the three most common UTS#39 multi-char gaps. |
no (opt-in) |
An unknown source name is an error (exit 2).
The default is pure UTS#39 — unchanged from prior releases. The supplemental
sources are opt-in and less authoritative. In particular, cl→d has a higher
false-positive rate (clear matches dear), so use the digraph source with
that caveat in mind. UTS#39 always wins on any collision.
The skeleton axes (skeleton_levenshtein, skeleton_damerau,
confusable_only) reflect whichever sources are enabled. Changing
--confusables changes the behavior of those three axes.
# Default (pure UTS#39): vv is not a known confusable for w
sqdist -j devflovv devflow
# → "confusable_only":false, "skeleton_damerau":2
# With digraph source: vv→w mapping is active, gap closes
sqdist -j --confusables uts39,digraph devflovv devflow
# → "confusable_only":true, "skeleton_damerau":0The flowcrypt source is derived from the
FlowCrypt idn-homographs-database,
licensed under the MIT License. The embedded data is pinned to commit f27b783
(dated 2021-05-26, retrieved 2026-05-24).
cargo build --release # -> target/release/sqdist
cargo test # unit tests in each module's #[cfg(test)] block; currently 88 testsThe confusables table is embedded at compile time (src/confusables_data.rs,
auto-generated from confusables.txt), so no runtime data files or network
access are needed for confusables. sqdist has one compiled dependency:
unicode-security (MIT/Apache-2.0), used for the script_restriction axis.
-v/--version prints the binary version, build commit, and one data: line
per embedded confusable source (regardless of --confusables):
sqdist 0.3.0 (<sha>)
data: UTS#39 confusables.txt v17.0.0 (2025-07-22)
data: FlowCrypt idn-homographs-database @ f27b783 (retrieved 2021-05-26)
If a new Unicode version ships, regenerate the embedded UTS#39 table:
curl -sSL https://www.unicode.org/Public/security/latest/confusables.txt -o confusables.txt
python3 scripts/gen_confusables.py # emits src/confusables_data.rsThe FlowCrypt single-char supplement is regenerated from the upstream
idn-homographs-database JSON. The 19 MB source file is not committed; fetch it
when you need to update:
# Online (fetches source JSON from GitHub)
python3 scripts/gen_flowcrypt.py # emits src/flowcrypt_data.rs
# Offline / pinned (provide local copies + provenance metadata)
python3 scripts/gen_flowcrypt.py \
--homographs path/to/homographs.json \
--confusables path/to/confusables.txt \
--source-commit <sha> \
--source-date <YYYY-MM-DD>