Homoglyphs

Homoglyphs lives! This Python library is an important and widely used library for handling Homoglyphs in Python. This is a fork of the original orsinium maintained project.

Homoglyphs -- python library for getting homoglyphs and converting to ASCII.

Features

It's smarter version of confusable_homoglyphs:

Autodect or manual choosing category (aliases from ISO 15924).
Auto or manual load only needed alphabets in memory.
Converting to ASCII.
More configurable.
More stable.

Installation

sudo pip install homoglyphs_fork

Usage

Best way to explain something is show how it works. So, let's have a look on the real usage.

Importing:

import homoglyphs_fork as hg

Languages

#detect
hg.Languages.detect('w')
# {'pl', 'da', 'nl', 'fi', 'cz', 'sr', 'pt', 'it', 'en', 'es', 'sk', 'de', 'fr', 'ro'}
hg.Languages.detect('т')
# {'mk', 'ru', 'be', 'bg', 'sr'}
hg.Languages.detect('.')
# set()

# get alphabet for languages
hg.Languages.get_alphabet(['ru'])
# {'в', 'Ё', 'К', 'Т', ..., 'Р', 'З', 'Э'}

# get all languages
hg.Languages.get_all()
# {'nl', 'lt', ..., 'de', 'mk'}

Homoglyphs

Get homoglyphs:

# get homoglyphs (latin alphabet initialized by default)
hg.Homoglyphs().get_combinations('q')
# ['q', '𝐪', '𝑞', '𝒒', '𝓆', '𝓺', '𝔮', '𝕢', '𝖖', '𝗊', '𝗾', '𝘲', '𝙦', '𝚚']

Alphabet loading:

# load alphabet on init by categories
homoglyphs = hg.Homoglyphs(categories=('LATIN', 'COMMON', 'CYRILLIC'))  # alphabet loaded here
homoglyphs.get_combinations('гы')
# ['rы', 'гы', 'ꭇы', 'ꭈы', '𝐫ы', '𝑟ы', '𝒓ы', '𝓇ы', '𝓻ы', '𝔯ы', '𝕣ы', '𝖗ы', '𝗋ы', '𝗿ы', '𝘳ы', '𝙧ы', '𝚛ы']

# load alphabet on init by languages
homoglyphs = hg.Homoglyphs(languages={'ru', 'en'})  # alphabet will be loaded here
homoglyphs.get_combinations('гы')
# ['rы', 'гы']

# manual set alphabet on init      # eng rus
homoglyphs = hg.Homoglyphs(alphabet='abc абс')
homoglyphs.get_combinations('с')
# ['c', 'с']

# load alphabet on demand
homoglyphs = hg.Homoglyphs(languages={'en'}, strategy=hg.STRATEGY_LOAD)
# ^ alphabet will be loaded here for "en" language
homoglyphs.get_combinations('гы')
# ^ alphabet will be loaded here for "ru" language
# ['rы', 'гы']

You can combine categories, languages, alphabet and any strategies as you want. The strategies specify how to handle any characters not already loaded:

STRATEGY_LOAD: load category for this character
STRATEGY_IGNORE: add character to result
STRATEGY_REMOVE: remove character from result

Converting glyphs to ASCII chars

homoglyphs = hg.Homoglyphs(languages={'en'}, strategy=hg.STRATEGY_LOAD)

# convert
homoglyphs.to_ascii('ТЕСТ')
# ['TECT']
homoglyphs.to_ascii('ХР123.')  # this is cyrillic "х" and "р"
# ['XP123.', 'XPI23.', 'XPl23.']

# string with chars which can't be converted by default will be ignored
homoglyphs.to_ascii('лол')
# []

# you can set strategy for removing not converted non-ASCII chars from result
homoglyphs = hg.Homoglyphs(
    languages={'en'},
    strategy=hg.STRATEGY_LOAD,
    ascii_strategy=hg.STRATEGY_REMOVE,
)
homoglyphs.to_ascii('лол')
# ['o']

# also you can set up range of allowed char codes for ascii (0-128 by default):
homoglyphs = hg.Homoglyphs(
    languages={'en'},
    strategy=hg.STRATEGY_LOAD,
    ascii_strategy=hg.STRATEGY_REMOVE,
    ascii_range=range(ord('a'), ord('z')),
)
homoglyphs.to_ascii('ХР123.')
# ['l']
homoglyphs.to_ascii('хр123.')
# ['xpl']

The to_ascii() method converts a string containing Unicode homoglyphs (characters that look similar to ASCII characters but are actually different Unicode code points) into a list of strings using only standard ASCII characters. It replaces confusable or lookalike Unicode characters with their closest ASCII equivalents, making the text more readable and less susceptible to spoofing or confusion. This is useful for normalizing text for security, comparison, or display purposes.

homoglyphs = hg.Homoglyphs(languages={'en'}, strategy=hg.STRATEGY_LOAD)

# convert
homoglyphs.to_ascii('\u0422\u0415\u0421\u0422')
# ['TECT']
homoglyphs.to_ascii('\u0425\u0420123.')  # this is cyrillic "\u0445" and "\u0440"
# ['XP123.', 'XPI23.', 'XPl23.']

# string with chars which can't be converted by default will be ignored
homoglyphs.to_ascii('\u043b\u043e\u043b')
# []

# you can set strategy for removing not converted non-ASCII chars from result
homoglyphs = hg.Homoglyphs(
    languages={'en'},
    strategy=hg.STRATEGY_LOAD,
    ascii_strategy=hg.STRATEGY_REMOVE,
)
homoglyphs.to_ascii('\u043b\u043e\u043b')
# ['o']

# also you can set up range of allowed char codes for ascii (0-128 by default):
homoglyphs = hg.Homoglyphs(
    languages={'en'},
    strategy=hg.STRATEGY_LOAD,
    ascii_strategy=hg.STRATEGY_REMOVE,
    ascii_range=range(ord('a'), ord('z')),
)
homoglyphs.to_ascii('\u0425\u0420123.')
# ['l']
homoglyphs.to_ascii('\u0445\u0440123.')
# ['xpl']

Name		Name	Last commit message	Last commit date
Latest commit History 189 Commits
.github		.github
homoglyphs_fork		homoglyphs_fork
.editorconfig		.editorconfig
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate.py		generate.py
logo.png		logo.png
logo.svg		logo.svg
pyproject.toml		pyproject.toml
requirements-dev.lock		requirements-dev.lock
requirements.lock		requirements.lock
tests.py		tests.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Homoglyphs

Features

Installation

Usage

Languages

Categories

Homoglyphs

Converting glyphs to ASCII chars

About

Uh oh!

Releases 4

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Homoglyphs

Features

Installation

Usage

Languages

Categories

Homoglyphs

Converting glyphs to ASCII chars

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages