Fraudulent Domains: Typosquatting and IDN Homograph Attacks

There is a class of attack that does not exploit a software flaw, but a flaw in perception: the human eye reading an address bar. The user does everything "right" (checks the domain, sees it is their bank, sees the padlock) and still ends up in the attacker's hands. In this post I walk through the real techniques used to build fraudulent domains: typosquatting, combosquatting, bitsquatting and IDN homograph attacks with Unicode characters. How they work, why your brain fails to spot them, how to find them and how to defend yourself.

Understanding the offensive side is the only way to build detection that actually works, and that same hands-on mindset is the one we apply in our Web eXploitation Expert (WXE) course, SixHack Academy's expert-level web hacking course.

Note: every domain and brand in this article is fictitious and used for educational purposes only. Any resemblance to a real entity is coincidental.

The root problem: the domain you see is not the domain it is

To a user, a domain name is a string of text read at a glance. But underneath there are two layers an attacker can manipulate without the user noticing:

The cognitive layer: the brain does not read letter by letter, it reads word shapes. rnybrand.com (with "r"+"n") is processed as "mybrand" if you do not stop and look.
The encoding layer: since IDN (Internationalized Domain Names), a domain can contain characters from any Unicode alphabet. There are dozens of glyphs in Cyrillic, Greek or Armenian that are visually identical to Latin letters, but to the computer they are completely different characters.

The attacker plays with one layer or both. Let's go type by type.

1. Typosquatting

Typosquatting (also "URL hijacking") consists of registering domains that are variations of a legitimate one, betting that the user makes a typing or reading mistake. There is no Unicode involved: these are plain ASCII characters, arranged to confuse.

The five classic families

Technique	Legitimate domain	Fraudulent variant	The trick
Omission	`yourbrand.com`	`yourbrnd.com`	A missing letter
Repeated character	`connecta.com`	`conecta.com`	One letter short in a double
Transposition	`example.com`	`exapmle.com`	Two letters swapped
Substitution (adjacent keys)	`yourbank.com`	`yourbamk.com`	"n"→"m", next to each other on the keyboard
Insertion	`myshop.com`	`myshopp.com`	An extra character

On top of this there is TLD-squatting: same name, different top-level domain. The user typing from memory writes .com where the company uses .org, or lands on a .co (without the "m"), historically a hotbed of typosquatting because of its resemblance to .com.

ASCII homoglyphs: the trick that needs no Unicode

Even before getting into Unicode, the Latin alphabet already offers pairs of characters that are confused depending on the typeface. The most famous is the lowercase "l" and the uppercase "I", which in many sans-serif fonts are indistinguishable, plus the number "1":

# Same look, different domains
myportal.com     ← legitimate (lowercase l)
myportaI.com     ← with uppercase I, identical in Arial/Helvetica
myporta1.com     ← with the number one

rn  vs  m  →  rnybrand.com   reads as  "mybrand.com"
vv  vs  w  →  myvveb.com     reads as  "myweb.com"
cl  vs  d  →  myclata.com    reads as  "mydata.com"

These are homoglyphs (glyphs that look the same) but still within ASCII. They need no IDN or punycode: they are registrable today at any registrar. That is why they are so widely used.

2. Combosquatting

Combosquatting does not alter the brand name: it combines it with another word. The domain contains the real brand, intact, which makes it especially convincing because when you scan for the brand in the string, there it is.

# The real brand + a hook word
yourbank-security.com
yourbank-verification.com
login-yourbank.com
yourbank.account-update.com    ← "yourbank" is a subdomain of account-update.com (!)

The subdomain trick. Look at the last example. The real domain is account-update.com (controlled by the attacker); yourbank is just a subdomain. The user reads left to right, sees "yourbank" first and relaxes. The part that actually determines who controls the site (the registrable domain) is on the right, and almost nobody looks there.

Large-scale academic studies have shown that combosquatting is considerably more prevalent than classic typosquatting and that many of these domains stay active for years, precisely because they contain the legitimate brand and slip past superficial reviews.

3. Bitsquatting

Bitsquatting is the most curious of all because it assumes no human error: it assumes a hardware error. A bit flipping from 0 to 1 (or the other way) in RAM due to a transient fault (cosmic rays, heat, faulty RAM) can turn one character of the domain into another that is adjacent in the ASCII table.

# One flipped bit = a different letter
'a' = 0110 0001  (0x61)
'c' = 0110 0011  (0x63)   ← flip bit 1 of 'a' and you get 'c'
'e' = 0110 0101  (0x65)   ← flip bit 2 of 'a' and you get 'e'

mydata.com  →  mydcta.com   (one bit in the 'a' → 'c')
                    mydeta.com   (another bit in the 'a' → 'e')

The attacker registers every variant one bit away from a heavily queried domain (a CDN, an update server, a telemetry domain) and simply waits. With billions of DNS resolutions a day, statistics work in their favor: sooner or later, a machine with a corrupted bit will ask for their domain. It is opportunistic, but it is real and documented.

4. IDN homograph attacks: the heart of the problem

This is where it gets serious. Since domains accept international characters (IDN), the attacker is no longer limited to the Latin alphabet. And it turns out Unicode contains hundreds of characters that render exactly like Latin letters, but are different code points.

The canonical example: the "а" that was not an "a"

Character	Glyph	Code point	Alphabet
Latin a	a	`U+0061`	Latin
Cyrillic а	а	`U+0430`	Cyrillic
Latin e	e	`U+0065`	Latin
Cyrillic е	е	`U+0435`	Cyrillic
Latin o	o	`U+006F`	Latin
Cyrillic о	о	`U+043E`	Cyrillic
Latin c	c	`U+0063`	Latin
Cyrillic с	с	`U+0441`	Cyrillic

The pairs in the table are visually identical in practically any font, but the computer treats them as completely different characters. That means I can register a domain that looks like a well-known brand, but where one of the letters is Cyrillic. To the name system it is a brand-new, free domain; to the user's eye it is indistinguishable from the real one.

Punycode: how it is actually encoded

DNS only understands ASCII. When a domain contains Unicode characters, it is translated into an ASCII representation called punycode, with the xn-- prefix. Here is the key defensive insight: under the hood, the fraudulent domain looks nothing like the real one.

# inspecting a homograph domain
$ python3 -c "print('ассеѕѕ.com'.encode('idna'))"
# 'access' written with Cyrillic characters that look Latin
b'xn--80ak9aa5ha.com'

$ dig +short xn--80ak9aa5ha.com
# The browser shows "access.com", but DNS resolves THIS:
203.0.113.42   ← not the legitimate brand's IP

$ whois xn--80ak9aa5ha.com | grep -i registrar
Registrar: Cheap-Registrar-LLC   ← red flag

A proof of concept in code, so the gap between what you see and what it is becomes clear:

# python3 · same look, different code point
legit = "access.com"             # all Latin
fake  = "\u0430\u0441\u0441\u0435\u0455\u0455.com"  # Cyrillic

print(legit)                     # access.com
print(fake)                      # ассеѕѕ.com   ← looks the same
print(legit == fake)             # False        ← but it is NOT
print(fake.encode("idna"))       # b'xn--80ak9aa5ha.com'

Mixed-script vs single-script

There are two flavors of homograph attack, and the difference matters for defense:

Mixed-script (mixing alphabets): the domain combines Latin and Cyrillic, e.g. a single Cyrillic "a" inside yourbаnk.com. Modern browsers detect this mix and, by policy, show the punycode (xn--...) instead of the pretty glyph. It is the easiest case to mitigate.
Single-script (one full alphabet): the famous 2017 case, a very well-known tech brand reproduced entirely in Cyrillic. Because there is no "mix", some browsers did not apply the mixed-script heuristic and showed the domain in the clear, identical to the real one. This forced the major browsers to harden their IDN display rules.

# mixed-script case (a single changed letter)
$ python3 -c "import idna; print(idna.encode('yourbаnk.com').decode())"
# 'yourbank' with ONE Cyrillic 'a' (U+0430), the rest Latin
xn--yourbnk-6fg.com   ← this is the domain that actually gets registered

Current state (2026). The major browsers apply Unicode's confusable-script detection rules (UTS #39) and show punycode when they detect risk. But the protection is not total: it depends on the TLD, the registry's policies and the specific character combination. And outside the browser (email clients, messaging apps, terminals, PDFs, notifications) the protection is usually much weaker or nonexistent. Email is still the most exploited vector.

Reconnaissance: how these variants are generated and detected

From the offensive side (and therefore from the side of whoever has to defend a brand), the go-to tool is dnstwist. It systematically generates every permutation of a domain (typos, homoglyphs, bitsquatting, TLD combinations, homographs) and checks which ones are registered and active.

# dnstwist against your own domain
$ pip install dnstwist

# Generate variants and check which are registered (--registered)
$ dnstwist --registered --mxcheck yourdomain.com

fuzzer       domain                  dns_a            dns_mx
homoglyph    yourdоmain.com          203.0.113.40     mail.bad.example  ← Cyrillic 'o' + active MX
omission     yourdomain.co           203.0.113.7      -
addition     yourdomain-support.com  203.0.113.9      mail.bad.example  ← combosquat with email
tld          yourdomain.net          203.0.113.2      -

# Export to JSON to wire into a monitoring pipeline
$ dnstwist --format json yourdomain.com > variants.json

An active MX record on a variant is the most worrying signal: it means that fraudulent domain can send and receive email, that is, it is set up for phishing, not just parked.

Inspecting a suspicious domain by hand

# bash · unmasking hidden characters
# See the real bytes of a domain (reveals non-ASCII code points)
echo -n "yourbаnk.com" | hexdump -C
# If you see bytes like d0 b0 (UTF-8 of U+0430), there is Cyrillic in disguise

# Convert to punycode to see the "real" domain
python3 -c "import idna; print(idna.encode('yourbаnk.com').decode())"
# xn--yourbnk-6fg.com   ← this is the domain that actually gets registered

# Monitor Certificate Transparency: did anyone issue a cert
# for a variant of my brand?
curl -s "https://crt.sh/?q=%25yourdomain%25&output=json" | \
  jq -r '.[].name_value' | sort -u

The full chain of a real attack

Putting the pieces together, this is how an attacker chains these techniques in a targeted phishing campaign:

# Typical attack chain
1. RECON      → dnstwist against the target brand; pick a free,
                 credible variant (homograph or combosquat)
2. REGISTER   → register the domain at a lax registrar; add MX
                 records so it can send email
3. TLS        → get a free certificate: the green padlock shows
                 up exactly as on the real site
4. CLONE      → copy the legitimate site (login included)
5. DELIVERY   → send the email from the fraudulent domain; the
                 sender "looks" identical to the real one
6. CAPTURE    → the user enters credentials; they are relayed to
                 the real site so as not to raise suspicion

The padlock does not mean "safe". One of the most dangerous myths. A TLS certificate only guarantees that the connection is encrypted and that the domain is who it says it is, that domain, not necessarily the one the user thinks. A homograph domain with a free certificate shows exactly the same padlock as the real bank. The padlock was never a guarantee of legitimacy.

Defense: what to do from each side

As a user

Distrust the link, type the URL yourself or use saved bookmarks for sensitive sites (banking, email, work).
If a domain shows up as xn--... in the bar, it is an IDN: stop and verify it. It is not necessarily malicious, but it deserves a second look.
The "padlock" does not validate the owner's identity. Look at the registrable domain (the part just to the left of the TLD), not the start of the string.

As a company / security team

Defensive registration: register the most obvious variants of your brand (common typos, the main TLDs, and at least the homograph punycode of your name). It is cheap compared to the cost of a successful phishing campaign.
Continuous monitoring: wire dnstwist and Certificate Transparency watching (crt.sh) into a pipeline that alerts you when a variant gets registered or a certificate is issued for one.
Harden email: SPF, DKIM and above all DMARC in p=reject mode. It does not stop similar domains being registered, but it stops them from spoofing your domain, which is half the problem.
Realistic training: phishing drills that include homograph and combosquatting domains, not just the typical clumsy "click here".

As a registry / browser

Enforce each zone's IDN policies: many registries forbid mixing scripts within a single label.
Implement Unicode's confusable-character rules (UTS #39) to show punycode when a risky mix appears.

Want to take your web hacking to the expert level?

If you are interested in professional-level web application exploitation, the Web eXploitation Expert (WXE) course walks you through attacking and defending real applications, with hands-on labs and the same methodology we apply in bug bounty and responsible disclosure programs.

Frequently asked questions

Is it illegal to register a domain similar to a brand?

Registering the domain itself falls into intellectual-property territory and UDRP-style disputes (cybersquatting), not necessarily criminal on its own. What is clearly illegal is using it to impersonate, defraud or capture credentials. In an audit or bug bounty context, all of this work is done on your own domains or with explicit authorization.

Don't browsers already protect against homograph attacks?

They protect partially: they show punycode when they detect suspicious script mixes, but coverage depends on the TLD and the character combination, and outside the browser (email, apps, terminals) protection is much weaker. It is not a solved problem.

What is the difference between typosquatting and combosquatting?

Typosquatting alters the brand name (changed, omitted or transposed letters). Combosquatting keeps the brand intact and adds another word (yourbank-security.com). The latter tends to be more convincing because the real brand appears as-is.

Does the HTTPS padlock help detect these domains?

No. The padlock only indicates an encrypted connection to that domain. A fraudulent domain can have a valid, free certificate and show the same padlock as the legitimate site. You have to look at the domain, not the padlock.

References

dnstwist: generation and detection of permuted domains (typo, homoglyph, bitsquatting, IDN).
Unicode UTS #39: security mechanisms and confusable characters.
RFC 3492: Punycode, the encoding of IDN to ASCII.
crt.sh: Certificate Transparency log search.
DMARC.org: email authentication to prevent spoofing of your own domain.

← Back to Articles