There is a class of attack that does not exploit a software flaw, but a flaw in perception: the human eye reading an address bar. The user does everything "right" (checks the domain, sees it is their bank, sees the padlock) and still ends up in the attacker's hands. In this post I walk through the real techniques used to build fraudulent domains: typosquatting, combosquatting, bitsquatting and IDN homograph attacks with Unicode characters. How they work, why your brain fails to spot them, how to find them and how to defend yourself.
Understanding the offensive side is the only way to build detection that actually works, and that same hands-on mindset is the one we apply in our Web eXploitation Expert (WXE) course, SixHack Academy's expert-level web hacking course.
Note: every domain and brand in this article is fictitious and used for educational purposes only. Any resemblance to a real entity is coincidental.
The root problem: the domain you see is not the domain it is
To a user, a domain name is a string of text read at a glance. But underneath there are two layers an attacker can manipulate without the user noticing:
- The cognitive layer: the brain does not read letter by letter, it reads word shapes.
rnybrand.com(with "r"+"n") is processed as "mybrand" if you do not stop and look. - The encoding layer: since IDN (Internationalized Domain Names), a domain can contain characters from any Unicode alphabet. There are dozens of glyphs in Cyrillic, Greek or Armenian that are visually identical to Latin letters, but to the computer they are completely different characters.
The attacker plays with one layer or both. Let's go type by type.
1. Typosquatting
Typosquatting (also "URL hijacking") consists of registering domains that are variations of a legitimate one, betting that the user makes a typing or reading mistake. There is no Unicode involved: these are plain ASCII characters, arranged to confuse.
The five classic families
| Technique | Legitimate domain | Fraudulent variant | The trick |
|---|---|---|---|
| Omission | yourbrand.com | yourbrnd.com | A missing letter |
| Repeated character | connecta.com | conecta.com | One letter short in a double |
| Transposition | example.com | exapmle.com | Two letters swapped |
| Substitution (adjacent keys) | yourbank.com | yourbamk.com | "n"→"m", next to each other on the keyboard |
| Insertion | myshop.com | myshopp.com | An extra character |
On top of this there is TLD-squatting: same name, different top-level domain. The user typing from memory writes .com where the company uses .org, or lands on a .co (without the "m"), historically a hotbed of typosquatting because of its resemblance to .com.
ASCII homoglyphs: the trick that needs no Unicode
Even before getting into Unicode, the Latin alphabet already offers pairs of characters that are confused depending on the typeface. The most famous is the lowercase "l" and the uppercase "I", which in many sans-serif fonts are indistinguishable, plus the number "1":
myportal.com ← legitimate (lowercase l) myportaI.com ← with uppercase I, identical in Arial/Helvetica myporta1.com ← with the number onern vs m → rnybrand.com reads as "mybrand.com" vv vs w → myvveb.com reads as "myweb.com" cl vs d → myclata.com reads as "mydata.com"
These are homoglyphs (glyphs that look the same) but still within ASCII. They need no IDN or punycode: they are registrable today at any registrar. That is why they are so widely used.
2. Combosquatting
Combosquatting does not alter the brand name: it combines it with another word. The domain contains the real brand, intact, which makes it especially convincing because when you scan for the brand in the string, there it is.
yourbank-security.com yourbank-verification.com login-yourbank.com yourbank.account-update.com ← "yourbank" is a subdomain of account-update.com (!)
account-update.com (controlled by the attacker); yourbank is just a subdomain. The user reads left to right, sees "yourbank" first and relaxes. The part that actually determines who controls the site (the registrable domain) is on the right, and almost nobody looks there.
Large-scale academic studies have shown that combosquatting is considerably more prevalent than classic typosquatting and that many of these domains stay active for years, precisely because they contain the legitimate brand and slip past superficial reviews.
3. Bitsquatting
Bitsquatting is the most curious of all because it assumes no human error: it assumes a hardware error. A bit flipping from 0 to 1 (or the other way) in RAM due to a transient fault (cosmic rays, heat, faulty RAM) can turn one character of the domain into another that is adjacent in the ASCII table.
'a' = 0110 0001 (0x61) 'c' = 0110 0011 (0x63) ← flip bit 1 of 'a' and you get 'c' 'e' = 0110 0101 (0x65) ← flip bit 2 of 'a' and you get 'e'mydata.com → mydcta.com (one bit in the 'a' → 'c') mydeta.com (another bit in the 'a' → 'e')
The attacker registers every variant one bit away from a heavily queried domain (a CDN, an update server, a telemetry domain) and simply waits. With billions of DNS resolutions a day, statistics work in their favor: sooner or later, a machine with a corrupted bit will ask for their domain. It is opportunistic, but it is real and documented.
4. IDN homograph attacks: the heart of the problem
This is where it gets serious. Since domains accept international characters (IDN), the attacker is no longer limited to the Latin alphabet. And it turns out Unicode contains hundreds of characters that render exactly like Latin letters, but are different code points.
The canonical example: the "а" that was not an "a"
| Character | Glyph | Code point | Alphabet |
|---|---|---|---|
| Latin a | a | U+0061 | Latin |
| Cyrillic а | а | U+0430 | Cyrillic |
| Latin e | e | U+0065 | Latin |
| Cyrillic е | е | U+0435 | Cyrillic |
| Latin o | o | U+006F | Latin |
| Cyrillic о | о | U+043E | Cyrillic |
| Latin c | c | U+0063 | Latin |
| Cyrillic с | с | U+0441 | Cyrillic |
The pairs in the table are visually identical in practically any font, but the computer treats them as completely different characters. That means I can register a domain that looks like a well-known brand, but where one of the letters is Cyrillic. To the name system it is a brand-new, free domain; to the user's eye it is indistinguishable from the real one.
Punycode: how it is actually encoded
DNS only understands ASCII. When a domain contains Unicode characters, it is translated into an ASCII representation called punycode, with the xn-- prefix. Here is the key defensive insight: under the hood, the fraudulent domain looks nothing like the real one.
$ python3 -c "print('ассеѕѕ.com'.encode('idna'))" # 'access' written with Cyrillic characters that look Latin b'xn--80ak9aa5ha.com'$ dig +short xn--80ak9aa5ha.com # The browser shows "access.com", but DNS resolves THIS: 203.0.113.42 ← not the legitimate brand's IP
$ whois xn--80ak9aa5ha.com | grep -i registrar Registrar: Cheap-Registrar-LLC ← red flag
A proof of concept in code, so the gap between what you see and what it is becomes clear:
legit = "access.com" # all Latin fake = "\u0430\u0441\u0441\u0435\u0455\u0455.com" # Cyrillicprint(legit) # access.com print(fake) # ассеѕѕ.com ← looks the same print(legit == fake) # False ← but it is NOT print(fake.encode("idna")) # b'xn--80ak9aa5ha.com'
Mixed-script vs single-script
There are two flavors of homograph attack, and the difference matters for defense:
- Mixed-script (mixing alphabets): the domain combines Latin and Cyrillic, e.g. a single Cyrillic "a" inside
yourbаnk.com. Modern browsers detect this mix and, by policy, show the punycode (xn--...) instead of the pretty glyph. It is the easiest case to mitigate. - Single-script (one full alphabet): the famous 2017 case, a very well-known tech brand reproduced entirely in Cyrillic. Because there is no "mix", some browsers did not apply the mixed-script heuristic and showed the domain in the clear, identical to the real one. This forced the major browsers to harden their IDN display rules.
$ python3 -c "import idna; print(idna.encode('yourbаnk.com').decode())" # 'yourbank' with ONE Cyrillic 'a' (U+0430), the rest Latin xn--yourbnk-6fg.com ← this is the domain that actually gets registered
Reconnaissance: how these variants are generated and detected
From the offensive side (and therefore from the side of whoever has to defend a brand), the go-to tool is dnstwist. It systematically generates every permutation of a domain (typos, homoglyphs, bitsquatting, TLD combinations, homographs) and checks which ones are registered and active.
$ pip install dnstwist# Generate variants and check which are registered (--registered) $ dnstwist --registered --mxcheck yourdomain.com
fuzzer domain dns_a dns_mx homoglyph yourdоmain.com 203.0.113.40 mail.bad.example ← Cyrillic 'o' + active MX omission yourdomain.co 203.0.113.7 - addition yourdomain-support.com 203.0.113.9 mail.bad.example ← combosquat with email tld yourdomain.net 203.0.113.2 -
# Export to JSON to wire into a monitoring pipeline $ dnstwist --format json yourdomain.com > variants.json
An active MX record on a variant is the most worrying signal: it means that fraudulent domain can send and receive email, that is, it is set up for phishing, not just parked.
Inspecting a suspicious domain by hand
# See the real bytes of a domain (reveals non-ASCII code points) echo -n "yourbаnk.com" | hexdump -C # If you see bytes like d0 b0 (UTF-8 of U+0430), there is Cyrillic in disguise# Convert to punycode to see the "real" domain python3 -c "import idna; print(idna.encode('yourbаnk.com').decode())" # xn--yourbnk-6fg.com ← this is the domain that actually gets registered
# Monitor Certificate Transparency: did anyone issue a cert # for a variant of my brand? curl -s "https://crt.sh/?q=%25yourdomain%25&output=json" |
jq -r '.[].name_value' | sort -u
The full chain of a real attack
Putting the pieces together, this is how an attacker chains these techniques in a targeted phishing campaign:
1. RECON → dnstwist against the target brand; pick a free, credible variant (homograph or combosquat) 2. REGISTER → register the domain at a lax registrar; add MX records so it can send email 3. TLS → get a free certificate: the green padlock shows up exactly as on the real site 4. CLONE → copy the legitimate site (login included) 5. DELIVERY → send the email from the fraudulent domain; the sender "looks" identical to the real one 6. CAPTURE → the user enters credentials; they are relayed to the real site so as not to raise suspicion
Defense: what to do from each side
As a user
- Distrust the link, type the URL yourself or use saved bookmarks for sensitive sites (banking, email, work).
- If a domain shows up as
xn--...in the bar, it is an IDN: stop and verify it. It is not necessarily malicious, but it deserves a second look. - The "padlock" does not validate the owner's identity. Look at the registrable domain (the part just to the left of the TLD), not the start of the string.
As a company / security team
- Defensive registration: register the most obvious variants of your brand (common typos, the main TLDs, and at least the homograph punycode of your name). It is cheap compared to the cost of a successful phishing campaign.
- Continuous monitoring: wire
dnstwistand Certificate Transparency watching (crt.sh) into a pipeline that alerts you when a variant gets registered or a certificate is issued for one. - Harden email: SPF, DKIM and above all DMARC in
p=rejectmode. It does not stop similar domains being registered, but it stops them from spoofing your domain, which is half the problem. - Realistic training: phishing drills that include homograph and combosquatting domains, not just the typical clumsy "click here".
As a registry / browser
- Enforce each zone's IDN policies: many registries forbid mixing scripts within a single label.
- Implement Unicode's confusable-character rules (UTS #39) to show punycode when a risky mix appears.
Want to take your web hacking to the expert level?
If you are interested in professional-level web application exploitation, the Web eXploitation Expert (WXE) course walks you through attacking and defending real applications, with hands-on labs and the same methodology we apply in bug bounty and responsible disclosure programs.
Frequently asked questions
Is it illegal to register a domain similar to a brand?
Registering the domain itself falls into intellectual-property territory and UDRP-style disputes (cybersquatting), not necessarily criminal on its own. What is clearly illegal is using it to impersonate, defraud or capture credentials. In an audit or bug bounty context, all of this work is done on your own domains or with explicit authorization.
Don't browsers already protect against homograph attacks?
They protect partially: they show punycode when they detect suspicious script mixes, but coverage depends on the TLD and the character combination, and outside the browser (email, apps, terminals) protection is much weaker. It is not a solved problem.
What is the difference between typosquatting and combosquatting?
Typosquatting alters the brand name (changed, omitted or transposed letters). Combosquatting keeps the brand intact and adds another word (yourbank-security.com). The latter tends to be more convincing because the real brand appears as-is.
Does the HTTPS padlock help detect these domains?
No. The padlock only indicates an encrypted connection to that domain. A fraudulent domain can have a valid, free certificate and show the same padlock as the legitimate site. You have to look at the domain, not the padlock.
References
- dnstwist: generation and detection of permuted domains (typo, homoglyph, bitsquatting, IDN).
- Unicode UTS #39: security mechanisms and confusable characters.
- RFC 3492: Punycode, the encoding of IDN to ASCII.
- crt.sh: Certificate Transparency log search.
- DMARC.org: email authentication to prevent spoofing of your own domain.