You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/posts/ctf/hkcert22/2022-11-14-hkcert-2022-base64-encryption.md
+23-8Lines changed: 23 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -43,9 +43,9 @@ We’re provided with an encryption script `chall.py` (written in Python), along
43
43
44
44
So how do we go about cracking this? Brute-force will be undoubtedly inefficient as we have $64! \approx 1.27 \times 10^{89}$ mapping combinations to try. It would take *years* before we have any progress! Also we’d need to look at results to determine if the English looks right (or automate it by checking a word list)—this would take even more time! Regardless, we need to find some other way.
45
45
46
-
## Let’s Get Cracking
46
+
## First Steps: Elimination by ASCII Range
47
47
48
-
Here’s one idea: since the plaintext is an English article, this means that most (ifnotall) characters are in the printable ASCIIrange (32-127). This means that the most significant bit (MSB) of each byte *cannot* be 1. We can use this to create a **blacklist** of mappings. For example, originally we have 64 mappings for the letter `A`. After blacklisting, we may be left with, say, 16 mappings. This drastically reduces the search space.[^extended-ascii]
48
+
Here’s one idea: since the plaintext is an English article, this means that most (ifnotall) characters are in the printable ASCIIrange (32-127). This means that the most significant bit (MSB) of each byte *cannot* be 1. We can use this to create a **blacklist** of mappings. For example, originally we have 64possible mappings for the letter `A`. After blacklisting, we may be left with, say, 16 possible mappings. This drastically reduces the search space.[^extended-ascii]
49
49
50
50
Since Base64 simply maps 8-bits to 6-bits, so 3 characters of ASCII would be translated to 4 characters of Base64.
51
51
@@ -62,23 +62,31 @@ def get_chars_with_mask(m):
62
62
"""Get Base64 chars which are masked with m."""
63
63
return {c for i, c inenumerate(charset) if (i & m) == m}
64
64
65
+
# List the 4 Base64 positions. We'll cycle through these positions (i.e. i % 4).
65
66
msbs = [0b100000, 0b001000, 0b000010, 0b000000]
67
+
68
+
# Get impossible characters for each position.
66
69
subchars = [get_chars_with_mask(m) for m in msbs]
67
70
71
+
# Create a blacklist for each Base64 char.
72
+
# e.g. blacklist['A'] returns the set of chars which 'A' can NOT map to.
68
73
blacklist = {c: set() for c in charset}
69
74
75
+
# Loop through each char in the shuffled Base64 text.
70
76
for i, c inenumerate(txt):
71
-
# Ignore char mappings which have 1 in corresponding msb.
77
+
# Ignore char mappings which have '1' in corresponding msb.
72
78
# These can't map to a printable ASCII char.
73
79
blacklist[c] |= subchars[i %4]
74
80
81
+
# Invert the blacklist to get a dictionary of possible mappings.
82
+
# e.g. whitelist['A'] returns the set of chars which 'A' CAN map to.
75
83
whitelist = {k: set(charset) - v for k, v in blacklist.items()}
@@ -97,9 +105,12 @@ We can do a similar thing on the low end. Again, since the smallest printable AS
97
105
def get_inverted_chars_with_mask(m):
98
106
return {c for i, c inenumerate(charset) if ((2**6-1- i) & m) == m}
99
107
100
-
subchars_not_in_ascii= [get_inverted_chars_with_mask(m) for m in in_ascii] # chars that don't have bits set in ascii.
108
+
# chars that don't have bits set in ascii.
109
+
subchars_not_in_ascii= [get_inverted_chars_with_mask(m) for m in in_ascii]
101
110
```
102
111
112
+
## Frequency Analysis with Known Text
113
+
103
114
Another idea comes to mind. Remember the plaintext isin English? Well, with English text, some letters appear more frequently than others. The same applies to words and sequences.
104
115
105
116
{% image "assets/base64-letter-frequencies.jpg", "w-65", "Frequency of English letters. But we need to be careful with letter cases."%}
{% image "assets/b64-crypt-1gram.jpg", "", "dcode.fr frequency analysis for encrypted Base64."%}
121
132
{% endimages %}
122
133
123
-
<sup>Frequency analysis of plain vs. encrypted Base64.</sup>
134
+
<sup>Frequency analysis of plain vs. encrypted Base64. Left: CNN Lite articles. Right: Encrypted challenge text.</sup>
124
135
{.caption}
125
136
126
137
From this, we can deduce that 'w' was mapped from'G'in the original encoding (due to the gap in frequency).
@@ -132,18 +143,22 @@ One useful option is the **bigrams/n-grams** option. We can tell dcode to analys
132
143
{% image "assets/b64-crypt-4gram.jpg", "", "dcode.fr 4-gram for encrypted Base64."%}
133
144
{% endimages %}
134
145
135
-
<sup>Frequency analysis of 4-grams in plain vs. encrypted Base64.</sup>
146
+
<sup>Frequency analysis of 4-grams in plain vs. encrypted Base64. Left: CNN Lite articles. Right: Encrypted challenge text.</sup>
136
147
{.caption}
137
148
138
149
Observe how "YoJP0H" occurs (relatively) frequently. This corresponds to "IHRoZS", which happens to be the Base64 encoding for" the".
139
150
151
+
## More Heuristics
152
+
140
153
Frequency analysis is useful to group letters into buckets. But using frequency analysis alone is painful. Some guesswork is needed. Here's the complete process I went through:
141
154
142
155
- Frequency Analysis: use dcode.fr to associate frequent characters.
143
156
- We can make use of our earlier constraints to eliminate wrong guesses.[^byebye-constraints]
0 commit comments