Ancient Cryptography

Ancient Texts => D'Agapeyeff Cipher => Topic started by: Phil_The_Rodent on June 19, 2009, 02:18:15 PM

Title: Frequency Analysis data
Post by: Phil_The_Rodent on June 19, 2009, 02:18:15 PM
I'm not sure where to post this, but I was thinking about the D'Agapeyeff Cipher in the data acquisition so feel free to move it somewhere more appropriate if you determine it necessary.

I was looking at D'Agapeyeff and had noticed that, aside from some of the curiosities mentioned, that the number of double-digits in the 80s is approximately 40% (and specifically 40.82%), which is a pretty significant portion. Consider that if you take the 5 most common letters in English (ETAOI or ETAON), that they'll account for 44% -- just 4% more than D'Agapeyeff's 80s. 40% is also a good approximation for vowels, but if the 80s are vowels they're not very nicely distributed (the best ratio I could find in this case was using a diagonal encoding). That got me thinking about the solution to portion 3 of Cryptos and the possibility of brute-forcing a double columnar transposition using the frequencies of vowels against consonants. I toyed in Rumkin for a few minutes with the idea of just looking at vowel distribution in a cipher and found it could quite rapidly rule out many transpositions.

So, the data.

For this particular analysis I used Tolstoy's War and Peace from the Gutenberg Project, and deleted the web address (but left the remainder of their disclaimer) and ignored numbers and so forth. Here's how it looks:
Code: [Select]
VOWELS-952248 FREQUENCY = 37.665013
CONSONANTS-1575955 FREQUENCY = 62.334987

SETS EXAMINED = 1638135

[1 VOWEL] -690502  FREQUENCY = 42.151715
[1 CNSNT] -322724  FREQUENCY = 19.700696
[2 CNSNTS]-299387  FREQUENCY = 18.276088
[3 CNSNTS]-144734  FREQUENCY = 8.835291
[2 VOWELS]-123408  FREQUENCY = 7.533445
[4 CNSNTS]-42574   FREQUENCY = 2.598931
[5 CNSNTS]-7825    FREQUENCY = 0.477677
[3 VOWELS]-4818    FREQUENCY = 0.294115

DIGRAMS
[1 VOWEL] [1 CNSNT] -279661   FREQUENCY = 17.071925
[1 CNSNT] [1 VOWEL] -271252   FREQUENCY = 16.558597
[2 CNSNTS][1 VOWEL] -254091   FREQUENCY = 15.511002
[1 VOWEL] [2 CNSNTS]-251423   FREQUENCY = 15.348134
[3 CNSNTS][1 VOWEL] -121665   FREQUENCY = 7.427048
[1 VOWEL] [3 CNSNTS]-117017   FREQUENCY = 7.143311
[1 CNSNT] [2 VOWELS]-49202    FREQUENCY = 3.003539
[2 VOWELS][2 CNSNTS]-46132    FREQUENCY = 2.816131
[2 CNSNTS][2 VOWELS]-43720    FREQUENCY = 2.66889
[2 VOWELS][1 CNSNT] -41421    FREQUENCY = 2.528548
[4 CNSNTS][1 VOWEL] -35779    FREQUENCY = 2.184131
[1 VOWEL] [4 CNSNTS]-34829    FREQUENCY = 2.126139
[2 VOWELS][3 CNSNTS]-26717    FREQUENCY = 1.630941
[3 CNSNTS][2 VOWELS]-22054    FREQUENCY = 1.346288
[2 VOWELS][4 CNSNTS]-7246     FREQUENCY = 0.442333
[4 CNSNTS][2 VOWELS]-6589     FREQUENCY = 0.402226
[5 CNSNTS][1 VOWEL] -6263     FREQUENCY = 0.382325
[1 VOWEL] [5 CNSNTS]-6151     FREQUENCY = 0.375488
[1 CNSNT] [3 VOWELS]-2000     FREQUENCY = 0.12209
[3 VOWELS][2 CNSNTS]-1749     FREQUENCY = 0.106768

TRIGRAMS
[1 VOWEL] [1 CNSNT] [1 VOWEL] -232652   FREQUENCY = 14.202266
[1 VOWEL] [2 CNSNTS][1 VOWEL] -212995   FREQUENCY = 13.002302
[1 CNSNT] [1 VOWEL] [1 CNSNT] -107219   FREQUENCY = 6.545195
[2 CNSNTS][1 VOWEL] [1 CNSNT] -99707    FREQUENCY = 6.086624
[1 VOWEL] [3 CNSNTS][1 VOWEL] -98014    FREQUENCY = 5.983275
[1 CNSNT] [1 VOWEL] [2 CNSNTS]-97573    FREQUENCY = 5.956354
[2 CNSNTS][1 VOWEL] [2 CNSNTS]-94218    FREQUENCY = 5.751548
[3 CNSNTS][1 VOWEL] [1 CNSNT] -53462    FREQUENCY = 3.263593
[1 CNSNT] [1 VOWEL] [3 CNSNTS]-47568    FREQUENCY = 2.903794
[1 VOWEL] [1 CNSNT] [2 VOWELS]-44949    FREQUENCY = 2.743916
[2 CNSNTS][1 VOWEL] [3 CNSNTS]-44900    FREQUENCY = 2.740925
[3 CNSNTS][1 VOWEL] [2 CNSNTS]-44190    FREQUENCY = 2.697583
[2 VOWELS][2 CNSNTS][1 VOWEL] -39532    FREQUENCY = 2.413235
[2 VOWELS][1 CNSNT] [1 VOWEL] -37213    FREQUENCY = 2.271671
[1 VOWEL] [2 CNSNTS][2 VOWELS]-37090    FREQUENCY = 2.264163
[1 VOWEL] [4 CNSNTS][1 VOWEL] -29204    FREQUENCY = 1.782761
[2 VOWELS][3 CNSNTS][1 VOWEL] -22866    FREQUENCY = 1.395857
[1 VOWEL] [3 CNSNTS][2 VOWELS]-18137    FREQUENCY = 1.107175
[1 CNSNT] [2 VOWELS][2 CNSNTS]-18084    FREQUENCY = 1.10394
[3 CNSNTS][1 VOWEL] [3 CNSNTS]-17804    FREQUENCY = 1.086847
[1 CNSNT] [2 VOWELS][1 CNSNT] -16870    FREQUENCY = 1.029831
[2 CNSNTS][2 VOWELS][2 CNSNTS]-16105    FREQUENCY = 0.983131
[4 CNSNTS][1 VOWEL] [1 CNSNT] -15977    FREQUENCY = 0.975318
[1 CNSNT] [1 VOWEL] [4 CNSNTS]-15595    FREQUENCY = 0.951998
[2 CNSNTS][2 VOWELS][1 CNSNT] -14441    FREQUENCY = 0.881552
[4 CNSNTS][1 VOWEL] [2 CNSNTS]-12684    FREQUENCY = 0.774296
[2 CNSNTS][1 VOWEL] [4 CNSNTS]-12536    FREQUENCY = 0.765261
[1 CNSNT] [2 VOWELS][3 CNSNTS]-10774    FREQUENCY = 0.6577
[2 CNSNTS][2 VOWELS][3 CNSNTS]-10192    FREQUENCY = 0.622172
[3 CNSNTS][2 VOWELS][2 CNSNTS]-8744     FREQUENCY = 0.533778
[3 CNSNTS][2 VOWELS][1 CNSNT] -7387     FREQUENCY = 0.45094
[2 VOWELS][2 CNSNTS][2 VOWELS]-6369     FREQUENCY = 0.388796
[2 VOWELS][4 CNSNTS][1 VOWEL] -6153     FREQUENCY = 0.375611
[4 CNSNTS][1 VOWEL] [3 CNSNTS]-5525     FREQUENCY = 0.337274
[1 VOWEL] [4 CNSNTS][2 VOWELS]-5462     FREQUENCY = 0.333428
[3 CNSNTS][1 VOWEL] [4 CNSNTS]-5092     FREQUENCY = 0.310842
[1 VOWEL] [5 CNSNTS][1 VOWEL] -4906     FREQUENCY = 0.299487
[3 CNSNTS][2 VOWELS][3 CNSNTS]-4155     FREQUENCY = 0.253642
[2 VOWELS][1 CNSNT] [2 VOWELS]-4048     FREQUENCY = 0.247111
[2 VOWELS][3 CNSNTS][2 VOWELS]-3707     FREQUENCY = 0.226294
[1 CNSNT] [2 VOWELS][4 CNSNTS]-2723     FREQUENCY = 0.166226
[1 CNSNT] [1 VOWEL] [5 CNSNTS]-2706     FREQUENCY = 0.165188
[5 CNSNTS][1 VOWEL] [1 CNSNT] -2651     FREQUENCY = 0.161831
[4 CNSNTS][2 VOWELS][2 CNSNTS]-2607     FREQUENCY = 0.159145
[2 CNSNTS][2 VOWELS][4 CNSNTS]-2396     FREQUENCY = 0.146264
[5 CNSNTS][1 VOWEL] [2 CNSNTS]-2265     FREQUENCY = 0.138267
[2 CNSNTS][1 VOWEL] [5 CNSNTS]-2256     FREQUENCY = 0.137718
[4 CNSNTS][2 VOWELS][1 CNSNT] -2069     FREQUENCY = 0.126302
[1 VOWEL] [1 CNSNT] [3 VOWELS]-1815     FREQUENCY = 0.110797

For those curious, here are Tolstoy's letter frequencies:
Code: [Select]
E-314818 FREQUENCY = 12.452244
T-226014 FREQUENCY = 8.939709
A-205430 FREQUENCY = 8.125534
O-192828 FREQUENCY = 7.627077
N-184154 FREQUENCY = 7.283988
I-173748 FREQUENCY = 6.872391
H-167028 FREQUENCY = 6.60659
S-162882 FREQUENCY = 6.4426
R-148052 FREQUENCY = 5.856017
D-118277 FREQUENCY = 4.678303
L-96514  FREQUENCY = 3.817494
U-65424  FREQUENCY = 2.587767
M-61642  FREQUENCY = 2.438174
C-61258  FREQUENCY = 2.422986
W-59198  FREQUENCY = 2.341505
F-54886  FREQUENCY = 2.170949
G-51315  FREQUENCY = 2.029703
Y-46264  FREQUENCY = 1.829916
P-45162  FREQUENCY = 1.786328
B-34641  FREQUENCY = 1.370183
V-26901  FREQUENCY = 1.064036
K-20415  FREQUENCY = 0.807491
X-4060   FREQUENCY = 0.160588
J-2574   FREQUENCY = 0.101811
Z-2388   FREQUENCY = 0.0094454
Q-2330   FREQUENCY = 0.009216

Currently, I'm trying to determine if there is a way to transcribe the 2-digit pairs into a 14x14 grid that evenly distributes the 80s; though I've not ruled out double-transposition (obviously).

Oh, also, based on the letter frequency chart above, here is what we should expect out of 196 characters using Tolstoy's letter frequencies (second column are the actual digit frequencies):

Code: [Select]
Expected      Actual
E   24.4      81-20
T   17.5      62-17
A   15.9      75-17
O   14.9      82-17
N   14.3      85-17
I/J 13.7      64-16
H   12.9      83-15
S   12.6      74-14
R   11.5      63-12
D   9.2       91-12
L   7.5       65-11
U   5.1       84-11
M   4.8       72-9
C   4.8       92-3
W   4.6       93-2
F   4.3       94-1
G   4.0       71-1
Y   3.6       04-1
P   3.5       61-0
B   2.7       73-0
V   2.1       95-0
K   1.6       01-0
X   0.3       02-0
Z   0.02      03-0
Q   0.02      05-0