Genetic Operators Applied to Symmetric Cryptography

In this article, a symmetric-key cryptographic algorithm for text is proposed, which applies Genetic Algorithms philosophy, entropy and modular arithmetic. An experimental methodology is used over a deterministic system, which redistributes and modifies the parameters and phases of the genetic algorithm that directly affect its behavior, carrying out a constant evaluation using the fitness function, in order to optimize the results. An independent encryption is established for the auxiliary key, using a main key, in charge of increasing security. The tests are performed over different text sizes, manipulating the parameters and criteria proposed to obtain their appropriate values. Finally, a comparison is presented against the following cryptographic algorithms DES (Data Encryption Standard), RSA (Rivest, Shamir and Adleman) and AES (Advanced Encryption Standard), exposing factors such as processing time, scalability, key size, etc. It is shown that the proposed algorithm has a better performance.


I. Introduction
C OMPUTER security has always been the discipline responsible for the protection of data stored in a physical or logical computer system. In recent years, technological growth has been such that this security has focused on minimizing any risk to information, combating all types of vulnerabilities that may exist in a network environment. Cryptographic algorithms confront computer attacks that are increasingly complex, forcing them to evolve by participating in recent and reliable methods [1].
There are many algorithms of this type, categorized in Symmetric (private key) and Asymmetric (public key). In symmetric cryptography, a key is used to encrypt and decrypt data, while in asymmetric cryptography two keys are used to perform these tasks; a public figure and a private decipher (e.g., RSA [Rivest, Shamir and Adleman]). Encryption is based on intensively computed mathematical functions and deciphering is usually the reverse process using the key(s) [2], for this DES (Data Encryption Standard) uses a 64-bit key, while AES (Advanced Encryption Standard) uses keys of 128, 192 and 256 bits [3].
In recent years, several investigations have been presented that relate cryptography with genetic concepts, making this field known as an alternative to solve computer security problems. B. Beegom and S. Jose [4] present an asymmetric cryptographic model based on a genetic approach and expose an efficient method in which they make use of the complexity of the DNA chains. N. Srilatha and G. Murali [5] define their work as an efficient three-level cryptographic technique, based on processes that use DNA sequences to transmit information.
Nowadays, for data encryption through the Internet, the HTTPS (Hypertext Transfer Protocol Secure) protocol uses SSL/TLSbased encryption to create a secure channel to shared data [6]. The cryptographic protocols TLS (Transport Layer Security) and SSL (Secure Sockets Layer) used by HTTPS use asymmetric cryptography which uses a pair of keys for sending information, authenticating the receiver more reliably [7]. [10] and based on molecular genetics; a genetic algorithm makes use of terms specific to this field, as well as its main phases: Selection, Crossing and Mutation [11].

Selection:
The selection is the stage in which each chromosome (representing a potential solution to the problem) goes through a process of evaluation on a certain fitness value, where some of them are chosen to be later transformed by the crossing and/or mutation operators. In this process the number of chromosomes, genes and alleles of the genotype is kept constant [12].

Crossing:
The cross is a genetic operator that allows information to be exchanged between two chromosomes to form a new one. For binary-chain individuals, ring, one-point, two-point, and uniform crossings are often used [12]. Fig. 1 shows the offspring that occurs with a crossing point 4 between two parents of length 5.

Mutation:
The mutation changes the value of the selected allele [12]. In Fig. 2 the mutation of the selected allele is shown as an example.

Fitness Function:
It is the quality control within the GA and plays a vital role in the guide of the same, it helps to explore the search space more effectively and efficiently [12].
The main idea of the GAs is to reproduce the random nature where the population of individuals adapts to their environment through natural selection, as well as the behavior of the ecosystem. Once the genetic representation of the initial population has been defined, a set of stochastic operators are applied iteratively: selection, crossing and mutation; under certain quality criteria called fitness function. The application of the GAs to optimization problems provides flexibility and adaptability, combined with the robustness and the advantages of the global search [16].
In the philosophy of genetic algorithms, a set of terms of the genetic language have been adopted to clarify and unify the concepts of development of this type of algorithms, as follows: Allele: Each bit is called Allele. Gene: Each group of alleles is called Gen. Chromosome: Each group of genes is called Chromosome.

B. Computer Security
Cryptographic algorithms, have mainly three measurable characteristics: Capacity, Security and Robustness. Capacity is related to the amount of information that the algorithm can process. Security refers to the protection that data receive against possible attacks.
Robustness is the resistance that the method has in its entirety against external attacks [15].
There are different types of computer attacks that try to obtain unauthorized access to a network service, among the most common is the Brute Force attack, which makes repeated and systematic attempts using possible credentials, based on different parameters that usually come from sets of credentials set by default, commonly used or valid in previous attacks [17].

C. Related Concepts
The work below has a strong relationship with Entropy, known as the measure of the uncertainty associated with a random variable [18], conceived as a measure of disorder, as well as the repetition of certain combinations. It is measured in bits, where the number of information bits of each character is given by: where k is the total number of characters. The entropy of a random source is the expected information content of the symbols it has, that is, the expected uncertainty of each symbol, knowing only the distribution according to the symbol [19].
As a last concept, due to its constant use in the different phases of the proposed algorithm, we have the Modular Congruence or Modular Arithmetic, defined as an arithmetic system for whole-number equivalence classes introduced by Carl Friedich Gauss in his book Disquisitiones Arithmeticae (1801) [20]. It is defined as follows: Let a and b be any integer, and n a positive integer. If n | (a-b) we say that a and b are congruent modulo n and we write [21]:

III. Methodology
An investigation-action is established, whose objective is the development of a cryptographic text Algorithm based on principles of computer security and genetic algorithms. The development and the tests are carried out in MATLAB R2017a, with academic license. The chosen methodology is experimental and is summarized in the diagram shown in Fig. 3. The methodology is based on previous knowledge in the area of GA and Computer Security, which is accompanied by a deep research and a state of the art description, which aim to establish a clear idea of the development of a symmetric cryptographic Algorithm that takes advantages of the phases of the GA and other related concepts in the matter.

IV. Proposed Cryptographic Algorithm
The proposal below is part of a deterministic system, which implies elimination of randomness and full knowledge of the input variables. It is important to consider that in the nomenclature worked below; the parameters are variables established by the user under certain conditions, while the criteria refer to constant data used to optimize the algorithm and established under a range of tests. The flow diagram (Fig. 4) explains the operation of the proposed algorithm.

A. Initial Population Preparation
It starts from a random message to be encrypted, as a string of characters based on the Latin alphabet and modern English. It is random in nature since it is unknown. As an example, we have the text "Hola mundo" (10 characters counting the space). Table I (annexes) shows the conversion of the entire message to American Standard Code for Information Interchange (ASCII) and then to binary, in this strict order and taking the interspaces.

B. Encryption Process
Once the Initial Text has been converted, the Initial Population is taken as the starting point of the GA. The ring-type crossing of the population is carried out with a single point, taking said point from the initial conditions. One child is generated for each pair of parents (genes). Each parent crosses 2 times, with a different partner, which originates a new generation, replacing the previous one. In order to achieve a greater diversity of genes, it is proposed to use the reversalCriterion, which allows to revert the order of the second parent digits of each crossing (or iteration) depending on the value of the criterion. Applying a crossParameter of 6 and a reversalCriterion of 3 (highlighted in bold) the offspring shown in Table I (annexes) is obtained.
The Last Generation is taken to continue with the Selection of the individuals, using the proposed Selection Equation for the Mutation, also called Mutation Clock, taken from the definition of modular congruence (equation 2), and defined as follows: Where i is the position of each allele A i being 1 ≤ i ≤ n, where n is the Population Total; and k is he mutationParameter defined between 0 < k < n in the initial conditions.
The mutation clock selects the alleles to be mutated, taking the last generation as a string of bits without counting spaces between genes. Thus, the algorithm continues with the alleles Mutation, to change their values and generating a new population. For example, the value of the mutation parameter k is taken arbitrarily equal to 5, whose process is evidenced (in bold) in the column Alleles selected by the Mutation Clock, in Table II (annexes). The last population is taken as the Mutated Population, that at the same time is the Population to Evaluate. It is evaluated with the proposed Fitness Function based on Entropy. This function compares groups or allele chains (not necessarily the size of the original genes) to find their frequency within the population. Table III (annexes) shows the value that ffCriterion must take based on the length of the population to be evaluated.
In some cases, the number of genes is not an integer number, for these cases, the last chain that is not the size of the ffCriterion value is omitted and is not evaluated in the Fitness Function.
For the example case, the range of genes is from 32 to 63, since the total number of genes is 39; therefore, the population is divided in such a way that each gene has 6 alleles, given the value of ffCriterion. It is important to know the total number of alleles in the population: Next, totalAlle is divided into ffCriterion to know the number of genes that the Population to Evaluate will have. (5)

C. Fitness Function Evaluation
As a result, the population to be evaluated consists of 52 genes, each one of 6 alleles, for a total of 312 alleles. Knowing these data, we proceed to perform the evaluation with the Fitness Function exposed in the pseudocode.
Where populationToEvaluate is the current population, divided into ffCriterion size chains; repetitionParameter corresponds to a given value in the initial conditions that defines the maximum number of times that each of the chains of the Population to be Evaluated can be repeated; unique, length and find are functions of MATLAB R2017a; differentChains are all different chains within the population to be evaluated; validChains are strings that meet the value of repetitionParameter; ffSatisfaction is the percentage of satisfaction to the fitness function given by the validChains divided by differentChains; detentionCriterion is the value of the percentage that is needed to deliver the optimal population (See section Tests and Results, section criteria).
To elucidate the fitness function procedure, we have the following data as input: Given the previous case, we obtain a percentage of satisfaction to the Fitness Function of 75%, that represents the total of different chains in the population to be evaluated, 75% is repeated 2 times or less as it is restricted by repetitionParameter. The value of detentionCriterion established in 80% is greater than ffSatisfaction, for this reason, the population does not pass the assessment and the algorithm continues with a new iteration. When the percentage of ffSatisfaction is greater than the detentionCriterion, the iteration will be considered optimal and the final population is delivered in a large gene chain of 8 alleles, as it was originally stated.
Further, the variable maxIterationParameter is defined in the initial conditions, which limits the maximum number of times the algorithm repeats its entire process. If the number of iterations becomes equal to said parameter and an optimal population is not found, the algorithm delivers the population with the best performance.

D. Decryption Process
The decryption of the cypher text is considered as the inverse process of the encryption using the key, in this way the initial text is delivered in a secure way to the receiver. Fig. 5 exposes the diagram of this process.
We propose a function that computes the approximate number of total alleles at as a function of the characters c of the initial text:

V. Keys Generation
The Keys of a cryptographic process are pieces of information whose main objective is to allow the encryption and decryption of data. For the present proposal we have an auxiliary key and a main one.

A. Main Key
It is previously created by the receiver and used by the algorithm (like private key). It consists of 16 characters included in ASCII code (which has 95 printable characters) arranged in a table. The selection of this key depends on the receiver, who creates it under his own criteria, always fulfilling that there cannot be repeated characters For greater understanding, observe Table IV (annexes), which shows the correct order to locate the key.

B. Auxiliary Key
It is generated in a strict order that only the algorithm knows, it is created as follows: Based on the above, we proceed to perform the conversion of each number to ASCII (taken from '0' ASCII code 48, to '9' ASCII code 57) and then to binary. As a result, the Auxiliary Key is obtained together with the information on the length of each of its elements (highlighted in bold).

C. Auxiliary Key Encryption
The generated Auxiliary key is encrypted with the help of the Main Key, as the first measure the gene is separated into two parts of four alleles each one; the first sub-chain is taken and the first two alleles of it are allocated in the values of the first column of the Table, subsequently the next two alleles are allocated in the first row of the table and the sub-chain is assigned the letter corresponding to the matching coordinate. For the first gene we have: gene1 = 00110101 → gene1 1 = 00 11, gene1 2 = 01 01 The character corresponding to gene1 = 0011 is V, since it is the point of intersection between row 00 and column 11. In the same way, the process is carried out with the other genes that are part of the main key, obtaining for each gene a pair of characters. The next step is to convert each character to ASCII and then to binary code. Table V (annexes) shows the information corresponding to the first two genes. As a result, we obtain the encryptedAuxKey that consists of 180 genes, that is, 1440 alleles.
Each digit in the criteria and parameters represent 2 characters in the auxiliary key, which are taken from the main key and then converted into 8 numbers in ASCII code, or 12 genes in the encrypted auxiliary key, that is, 96 alleles; In addition, the values representing the total number of digits of each condition in the initial auxiliary key always add 384 alleles in the encrypted auxiliary key. The minimum number of possible initial digits is 9 (crossParameter 1 digit, mutationParameter 1 digit, repetitionParameter 1 digit, maxIterationParameter 1 digit, reversalCriterion 1 digit, ffSatisfaction 1 digit, detentionCriterion 2 digits, optimalIteration 1 digit), that is, 864 alleles in the encrypted auxiliary key, which means that the key will be at least 864 + 384 = 1248 alleles. Given this value, the probability of being guessed is: The purpose of encrypting the auxiliary key is to increase security, because if it becomes a victim of a Brute Force computer attack, the encrypted string but not the original key could be discovered. This key goes from being a chain composed of 30 genes to have 180 after the encryption process, which implies a significant advance for the algorithm in terms of the security, since its length is considerably increased.

D. About the Keys
The amount of information in bits that each key carries, is based on the possible combinations that the same one can have (value that differs from its representation in bits). Each parameter and criterion takes a value or another depending on its own restrictions (if it has any). The parameter value of crossParameter ranges from 1 to 7, mutationParameter between 1 and 999, repetitionParameter between 1 and 99, maxIterationParameter between 1 and 999, reversalCriterion between 1 and 999, ffSatisfaction between 1 and 99, detentionCriterion between 72 and 98, optimalIteration between 1 and 999. According to the above, the length of the auxiliary key is determined as follows -based on equation (1), where k would be the number of total combinations-: log 2 (7 * 999 * 99 * 999 * 999 * 99 * 27 * 999)≈60.67 bits (11) Furthermore, given that the possibilities of each of the 16 characters in the main key is 95 (characters printable in ASCII code) its information in bits is determined as follows:

VI. Quality Analysis
For the quality analysis we have the data generated of the algorithm developed, the tests and discussion about it, a comparative table of speed between DES and RSA algorithms, and finally a comparison of different factors between the AES and DES algorithms.

A. Test and Results
The tests below are carried out based on the development of the proposed algorithm, it allows to conclude on the results thrown by the same. Variations are made in some criteria (treated under the tests, however, are established under the user's criterion) to optimize the encryption process aimed at security and performance, the treatment of restricted parameters within their ranges is also performed.
crossParameter: This parameter is used by the algorithm in the crossing stage, it defines the start point to select the offspring between two genes, the maximum value that can take must not exceed the length of a gene. Since there is always an initial population composed of genes of 8 alleles (length 8), this variable must be contained within a range between 1 and 7.

mutationParameter:
The mutation parameter defines the alleles to be selected in order to modify them, allowing the algorithm to diversify the population. It is suggested to set this variable to an odd value, because the genes are of length 8 (even number). The number of mutated alleles varies according to the value of the parameter, if this is greater than the size of the population, the algorithm does not make any mutation.
repetitionParameter: This variable defines the maximum number of times that each of the chains of the population to be evaluated can be repeated. Its value is established according to the performance of the algorithm, note that if the text size is long, more repeataded chains are found. reversalCriterion: It allows to invert the digits of the second father of each crossing (or iteration), its value is selected based on the results thrown by the tests itself. For these, a value is taken between 1 to 6 and its performance is evaluated in 3 different cases according to the Fitness Function. It is iterated 4 times and the average satisfaction percentage is found. It is observed that for Case 1 the best result (97%) is obtained in the values of reversalCriterion of 1, 2 and 4, implying a great homogeneity and allowing to conclude that if the initial text is short (until 4 words or 25 characters), the value of the criterion can vary without affecting the efficiency of the algorithm. For Cases 2 (90%) and 3 (97%) the best performance is achieved in the value of 3, it is suggested that this should be taken as reversalCriterion when the string is greater than 4 words.

detentionCriterion:
The detention criterion establishes the percentage of different chains that comply with the repetitionParameter, it is used in the fitness function to evaluate if an iteration passes the evaluation. Based on the tests carried out for the reversalCriterion, it is suggested to establish the value of this variable in a range between 72 and 98, since this was the minimum and the maximum percentage obtained, respectively.

B. Runtime and Call Functions
Some compilation data of the algorithm in MATLAB are presented in Table VIII (annexes), where the number of calls, the time spent by each of the present functions and the total value are described. For these tests, the cases described in Table VI are taken. Runtime depends directly on the iterations made by the algorithm to arrive at the optimal solution; for the tests, it was established a detentionCriterion of 98, it carried out 2, 52 and 10 iterations for Case 1, 2 and 3 respectively. It is also observed that the function with the highest number of calls is Convert for all cases, this function aims the conversion of ASCII code to binary code and is used to transform the initial text into chains that the algorithm manipulates to perform the encryption, besides it is used in the process of creating keys.
It is also noted that the function that covers the longest runtime (a little more than the half) is AlgorithmEncryption, since it complies the task of carrying out the encryption process, starting from the initial population previously converted and getting a final population, ready for the evaluation.

VII. Comparison
In order to establish objective conclusions about the performance of the proposed cryptographic algorithm, a comparison is made (at encryption execution time) against DES and RSA [22], allowing a comparison to be made as shown in Table IX ( The text size for the proposed algorithm is taken as the bit representation of the original text, that is, the so-called Initial Population. The values of detentionCriterion and reversalCriterion were established in 97 and 3 respectively for the tests, while the values of crossParameter, mutationParameter, repetitionParameter and maxIterationParameter, respectively: 6, 5, 3 and 200 were set for text sizes of 128 and 256 bits; 5, 7, 5 and 100 for the text sizes of 512, 1000, 2000 and 5000 bits and 5, 17, 10 and 100 for the text size of 10000. The key length (understood as the amount of information it carries, but not the size of its representation in bits) is 56 bits for DES and 22 bits for RSA. For the proposed Algorithm it is 105.11 bits for the main key and 60.67 bits for the auxiliary key.
It is remarkable the performance against the DES and RSA algorithms, the superiority in runtime against RSA is shown in Table  IX. On the other hand, the Proposed Algorithm is superior to DES in execution time, for texts greater than 512 bits. In addition, it is observed that each of the total times of the proposed algorithm, mostly exceed that of the RSA encryption (except for 128 bits) and are very close to the DES encryption time.
In Table X (annexes), different features of the AES, RSA and DES algorithms [2] are exposed, to compare against the proposed Algorithm.

VIII. Conclusion
The proposed algorithm modifies the order and process of the phases of the genetic algorithms, by applying a deterministic system, leaving aside some random procedures.
When comparing the proposed algorithm against RSA and DES, satisfactory performance is evidenced in several factors, demonstrating that Genetic Algorithms are a good alternative to face problems in computer security.
The proposed algorithm manages to disrupt the information through entropy, evidenced in the fitness function as the different chains.
The length (amount of information that it transports) of the auxiliary key is of 60.67 bits and of the main key is of 105.11 bits, overcoming in this aspect the cryptographic algorithm DES and approaching considerably to AES.
The present work exposes a development based on basic concepts like GA, entropy, modular congruence and determinism, that together make an efficient cryptographic process.

IX. Considerations
The code of the algorithm developed is in a private github repository, with the option of being visible for those who request access to any of the contact emails. In addition there is a demo in heroku that performs the whole process of encryption of a given text: https:// symmetric-cryptography-genetic.herokuapp.com/ It is expected to be able to use the principles of the proposed algorithm in the encryption of images and audio. On the other hand, there is a possible application in data compression.         Case  1  2  3  1  2  3  1  2  3  1  2  3   1   36  68  552  36  59  516  100%  87%  93%   98%  83%  94%  33  57  569  32  45  536  97%  79%  94%  32  56  547  32  44  511  100%  79%  93%  34  65  546  32  58