Click here to go to the blowfishvhdl sourceforge page:
Below is a paper I wrote when I coded up the original blowfish algorithm. I apologize in advance for any problems with this document; I wrote it up in a couple hours and never really edited it. =)
By Wesley J. Landaker <wjl@mindless.com>
Because something comparable was not readily and freely availible, I implemented the Blowfish encryption algorithm in synthesizable VHDL. My implementation is certainly not the fastest possible implementation, and it is also not the smallest.
My implementation is somewhere middle-of-the-line, where I made significant tradeoffs between speed, size, and ease of implementation. The main focus was to make an implementation that was usable, moderately compact, and would still run at an acceptable clock speed.
I chose to implement Blowfish for several reasons. The Blowfish cipher is extremely strong, is completely free, and currently does not have any public VHDL or hardware implementation of any sort readily available anywhere.
As described by Counterpane, the home of the algorithm, "Blowfish is a symmetric block cipher that can be used as a drop-in replacement for DES or IDEA. It takes a variable-length key, from 32 bits to 448 bits, making it ideal for both domestic and exportable use. Blowfish was designed in 1993 by Bruce Schneier as a fast, free alternative to existing encryption algorithms. Since then it has been analyzed considerably, and it is slowly gaining acceptance as a strong encryption algorithm. Blowfish is unpatented and license-free, and is available free for all uses." (http://www.counterpane.com/blowfish.html)
The Blowfish algorithm is block cipher consisting of a standard Fiestel network with 16 rounds, plus some initial and final encryption functions. The main feature that sets it off from other similar algorithms is that the non-linear substitution boxes (called sboxes) are key-dependent. Because of this, there is an expensive initialization step required on every key change, requiring the equivalent of 520 encryptions to initialize the sboxes. Because of this step, any Blowfish implementation requires a substantial number of finite state machines to control initialization and encryption.
In any implementation, there are several separate circuit pieces that can be identified. First is the encryption core (further referred to as 'crypt') that implements the actual Fiestel network. Second is the function F(xL) that crypt relies on for each round of the Fiestel network. Third is a generated array of sub-keys, called the parray, which is also used by crypt each round. Fourth are the four key-dependent sboxes that are read by the F(xL) function also each round. Fifth would be any control logic necessary to initialize the parray and sboxes.
As with all circuits, there are many ways to implement this algorithm. There are, however, two main paradigms for Blowfish implementation. The first is a purely combination or fully-pipelined core (which both have the same problems as discussed below) with a synchronous control. With this approach, the main problem is that since all 16 rounds will be running simultaneously, the parray and the sboxes must be able to have all of their data read at every location, every clock cycle. This translates to 1024 32-bit memory ports with 8-bit addresses, and 18 more with 5-bit addresses, which could be an extremely expensive size cost. This also means that the crypt and F(xL) hardware must be duplicated 16 times. The advantages to this approach, however, are that one data block could be processed every clock cycle, and a new key could be initialized in 520 clock cycles.
The other obvious way to implement this algorithm is the way that I have done it. That is, to have a single register that feeds back to itself through each round. First, this means that the parray and sboxes only need to have 1 memory port each--a total of 5 memory ports, compared to the 1042 required in the previously discussed implementation. Second, this also means that the only hardware that has to exist in crypt is that for a single round, and the hardware for F(xL) only has to exist once. The downside of this sort of implementation is that running optimally, one round per cycle, it can never take less than 16 clock cycles per data block, and 8320 cycles to initialize a key. In my implementation, I actually take approximately twice as long, because my sboxes and parray are implemented synchronously. For more information about the exact timing of my design, see the section on timing below.
The final large issue is the issue of supporting variable key lengths. It is viable for an implementation to hard-code a key length into the circuit if it is known that only keys of a certain size will be used. However, to implement the algorithm the most generally, it requires a very large structure of muxes or something similar to route various bits of the key to the proper places depending on the key length. In my implementation, I have added this large muxing structure, but if the key length is hard coded to a constant value before synthesis, this hardware will not be synthesized, and a considerable space savings will be achieved.
In order to test this implementation of Blowfish, I used a software version of Blowfish I have written that has been checked against some standard test vectors and the official reference implementation of Blowfish. This allowed me to run the same data through the software and VHDL implementations of Blowfish and compare the results.
I tested this implementation with several key lengths between 128 and 448, each with 2 or 3 random data blocks. Each block encrypted was then decrypted to make sure that the original data was restored correctly.
Although it would be nearly impossible to exhaustively test all combinations of inputs (448 bits of key, 4 bits of key_size, and 64 bits of data = 114688 bits!!), because of the nature of encryption algorithms (they are designed to have every output bit depend on every input bit), several tests is usually all that is necessary to determine that the system works correctly.
Interfacing with this implementation Blowfish is very straightforward. Below is a description of the inputs and outputs and how they are used.
clk : 1 - The input clock signal.
key : 448 - The encryption/decryption key. If less than a 448 bit key is desired, this signal must be padded up to 448 bits. Typically, this padding consists of all 0's.
key_size : 4 - The size of the key being used, in 32-bit words. Valid inputs are between 1 and 14. If 0 or 15 are used, key_size of 14 is assumed. For example, 4 corresponds to a 128-bit key, 14 corresponds to a 448-bit key. NOTE: If this is hard_coded before synthesis, it eliminate a huge matrix of multiplexers needed to implement key permutations of variable size.
data_in : 64 - The input data. In encryption mode, this is the plaintext. In decryption mode, this is the ciphertext. This is only read when the ready signal is asserted.
new_data : 1 - This signal should be asserted when new data is presented on the data_in port. This signal is ignored when the circuit is not asserting ready. This should not be asserted at the same time as new_key.
new_key : 1 - This signal should be asserted when a new key or key_size is presented. This signal is ignored when the circuit is not asserting ready. This should not be asserted at the same time as new_data.
encrypt : 1 - This signal toggles between encryption and decryption operation. 1 means encrypt, 0 means decrypt. This signal is only read the same cycle that new_data is asserted and read.
data_out : 64 - The output data. In encryption mode, this is the ciphertext. In decryption mode, this is the plaintext. This is only modified during key initialization and the same cycle that ready is raised after an encryption or decryption sequence.
ready : 1 - This signal is asserted when the circuit is ready to receive a new data block or a new key (or key_size). The circuit will ONLY read new_key or new_data when this signal is asserted. This signal also signifies that the data_out signal is valid if an encryption or decryption sequence was previously running.
To properly initialize the circuit, new_key and new_data must be low before the first rising edge of the clock. After the first clock cycle, the circuit will assert ready and it can then be used. If this constraint is not met, the circuit will still initialize, but can take up to 21106 clock cycles before ready is asserted. However, any time that ready is asserted after the first clock cycle, the circuit is usable.
The following timing is given as a reference. Generally, in interfacing with this implementation, it is not necessary to keep track of any timing or have any external counters--it is sufficient to assert the proper signals, then wait for the ready signal to come back on before performing the next operation.
Time from when new_key is asserted and read to when ready is asserted again (i.e. time to initialize the algorithm with a new key): 21106 clock cycles.
Time from when new_data is asserted and read to when ready is asserted again (i.e. time to process one block of data): 37 clock cycles.
(NOTE: new_key and new_data are only read when ready is asserted.)
My implementation of Blowfish has been designed to completely and correctly implement the algorithm with a focus on ease-of-design and ease-of-use without sacrificing too much speed or size. Certainly better implementations could be used and my existing circuit could definitely be optimized. However, the circuit does work as designed and follows the Blowfish algorithm specification exactly, which was my main goal.
This VHDL implementation will be cleaned up a bit with code comments and some minor documentation and licensing information and then will be released publicly under the GNU Public License, which will provide almost unlimited use of this VHDL model to the public.