I. Introduction

The SHA-2 (Secure Hash Algorithm) is a set of cryptographic hash functions designed by the National Security Agency and SHA-256 is one of these algorithms. These algorithms were defined semi-formally in documents produced by NIST. Cryptographic algorithms are good candidates for hardware implementation because they typically possess components whose performance benefits from hardware acceleration. But they are also prime candidates for formal verification because of their critical role in maintaining security and integrity in a wide variety of systems.

This case study explores the model-driven design and synthesis of a hardware accelerator for the SHA-256 cryptographic hash function. The approach is semantics-driven, starting from a reference semantics for SHA-256. The whole derivation takes place within the Haskell functional language and so the hardware accelerator produced by the ReWire compiler is formally verifiable with respect to the reference semantics using equational reasoning in Haskell.

Overview

The development in this case study proceeds along the same lines as in Case Study #1. That is, first we develop a reference semantics for SHA-256 in Haskell, show how to transform it into a ReWire implementation, and, finally, show how the resulting implementation is synthesized. The code that we make use of in this is as follows:

II. Haskell Reference Semantics for SHA-256

The Top Level

The input to the SHA-256 algorithm is assumed to be a sequence of bytes and its output is a digest of eight 32-bit words. Respectively, these are represented in Haskell as the types [Char] (i.e., a list of ASCII characters) and Oct Word32. A Haskell function implementing this algorithm, refsha256, is presented below; its type is [Char] -> Oct Word32. The function refsha256 is the composition of functions pad, sha256, and runM which are discussed in more detail below.

refsha256 :: [Char] -> Oct Word32
refsha256 = runM . sha256 . pad

pad    :: [Char] -> [Hex Word32]
sha256 :: [Hex Word32] -> M (Oct Word32)
runM   :: M a -> a

data Oct a = Oct a a a a a a a a
data Hex a = Hex a a a a a a a a a a a a a a a a

The SHA-256 algorithm first pads the input sequence so that the result has length as a multiple of 512 bits. The padded input is then parsed into 512 bit blocks, $M^{(1)},\ldots,M^{(N)}$. Each of these blocks, $M^{(i)}$, is further parsed into sixteen 32-bit words, $M_0^{(i)},\ldots,M_{15}^{(i)}$ and represented in Haskell with the type Hex Word32— thus, the type of the pad function. The code for pad (as well as for runM) is found in TestHarness.hs; their definitions would not be illuminating here.

The function sha256 implements the main loop of the SHA-256 algorithm. It takes the sequence of blocks formatted by pad (i.e., a list of type Hex Word32) as input and produces a computation of a digest (i.e., a computation of type M (Oct Word32)) as output. Here, M is a monad, which is a construction used for representing side-effects in the pure functional language Haskell. We define M and its use below.

Testing Against the NIST Standard

The NIST document defining SHA-256 provides three example hash computations against which one may test implementations of the algorithm. The transcript below summarizes the tests applied to refsha256. The tests apply refsha256 to messages msg1, msg2, and msg3; msg1 is just "abc" while the other two are too long to print easily here.

ghci> refsha256 msg1
  Oct 3128432319 2399260650 1094795486 1571693091 
      2953011619 2518121116 3021012833 4060091821
ghci> hashed1
  Oct 3128432319 2399260650 1094795486 1571693091 
      2953011619 2518121116 3021012833 4060091821
ghci> refsha256 msg2
  Oct 613247585 3523623096 3854575251 205414457 
      2738676825 1694441831 4142722516 433784513
ghci> hashed2
  Oct 613247585 3523623096 3854575251 205414457 
      2738676825 1694441831 4142722516 433784513
ghci> refsha256 msg3
  Oct 3452399196 2568289170 2174863330 2228698727 
      4051737160 2761367566 74267084 3339791568
ghci> hashed3
  Oct 3452399196 2568289170 2174863330 2228698727 
      4051737160 2761367566 74267084 3339791568 

In the above transcript, a test harness is loaded into the GHC (Glasgow Haskell Compiler) interpreter. First the refsha256 is applied, producing the hash digest calculated by the reference semantics. Then, the published hash digests from NIST, Appendix B, are displayed—these are hashed1, hashed2, and hashed3, respectively—and note that they are identical to the corresponding output of refsha256. By default, these digests are displayed in decimal format rather than the hexadecimal format in which they appear in NIST.

SHA-256 Logical Operators

SHA-256 defines a number of operations on 32-bit words called logical operators in Sections 3.2 and 4.1.2 of NIST. These are excerpted in part below:

Screenshot

Each of these functions transliterates simply into Haskell using built-in functions and data from the Haskell libraries, Data.Bit and Data.Word. Within the code listing below, .&., complement, and xor are, resp., the bitwise and, complement and exclusive or operations. The rotateR and shiftR functions are the rotate right and shift right operations.

ch :: Word32 -> Word32 -> Word32 -> Word32
ch x y z = (x .&. y) `xor` (complement x .&. z)

maj :: Word32 -> Word32 -> Word32 -> Word32
maj x y z = (x .&. y) `xor` (x .&. z) `xor` (y .&. z)

bigsigma0 :: Word32 -> Word32
bigsigma0 x = (rotateR x 2) `xor` (rotateR x 13) `xor` (rotateR x 22)

bigsigma1 :: Word32 -> Word32
bigsigma1 x = (rotateR x 6) `xor` (rotateR x 11) `xor` (rotateR x 25)

sigma0 :: Word32 -> Word32
sigma0 x = (rotateR x 7) `xor` (rotateR x 18) `xor` (shiftR x 3)

sigma1 :: Word32 -> Word32
sigma1 x = (rotateR x 17) `xor` (rotateR x 19) `xor` (shiftR x 10)

Constants and Counters

The inner loop of the SHA-256 algorithm is, in effect, a for-loop counting from $0$ to $63$. To allow expression of this in Haskell, we define a counter type Ctr below. It would be possible in the reference semantics to use the built-in Int type as a loop variable, but, anticipating the shift to hardware implementation, a discrete, finite type like Ctr makes more sense. The SHA-256 algorithm also makes use of 64 constants (these are defined in Section 4.2.2 of NIST). We index these constants via the seed function. This code appears in Counter.hs.

data Ctr = C0  | C1  | C2  | C3  | C4  | C5  | C6  | C7  |
           C8  | C9  | C10 | C11 | C12 | C13 | C14 | C15 |
           C16 | C17 | C18 | C19 | C20 | C21 | C22 | C23 |
           C24 | C25 | C26 | C27 | C28 | C29 | C30 | C31 |
           C32 | C33 | C34 | C35 | C36 | C37 | C38 | C39 |
           C40 | C41 | C42 | C43 | C44 | C45 | C46 | C47 |
           C48 | C49 | C50 | C51 | C52 | C53 | C54 | C55 |
           C56 | C57 | C58 | C59 | C60 | C61 | C62 | C63 

incCtr :: Ctr -> Ctr
incCtr C0  = C1
    ...stuff deleted...
incCtr C62 = C63
incCtr C63 = C0 

seed :: Ctr -> Word32
seed C0  = 0x428a2f98
    ...stuff deleted...
seed C63 = 0xc67178f2

Memory Layout of SHA-256

SHA-256 is an imperative algorithm, meaning that it involves assignments to registers, loops, etc. To program in imperative style in Haskell, one normally uses state monads. This section will provide a quick refresher of the fundamentals of monadic programming in Haskell for the sake of being as self-contained as possible. For readers requiring more information, please consult, for example, Liang 1995.

There are four types of storage in the algorithm:

  • Intermediate Digest: temporary storage for a digest value; of type Oct Word32.
  • Current Block: the current, formatted block; of type Hex Word32.
  • Digest: the value of the digest; of type Oct Word32.
  • Loop Counter: the current value of the loop counter register of type Ctr.

To create storable registers for each of these types, we create a monad M using the state monad transformer StateT in the code below. Notice that, for each type of register above, there is a corresponding application of the state monad transformer. This stack of StateT applications applies to the Identity monad, which has no registers at all.

type M = 
  StateT (Oct Word32)                -- Intermediate Digest
    (StateT (Hex Word32)             -- Current Block
       (StateT (Oct Word32)          -- Digest
          (StateT Ctr Identity)))    -- Loop Counter

For each register type, there are corresponding read and write operations in M for manipulating the register. These operations are prepended with get and put, respectively, and are defined below. Paying attention to the types of the operation is the best way to understand them. For example, to read the current digest value, one uses getDigest :: M (Oct Word32). This type signifies that getDigest is an M computation that returns an Oct Word32 value. To update the current value of the digest register, one uses putDigest :: Oct Word32 -> M(). For a given d :: Oct Word32, (putDigest d) :: M() is a computation that sets the current value of the digest register to d.

getIntDig :: M (Oct Word32)         -- Intermediate Digest Operations
getIntDig = get
putIntDig :: Oct Word32 -> M ()
putIntDig = put

getBlock :: M (Hex Word32)          -- Current Block Operations
getBlock = lift get
putBlock :: Hex Word32 -> M ()
putBlock = lift . put

getDigest :: M (Oct Word32)         -- Digest Operations
getDigest = lift (lift get)
putDigest :: Oct Word32 -> M ()
putDigest = lift . lift . put

getCtr :: M Ctr                     -- Loop Counter Operations
getCtr = lift (lift (lift get))
putCtr :: Ctr -> M ()
putCtr = lift . lift . lift . put

To chain together these operations, we use Haskell’s do notation. In the code below, first the current digest is read with getDigest, then the current intermediate digest is read with getIntDig, and finally they are added together component-wise and stored back in the digest register using putDigest. This intermediate operation is part of the Haskell reference semantics for SHA-256.

intermediate :: M ()
intermediate = 
   do
      Oct h1 h2 h3 h4 h5 h6 h7 h8 <- getDigest
      Oct a b c d e f g h         <- getIntDig
      putDigest (Oct (a+h1) (b+h2) (c+h3) (d+h4) 
                     (e+h5) (f+h6) (g+h7) (h+h8))

There is one more M operation to describe: return :: a -> M a. This polymorphic operation takes a value v of some type a and produces an M computation that simply returns v. The return operation produces no side effects and is, in some sense, a "do nothing" operation.

Main Function

Below is displayed the Haskell code for the main loop of the SHA-256 algorithm, which is represented by the recursive list function sha256. The sha256 function first initializes the digest and then passes the parsed input on to the mainloop function. The mainloop function is a straightforward rendering of the pseudo-code defining the main loop of SHA-256 (see Section 6.2.2 of NIST). The code for the message scheduling and compression routines can be found in Reference256.hs.

sha256 :: [Hex Word32] -> M (Oct Word32)
sha256 hws = do
                putDigest initialSHA256State
                mainloop hws
                getDigest

mainloop :: [Hex Word32] -> M ()
mainloop []             = return ()
mainloop (hw32 : hw32s) = do
                             hi_1 <- getDigest
                             putIntDig hi_1
                             putBlock hw32
                             putCtr C0
                             innerloop
                             mainloop hw32s

innerloop :: M ()
innerloop = do
               ctr <- getCtr
               s <- sched
               compress (seed ctr) s
               putCtr (incCtr ctr)
               case ctr of
                    C63 -> intermediate
                    _   -> innerloop

-- SHA-256 message scheduling and compression routines
sched    :: M Word32
compress :: Word32 -> Word32 -> M ()

III. The First Implementation

The reference semantics as a whole will not translate directly to hardware: it makes extensive use of list types in Haskell (i.e., list types) that are of potentially unbounded size. Hardware circuits, which require a finite storage footprint, cannot accommodate such types generally. However, large parts of the reference semantics may be re-used in a ReWire specification of a circuit that computes the sha256 function from the end of the previous section, albeit with a different calling convention to eliminate lists.

Hardware Accelerator for SHA-256 in Haskell

The code below presents the input and output signals for the hardware accelerator device; these are the Inp and Out types, resp. The Init input signal tells the accelerator to initialize the current block and begin a new hash computation. The Load signal instructs the accelerator to load new input into the current block register. The DigestQ signal is used to return the computed hash digest. The DigestR output signal returns the computed hash digest and the Nix output signal is the default output of the accelerator. The code below also presents the definition of the SHA-256 accelerator. The most important thing to note is that devsha256 reuses most of the code directly from the reference semantics sha256. First, devsha256 outputs the Nix signal, receives an input i and passes it to dev. The function dev handles each Inp signal as described in the previous paragraph.

data Inp = Init (Hex Word32) | Load (Hex Word32) | DigestQ
data Out = DigestR (Oct Word32) | Nix

type Device = ReacT Inp Out M ()

devsha256 :: Device
devsha256 = do 
               i <- signal Nix
               dev i

dev :: Inp -> Device
dev (Init hw32) = do
                     lift (do
                              putDigest initialSHA256State
                              hi_1 <- getDigest
                              putIntDig hi_1
                              putBlock hw32
                              putCtr C0)
                     signal Nix
                     innerloop 
dev (Load hw32) = do
                     lift (do
                              hi_1 <- getDigest
                              putIntDig hi_1
                              putBlock hw32
                              putCtr C0)
                     signal Nix
                     innerloop
dev DigestQ     = do
                     h_n <- lift getDigest 
                     i <- signal (DigestR h_n)
                     dev i

innerloop :: Device
innerloop   = do
                 ctr <- lift (do
                                 c <- getCtr
                                 s <- sched
                                 compress (seed c) s
                                 putCtr (incCtr c)
                                 return c)
                 i <- signal Nix
                 case ctr of
                      C63 -> do lift intermediate 
                                dev i
                      _   -> innerloop

Top Level for devsha256

The code below presents the top-level functions for testing the devsha256 accelerator in Haskell. Below, we discuss the calling convention for devsha256, which is encapsulated as the format function. The simdev function takes a Device and a list of Inps and computes the final Out signal produced by the device on those inputs. The code for both of the functions is found in TestHarness.hs. Importantly, the hardware accelerator may be tested in Haskell: using godev256 produces the same results as the refsha256 did in Section II.

godev256 :: [Char] -> Out
godev256 = runM . simdev devsha256 . format . pad
format    :: [Hex Word32] -> [Inp]
simdev    :: Device -> [Inp] -> M Out

To call the main function from the reference semantics, one simply applies the function sha256 to the list of input blocks. That is, ${\mathtt{sha256~ [{\mathit{M}}^{(\mathit{1})},\ldots,{\mathit{M}}^{(\mathit{N})}]}}$ computes a digest. The device version, devsha256, requires that this input be formatted differently, and this is accomplished with the function, format :: [Hex Word32] -> [Inp]. The code for format is defined in TestHarness.hs, but the action of the format function is described below:

\[\mathtt{format} [M^{(1)},\ldots,M^{(N)}] = \begin{array}[t]{l} [\mathtt{Init}~ M^{(1)}, \underbrace{\mathtt{DigestQ},\ldots,\mathtt{DigestQ}}_{\times{64}}, \\ ~\mathtt{Load}~ M^{(2)}, \underbrace{\mathtt{DigestQ},\ldots,\mathtt{DigestQ}}_{\times{64}}, \\ ~~~~~~~~~~~~~~~~~~~~~~~~\vdots \\ ~\mathtt{Load}~ M^{(N)}, \underbrace{\mathtt{DigestQ},\ldots,\mathtt{DigestQ}}_{\times{64}}, \\ ~\mathtt{DigestQ}] \end{array}\]

This calling convention is analogous to those found in publications presenting SHA-256 accelerators hand-coded in VHDL (e.g., Chaves06, Algredo13, Ting02, McEvoy06, Garcia14, Sklavos05, and Kahri15). Note that the sixty four DigestQ signals are used only as padding to wait for the inner loop within the device to finish. The final DigestQ will result in the completed hash value being returned.

Hardware Accelerator for SHA-256 in ReWire

This section highlights the transformations required of the first implementation (i.e., FirstImplementation.hs) to produce a ReWire implementation (i.e., RWFirstImplementation.hs). These changes proceed along the following lines:

  • Recasting the Primitive Types and Operations. We have, heretofore, used the Word32 type and related operations from the Haskell libraries Data.Bit and Data.Word. To produce a synthesizable implementation, we implement our own type, W32, and related operations. In this part, we see how to link a ReWire type to an implementation in VHDL using the ReWire keyword nativeVhdl.
  • Eliminating type synonyms. This mainly involves eliminating the use of M.
  • Adding a start symbol. This will involve the extrude primitive used whenever devices have internal state (i.e., use the state monad transformer).

Recasting the Primitive Types and Operations

Up to this point, we have used the Word32 type provided by the Haskell libraries, Data.Bit and Data.Word. This has been very convenient for the rapid prototyping of the SHA-256 algorithm in Haskell, because those libraries have nice syntax for 32-bit words (e.g., 0x0) and common operations (e.g., + and .&.) are already defined. In this section, we describe the conversion of the specification from Word32 to the W32 type provided in the ReWire prelude (i.e., ReWirePrelude.hs). This conversion is necessary to transform the SHA-256 specification into proper ReWire.

Constants

With the Haskell libraries Data.Bit and Data.Word, one can write hexadecimal constants for 32 bit words by prepending the constant with 0x, as in the following code we have already seen:

seed :: Ctr -> Word32
seed C0  = 0x428a2f98
seed C1  = 0x71374491
seed C2  = 0xb5c0fbcf
     ...

This syntax for constants is not available from the ReWire front end, and so each constant required by the SHA-256 algorithm must be defined as a value of type W32, as in the following code (from MetaprogrammingRW.hs).

w428a2f98 :: W32
w428a2f98 = W32 Zero One Zero Zero Zero Zero One Zero One Zero Zero Zero One Zero One Zero Zero Zero One Zero One One One One One Zero Zero One One Zero Zero Zero

While this is a bit tedious to perform these instantiations, the burden was alleviated somewhat by Haskell code to generate them — see the commented code in MetaprogrammingRW.hs. Once all the necessary constants have been defined, the Word32 constants were replaced with their corresponding W32 constants; please note the change of type declarations as well.

seed :: Ctr -> W32
seed C0  = w428a2f98
seed C1  = w71374491
seed C2  = wb5c0fbcf
     ...
Native VHDL Implementations of W32 operations

The Data.Bit and Data.Word Haskell libraries also provided implementations for standard operations: addition (+), bitwise and (.&.), bitwise exclusive or (xor), and bitwise complement (complement) as well as rotate and shift operations. These we replace, respectively, by w32Plus, w32And, w32Xor, and w32Not. The rotate and shift operations are replaced by operations from ReWirePrelude.hs. So, for example, the code for the SHA-256 logical operators now becomes the following:

ch :: W32 -> W32 -> W32 -> W32
ch x y z = (x `w32And` y) `w32Xor` (w32Not x `w32And` z)

maj :: W32 -> W32 -> W32 -> W32
maj x y z = (x `w32And` y) `w32Xor` (x `w32And` z) `w32Xor` (y `w32And` z)

bigsigma0 :: W32 -> W32
bigsigma0 x = (rotateR2 x) `w32Xor` (rotateR13 x) `w32Xor` (rotateR22 x)

bigsigma1 :: W32 -> W32
bigsigma1 x = (rotateR6 x) `w32Xor` (rotateR11 x) `w32Xor` (rotateR25 x)

sigma0 :: W32 -> W32
sigma0 x = (rotateR7 x) `w32Xor` (rotateR18 x) `w32Xor` (shiftR3 x)

sigma1 :: W32 -> W32
sigma1 x = (rotateR17 x) `w32Xor` (rotateR19 x) `w32Xor` (shiftR10 x)

The question now remains as to how we implement w32Plus, w32And, w32Xor, w32Not, and incCtr? For these operations, we choose to implement them directly in VHDL. Accomplishing this in ReWire comes in two parts: writing the type signatures of the operations using nativeVhdl and providing VHDL implementations of the operations in a separate file.

The following shows the first step (in RWFirstImplementation.hs):

w32Plus :: W32 -> W32 -> W32
{-# INLINE w32Plus #-}
w32Plus = nativeVhdl "w32Plus" w32Plus

w32Xor :: W32 -> W32 -> W32
{-# INLINE w32Xor #-}
w32Xor  = nativeVhdl "w32Xor" w32Xor

w32And :: W32 -> W32 -> W32
{-# INLINE w32And #-}
w32And  = nativeVhdl "w32And" w32And

w32Not :: W32 -> W32
{-# INLINE w32Not #-}
w32Not = nativeVhdl "w32Not" w32Not

incCtr :: Ctr -> Ctr
{-# INLINE incCtr #-}
incCtr = nativeVhdl "incCtr" incCtr

Above, the nativeVhdl keyword tells the ReWire compiler that it will find the implementations of these operations in a separate VHDL file. The second step defines each operation, as in the following file, prims.vhd:

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

package prims is
  pure function w32Plus (x : std_logic_vector; y : std_logic_vector) return std_logic_vector;
  pure function w32Xor (x : std_logic_vector; y : std_logic_vector) return std_logic_vector;
  pure function w32And (x : std_logic_vector; y : std_logic_vector) return std_logic_vector;
  pure function w32Not (x : std_logic_vector) return std_logic_vector;
  pure function incCtr (x : std_logic_vector) return std_logic_vector;
end prims;

package body prims is
  pure function w32Plus (x : std_logic_vector; y : std_logic_vector) return std_logic_vector is
  begin
    return (std_logic_vector(unsigned(x)+unsigned(y)));
  end w32Plus;

  pure function w32Xor (x : std_logic_vector; y : std_logic_vector) return std_logic_vector is
  begin
    return (x xor y);
  end w32Xor;

  pure function w32And (x : std_logic_vector; y : std_logic_vector) return std_logic_vector is
  begin
    return (x and y);
  end w32And;
  
  pure function w32Not (x : std_logic_vector) return std_logic_vector is
  begin
    return (not x);
  end w32Not;

  pure function incCtr (x : std_logic_vector) return std_logic_vector is
  begin
    return (std_logic_vector(unsigned(x)+1));
  end incCtr;
end prims;

Eliminating type synonyms

There are two type synonyms used in FirstImplementation.hs; these are:

type M = StateT (Oct Word32)
            (StateT (Hex Word32)
                (StateT (Oct Word32)
                    (StateT Ctr Identity)))
type Dev i o   = ReacT i o M

This means that, in RWFirstImplementation.hs, each occurrence of M is replaced by:

   StT (Oct Word32) (StT (Hex Word32) (StT (Oct Word32) (StT Ctr I)))

and each occurrence of Dev Inp Out is replaced by

   ReT Inp Out (StT (Oct Word32) (StT (Hex Word32) (StT (Oct Word32) (StT Ctr I))))

There are three small syntactic changes to note. In ReWire, the built-in type constructors for the state and reactive resumption monad transformers are StT and ReT (instead of StateT and ReacT). Also, as we have seen before, the built-in contructor for the identity monad in ReWire is I rather than Identity. Eliminating this distinction without a difference is another example of a small tweek to the language which is on our to-do list.

Adding a start symbol

The device specification in RWFirstImplementation.hs is identical to that of FirstImplementation.hs, although we affix an ` to it for ease of presentation:

devsha256' :: ReT Inp Out (StT (Oct W32) (StT (Hex W32) (StT (Oct W32) (StT Ctr I)))) ()
devsha256' = signal Nix >>= \ i -> dev i

This subsection illustrates how we produce a declaration compilable to VHDL from this. The ReWire compiler can compile device expressions of type ReT i o I t for appropriate input, output, and return types, i, o, and t. So, clearly, some function must be applied to devsha256 to produce something of the right type for the compiler. ReWire has a built-in function for doing just that called extrude.

The ReWire built-in function extrude is used to provide a device defined with StT s with an initial state of type s. For any monad m, it has type:

extrude :: ReT i o (StT s m) a -> s -> ReT i o m (a,s)

In a call (extrude dev s0), extrude, in effect, initializes the s-state to s0 in device dev. By using multiple applications of extrude, we can initialize all the states in a device specification. For devsha256', this will require four applications of extrude. Therefore, we can define the start symbol as:

start :: ReT Inp Out I (((((), Oct W32), Hex W32), Oct W32), Ctr)
start = extrude
         (extrude
           (extrude
            (extrude
                 devsha256'
                 (Oct w00000000 w00000000 w00000000 w00000000 w00000000 w00000000 w00000000 w00000000))
                 (Hex w00000000 w00000000 w00000000 w00000000 w00000000 w00000000 w00000000 w00000000 w00000000 w00000000 w00000000 w00000000 w00000000 w00000000 w00000000 w00000000))
                 (Oct w00000000 w00000000 w00000000 w00000000 w00000000 w00000000 w00000000 w00000000))
		 C0

In the above definition, we have initialize the intermediate digest register, the current block register, the digest register, and the counter register to appropriate forms of 0. Observe that start now possesses a type that the ReWire compiler can translate to VHDL.

IV. Second Implementation: Refactoring for Implementation

Recall that the inputs to the devsha256 are 512 bits wide—i.e., a Hex Word32 value requires 512 bits ($= 16 \times 32)$ to represent. Similarly, the Oct Word32 outputs of devsha256 require 256 ($= 8 \times 32$) bits to represent. Consequently, we would need an FPGA with at least 768 (= 512+256) I/O pins in order to synthesize devsha256 directly; this is an unrealistically large number. To bring the pin count down, we can refactor the devsha256 specification to marshal and unmarshal the input and output 64 bits at a time. If we want to target a smaller FPGA, then we can refactor the devsha256' specification to marshal and unmarshal the input and output to and from the board, thereby cutting the number of required I/O pins.

Marshaling and Unmarshaling

The code below displays the new definitions of the Inp and Out data types to accomplish the marshaling and unmarshaling of data on and off the targeted board. Each load signal (i.e., either Init or Load0 through Load7) contains two 32 bit words, so that words are loaded two at a time onto the targeted board. Each output digest signal (i.e., DigestR) off loads two 32 bit words per cycle, so that it takes four digest request signals (i.e., DigestQ0 through DigestQ3) to output one digest. The calling convention for the new marshaling accelerator device changes as well, and it is defined similarly to the format function from the previous section (and can be found in TestHarness.hs). The code for the refactored device is found in RWSecondImplementation.hs.

data Inp = Init Word32 Word32  | Load0 Word32 Word32 | 
           Load1 Word32 Word32 | Load2 Word32 Word32 | 
           Load3 Word32 Word32 | Load4 Word32 Word32 | 
           Load5 Word32 Word32 | Load6 Word32 Word32 | 
           Load7 Word32 Word32 |
           DigestQ0 | DigestQ1 | DigestQ2 | DigestQ3
data Out = DigestR Word32 Word32 | Nix

The following is the new implementation, devsha256'' that marshals and unmarshals data blocks off of the circuit. The loading and digest functions (i.e., load0, digest0, etc.) can be found in the ReWire specification RWSecondImplementation.hs.

devsha256'' :: ReT Inp Out (StT (Oct W32) (StT (Hex W32) (StT (Oct W32) (StT Ctr I)))) ()

devsha256'' = do i <- signal Nix
                 dev i

dev :: Inp -> ReT Inp Out (StT (Oct W32) (StT (Hex W32) (StT (Oct W32) (StT Ctr I)))) ()
dev (Init w1 w2) = do
                      lift (do
                               putDigest initialSHA256State
                               hw <- getBlock
                               putBlock (load0 w1 w2 hw))
                      i <- signal Nix
                      dev i
dev (Load0 w1 w2) = do
                       lift (do
                                hw <- getBlock
                                putBlock (load0 w1 w2 hw))
                       i <- signal Nix
                       dev i
             ...stuff deleted...
dev (Load7 w1 w2) = do
                       lift (do
                                putCtr C0
                                hi_1 <- getDigest
                                putIntDig hi_1
                                hw <- getBlock
                                putBlock (load7 w1 w2 hw))
                       signal Nix
                       innerloop
dev DigestQ0      = do
                       h_n <- lift getDigest
                       i <- signal (digest0 h_n)
                       dev i
             ...stuff deleted...
dev DigestQ3      = do
                       h_n <- lift getDigest
                       i <- signal (digest3 h_n)
                       dev i

Synthesis

The devsha256'' accelerator has been tested in Haskell to show that it computes the same hashes as the reference semantics. Using the ReWire compiler, we can produce synthesizable VHDL and synthesize to a FPGA target — here we target the Xilinx Spartan-3E XC3S500E-4FG320 featured on the Xilinx Spartan-3E Starter Kit. Note that this board possesses sufficient I/O pins to accommodate devsha256''.

The table below presents the synthesis numbers for devsha256'' for the Xilinx ISE toolset targeting the Spartan-3E. The maximum clock rate was 60 MHz. Screenshot There are 76 cycles per output, and so the total throughput for devsha256'' is calculated as:

\[(60 \mathit{MHz} \times 512 \mathit{bits}) / 76 \mathit{cycles} = 404 \mathit{Mbps}\]

This throughput for the devsha256'' accelerator is on the low end of the performance spectrum of the hand-crafted VHDL implementations reported in Chaves06 — higher than Sklavos and Koufopavlou Sklavos05 (376 Mbps) and lower than Chaves et al. (1370 Mbps). The aforementioned prior works are not targeting the same FPGA, and so the above is something of a sort of "apples-to-oranges" comparison. The performance numbers should be taken merely as indicative that performance is reasonable for a rapidly developed prototype.