In part one of this series dedicated to Formal Verification of Ethereum 2.0, Joanne Fuller from the Consensys R&D illustrates how to locate, understand, and fix a subtle array-out-of-bound runtime error within the state transition component.
The Phase 0 Beacon Chain reference implementation describes the state machine that every Beacon node has to implement in the decentralised Ethereum 2.0 ecosystem. The reference implementation comprises three main sections: 1. State Transition, 2. Simple Serialize (SSZ) library, and 3. The Merkleise library.The state transition is the most critical component and describes how the state is to be updated upon processing of a new block.
Below is research that supports how to locate, understand, and fix a subtle array-out-of-bound runtime error within the state transition component. This error is dependent on the number of active validators as summarised below and can therefore be fixed by imposing an upper bound on the number of active validators.
Number of active validators (V) | No array-out-of-bound error? |
---|
Table 1: Summary of results.
Detecting this error would be difficult using any other strategy, e.g. code peer review or testing. In contrast, using formal verification, we can detect and fix the error, we are able to mathematically and mechanically prove the correctness of the fix which goes beyond what testing techniques can do. It is a technical blog (written in a tutorial style) with the aim to allow readers, even those not completely familiar with formal verification, to be able to understand how these techniques are useful in the blockchain context.
The Project
The project to formally verify Phase 0 of the Ethereum 2.0 Beacon Chain reference implementation using the verification-friendly language Dafny was complete earlier this year.
The primary goal was to formally verify the absence of runtime errors (under/overflows, array-out-of-bounds, division by zero) and our secondary goal (not covered in this blog) to prove functional correctness.
For a more detailed description of the whole project, you can refer to our paper Formal Verification of the Ethereum 2.0 Beacon Chain. To see the full project code, please refer to our GitHub repository: https://github.com/Consensys/eth2.0-dafny.
An Overview of the State Transition
Let’s begin with a simplified view of the Beacon Chain system (Figure 1). Time is logically divided into epochs and each epoch is then divided into a fixed number of slots. Blocks are associated with a particular slot; however a slot may be empty and therefore no block will be processed. As shown in Figure 1, the arrival of a block causes the state to change, as does the passing of an epoch boundary (even if no further blocks have arrived). This view of the system is a simplified one; please note that the BeaconState
is actually updated at every slot; but in the case of empty slots that aren’t at an epoch boundary, this update is simply an increment to the slot
field.
Now let’s consider the state_transition
; it governs how the BeaconState
is updated. The state_transition
is a complex procedure but at the highest level it comprises two distinct steps; process_slots
and then process_block
(Figure 2).
The state_transition
procedure is triggered when a new block arrives for a given slot. The first step (process_slots
) progresses the BeaconState
through any empty slots such that the state can ‘catch up’ from the last recorded block/slot to the current one. The BeaconState
is updated as each empty slot is processed (i.e. the slot field is incremented) and if the empty slot is an epoch boundary, additional processing occurs to implement, for example, justification and finalisation, the rewards and penalties, slashing, etc.
Once the state slot is equal to the slot of the block that has just arrived, the second step (process_block
) can begin. Within the block processing section further complexity arises. In particular, the process_operations
section contains many components such as processing the proposer and attester slashings, attestations, any new deposits and any voluntary exits; each component affecting its own subset of the BeaconState
fields.
Using Formal Verification to find an Array-out-of-bound Runtime Error
We are now ready to look at the details of an array-out-of-bound runtime error discovered during our project. This error serves as an important example of the benefits of formal verification and in particular, demonstrates its power when dealing with complex calculations that span multiple levels of function calls and include less familiar operations such as floor division.
Where does the error occur?
The function get_attesting_indices
is called within the process_rewards_and_penalties
, process_justification_and_finalisation
, and process_attestion
components of the state_transition
(Figure 3). In the final line of get_attesting_indices (Figure 4) the set of validator indices to be returned is determined by enumerating over committee
such that committee[i]
is included if bits[i]
is set to 1.
The bits
parameter is a Bitlist
with a maximum length of MAX_VALIDATORS_PER_COMMITTEE
(i.e. 2048). Hence, if
|committee| > MAX_VALIDATORS_PER_COMMITTEE
then an array-out-of-bound runtime error will occur.
Can this occur?
This is where we see the power of using a tool like Dafny. When this function was coded in Dafny the verifier automatically checked for runtime errors and in this case the verifier identified a potential array-out-of-bound error. (Note, I use the word 'potential' as there are instances where the verifier may not have enough information to determine correctness, or may 'time out' while trying to verify.) So we had to trace the problem and further understand the calculation of committee
. The aim was to determine whether additional information (such as pre and post conditions) would facilitate verification, or whether a bug existed and if so, specifically under what circumstances would it be triggered.
Tracing the calculation
The committee
variable is set by calling get_beacon_committee
, which in turn calls compute_committee
as shown in the Python code below (Figures 5 and 6).
Starting with the return
statement in compute_committee
we see a sequence with length end-start is returned, hence:
|compute_committee(indices, seed, index, count)|
= end
-start
= (len(indices) * (index+1)) // count
- (len(indices) * index) // count
When compute_committee
is called from get_beacon_committee
we see that:
indices
=get_active_validator_indices
seed
=get_seed
index
=(slot % SLOTS_PER_EPOCH) * committees_per_slot + index
Note that the
index
parameter inget_beacon_committee
is different from theindex
parameter incompute_committee
.
count
=committees_per_slot * SLOTS_PER_EPOCH
Making these substitutions:
end
=(len(get_active_validator_indices) * (((slot % SLOTS_PER_EPOCH) * committees_per_slot + index)+1)) // (committees_per_slot * SLOTS_PER_EPOCH)
start
=(len(get_active_validator_indices) * ((slot % SLOTS_PER_EPOCH) * committees_per_slot + index)) // (committees_per_slot * SLOTS_PER_EPOCH)
As committees_per_slot
is determined by a call to get_committees_per_slot
and is a function of the number of active validators:
committees_per_slot
=max(1, min(MAX_COMMITTEES_PER_SLOT, len(get_active_validator_indices) // SLOTS_PER_EPOCH // TARGET_COMMITTEE_SIZE))
It will be a value between 1 and 64 inclusive.
Substituting this formula for committees_per_slot
into end and start we conclude that end-start is a function of:
len(get_active_validator_indices)
,slot % SLOTS_PER_EPOCH
where 0 <=slot % SLOTS_PER_EPOCH
< 32, andindex
where 0 <=index
<committees_per_slot
(i.e. the committeeindex
parameter ofget_beacon_committee
).
Putting all of this together, when get_beacon_committee(state, data.slot, data.index)
is called from get_attesting_indices
for a state of a particular number of active validators, it can be done so for multiple (data.slot
,data.index
) combinations. Also:
as
committees_per_slot
depends on the number of active validators, the range ofindex
also depends on the number of active validators; andonce the number of active validators is sufficiently large, there will be a maximum of 32 × 64 = 2048 of these (
slot % SLOTS_PER_EPOCH
,index
) combinations to consider for a state with a particular number of active validators.
To determine which combinations of (len(get_active_validator_indices)
, slot % SLOTS_PER_EPOCH
, index
) cause the committee
length to exceed MAX_VALIDATORS_PER_COMMITTEE
we need to analyse the above formulas. The use of floor division makes this analysis more than simple algebra but we can use visualisations to help understand what is happening.
The relationship between number of active validators and committees_per_slot is shown in Figure 7. Notice that once there are 218 = 262,144 or more active validators, committees_per_slot
will reach its upper bound of 64.
The value of committees_per_slot
then determines the number of (slot % SLOTS_PER_EPOCH
,index
) combinations to consider for a state with a particular number of active validators.
In Figure 8 we plot the number of (slot % SLOTS_PER_EPOCH
,index
) combinations, as well as how many of these input combinations will produce |committee| = end - start > MAX_VALIDATORS_PER_COMMITTEE
. We observe that the number of (slot % SLOTS_PER_EPOCH
,index
) combinations reaches its maximum of 32 × 64 = 2048 at 218=262,144 active validators.
We can also observe that there are zero (slot % SLOTS_PER_EPOCH
,index
) combinations where |committee| > MAX_VALIDATORS_PER_COMMITTEE
until the number of active validators exceeds 222 = 4,194,304; as shown in the section of graph reproduced in Figure 9.
Hence, once the number of active validators exceeds 222 = 4,194,304 there are (slot % SLOTS_PER_EPOCH
,index
) combinations that cause |committee| > MAX_VALIDATORS_PER_COMMITTEE
. Further, once the number of active validators reaches 222 + 211 = 4,196,352 we can observe that all (slot % SLOTS_PER_EPOCH
,index
) combinations will result in the error.
Example: 4,194,355 active validators
When the number of active validators is 4,194,355, multiple (slot % SLOTS_PER_EPOCH
,index
) combinations result in the error. Figure 10 shows the particular combinations that result in the error (i.e. the red blocks).
Conclusion
This previously undocumented bug was difficult to detect. It required many hours of effort to model the dynamics of the problem; the analysis was quite complex due to the multiple interrelated parameter calculations, and the use of floored integer division. The ability of Dafny to initially detect the problem is significant and provides clear justification for the use of formal verification techniques in the blockchain context. The full analysis of this bug has been reported as an issue to the reference implementation github repository. The issue was confirmed by the reference implementation writers.
Further Reading
If you are interested in reading more about formal verification in the Ethereum 2.0, check out the paper by my colleague Franck Cassez discussing the work undertaken to formally verify the Eth2 Deposit Smart Contract.
For regular updates and news on the Merge, visit the Consensys Merge Knowledge Base, where you’ll also find Ethereum 2.0 archives with key milestones achieved and other essential resources for developers.