Community Update: Technical Analysis and Incident Review

Dear Community,

We wish to share insights into two significant events that recently transpired within our network, aiming to maintain transparency and keep you informed about our continuous efforts to strengthen our system.

What happened

In the early hours of February 28, 2024 , during a routine runtime upgrade on our Mainnet meant to enable the Burn and upgrade the substrate version, a critical migration was inadvertently missed. This migration, vital for the contracts pallet to function correctly, was not included in the upgrade, leading to an unforeseen halt in block production and finalization. This oversight stemmed from a discrepancy between our test environments (Devnet, Qanet, and Testnet), where the migration had been successfully applied, and our Mainnet, where it was absent.

The heart of the issue lies in the upgrade process of our contracts pallet. Specifically, migrating to version 9 was essential for subsequent migrations and was implemented in our testing networks but not in Mainnet. Consequently, the network was unable to proceed with block production, as the contracts pallet continuously rejected the applied runtime, missing the necessary migrations for the transition to v9. Meanwhile, due to its fallback strategy, the node’s dispatcher kept attempting to apply the deployed runtime. This caused the network’s nodes to enter a re-entrant cycle, unable to resolve the inconsistency between migrations.

How did Cere Network respond?

Upon detecting the halt in block production and finalization, the Cere Network team mobilized to diagnose the issue. The immediate response showcased the strength of our community, particularly the engagement and collaboration of our validators. Our response was multifaceted, focusing on immediate resolution, communication, and long-term system resilience:

Diagnosis and Action: Our team identified the missed migration as the root cause, underscoring the need for a robust upgrade and migration protocol.
Engagement and Transparency: We maintained open lines of communication with our community and validators, ensuring they were informed and involved in the resolution process.
Validator Collaboration: The active involvement of our validators was crucial. Their readiness to engage and collaborate helped us navigate the incident more effectively.

How was the issue fixed?

The resolution involved a series of carefully orchestrated steps, focused on restoring network functionality while ensuring system integrity and compensating our community:

Network Restoration: Mainnet was restored from a backup, ensuring the continuity of the network.
Cancellation of the runtime upgrade: The extrinsic that scheduled to perform the runtime upgrade was cancelled through the Scheduler.
Technical Fixes: The missing migration was properly integrated and deployed, and rigorous checks were instituted to prevent similar issues in the future.
Compensation Mechanism: To compensate for the rewards missed during the downtime, Cere Network will enact a temporary increase in the Annual Percentage Rate (APR) by doubling it for both validators and nominators. This adjustment will be in place for 20 days, allowing participants who remain staked in the Cere Network throughout this period to earn twice the normal rewards. This period of double rewards initiated on April 2, 2024 and will last 20 days until April 22, 2024. Please read more about the compensation mechanism here.

‍

What did we learn?

Backup and Restore Strategy: A solid and field-tested backup and restore strategy is essential for minimizing downtime and data loss.
Network Monitoring and Alerting: Enhanced monitoring and alert systems are crucial for early detection of issues.
QA Procedures: Rigorous quality assurance processes are vital to prevent issues from affecting Mainnet.
Community and Expert Engagement: The engagement of our validators and the need for a shortlist of industry experts for quick consultation were highlighted

‍

What will we do ?

In response to these learnings, we are implementing several key improvement initiatives:

Enhanced Observability: Enhancing our alerting systems for earlier notification of anomalies.
Improved Backup and restore automation: We've upgraded our automated backup system to ensure virtually zero block data loss, should a restore become necessary. This enhancement includes a much more robust restoration process alongside higher fidelity backup snapshots, offering unparalleled data protection and recovery capabilities
Improved Runbooks: We've expanded our runbooks by integrating more automated tests and cross-checking between incremental upgrade steps. This ensures new scenarios are thoroughly tested and verified with every runtime update, significantly elevating the scrutiny of each change deployed to the Mainnet
Community Engagement: Continuing to foster open communication with our community, including transparent post-mortem reviews.
Infrastructure and Testing Improvements: Minimizing the delta between Testnet and Mainnet and introducing additional runtime update testing scenarios.

Closing

The incident, while challenging, has been a catalyst for significant improvements across our network's operational, technical, and community engagement practices. We are grateful for the patience, support, and collaboration of our community and validators. Together, we have passed a very important pressure test and taken a big step toward being a fully resilient, secure, and community-driven network.

As we move forward, our commitment to learning, improvement, and transparency remains unwavering. Thank you for being part of our journey.

‍

Join our team

We're hiring