Description: Router hardware issues and emergency repairs on October 13, 2023 at approximately the following times:
- 11:03PST until 11:04PST
- 15:12PST until 15:13PST
- 16:12PST until 16:24PST
- 16:28PST until 16:32PST
Root cause: Our master routing engine within edge1.LBLKWA experienced an unplanned reboot, accompanied by an unusual error code at 11:03PST. Subsequently, the backup routing engine assumed control, and by 11:04PST network traffic resumed following the re-establishment of BGP sessions. To diagnose and resolve this unexpected issue, our NOC immediately engaged with a Juniper JTAC representative. The advice received was to pursue two critical actions: a firmware upgrade on both routing engines and the replacement of the problematic routing engine and control board. Notably, the error message exhibited by the router was abnormal and had not been seen by JTAC.
Resolution: Our NOC diligently executed the recommended measures, which included the firmware upgrades and the replacement of the malfunctioning routing engine and additional control board in our core router. During these procedures, there was a series of brief outages as the mastership role transitioned between the master and backup routing engines, leading to momentary disruptions in BGP sessions which subsequently re-established. After the repairs were completed network traffic stability and core redundancy within our edge infrastructure were re-established.
We understand that this incident caused inconveniences to our customers, and we do sincerely apologize for the disruptions it caused. We will immediately restock our spare/replacement critical hardware and continue to actively monitor for any signs of trouble.