Full Outage Report
Incident Report for TripWorks
Postmortem

Incident: FontAwesome Library Missing from Scaled Instances
Date of Incident: September 24th, 2024
Duration of Outage: 17 minutes (9:25 AM ET – 9:42 AM ET)
Impact: Users unable to log in to the dashboard, no impact to OTA bookings

Summary: On September 24th, 2024, TripWorks experienced a brief outage from 9:25 AM ET to 9:42 AM ET that prevented users from logging into the dashboard. The outage was caused by the failure to include the FontAwesome library in newly scaled instances, which created a dependency error for the login page. Despite the dashboard login issue, there was no impact on OTA (Online Travel Agency) bookings, which continued to be accepted by the platform without interruption.

Impact:

  • Affected Service(s): TripWorks dashboard login functionality
  • Users Impacted: All dashboard users during the outage window
  • Functional Impact: Users were unable to access the dashboard, but OTA bookings were unaffected and processed normally.

Root Cause: The root cause was the failure to include the FontAwesome library in the build process for newly scaled instances during a period of higher load. As the system scaled, instances without the library were added to the production environment, causing the login page, which depended on FontAwesome, to malfunction.

Timeline of Events:

  • 9:25 AM ET: Increased load caused the system to auto-scale, adding new instances to the production environment.
  • 9:26 AM ET: Users began reporting issues with logging into the dashboard.
  • 9:30 AM ET: Investigation identified the login page was failing due to missing FontAwesome assets.
  • 9:35 AM ET: Root cause determined: newly scaled instances did not include the FontAwesome library.
  • 9:38 AM ET: Emergency patch applied to reintroduce the library on affected instances.
  • 9:42 AM ET: Full resolution confirmed, and login functionality restored.

Resolution: The immediate resolution involved manually updating the scaled instances to include the FontAwesome library, which restored the login functionality. The team then verified that all new instances were correctly configured with the necessary assets.

Lessons Learned:

  • Monitoring Gaps: This incident highlighted a gap in our internal monitoring tools, which failed to detect that new instances were missing critical libraries. Early detection could have shortened the resolution time.
  • Configuration Management: A weakness in configuration management allowed a third-party library (FontAwesome) to be omitted from the build process for dynamically created instances.

Action Items:

  • Short-Term Fix: Ensure that the FontAwesome library is explicitly included in the build process for all new instances.
  • Long-Term Fix:

    • Review and update the build configuration process to prevent missing dependencies on any scaled instances.
    • Improve internal monitoring to detect and alert on build failures related to third-party library inclusion.
  • Monitoring Improvements: Add alerts to flag incomplete builds or missing assets during auto-scaling to detect and resolve such issues more quickly.

  • Team Training: Provide a review session on configuration management to reinforce the importance of including all dependencies in dynamic scaling scenarios.

Conclusion:The outage on September 24th, though short, had a significant impact on dashboard access for users. However, OTA bookings were not affected, ensuring core platform functionality remained intact. This incident has brought to light areas for improvement in our scaling processes and monitoring systems, and we are committed to implementing measures to prevent similar issues in the future.

Posted Sep 24, 2024 - 09:57 EDT

Resolved
The issue that was preventing users from logging in has been fully resolved. All systems are now operational, and users should be able to log in without any further issues.

Thank you for your patience during this time.
Posted Sep 24, 2024 - 09:48 EDT
Identified
We have identified the root cause of the outage that is preventing users from logging in. Our team is now working on implementing a solution to restore full access.

We will continue to provide updates as we make progress. Thank you for your patience as we work to resolve this issue.
Posted Sep 24, 2024 - 09:34 EDT
Investigating
We are currently experiencing a system-wide outage that is preventing users from logging in. Our team is actively investigating the issue and working to restore full functionality as soon as possible.
Posted Sep 24, 2024 - 09:30 EDT
This incident affected: DashBoard.