Full Outage

Incident Report for TripWorks

Postmortem

Incident: Configuration Change Without Pre-Production Validation
Date of Incident: April 2nd, 2025
Duration of Outage: 35 minutes (10:55 AM ET – 11:30 AM ET)

Impact: Interruption of platform services, including dashboard access and booking flows

Summary:
On April 2nd, 2025, TripWorks experienced a service interruption from approximately 10:55 AM ET to 11:30 AM ET. The outage affected core platform services, including access to the dashboard and booking flows. The root cause was a configuration change that was introduced to the production environment without going through the standard pre-deployment validation process.Our engineering team quickly identified the issue and rolled back the change, restoring full platform functionality. The system was closely monitored afterward, and no further service degradation was observed.Impact:

  • Affected Service(s): Dashboard access, booking flow functionality
  • Users Impacted: All platform users during the outage window
  • Functional Impact: Inability to access dashboard and complete bookings during the incident

Root Cause:
The outage was caused by the deployment of a configuration change that bypassed our standard staging and validation process. The change introduced unexpected behavior in core services, leading to a platform-wide disruption.Resolution:
The immediate resolution involved rolling back the configuration change to the last known stable version. Following the rollback, all affected services resumed normal operation. Post-resolution monitoring confirmed system stability.Lessons Learned:

  • Process Adherence: This incident revealed a gap in our deployment process that allowed a change to reach production without adequate testing.
  • Risk of Manual Changes: Bypassing standard workflows introduces avoidable risk, even for seemingly minor changes.
  • Response Efficiency: Rapid identification and rollback minimized the impact window, demonstrating the effectiveness of our response process.

Action Items:
Short-Term Fixes:

  • Enforce mandatory staging validation for all configuration changes, regardless of perceived risk.

Long-Term Improvements:

  • Pre-Deployment Validation: All configuration changes must now be tested and validated in staging before being deployed to production.
  • Deployment Safeguards: Additional checks have been implemented in our CI/CD pipelines to flag any changes lacking approval or validation.
  • Monitoring Enhancements: Expand monitoring to more proactively detect misconfigurations that impact critical services.
  • Team Training: The engineering team has reviewed this incident as part of a broader training initiative to reinforce deployment best practices.

Conclusion:
While the outage on April 2nd was relatively brief, it significantly impacted both operator workflows and guest booking capabilities. We take this incident seriously and have implemented changes to strengthen our validation and deployment practices. Our goal is to prevent similar issues in the future and ensure continued reliability for all TripWorks users.

Posted Apr 03, 2025 - 14:10 EDT

Resolved

The incident has been resolved.
Posted Apr 02, 2025 - 22:57 EDT

Monitoring

We experienced a full system outage between 10:55AM eastern and 11:30AM eastern, which prevented users from logging into the platform. At this time, all systems are operational and users should be able to log in as expected. We are closely monitoring performance to ensure continued stability.

A full post-mortem will be published soon with details about the root cause, impact, and steps we’re taking to prevent similar issues in the future.

Thank you for your continued patience and understanding.
Posted Apr 02, 2025 - 12:30 EDT
This incident affected: DashBoard.