Refactoring Monitoring Systems and Covering Business Logic with Custom Metrics
Project Description
The client is a company providing access to media content with a core audience in Europe and the USA. The company used a substantial and complex infrastructure consisting of bare-metal servers and virtual nodes in multiple data centers across Europe and Canada. The main problem regularly encountered by the client was the inability to determine with high confidence whether the entire infrastructure was functioning correctly. It was also extremely difficult to identify the cause of identified disruptions in the information systems, leading to many hours of manual analysis.
Main requests from the client:
  • Revamp the entire monitoring scheme by replacing outdated software with Datadog.
  • Ensure an availability level of no less than the current 97.6%.
  • Reduce the time required to resolve incidents.
Key Metrics
  • 26% improvement in the integral metrics of the project owner’s sleep quality. Yes, we requested statistics from his fitness bracelet.
  • 99.8% increase in service availability.
  • 42% reduction in the time to resolve an incident.
Key Challenges and Results
One of the key problems we identified immediately during the audit was an ineffective monitoring system with very limited capabilities for retrospective analysis.
We conducted several interviews with company representatives to gather as much data as possible on which business indicators were most important to the client. We also conducted a full audit of the entire infrastructure and identified the most vulnerable sections of the architecture.
We ranked all potential system failure scenarios by criticality, describing all metrics in code. Detailed documentation was also prepared for the client’s employees. In addition, we integrated Datadog with the existing emergency notification system used by the client. As a result, we ensured a seamless transition from the old monitoring system to the new one.
All key metrics were highlighted in separate, user-friendly dashboards, which were also described in terraform and deployed automatically. Thanks to this, the client’s engineers could access information about the system status at any time, grouped by functional groups. This significantly reduced the time needed to establish the cause of a failure.
In the next implementation phase, we developed response plans and DRPs for all key systems in the client's infrastructure. Each alert was automatically linked to the corresponding section of the documentation and contained comprehensive information on the actions of the on-duty personnel in an emergency.
Custom metrics that mimic regional latency in video playback for users from different continents.
In addition to typical, purely infrastructure indicators, we developed custom metrics that covered the client's business logic. For example, we monitored average profit indicators, and probe nodes simulated user behavior, assessing critical streaming quality and content availability metrics.
Everything was fully described in code and handed over to the client along with the documentation. Thanks to a significant improvement in monitoring quality, we managed to increase service availability to 99.8% and identify several hidden issues that were unknown before implementation.
Related services
CI/CD implementation services & development process organization
We provide CI/CD implementation services & development process organization services for your business
We provide Containerization & orchestration services for your business
The Wise Ops Team creates a scalable and fault-tolerant environment by deploying application components into containers using Docker and Kubernetes.