We considered several monitoring system options. First, we evaluated the integration possibilities with Datadog. It is one of the most user-friendly solutions, offering native Azure integration and providing ready-made dashboards and mechanisms for incident alerts. However, the main drawback for the client was the somewhat opaque billing policy, especially concerning cloud infrastructure and serverless elements.
Therefore, we decided to explore the option of implementing a stack based on Grafana, which would provide aggregation, convenient analysis, and alerts for all key metrics and logs from the client’s infrastructure. We had to work within tight timeframes and budget constraints since deploying large, full-scale systems at the startup stage was excessive.
We decided to use the free plan in Grafana Cloud, which we integrated with the client’s Azure infrastructure. We rejected the initial implementation plan using Grafana Alloy. The push architecture would have exceeded the client’s planned budget, as all logs and metrics would be stored on the Grafana Cloud side. The agent collects all available data from the Azure Resource Graph API and sends it for further processing in Grafana Cloud.
We implemented a different approach—integrating through a pull scheme, where the data is physically stored in Azure, and Grafana Cloud requests it only as needed. This significantly reduces traffic volumes and allows the scheme to function within the free tier. For example, when a dashboard is generated, the data is collected from Azure at the time of the request. Some metrics are collected regularly to ensure timely responses to application and infrastructure issues.
As a result of our work, the client received:- Deployment of a monitoring system within a short timeframe, allowing them to focus on resolving application issues that impact clients.
- Integration of their infrastructure with Grafana Cloud.
- Transparency of the entire infrastructure in terms of metrics and logs.
- Adherence to the limited budget requirements for implementing the monitoring system.
- Minimization of monitoring system costs. The infrastructural cost for the client is close to zero.
- User-friendly dashboards that provide a complete view of the current state of the infrastructure and allow retrospective analysis of incident situations.
- Alerts and notifications to responsible parties when critical deviations occur in HTTP responses from application components and health check endpoints.
The implemented solution not only provided the necessary transparency with zero operational costs during the startup's development stage but also can be quickly scaled as the company grows, utilizing paid plans when needed. The implementation of the system reduced the time spent investigating incidents threefold, freeing up significant team resources to work on improving user experience and developing new platform features.
Thanks to prompt incident response and monitoring transparency, the client was able to prevent several potential situations that could have led to significant losses and client attrition. The system’s implementation helped reduce risks and ensure the platform's continuous operation, which is crucial during the startup’s growth phase.
The results of the implementation became evident within the first few weeks: the client avoided serious losses, and the company’s customers began receiving a stable and reliable service.