Implementation of a Monitoring System
for a Startup

Project Description
The client is a startup that provides a platform to facilitate communication between app developers, companies, and users. The platform helps boost sales and enhance user comfort by improving the app installation process.

The client’s infrastructure is fully deployed in Azure and relies entirely on its tools. However, a major problem for the developers was the near-total lack of visibility into the application during incidents. When one of the components failed, they would only learn about it from the service's clients.
Main Requests from the Client:

  • Conduct an audit of the infrastructure and current data sources.
  • Find a cost-effective solution that ensures transparent monitoring now and can scale in the future.
  • Configure informative dashboards.
  • Set up alerts for key metrics.
  • Organize the collection and easy analysis of application logs.
Key Metrics:

  • Dozens of major clients daily.
  • Limited budget.
  • 98.6% application availability after implementing the monitoring system and resolving identified issues.
  • 27% increase in positive client feedback upon closing deals.
  • 3x reduction in work hours spent investigating incident causes.
Key Challenges and Outcomes
We considered several monitoring system options. First, we evaluated the integration possibilities with Datadog. It is one of the most user-friendly solutions, offering native Azure integration and providing ready-made dashboards and mechanisms for incident alerts. However, the main drawback for the client was the somewhat opaque billing policy, especially concerning cloud infrastructure and serverless elements.

Therefore, we decided to explore the option of implementing a stack based on Grafana, which would provide aggregation, convenient analysis, and alerts for all key metrics and logs from the client’s infrastructure. We had to work within tight timeframes and budget constraints since deploying large, full-scale systems at the startup stage was excessive.
We decided to use the free plan in Grafana Cloud, which we integrated with the client’s Azure infrastructure. We rejected the initial implementation plan using Grafana Alloy. The push architecture would have exceeded the client’s planned budget, as all logs and metrics would be stored on the Grafana Cloud side. The agent collects all available data from the Azure Resource Graph API and sends it for further processing in Grafana Cloud.

We implemented a different approach—integrating through a pull scheme, where the data is physically stored in Azure, and Grafana Cloud requests it only as needed. This significantly reduces traffic volumes and allows the scheme to function within the free tier. For example, when a dashboard is generated, the data is collected from Azure at the time of the request. Some metrics are collected regularly to ensure timely responses to application and infrastructure issues.

As a result of our work, the client received:

  • Deployment of a monitoring system within a short timeframe, allowing them to focus on resolving application issues that impact clients.
  • Integration of their infrastructure with Grafana Cloud.
  • Transparency of the entire infrastructure in terms of metrics and logs.
  • Adherence to the limited budget requirements for implementing the monitoring system.
  • Minimization of monitoring system costs. The infrastructural cost for the client is close to zero.
  • User-friendly dashboards that provide a complete view of the current state of the infrastructure and allow retrospective analysis of incident situations.
  • Alerts and notifications to responsible parties when critical deviations occur in HTTP responses from application components and health check endpoints.

The implemented solution not only provided the necessary transparency with zero operational costs during the startup's development stage but also can be quickly scaled as the company grows, utilizing paid plans when needed. The implementation of the system reduced the time spent investigating incidents threefold, freeing up significant team resources to work on improving user experience and developing new platform features.
Thanks to prompt incident response and monitoring transparency, the client was able to prevent several potential situations that could have led to significant losses and client attrition. The system’s implementation helped reduce risks and ensure the platform's continuous operation, which is crucial during the startup’s growth phase.
The results of the implementation became evident within the first few weeks: the client avoided serious losses, and the company’s customers began receiving a stable and reliable service.