Monitoring is one of the primary means by which services owner keep track of a system’s health and availability. As such, the monitoring strategy should construct thoughtfully. Monitoring the application and system is usual practice every organization follows. However, when an organization matured to practice DevOps, a general health check of application/system would not be the right choice. So the approach of continuous monitoring in DevOps is encouraged us to do full-stack monitoring.
In traditional monitoring practice, monitoring parameters are set up based on reactive manner. In some cases, monitoring configured without purpose. The effort that put for developing the system not considered while monitoring that system. However, “Monitoring as a discipline” means ensuring network, servers, applications, and so on are all stable, healthy, and running at peak efficiency. It means not just being able to tell that a system has crashed, but more importantly to say the possibility of a system crash and intervening to avoid the collision.
Things to monitor:
In the DevOps world, watch everything is a good practice. Everything means, it includes
- Infrastructure monitor,
- Web server,
- App server,
- Network connectivity,
- Application monitoring,
- Log monitoring,
- API monitoring,
- File process monitoring,
- Batch process monitoring,
- Transactions monitoring,
- SQL transactions,
- Code visibility monitoring,
- CI/CD Pipeline monitoring,
- End to End application monitoring,
- Gather stats from the system,
- Internal application monitoring,
- External application monitoring,
- Raise the Alert before the adverse event occurs.
The System represents the server where the applications are running. The servers may reside on on-premises or cloud, but our monitoring solution should provide us visibility of our infra. So we get the clear picture of the infrastructure and network on which our application runs.
We usually set up the monitoring parameters for the servers around
CPU Usage, Disk Usage, Memory Usage, connectivity, Port establish status and other services related to OS. If any adverse event occurs to reach the threshold, we get an alert to act upon it. This kind of alerting system alarms us to fix instead give us more details of the root cause. Warnings raised based on a threshold that estimated with a reactive approach. The reactive monitoring set up is always not the right system monitoring solution. In the modern world of DevOps, monitoring can be set to collect System stats, metrics and watch event log, Syslog file, performance, logs, and Integrated systems. So when we gather and watch all these system-related components, then it is useful to understand how infrastructure metrics correlate to business transaction performance.
Metrics that needs to collect from the system:
There are important metrics that we need to obtain from the server to get the clue of how our servers perform. In general, there are important metrics that help us to check the health of ours.
- Request per second: How many applications received and processed by the target server.
- Error Rates: Error rates measure the failure in the application and highlight the failure points.
- Response Time: With the average response time, we can test the rate of the target web application.
- Peak Response Time: It measures the point that took the longest response time.
- Uptime: How many hours the server is up and running.
- CPU Usage: Amount CPU time used by the application that is running in the server.
- Memory Usage: Amount of memory used by the application.
- Thread counts: Usually, the application creates the threads to process the request.
- So it is essential to count the number of threads per process as its limit by the system.
- File IO Operation: In general, there should be a limitation on IO operation per process to handle the file.
- Disk Usage: Amount of Disk consumed in the server by any running service/applications.
- Network Bandwidth: Service or application consume the higher network bandwidth.
- Log File Size: There are web server logs that may face a sudden increase in size due to underlying application malfunction.
Application monitoring always happens only in production. The development occurs without the plan of monitoring. So the production team set up monitoring based on the logs and stats that observed from application behavior. However, in this monitoring set up, the production team is lack of visibility inside the application. So the monitoring scope limit with adverse events raised by the services. However, in the modern technology world, application monitoring is happening to start from the development stage. So the monitoring parameter can be set at the code level to get complete visibility. Some monitoring tools such as AppDynamics, Datadog, and Prometheus give us more insights about applications through agents. The agents embedded in the code collects data and metrics on each stage from web, application, and database. With data and parameters feed into an underlying system, we can see the flow of transactions.
Metrics that needs to collect from Application:
The following are the types of data that can be stored and analyzed by Application monitoring:
- HTTP request rates, response times, and success rates.
- Dependency (HTTP & SQL) call rates, response times, and success rates.
- Exception traces from both server and client.
- Diagnostic log traces.
- Page view counts, user and session counts, browser load times, exceptions.
- Response times to reach success rates.
- Server performance counters.
- Custom client and server telemetry.
- Segmentation at client location, browser version, OS version, server instance, and
- custom dimensions.
- Availability tests results
Application Monitoring Performance:
Service availability monitoring is excellent, but latency for response is also essential when we plan for the next level of monitoring. When the volume increase in the system, there are always possible in performance degrade. So, we need to visualize the performance of the system with collected stats and metrics. If there is any deviation in the production, then we need to identify the bottleneck to improvise the system. Delay in latency can happen at any level; it can occur at the code level, operation, web/front-end, Database, and network level. In this case, end to end monitoring to watch the application performance is best practice.
Network monitoring is always the out of scope topics for the operational support team. Of course, network monitoring is part of network management but to do full monitoring stack, it is good practice to bring it under one roof. Issues analysis perspective, network is the last layer if there is no relevant information found from other traces. So we need to get the idea of how our network is managed and monitored.
Network monitoring is also happening by using software and hardware. In that, the bare least check usually happens using ping, SNMP, ICMP, and logs. To get complete visibility around the network management, it is good practice to
- Execute the script using an agent on the device to collect detailed information.
- Track IP SLA between the devices in the network infrastructure.
- Analyze the bandwidth utilization and traffic using NetFlow.
- Collect the dumps using a network tap.
- Check devices performance.
- resource usage,
- disk I/O,
- caches and buffers pool usage,
- Negative error codes,
- queries that delay response,
- number of threads established by application and other services,
- idle threads and running threads,
- Connection error that caused due to server error,
- Failed connections,
- tables growth,
- Index efficiency,
- partition growth,
- state of stored procedure and functions,
- Database trigger status.
I believe, we can collect some more metrics by thinking from a support perspective and visualize.
- the trigger did not occur when the code commit happened,
- something got aborted while packaging the code,
- the automated unit test did not trigger,
- Unit test script malfunction,
- security scan failed,
- Code integration has not passed before integration test performance,
- Deployment failure due to technical challenges,