Operation¶
This part will describe tools to administer and operate systems. The article is not meant for small but for medium to large systems with a lot of hosts containing different services which work together. If something is failing you often can't easily find the cause which may be not directly linked with the failing part.
Architecture¶
To manage this you have to analyze a lot of runtime information for which you can make an architecture of differnet tools to help:
Parts¶
At first this looks complex but have a look at each area for itself, first:
-
Thats what the classical monitoring systems will do, check the current status of a host or service and alert if something didn't work correct.
-
Metric data is a collection of lots of numerical values over time which shows how the system works. This can be analyzed to find bottlenecks, overload or problems.
-
Nearly any process will write log files with textual information about problems or errors. This log events can be further analyzed to show changes over time and compare the events from multiple logs at the same time together.
-
That's what an operator will do. Here some tools will help to hold an inventory, assist in manual tasks and maintenance work.
-
Deployment of new code, data or change of architecture will be done with a lot of testing and tool support to possibly do it without an outage. The deployment to production should normally not be done while there are problems in the staging system.
All five parts are basically working on their own, and should be setup separately on different hosts. But then they can be connected together like shown in the graph. So each area can be exchanged with a better solution in the future, or resized and extended on demand.
Attention
This and all the subpages here are my opinion of how things should work. It is based on my knowledge and the current state of technology. This all may change over the next years, again.
So please see this concept and the descriptions as a hint but decide for your own what you need and will invest for each IT operations area.
Alltogether¶
In the end you may connect all the above systems together meaning:
-
Status Monitoring -> Metrics Data
- collect data from metrics where possible
- reduse the load on the system
-
Metrics Data -> Log Events
- Show events as annotations in Grafana reports
-
Management -> Status/Metrics/Events
- gather information for analyzation
- setup systems on changes, additions