Comments on Lusis: Monitoring Sucks - Watch your language

This sounds similar to the thought process of the ...

2012-07-11T20:40:10.230-04:00

This sounds similar to the thought process of the guys who built LogicMonitor.

At infochimps we've been iterating around this...

2011-10-24T23:16:24.297-04:00

At infochimps we've been iterating around this too, but using the term 'facts' instead of 'context'.* metrics -- name + number + timestamp: < web.api.api_node_5.req_rate | 234 | 20110908123456 > * logs -- "83.240.154.3 - - [07/Jun/2008:20:37:11 +0000] "GET /faq HTTP/1.1" 200 569"* facts -- { :group => "web.api", :node => "api_node_5", :req_rate => 234, :req_200 => 213, req_404 => 21, :timestamp => "20110908123456" }.Logs are *unstructured records* with *pre-determined semantics*. Their limitations are well-recognized.Metrics are *structured values* with *arbitrary semantics*. Their lightweight nature makes the view layer trivial, but limits what you can ask of it.Facts are *structured records* with *arbitrary semantics*. The writing agent can inject anything it wants; it's up to the view layer to interpret. I lean more and more towards shipping around facts and having decorators turn them into metrics where necessary.

Found the irc-logs at https://github.com/monitorin...

2011-08-15T17:10:30.610-04:00

Found the irc-logs at https://github.com/monitoringsucks

Do you have logs of the meeting, please?

2011-08-15T16:37:27.904-04:00

Do you have logs of the meeting, please?

I think that the distinction between metrics and c...

2011-07-22T16:00:42.688-04:00

I think that the distinction between metrics and context has been confused due to historical concerns with data storage issues and data access issues, both of which can be solved more easily now.

The storage system for monitoring data should store data. Data becomes a datapoint from our perspective when it gets a timestamp--typically representing the time the data was collected.

A datapoint's storage system should not be limited by concepts preoccupied with efficiency over flexibility (although the schema should be sane). I'm looking at RRD and to a lesser extent Whisper. A relational database or a flexible nosql equivalent would be better. Form should follow function.

Let's think of a datapoint as a sql db row: created_at timestamp, component_id, numerical metric (or more than one?), string data. Log-like info and metrics in the same datapoint, or just a metric in the datapoint, or just a log line. The datapoint's related component id could be anything at all, and we would be completely free from the host/service model. A use-agnostic datapoint storage system would be naturally useful to analysts of all types of data--ops would start with it, then finance developers would pick it up, and environmental analysts would use it. Let's not limit ourselves to numbers on the storage side.

The definition of events in the summary conflicts with this idea--I think at the data storage level, the concept of an "event" is not applicable--it is just data, 1s and 0s, and any meaning that data has should be decided on by a component at another level.

Put it all in one place and let the storage engine deal with it, at whatever cost. The cost is certainly higher both in processing and disk space, but can be kept within sane limits--look at Zabbix's storage system.

With such a data model, reporting and correlation (say, if everything is in a sql db) is easy and any application can use the data for any purpose--correlation, graphing, alerting, business data analysis, etc.

So, I'm proposing two things:

1) recognize the datapoint as a primitive:
   - timestamp (mandatory)
   - component (mandatory)
   - numerical data (optional)
   - string data (optional)
   # numerical and string data can coexist, with interpretation left up to the controller

2) a storage system able to handle this flexibility. I would like to suggest a relational database for this--like I said, the cost would be higher, but I think it's justified by the benefits. For writes, a special collector/proxy (like statsd or a collectd output plugin) would be helpful, and for reads, well, it's sql and that can be dealt with later.

Didn't have a chance to participate in the dis...

2011-07-22T12:19:20.474-04:00

Didn't have a chance to participate in the discussion online but have been following the progress from afar. Thoughts below are strictly IMHO.

Re metric - I think metric should be numeric only, int or float. Boolean is technically a subset of int, true == 1, false == 0.

Re context being human readable attrs - I think it's not necessary. Human readability is a part of how monitoring data are presented to humans, nothing to do with monitoring system and events.

Re resource as source of metric - strongly disagree. Firstly, an event could be related to multiple resources ("ping latency from host A to host B is N" is related to host A, host B and network which connects A and B; if A and B are connected over WAN, it could relate a whole set of resources in the path between A and B). Secondly, inventing "virtual" resources to be able to avoid having events without resources (like your "whole business" resource) sounds like a hack.

My view how a monitoring system should be built: events (which are numeric value + hierarchical descriptor as in Graphite) are all in a datastore; can be queried with query or as realtime stream; everything else (alerting, event correlation, trending, UI, etc) are independent services that operate on data from events data store. All services independently analyze data, generate new events and put them back into datastore, to be picked up by other services, and so on.

Hope to blog about this some time soon.

Zabbix: Metric = Item. Items in Zabbix represent...

2011-07-22T01:30:41.034-04:00

Zabbix:

Metric = Item. Items in Zabbix represent data points. They can be lots of things, such as measurements (disk usage percentages), text (hostname, log file entry etc.), boolean (port state), a step of accessing a webpage (response time, status code)

Context = Value, be it percentage, text or other.

Event = Trigger.

Resouce = Zabbix doesn't do this natively, but it's possible. Its primary focus is x item on y server. You could work around it, custom key in the zabbix agent on monitoring server triggers a local script that puts in a call to the zabbix agent on multiple servers, grabs their value and does whatever to it. Zabbix provides tools for independently 'getting' values from agents.

Action = Action. Based on trigger(s) firing, certain actions can be automated. Supports two types of operation: Messages & Remote Commands. Messages will go out to e-mail, sms, or whatever is defined. Remote Commands is a call to run a command on whatever server running a zabbix agent you specify. e.g. in the case of Apache dying, you could tell the zabbix agent on the box to "sudo service restart httpd", or run a custom script on another box to start apache there, or whatever takes your fancy.

Collection = Zabbix supports a push and pull model, primarily pull. Active checks can be defined which means an agent will push values to the central server. I've not messed about with the latter, can't swear to its effectiveness.

EVent Processing = Covered within triggers, Zabbix strongly favours the numerical types but can cover string searches. Can fire at various severity levels as required (and Actions can be fine tuned based on severity). You can specify that certain triggers are dependent on other triggers, e.g. don't bother alerting about Apache being down if the server isn't responding to pings.

Presentation = Zabbix has it's own graphing engine. Data is stored in a mysql database and then queried live to bring you your information. You specify against the item how long accurate data is stored and how long trend data is stored. Graphing is mostly accurate. Zoomed out can be odd as it averages out the value over a period of time. Can mean an graph displaying an hour tells you max was 500, where a graph showing you 5 minutes might show you you actually peaked at 2000+. Also seems to be fairly CPU intensive (mostly MySQL bound)

Analytics = Not so strong. You can create custom graphs of whatever combination of metrics you fancy, but often feels limited. You can define Reports availability reports, top 100 triggers and some comparison, but options are minimal.

Not sure quite how much data you want when mapping...

2011-07-22T00:43:50.926-04:00

Not sure quite how much data you want when mapping points to monitoring systems, but here's a stab at Zabbix (http://www.zabbix.com/):

Metric = Item. Items in Zabbix represent data points. They can be lots of things, such as measurements (disk usage percentages), text (hostname, log file entry etc.), boolean (port state), a step of accessing a webpage (response time, status code)

Context = Value, be it percentage, text or other.

Event = Trigger.

Resouce = Zabbix doesn't do this natively, but it's possible. Its primary focus is x item on y server. You could work around it, custom key in the zabbix agent on monitoring server triggers a local script that puts in a call to the zabbix agent on multiple servers, grabs their value and does whatever to it. Zabbix provides tools for independently 'getting' values from agents.

Action = Action. Based on trigger(s) firing, certain actions can be automated. Supports two types of operation: Messages & Remote Commands. Messages will go out to e-mail, sms, or whatever is defined. Remote Commands is a call to run a command on whatever server running a zabbix agent you specify. e.g. in the case of Apache dying, you could tell the zabbix agent on the box to "sudo service restart httpd", or run a custom script on another box to start apache there, or whatever takes your fancy.

Collection = Zabbix supports a push and pull model, primarily pull. Active checks can be defined which means an agent will push values to the central server. I've not messed about with the latter, can't swear to its effectiveness.

EVent Processing = Covered within triggers, Zabbix strongly favours the numerical types but can cover string searches. Can fire at various severity levels as required (and Actions can be fine tuned based on severity). You can specify that certain triggers are dependent on other triggers, e.g. don't bother alerting about Apache being down if the server isn't responding to pings.

Presentation = Zabbix has it's own graphing engine. Data is stored in a mysql database and then queried live to bring you your information. You specify against the item how long accurate data is stored and how long trend data is stored. Graphing is mostly accurate. Zoomed out can be odd as it averages out the value over a period of time. Can mean an graph displaying an hour tells you max was 500, where a graph showing you 5 minutes might show you you actually peaked at 2000+. Also seems to be fairly CPU intensive (mostly MySQL bound)

Analytics = Not so strong. You can create custom graphs of whatever combination of metrics you fancy, but often feels limited. You can define Reports availability reports, top 100 triggers and some comparison, but options are minimal.

Thank YOU, John, for all of the coordination, your...

2011-07-21T23:24:45.428-04:00

Thank YOU, John, for all of the coordination, your work on it, and this post.