Enhance Administrative Protocol with Counters

Registered by Scott C. Lemon

It would be nice if the administrative protocol - specifically the "status" request or a new one - would return a list of the functions with counters of how many total jobs had been submitted, and how many had been serviced. The current values are gauges. With counters we can then see total work done, and calculate jobs/second by polling.

Blueprint information

Status:
Not started
Approver:
None
Priority:
Undefined
Drafter:
None
Direction:
Needs approval
Assignee:
None
Definition:
New
Series goal:
None
Implementation:
Unknown
Milestone target:
None

Related branches

Sprints

Whiteboard

I would like to expand on Scott's request. Currently, the only real way to have any visibility into what the a job server is doing is for the clients and the workers to independently log job requests sent and received. What happens inside the job server is visible only through extremely low resolution gauges. It's possible for a single server to be doing many jobs per second, but only a few jobs ever show up in the gauges.

What is needed is a combination of gauges and counters depending on the type of data being represented. At the very least, cumulative gauges that actually represent what the server has been doing over the last n seconds or minutes. Even just a set of counters per job that had the total number of jobs requested, completed and failed since restart would be amazing to have.

The long list of useful data would be:

Global Server stats since the last restart:
* Existing "status" output (useful for debugging immediate problems on the server itself)
* Total Available Workers (gauge)
* Total bytes read (counter)
* Total bytes written (counter)
* rusage (gauge)
* current connections (gauge)
* total connections (counter)

Per Worker counters since the last restart:
* Total bytes read (counter)
* Total bytes written (counter)
* total jobs requested (counter)
* total jobs completed (counter)
* total jobs failed (counter)

Per Function counters since the last restart:
* total jobs requested (counter)
* total jobs completed (counter)
* total jobs failed (counter)
* total payload in bytes (counter)

Having the ability to output all of this in some machine readable format would also be very worthwhile. (tabular, json, yaml, xml,etc.)

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.