Enhance Administrative Protocol with Counters
It would be nice if the administrative protocol - specifically the "status" request or a new one - would return a list of the functions with counters of how many total jobs had been submitted, and how many had been serviced. The current values are gauges. With counters we can then see total work done, and calculate jobs/second by polling.
Blueprint information
- Status:
- Not started
- Approver:
- None
- Priority:
- Undefined
- Drafter:
- None
- Direction:
- Needs approval
- Assignee:
- None
- Definition:
- New
- Series goal:
- None
- Implementation:
-
Unknown
- Milestone target:
- None
- Started by
- Completed by
Related branches
Related bugs
Sprints
Whiteboard
I would like to expand on Scott's request. Currently, the only real way to have any visibility into what the a job server is doing is for the clients and the workers to independently log job requests sent and received. What happens inside the job server is visible only through extremely low resolution gauges. It's possible for a single server to be doing many jobs per second, but only a few jobs ever show up in the gauges.
What is needed is a combination of gauges and counters depending on the type of data being represented. At the very least, cumulative gauges that actually represent what the server has been doing over the last n seconds or minutes. Even just a set of counters per job that had the total number of jobs requested, completed and failed since restart would be amazing to have.
The long list of useful data would be:
Global Server stats since the last restart:
* Existing "status" output (useful for debugging immediate problems on the server itself)
* Total Available Workers (gauge)
* Total bytes read (counter)
* Total bytes written (counter)
* rusage (gauge)
* current connections (gauge)
* total connections (counter)
Per Worker counters since the last restart:
* Total bytes read (counter)
* Total bytes written (counter)
* total jobs requested (counter)
* total jobs completed (counter)
* total jobs failed (counter)
Per Function counters since the last restart:
* total jobs requested (counter)
* total jobs completed (counter)
* total jobs failed (counter)
* total payload in bytes (counter)
Having the ability to output all of this in some machine readable format would also be very worthwhile. (tabular, json, yaml, xml,etc.)