Collecting GPU freeze debugging data with apport

Registered by Chris Halse Rogers on 2010-04-12

* What can we do better for -intel GPU freezes?
  - These inevitably will have lots of duplicates, which reduced their usefulness in Lucid. What information do we need to match duplicates? Can we get the kernel to help?
* Can we get the same sort of information via radeontool & avivotool for ati cards?
* Can we get the same sort of information for nouveau?
* How can we make mmiotracing as painless as possible for users, and as useful as possible for developers?

Blueprint information

Status:
Complete
Approver:
Bryce Harrington
Priority:
Low
Drafter:
Chris Halse Rogers
Direction:
Needs approval
Assignee:
Bryce Harrington
Definition:
Approved
Series goal:
Accepted for maverick
Implementation:
Implemented
Milestone target:
None
Started by
Bryce Harrington on 2011-04-07
Completed by
Bryce Harrington on 2011-04-07

Related branches

Sprints

Whiteboard

Work items:
[sconklin] Investigate whether drm.debug=0x04 has an unacceptable overhead: DONE
[sconklin] Can drm.debug be changed at runtime? -> in e.g. /sys/module/drm/debug_level: DONE
[pitti] Can we have apport detect if the gpu hang did not produce debug info, and then offer to the user to turn debugging on (whatever the kernel team comes up with, or adding “drm.debug=0x06” to the kernel command line), prompt to reboot, and then next time send a bug report in:DEFERRED
[raof] Reveiw triage notes - do triagers have enough info to go straight upstream with all upstream needs? ( Also look at previously upstreamed X freeze bugs to examine what other info typically gets asked.): DONE
[raof] Ask upstreams - what is the miminum debugging information they need? Can it be turned on at runtime?:DONE
[sconklin] Make the intel kernel module always print: possible outputs, connected outputs, detected modes, selected mode: POSTPONED
[pitti] Add apport.hookutils.add_drm_info(): DONE
[raof] Call apport.hookutils.add_drm_info() from compiz, X, kernel hooks: DEFERRED

Work items for maverick-alpha-2:
[raof] Quick way of detecting accurately whether the GPU has actually hung: DONE
[raof] Determine list of files desirable for debugging from /sys: DONE
[raof] Re-enable apport hook in Maverick for intel GPU hangs after ensuring the hook tags bugs appropriately: DONE

2010-06-09 - raof:
* Files for add_drm_info hook:

/sys/class/drm is a symlink farm of directories, some of which depend on the driver in question (particularly, the ttm/ directory only exists for radeon and nouveau). The files that would be most commonly useful reside in directories of the form card$NUM-$CONNECTOR-$CONNUM (eg: card0-DVI-D-1, card0-LVDS-1).

The files in these directories differ depending on the connector type. The hook should grab all the files in each of the /sys/class/drm/card$NUM-$CONNECTOR-$CONNUM with the exception of “uevent”. The hook need not collect anything from any subdirectory of card$NUM-$CONNECTOR-$CONNUM.

2010-06-21 - sconklin:

drm.debug=0x04 should have little performance impact - the additional debug output is in paths having to do with hardware initialization, monitor probing, and other mode setting actions which only happen when booting, adding and removing displays, and suspend/resume.

drm.debug can be changed while the system is running i.e.
echo 4 > /sys/module/drm/parameters/debug

bryce 2010-07-22: RAOF, what was the outcome of the discussion with upstream about required debug information? Are they happy with what we're collecting for this, or is more needed?

RAOF 2010-09-27: Deferring remaining WI. Triager notes can be updated during the freeze, and add_drm_info has been added to all but the kernel apport hooks.

bryce 2011-04-07: For natty, gpu lockup bugs appear to be hit a lot more often with -intel than -ati or -nouveau. There are a lot of reasons for this. I spent some time this cycle optimizing the intel hook to prevent dupes, and to ensure the right information is being captured for upstream. I think due to this tool we were able to solve a number of bugs that we would not have been able to without it. Going forward, we still need hooks for -ati and -nouveau but given what we saw in natty, the -intel gpu hook is the highest priority.

bryce 2011-04-07: Retargeting to oneiric for follow up work.
1. "Apport detecting if gpu hang didn't include debug info." Moved to desktop-o-xorg-tools-and-processes. Fwiw, upstream never asks us to enable debugging with these bugs so this may be low priority.
2. "kernel module always print: possible outputs" Moved to desktop-o-xorg-tools-and-processes.
3. "apport.hookutils.add_drm_info()." Actually, in practice this adds a lot to the length of a bug report, but we don't use it very often. We can also get most of this info from xrandr and Xorg.0.log; I don't think it's proven to be that useful unfortunately. Maybe would be better if it attached the info as a separate file?

With those three deferred items moved to a new blueprint for follow up, I think we can consider this blueprint complete.

(?)

Work Items