CI kernels causing many "Illegal Instruction"s

Bug #859473 reported by Spring Zhang
22
This bug affects 1 person
Affects Status Importance Assigned to Milestone
LAVA Validation Lab
Invalid
Undecided
Dave Pigott
Linaro CI
Fix Released
High
Deepti B. Kalakeri
Obsolete LAVA Test
Invalid
Undecided
Unassigned

Bug Description

Recently there have been lots of validation failures for kernels from
CI.

These failures manifest as "Illegal Instruction" errors all over the logs
(not just during tests, but booting as well.)

It seems that the kernel is causing a lot of SIGILLs, but we don't know
why yet.

To reproduce
------------

Combining

http://ci.linaro.org/kernel_hwpack/omap3/hwpack_linaro-omap3_20111007-0028_armel_supported.tar.gz
http://snapshots.linaro.org/11.05-daily/linaro-nano/20111005/1/images/tar/nano-n-tar-20111005-1.tar.gz

with linaro-media-create (--rootfs ext2) should give you an image that shows the problem.

Information wanted
------------------

There are currently two things that we would like more information on to try
and narrow down the cause of the problem.

1) A core file from a crashing application

To get this

  Create the image as described above
  Boot in to it
  Run "ulimit -c 1024"
  Run "hwclock"
  Watch it crash with "Illegal Instruction (core dumped)"
  Get that core from the filesystem.

2) Test a defconfig build

If this succeeds then it suggests a problem with the way that the kernel is
being built on ci.linaro.org.

To do this

  Get the tip of linus' tree
  Build with omap2plus_defconfig
  Boot the resulting kernel

If that shows no SIGILL problems, then try a cross-build from an x86 host.

If neither of those show problems then we can concentrate on
ci.linaro.org.

Revision history for this message
Alexander Sack (asac) wrote :

looking at our linux-ci bundlestream http://validation.linaro.org/lava-server/dashboard/streams/anonymous/ci-linux/bundles/

this seem to have started around the time when the bundles got real names ... feels likely that this is caused by some lab/infrastructre/lava-test code changes rolled out at that time.

Revision history for this message
Alexander Sack (asac) wrote :

adding lava-lab project because the lab could potentially be rolled back? also for release the lab state is the real blocker.

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

I believe that smem is built with the toolchain right on the test device. I would redirect this to the toolchain working group.

Revision history for this message
Alexander Sack (asac) wrote :

Please subscribe kernel and toolchain folks and get their input.

Revision history for this message
Fathi Boudra (fboudra) wrote :

@Deepak, Michael,
Please, could you take a look to this bug and give your input?

Revision history for this message
Paul Larson (pwlars) wrote :

This wouldn't have anything to do with the naming of links, but it could have something to do with the master images. As Zygmunt said, if anything has to be compiled, it's compiled on the master image and installed before booting the device. Could be that some of the newer boards have a master image based on a newer linaro image that could have a more recent toolchain, or, could be that it's just a kernel bug in the ci kernels that are being tested, or maybe the way they are being built? I don't think I've seen anything like this in the daily image testing.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

I don't know what's going on here, but why is the lab infrastructure being suspected? All the illegal instructions are in the image being tested, so isn't it likely to be a problem with the image being tested?

David Zinman (dzinman)
Changed in linaro-ci:
milestone: none → 2011.10
James Westby (james-w)
Changed in linaro-ci:
status: New → Confirmed
Revision history for this message
James Westby (james-w) wrote :

Hi,

I'm seeing more recent tests in the same stream that work, but e.g. linus is failing.

It's failing across all the tests, which suggests a systemic problem.

Some ideas:

   1) bad kernel commit causing the problem
   2) It's not actually running the tests on an ARM machine :-)
   3) Faulty hardware

1 looks unlikely because we have the same revision of the kernel failing and passing
in different runs.

3 looks unlikely because sometimes the same board will fail and sometimes it will
pass (I haven't looked carefully enough to find an instance where it fails and then
passes, so it could be that a number of boards are failing.)

The toolchain looks unlikely to be implicated unless we are varying the toolchain
used to build the kernel, but even then we currently have arm-soc passing and linus
failing, and presumably they would be built with the same toolchain.

Therefore I don't have a clear idea of where the problem lies currently.

Thanks,

James

Changed in linaro-ci:
importance: Undecided → High
Revision history for this message
James Westby (james-w) wrote :

<mru> the first thing I'd do is get a core dump and find out what instruction is crashing
<mru> also check kernel log
<mru> sometimes there's useful information there

Revision history for this message
James Westby (james-w) wrote :

We clearly need some more information to decide what is at fault here.

Is it possible to run the test "by hand" remotely? If so, would someone be able
to do that? If not could Dave pull one of the boards to do so?

Thanks,

James

Fathi Boudra (fboudra)
Changed in lava-lab:
assignee: nobody → Dave Pigott (dpigott)
Revision history for this message
Paul Larson (pwlars) wrote :

> 1 looks unlikely because we have the same revision of the kernel failing and passing
> in different runs.
Can you explain this further? Do you have an example you can point to of this?

#2 is not happening for certain.

Debugging this further should be pretty straightforward. We can take one of the failing ones (if you have a favorite one, let me know), take a machine offline, and manually submit the job just up to the deployment point. So it won't reboot, won't submit results, or anything. We can then connect to that machine and run the tests by hand, gather any debug info we care about, etc.

Revision history for this message
James Westby (james-w) wrote : Re: [Bug 859473] Re: "Illegal instruction" hints when run stream and posixtestsuite
Download full text (3.5 KiB)

On Thu, 06 Oct 2011 20:21:42 -0000, Paul Larson <email address hidden> wrote:
> > 1 looks unlikely because we have the same revision of the kernel failing and passing
> > in different runs.
> Can you explain this further? Do you have an example you can point to of this?

linus

9c1f8594df4814ebfd6822ca3c9444fb3445888d passed on beaglexm01
d93dc5c4478c1fd5de85a3e8aece9aad7bbae044 passed on beaglexm04
d93dc5c4478c1fd5de85a3e8aece9aad7bbae044 passed on beaglexm01
d942e43b58dc27b36305bcd374a74f7cc15183a3 failed on panda01
d942e43b58dc27b36305bcd374a74f7cc15183a3 failed on beaglexm03
78bbd284e85f1af56a9fa30760c019357c2a1b4b failed on beaglexm01
78bbd284e85f1af56a9fa30760c019357c2a1b4b failed on panda08
b172e38e435a158cc84169d5b9127a8dd8d21e76 failed on panda09
b172e38e435a158cc84169d5b9127a8dd8d21e76 passed on beaglexm02
f9d81f61c84aca693bc353dfef4b8c36c2e5e1b5 failed on beaglexm02
f9d81f61c84aca693bc353dfef4b8c36c2e5e1b5 failed on panda11
f9d81f61c84aca693bc353dfef4b8c36c2e5e1b5 failed on beaglexm03
f9d81f61c84aca693bc353dfef4b8c36c2e5e1b5 failed on panda12
858b1814b89d043a3866299c258ccdc27eb2538c failed on beaglexm03
858b1814b89d043a3866299c258ccdc27eb2538c failed on panda06
815d405ceff0d6964683f033e18b9b23a88fba87 failed on beaglexm02
815d405ceff0d6964683f033e18b9b23a88fba87 failed on panda17
a102a9ece5489e1718cd7543aa079082450ac3a2 failed on beaglexm04
a102a9ece5489e1718cd7543aa079082450ac3a2 failed on panda24

so apparently no failures before d942e43b58 and no passes after b172e38.

arm-soc/for-next

bdb424537d00b036a695763c71068acce487b099 passed on beaglexm03
bdb424537d00b036a695763c71068acce487b099 failed on panda18
bdb424537d00b036a695763c71068acce487b099 passed on beaglexm01
bdb424537d00b036a695763c71068acce487b099 failed on beaglexm03
bdb424537d00b036a695763c71068acce487b099 failed to deploy on panda13
bdb424537d00b036a695763c71068acce487b099 passed on beaglexm01
bdb424537d00b036a695763c71068acce487b099 passed on panda10
bdb424537d00b036a695763c71068acce487b099 passed on beaglexm01
bdb424537d00b036a695763c71068acce487b099 passed on beaglexm04
bdb424537d00b036a695763c71068acce487b099 passed on panda19
bdb424537d00b036a695763c71068acce487b099 passed on panda03
bdb424537d00b036a695763c71068acce487b099 failed on beaglexm03 for other
    reasons
bdb424537d00b036a695763c71068acce487b099 passed on panda14
bdb424537d00b036a695763c71068acce487b099 passed on beaglexm02
bdb424537d00b036a695763c71068acce487b099 passed on panda07
bdb424537d00b036a695763c71068acce487b099 passed on beaglexm01
08db4132dbf24693d6af5b1fdde859c3838feefb passed on panda22
dfa690cd787c72468f8d633c0be81330e4466c08 passed on panda07
dfa690cd787c72468f8d633c0be81330e4466c08 passed on beaglexm02
dfa690cd787c72468f8d633c0be81330e4466c08 passed on panda13
dfa690cd787c72468f8d633c0be81330e4466c08 passed on panda14
dfa690cd787c72468f8d633c0be81330e4466c08 passed on beaglexm04
dfa690cd787c72468f8d633c0be81330e4466c08 passed on panda21
dfa690cd787c72468f8d633c0be81330e4466c08 passed on beaglexm01
dfa690cd787c72468f8d633c0be81330e4466c08 passed on panda14

so there are two failures with that one revision, panda18 and beaglexm03
neither of which have a pass afterwards.

This data ...

Read more...

Revision history for this message
Dave Pigott (dpigott) wrote : Re: "Illegal instruction" hints when run stream and posixtestsuite

This really seems unlikely to be anything to do with the master image and how current or otherwise it is. The beagle master images were deployed in late June, and the pandas were deployed from an image built on Tuesday this week.

Revision history for this message
Dave Pigott (dpigott) wrote :

I'm going to run this against an old and new image. Can someone point me at a JSON file that I can submit to test it out?

Revision history for this message
James Westby (james-w) wrote : Re: [Bug 859473] Re: "Illegal instruction" hints when run stream and posixtestsuite

On Fri, 07 Oct 2011 09:13:51 -0000, Dave Pigott <email address hidden> wrote:
> I'm going to run this against an old and new image. Can someone point me
> at a JSON file that I can submit to test it out?

http://validation.linaro.org/lava-server/scheduler/job/2348

is a job that failed with Illegal Instructions.

Thanks,

James

Revision history for this message
James Westby (james-w) wrote : Re: "Illegal instruction" hints when run stream and posixtestsuite

Hi,

There's also a suggestion the the kernel under test isn't build with
CONFIG_NEON. Given that it's not consistent failures, if that was the
case it would suggest that using the defconfig doesn't reliably
determine the value of all config parameters.

Thanks,

James

Revision history for this message
Dave Pigott (dpigott) wrote :

Wow. OK. So the first occurrence of the "illegal instruction" message is just towards the end of the boot of the newly deployed image - before we've even got to the point of running any tests. This would point very firmly at a problem with the image itself. A lot of the tests thereafter also pump out the "illegal instruction" message.

It never occurs during the boot of the root master image.

James' suggestion that it's the CONFIG_NEON switch bears further investigation.

Revision history for this message
James Westby (james-w) wrote : Re: [Bug 859473] Re: "Illegal instruction" hints when run stream and posixtestsuite

On Fri, 07 Oct 2011 13:58:08 -0000, Dave Pigott <email address hidden> wrote:
> Wow. OK. So the first occurrence of the "illegal instruction" message is
> just towards the end of the boot of the newly deployed image - before
> we've even got to the point of running any tests. This would point very
> firmly at a problem with the image itself. A lot of the tests thereafter
> also pump out the "illegal instruction" message.

Indeed. It would still be nice to get some gdb information as apparently
that will give us some more clues.

I don't know what info we would want though.

Thanks,

James

Revision history for this message
Mans Rullgard (mansr) wrote : Re: "Illegal instruction" hints when run stream and posixtestsuite

The output of the gdb commands "x/i $pc" and "info registers" with a core dump from any failing command should provide enough to information to proceed. Debug symbols are not required for this to be useful.

A kernel log (dmesg) after a failure might also contain clues.

Revision history for this message
James Westby (james-w) wrote :

I can't see that omap2plus_defconfig ends up with anything but CONFIG_NEON=y (despite it
not directly specifying that in the file.) Therefore I don't know that we would end up with
a kernel build without NEON.

I've also downloaded one of the kernel packages in question and the /boot/config-blah-blah
has CONFIG_NEON=y.

Therefore either the kernel is being built without CONFIG_NEON despite it having CONFIG_NEON=y
in that file (unlikely I think,) or CONFIG_NEON isn't the issue.

Thanks,

James

Revision history for this message
James Westby (james-w) wrote :

I've reproduced this locally:

Just combining

http://ci.linaro.org/kernel_hwpack/omap3/hwpack_linaro-omap3_20111007-0028_armel_supported.tar.gz
http://snapshots.linaro.org/11.05-daily/linaro-nano/20111005/1/images/tar/nano-n-tar-20111005-1.tar.gz

with linaro-media-create (--rootfs ext2) should give you an image that shows the problem.

Unfortunately my SD card seems have to have died, so I can't investigate further.

Running many things would hit SIGILL. I installed gdb in order to try and get the
info Mans requested, but simply trying to start gdb got a SEGV.

Mans looked at the core:

<mru> gdb died reading a byte from a null pointer fwiw
<mru> it's crashing in one of the strtol family by the looks of it

This may be related to the SIGILL problems, but there is a chance that it isn't.

I'm still unsure where the problem is, but I think the current evidence points to the
kernel, either the code itself, or something about the way that it is built under CI.

Anyone else have any good ideas?

Thanks,

James

summary: - "Illegal instruction" hints when run stream and posixtestsuite
+ CI kernels causing many "Illegal Instruction"s
Revision history for this message
James Westby (james-w) wrote :

> in d942e43 but not in 9c1f859 in linus

That set of patches doesn't look to have anything interesting (TPM/zorro/docs/S390)
which suggests that it wasn't caused by a change to linux in
that timeframe, so either:

  1) It's not a linux change at all
  2) It was something earlier where the non-determinism meant that it didn't show up as soon as it was introduced
  3) I'm wrong about the above

If it's 1) then it may point to a change on ci.linaro.org around that time, but it's
not clear why arm-soc for-next hasn't shown this problem for such a long run
of builds now.

We seem to have elminated LAVA entirely from the equation by reproducing
with just the hwpack and rootfs.

Now let's see if we can eliminate CI. Would someone who knows how to do such things
build linus tip with omap2plus_defconfig and test it on an OMAP?

If that works, does it make a difference if you cross-build from an x86 host?

Thanks,

James

Changed in lava-test:
status: New → Invalid
Changed in lava-lab:
status: New → Invalid
Revision history for this message
James Westby (james-w) wrote :

Hi,

Mans has found what was causing gdb to crash:

<mru> it's crashing in get_linux_version()
<mru> in gdb/arm-linux-nat.c
<mru> on the 3rd strtoul call
<mru> probably totally hating the fact version is reported as 3.1

so it's unrelated to the SIGILL issues.

So that means that getting a core from a SIGILL would still be interesting.

If someone can combine the hwpack and rootfs I referenced before,
boot it

  ulimit -c 1024
  hwclock

then it should dump a core that may give us some more info.

Getting a fixed gdb on to the system would also be possible so that
the commands could be run directly, rather than working on the core.

Thanks,

James

James Westby (james-w)
description: updated
Revision history for this message
Mans Rullgard (mansr) wrote :
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

I've tried to reproduce this, but l-m-c is failing for me:

$ sudo linaro-media-create --rootfs ext2 --mmc /dev/sde --dev panda --hwpack ~/Downloads/hwpack_linaro-omap3_20111007-0028_armel_supported.tar.gz --binary ~/Downloads/nano-n-tar-20111005-1.tar.gz --hwpack-force-yes
...

Writing inode tables: done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 25 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
cp: cannot stat `/tmp/tmpKxd3zn/binary/usr/lib/u-boot/omap4_panda/u-boot.bin': No such file or directory
Traceback (most recent call last):
  File "/usr/bin/linaro-media-create", line 171, in <module>
    args.is_live, args.is_lowmem, args.consoles)
  File "/usr/lib/pymodules/python2.7/linaro_image_tools/media_create/boards.py", line 743, in populate_boot
    proc.wait()
  File "/usr/lib/pymodules/python2.7/linaro_image_tools/cmd_runner.py", line 100, in wait
    raise SubcommandNonZeroReturnValue(self._my_args, returncode)
linaro_image_tools.cmd_runner.SubcommandNonZeroReturnValue: Sub process "['cp', '-v', '/tmp/tmpKxd3zn/binary/usr/lib/u-boot/omap4_panda/u-boot.bin', '/tmp/tmpKxd3zn/boot-disc']" returned a non-zero value: 1

This is exactly like bug 842421, but that was reported to be a hwpack that was too new for the l-m-c, and I get this with lp:linaro-image-tools tip (as well as the version from the ppa, and a few other versions I tried). Boo! :(

Revision history for this message
Ulrich Weigand (uweigand) wrote :

I have opened bug 871901 to track the GDB crash issue.

Revision history for this message
Mattias Backman (mabac) wrote :

Michael:

I think all you need is to change the --dev option to beagle:

$ sudo linaro-media-create --rootfs ext2 --mmc /dev/sde --dev beagle --hwpack ~/Downloads/hwpack_linaro-omap3_20111007-0028_armel_supported.tar.gz --binary ~/Downloads/nano-n-tar-20111005-1.tar.gz --hwpack-force-yes

It's a V1 hwpack so l-m-c will use the hard coded paths and this hwpack does not contain the Panda binaries.

Revision history for this message
Alexander Sack (asac) wrote :

mabac: are you saying our latest lmc cannot use old hwpacks?

Revision history for this message
Alexander Sack (asac) wrote :

on this bug: how about thumb2? is that enabled in our kernels from CI?

Revision history for this message
Deepti B. Kalakeri (deeptik) wrote :

No thumb2 is not enabled.

Thanks!!!
Deepti.

Revision history for this message
James Westby (james-w) wrote : Re: [Bug 859473] Re: CI kernels causing many "Illegal Instruction"s

On Sun, 16 Oct 2011 10:04:08 -0000, Alexander Sack <email address hidden> wrote:
> mabac: are you saying our latest lmc cannot use old hwpacks?

Nope, he's saying that a v1 hwpack for beagle won't work for a panda.

The same is true of a v2 hwpack, but the failure mode will be different
(trying to use the wrong bootloader/kernel at boot time.)

Thanks,

James

Revision history for this message
Alexander Sack (asac) wrote :

james: thanks for clarifying rereading with more care explains that.

as bug 860556 this is probably a thumb2 not enalbed in defconfig's bug ... we need a way to enable just thumb2 for all our CI builds until we have rootfs without thumbe2 and then see if these go away.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Hah, if it's missing thumb2 support I'm surprised it works as much as it does...

Revision history for this message
James Westby (james-w) wrote :

On Mon, 17 Oct 2011 21:07:43 -0000, Michael Hudson-Doyle <email address hidden> wrote:
> Hah, if it's missing thumb2 support I'm surprised it works as much as it
> does...

Well, nico says that thumb2 userspace on non-thumb2 kernel should work,
so we've caught a regression here.

We should be testing primarily in the configurations we care about, but
other configurations like this should be tested too if we wish to
support them.

asac asked in a private mail thread about how to enable thumb2 on an
otherwise plain defconfig.

Thanks,

James

Revision history for this message
Nicolas Pitre (npitre) wrote :

On Mon, 17 Oct 2011, James Westby wrote:

> asac asked in a private mail thread about how to enable thumb2 on an
> otherwise plain defconfig.

Two things:

1) If you want your kernel to support Thumb2 user space binaries, you
   must have CONFIG_ARM_THUMB=y in your kernel config. This is probably
   set as our user space has been compiled to Thumb2 for quite a while.

2) If you want your kernel itself to be a Thumb2 binary, then you must
   have CONFIG_THUMB2_KERNEL=y. This is however available only for
   ARMv7 and above. For example, in the OMAP case, you can't support
   OMAP2 with a Thumb2 kernel, therefore the CONFIG_THUMB2_KERNEL option
   will be visible only if OMAP2 support is configured out (OMAP3 +
   OMAP4 is fine).

Also it is not necessary to match the kernel and user space "thumbness".
However there was a bug in the kernel compiled for ARM mode when a Thumb
mode user space was used with it when that kernel was configured for
both ARMv6 and ARMv7 (e.g. omap2plus_defconfig). The fix was available
in linux-linaro-3.0 but this was missing from linux-linaro-3.1 until a
few hours ago.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Ah wow, that's all very interesting. Thanks for clearing things up for me.

Revision history for this message
Scott Bambrough (scottb) wrote :

Some notes from the LT's on Thumb2 kernels (CONFIG_THUMB2_KERNEL=y)

TI: The LT tested it when Dave Martin sent out the call, it blows chunks for us in power management patches on tracking -->

''From the crash dump, the faulting code seems to be:

         f503 7182 add.w r1, r3, #260 ; 0x104
** e851 0f00 ldrex r0, [r1] **
         f100 0001 add.w r0, r0, #1

r1 is 0x53555151, which is looks more likely to be garbage than to be a real, but misaligned, address. I guess we'll need to figure out where that value is coming from...''

It's stuck at the moment.

Freescale: The LT is currently having some difficulties building the kernel completely with Thumb2 instructions. The major reason is the suspend/resume code which is very SoC specific and low level, uses ARM instructions which they need to convert into thumb2 compatible firstly. It's currently on their wish list.

ARM: The Virtual Express builds have this set and it seems to work.

STE: Need to investigate. I expect problems.

Samsung: Unknown at this time.

Revision history for this message
James Westby (james-w) wrote :

Hi,

Fixing this is blocked on finding out how to non-interactively enable
CONFIG_ARM_THUMB on an otherwise plain defconfig, assuming that
we in fact want to be testing in that configuration.

If we don't want to be testing in the configuration, then this bug
is not a linaro-ci bug, it's a kernel bug, as it's a regression there and
needs to be fixed in all the trees being tested.

Thanks,

James

Revision history for this message
Dave Martin (dave-martin-arm) wrote :

On Tue, Oct 18, 2011 at 12:34:33PM -0000, Scott Bambrough wrote:
> Some notes from the LT's on Thumb2 kernels (CONFIG_THUMB2_KERNEL=y)
>
> TI: The LT tested it when Dave Martin sent out the call, it blows chunks
> for us in power management patches on tracking -->
>
> ''From the crash dump, the faulting code seems to be:
>
> f503 7182 add.w r1, r3, #260 ; 0x104
> ** e851 0f00 ldrex r0, [r1] **
> f100 0001 add.w r0, r0, #1
>
> r1 is 0x53555151, which is looks more likely to be garbage than to be a
> real, but misaligned, address. I guess we'll need to figure out where
> that value is coming from...''
>
> It's stuck at the moment.
>
> Freescale: The LT is currently having some difficulties building the
> kernel completely with Thumb2 instructions. The major reason is the
> suspend/resume code which is very SoC specific and low level, uses ARM
> instructions which they need to convert into thumb2 compatible firstly.
> It's currently on their wish list.

omap3 had some problems of this sort. To avoid needless churn, we
simply kept some of the affected code as ARM for now.

See
https://wiki.linaro.org/WorkingGroups/Kernel/Thumb2Guide#Firmware_Interactions
for an example of how this was implemented.

It is likely that most of all of the affected code can be made Thumb-2
compatible, but building selected snippets as ARM is a useful first
step, and still allows the rest of the kernel to be build in Thumb-2.

If you have a log from make -k, that would also be useful for
identifying how to fix the build failures.

Cheers
---Dave

Changed in linaro-ci:
milestone: 2011.10 → 2011.11
Revision history for this message
Данило Шеган (danilo) wrote :

Deepti confirmed last week before going on vacation that it was enough to set the THUMB2 config option and that this solved the problem. Please reopen the bug if you are still seeing the problem.

Changed in linaro-ci:
status: Confirmed → Fix Released
assignee: nobody → Deepti B. Kalakeri (deeptik)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.