Program crash messages

From Helpful
Jump to: navigation, search
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me)


Segfault, Bus Error, Abort, and such

tl;dr:

  • segfault means the kernel says: there is something at that address, but your process may not access it
  • bus error means the kernel says: that address doesn't even exist - anymore, or at all
  • (both are kernel responses to hardware signals from the memory controller)
  • pointer bugs can lead to either segfault or bus error
...note that specific bugs are biased to cause one or the other, due to the likeliness of hitting existing versus non-existing addresses
(and various things can influence that likeliness, e.g. 32-bit address spaced usually being mostly or fully mapped, 64-bit not)


  • abort() means code itself says "okay, continuing running is a Bad Idea, let's stop now"
usually based on a test that should never fail.
if you look from a distance it's much like an exit(). The largest practical differences:
abort implies dumping core, so that you can debug this
abort avoids calling exit handlers
...and the earlier this happens, the more meaningful debugging of the dumped core is. Hence the explicit test and abort.
a fairly common case is memory allocation (as signalled by something that actually checks; not doing so is often a segfault very soon after, particularly if dereferencing null)



Segfault

Segmentation refers to the fact that processes are segmented from each other.

A segmentation fault (segfault) signals that the requested memory location exists, but the process is not allowed to do what it is trying.


Which is often one of:

  • the address isn't of the requesting processess's currently mapped space, e.g.
    • a the null pointer dereference, because most OSes don't map the very start of memory to any process (mostly for this special case)
    • buffer overflow when it gets to memory outside the mapped space
    • a stack overflow can cause it (though other errors may be more likely, because depending on setup it may trample all of the heap before it does)
  • attempt to write to read-only memory


A segfault is raised by hardware that supports memory protection (the MPU in most modern computers), which is caught by the kernel.

The kernel then decides what to do, which in linux amounts to sending a signal (SIGSEGV) to the originating process, where the default signal handler quits that program.


(core dumped) means it dumped process memory to a file for debugging purposes (the word core is historic, referring to magnetic core memory. It seems it stuck because this is a nicely specific term)

Bus error

Means the processor / memory subsystem cannot even attempt to access the memory it was asked to access.

Also sent by hardware, received by the kernel, and on linux handled by sending it SIGBUS, triggering the default signal handler.


Possible causes include:

  • address does not make sense, in that it cannot possibly be there (outside of mappable addresses)
e.g. a using random number as a pointer pointer has a decent chance of being this or a segfault
  • IO
    • device became unavailable (verify)
    • device has to reports something is unavailable, e.g. a RAID controller refusing access to a broken drive (e.g. search for combination with "rejecting I/O to offline device")
...or ran out of space, e.g. when mmapping on a ram disk (verify)
  • address fails the platform's alignment requirements
larger-than-byte units often have to be aligned to their size, e.g. 16-bit to the nearest 16-bit address
Less likely on x86 style platforms than others (x86 is more lenient around misalignments than others)
Theoretically rare anyway, as compilers tend to pad data that is not ideally aligned.
  • cannot page in the backing memory (verify)
e.g.
a broken swap device? (verify)
accessing a memory-mapped file that was removed (though in some cases the OS/filesystem may keep the storage around until it is not used, making this impossible)
executing a binary image that was removed (similar note as above)



In comparison to a segfault:

  • similar in that it is about the address
and having a mangled or random-valued pointer value could lead to either
  • similar in that both are raised by the underlying hardware, that the OS sends the originating process a signal, and that the default (kernel-supplied) signal handler kills that originating process.


  • differs in that a segfault means the request is valid in a mechanical way, but the requesting process may not do this operation

Aborted (core dumped)

This message comes from the default signal handler(verify) for an incoming SIGABRT.


The reason for the handler is often to abort() and stop the process as soon as possible (without calling exit handlers(verify)), typically the process itself intentionally stopping/crashing as soon as possible, which is done for two good reasons:

  • the sooner you do, the more meaningful the core dump is to figuring out what went wrong
  • the sooner you do, the less likely you go on to nonsense things to data (and potentially write corrupted data to persistent storage)

Ideally, this is only seen during debugging, but the latter reason is why you'ld leave this in.

The likeliest sources are the process itself asking for this via a failed
assert()
, from your own code or runtime checking from libraries, e.g. glibc noticing double free()s, malloc() noticing overflow corruption, etc.


On core dumps

A process core dump contains (most/all?(verify)) writeable segments specific to the process, which basically means the data segment and stack segment.


A core dump uses ELF format, though is seems to be a bit of a de facto thing wider than the ELF standard.


By default it does not contain the text segment, which contains the code, which is when debugging you also have to tell it what executable was being used.

It wouldn't be executable even them, since it's missing some details (entry point, CPU state).


Illegal instruction (core dumped)

Illegal instruction means the CPU got an instruction it did not support.

(the CPU signals this via what in the kernel becomes SIGILL - which has further uses[1])


It can happen when executable code becoming corrupted.


More commonly, though it comes from programs being compiled with very specific optimizations within the wider platform it is part of.

Most programs are compiled to avoid this ever happening, by being conservative about what it's being run on, which is what compilers and code defaults to.


But when you e.g. compile for instructions that were recently introduced, and omit fallbacks (e.g. via intrinsics), and run it on an older CPU, you'll get this.


For example, some recent tensorflow builds just assume your CPU has AVX instructions, which didn't exist in any CPUs from before 2011[2] and still don't in some lower-end x86 CPUs (Pentium, Celeron, Atom).