Program crash messages
|This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)|
Segfault, Bus Error, Abort, and such
- segfault means the kernel says: there is something at that address, but your process may not access it
- bus error means the kernel says: that address doesn't even exist - anymore, or at all
- (both are kernel responses to hardware signals from the memory controller)
- pointer bugs can lead to either segfault or bus error
- ...note that specific bugs are biased to cause one or the other, due to the likeliness of hitting existing versus non-existing addresses
- (and various things can influence that likeliness, e.g. 32-bit address spaced usually being mostly or fully mapped, 64-bit not)
- abort() means code itself says "okay, continuing running is a Bad Idea, let's stop now"
- usually based on a test that should never fail.
- if you look from a distance it's much like an exit(). The largest practical differences:
- abort implies dumping core, so that you can debug this
- abort avoids calling exit handlers
- ...and the earlier this happens, the more meaningful debugging of the dumped core is. Hence the explicit test and abort.
- a fairly common case is memory allocation (as signalled by something that actually checks; not doing so is often a segfault very soon after, particularly if dereferencing null)
A segfault signals that while the requested memory location exists, the process is not allowed to do what it is trying. (segmentation refers to the fact that processes are segmented from each other)
Cause is often one of:
- Writing to read-only memory
- the address part of the requesting processess's currently mapped space.
- including the null pointer dereference, because most OSes don't map it to processes
- buffer overflow when it gets to memory outside the mapped space
- a stack overflow can cause it (though other errors may be more likely, because depending on setup it may trample all of the heap before it does)
A segfault is raised by hardware that supports memory protection (most modern computers), notifying the kernel of this violation, which in linux sends a signal to the process (SIGSEGV), where the default signal handler quits the program.
(core dumped) means it dumped process memory to a file for debugging purposes (the word core is historic, referring to magnetic core memory. It seems it stuck because this is a nicely specific term)
Means the processor / memory subsystem cannot even attempt to access the memory it was asked to access.
Also sent by hardware, received by the kernel, and on linux handled by sending it SIGBUS, triggering the default signal handler.
Possible causes include:
- address does not make sense, in that it cannot possibly be there
- a pointer that was uninitialised/non-zeroed (and so contains whatever the memory it gets mapped to contained previously) has a decent chance of being this or a segfault
- device became unavailable (verify)
- device has to reports something is unavailable, e.g. a RAID controller refusing access to a broken drive (e.g. search for combination with "rejecting I/O to offline device")
- cannot page in the backing memory (verify)
- a broken swap device? (verify)
- accessing a memory-mapped file that was removed (though in some cases the OS/filesystem may keep the storage around until it is not used, making this impossible)
- executing a binary image that was removed (similar note as above)
- address fails the platform's alignment requirements
- larger-than-byte units often have to be aligned to their size, e.g. 16-bit to the nearest 16-bit address
- Less likely on x86 style platforms than others (x86 is more lenient around misalignments than others)
- Theoretically rare anyway, as compilers tend to pad data that is not ideally aligned.
- memory mapped IO where the backing device is not currently available
- ...or ran out of space, e.g. when mmapping on a ram disk (verify)
In comparison to a segfault:
- similar in that it is about the address
- and having a mangled or random-valued pointer value could lead to either
- similar in that both are raised by the underlying hardware, that the OS sends the originating process a signal, and that the default (kernel-supplied) signal handler kills that originating process.
- differs in that a segfault means the request is valid in a mechanical way, but the requesting process may not do this operation
Aborted (core dumped)
The direct cause is that the kernel has sent a SIGABRT.
For which the default signal handler is to call abort() an stop the process (without calling exit handlers(verify)), mostly for the intent of producing a core dump that is as meaningful as possible.
The reason the kernel sends SIGABRT is often that the process itself has asked for this, via a failed .
Intentionally stopping/crashing as soon as possible is done for two good reasons:
- the sooner you do, the more meaningful the core dump is to figuring out what is wrong
- the sonner you do, the less likely it is you wrote corrupted data to persistent storage (or other systems that do)
In practice, many aborts come from a libraries that notice a serious runtime problem, e.g. glibc noticing double free()s, malloc() noticing overflow corruption, etc.
On core dumps
A process core dump contains (most/all?(verify)) writeable segments specific to the process, which basically means the data segment and stack segment.
A core dump uses ELF format, though is seems to be a bit of a de facto thing wider than the ELF standard.
By default it does not contain the text segment, which contains the code, which is when debugging you also have to tell it what executable was being used.
It wouldn't be executable even them, since it's missing some details (entry point, CPU state).
Illegal instruction (core dumped)
Illegal instruction means the CPU got an instruction it did not support.
(the CPU signals this via what in the kernel becomes SIGILL - which has further uses)
Most applications are compiled to avoid this ever happening, by being conservative about what it's being run on, which is what most compilers and code defaults to.
So this usually seems to happen on relatively custom software or relatively special builds for recent CPUs.
For example, recent tensorflow builds assume your CPU has AVX, which didn't exist in any CPUs from before 2011 and still doesn't in some lower-end x86 CPUs (Pentium, Celeron, Atom).
And neural stuff is slow enough on a fast CPU that you probably wouldn't want to run it on an Atom, so it's not unreasonable, though an error message would be nice .