Debugging
Linux-related notes
Shell, admin, and both:
|
What-does-it-spit-out debugging
print statements
...While various IDEs let you hook into the execution environment and allows you do step lines at a time and sees what happens in memory, and that is sometimes necessary...
...for many bugs, just printing out what happens in the crucial steps simple, effective, and almost universally available.
If this is something that might regress, consider doing it via logging instead - it makes it easier to filter the output, to check it after the fact, and it's handier when debugging is likelier to happen via helpdesks rather than the programmer who wrote the thing.
checking that invariants are actually invariant
strace
A trace of system calls: finds syscalls a program makes, prints to stderr as they happen, no instrumentation/alteration of your program necessary.
For example, things file related will use fstat and probably open - so you can see what files are being accessed - and you can see reads and writes, sockets opening and closing, and such.
Good for certain debugging when programs don't say enough, decent for certain workload characterization for things that seem slow.
strace can run a command for you, or you can attach to a running process via its PID (-p)
For things that fork off, you'll want -f to follow that.
To compare time spent in each syscall, use -c.
You can filter what syscalls are reported using their names, and with a few existing categorizations:
- file: any file-related calls (access, state, read, write, etc.)
- desc: file descriptor related
- process: process management, e.g. fork, wait,l exec
- network
- signal
- ipc
- memory
For example:
# only mention open() calls strace -eopen,openat ls
# see whether there are a bulk of stat()s done (for how many, use -c) strace -estat,fstat,lstat,newfstatat find /tmp
# see what progrms are invoked strace -f -eprocess service apache2 status
# summarize time spent in each syscall strace -c ls -l /proc
# comparing e.g. ls on small and large dirs strace -c ls -l /data/largedir
For example, say you notice that ls -l seems to do a lot of lgetxattr()s. -c would show that for large directories getdents dwarfs everything, and in general it's nearly free assuming that you've just lstat()ted.
Logging
System logging
Windows Event Log
Live debugging
From the IDE
Terms
- Step into: Go into the called function's workings. (does the same as 'over' if not a call)
- Step over: Perform evaluation/call, but don't go into it, and don't display ("I trust this, give me the next thing at the current level")
- Step out: "I've seen enough of this function's guts, go (back) to its caller."
Note: while 'over' may suggest skipping code execution, none of the options skip execution - it just means skipping its debugger display.
Debug and performance inspection tools
ltrace
An execution shim that shows calls into shared libraries.
https://en.wikipedia.org/wiki/Ltrace
pcstat
Page Cache statistics
For individual files, find how much is in the page cache
See also :
echo l > /proc/sysrq-trigger
...gives the backtrace for what's on each CPU.
Writes it to dmesg, which often also goes to something like kern.log, syslog / messages
kernel profiler sources
kprobes
(kernel functions)
kernel tracepoints
dtrace
userspace profiler sources
uprobes
(userspace functions)
USDT (Userland Statically Defined Tracing)
A way of embedding dtrace probes into an app
LTTng userspace
profiler tools / frontends
(Note that some basic things can be gotten from /proc)
dtrace (solaris, also freebsd, linux, osx, smartos)
Very cool tools originating on Solaris.
The rest are ports, some close to the Solaris version, some further away.
perf (linux)
a.k.a. perf_events, since 2.6 kernels.
It basically abstracted away hardware when doing performance counting/measurements.
It also introduced a few CLI tools that are easier to use than many that came before.
https://perf.wiki.kernel.org/index.php/Tutorial
http://www.brendangregg.com/perf.html
https://www.brendangregg.com/blog/2014-07-03/perf-counting.html
https://github.com/brendangregg/perf-tools.git
systemtap (linux)
systemtap seems to imitate dtrace decently
https://en.wikipedia.org/wiki/SystemTap
https://sourceware.org/systemtap/
https://sourceware.org/systemtap/tutorial.pdf
ptrace
ftrace
BPF (linux, bsd, more?)
extended Berkeley Packet Filter ('eBPF, often just BPF) originated in network packet filtering, but grew so flexible that it is also very useful for system tracing (and never got a new name).
It compiles user requests into sandboxed bytecode to be run in the kernel, which makes it a mix of safe, flexible, and fast.
See also:
- BPF Compiler Collection makes it easier to use the mechanisms, and to bolt scripts on top of that. Has various examples.
- bpftrace bolts a language on top of bcc and more
- (compiled at runtime so requires kernel headers, and compiler)
LTTng
OProfile
https://en.wikipedia.org/wiki/OProfile
Semi-sorted
Instruments (osx)
https://en.wikipedia.org/wiki/Instruments_(software)
Xperf (windows)
https://blogs.msdn.microsoft.com/ntdebugging/2008/04/03/windows-performance-toolkit-xperf/
flame graphs
For context (though I think they came later): Firefox and Chrome has Flame Charts, which basically just stack over time. Since this is usually shown at a scale useful to find code that slow down the framerate, and JS is single-threaded, this tends to look fairly clean.
CPU flame graphs instead are primarily geared to show how common a particular stack is.
While it does not show a timeline (x axis is sorted alphabetically), or really collect amount of time spent, the amount of times the sample mentions the specific function is still a good indication of where most time is spent (particularly when things are unusually slow, which often means one function is dominating).
It's regularly used with statistical profiling, which takes samples frequently enough (though ideally not in step with anything else), which for specific functions is more approximate than cycle-accurate profiling , but can be a more real-world-accurate view of the general execution due to the profiling having minimal impact itself (on CPU, but also cache).
The file format is basically just lines of (verify)
stackframe;stackframe;stackframe count
Notes:
- the count is technically optional, but whenever you can aggregate it can make these files a lot shorter.
- a second count column is used for the differential variant
- variants use some annotation on the function name
- See flamegraph.pl's comments for more detail
It was recently popularized by Brendan Gregg, who specializes in digging into lower levels of system so there are various relevant tools for kernel use, like syscalls or network efficiency - basically scripts that use dtrace, perf, SystemTap, etc.
Various runtimes can also output things useful for these.
There is sometimes value to producing them from code, e.g. get a better idea of database use with entries like stack;SQL milliseconds
Variations on the theme:
- Off-CPU flame graphs - intercept mainly the syscalls related to why something is not scheduled for CPU, such as general scheduling, IO,
- Hot/Cold flame graphs - combine CPU and Off-CPU, to get an indication of how a program is scheduled
- Memory flame graph flame graphs - intercept mainly the syscalls related to memory allocation
- differential flame graphs - mainly meant to indicate performance regressions/improvements; takes two CPU flame graphs, and show changes in time spent
- since color is based on newstackvalue-oldstackvalue, this has issues like renames making things disappear
The original tool is [1], and see also [2], and there are various that imitate it, like [https://www.npmjs.com/package/stackvis
stackvis]
http://www.brendangregg.com/flamegraphs.html
Unsorted
http://dtrace.org/blogs/brendan/