Basic Debugging HOWTO Original Author: Peter Elmer <[email protected]> Version: $Id: HOWTO-Basic-Debugging,v 1.10 2001/08/08 21:04:36 elmer Exp $ This document contains notes on how to do "Basic Debugging" of problems seen while running BaBar executables. Here "Basic Debugging" is defined as obtaining the basic information necessary to file a useful Remedy problem report (while not necessarily solving the problem oneself). It includes short sections on debuggers, types of crashes, the use of Framework Actions, hung jobs and filing a useful problem report. ______________________________________________________________________ Table of Contents: 1. Introduction 2. Reporting problems 2.1 Where to post a problem report 2.2 What information is useful? 3. Types of crashes 4. Debuggers 4.1 core files 4.2 dbx (Sun) 4.3 ladebug (OSF) 4.4 gdb (Linux) 4.5 Sun Workshop (Sun) 4.6 Other misc. notes 5. Obtaining information from a traceback (an example) 6. Hung jobs 6.1 attaching processes in the debugger 6.2 pstack on SunOS5 7. Useful Framework features and tricks 7.1 NameAction 7.2 Skipping events 8. Searching the release for code fragments 9. Hypernews 10. Remedy 11. Misc. Tricks 11.1 Putting the core file someplace besides the current directory 11.2 "watchmalloc" on Sun for finding memory management problems 11.3 Truss and strace 11.4 MALLOC_CHECK_ on Linux 12. Sources of further information 13. Credits ______________________________________________________________________ 1. Introduction The BaBar code base consists of many millions of lines of code and a large number of executables are in general use on (as I write this) three different platforms. Almost all of us will at times come up against various types of crashes or hangs while running executables with which we are not 100% familiar. The purpose of this HOWTO is to explain the basic debugging techniques that would be used by a relatively expert user who has a problem while running an unfamiliar executable. It is hoped that after reading this HOWTO even users that do not feel themselves expert will find that they have the basic tools in hand for putting together (a) a useful problem report for the expert on the problematic code and (b) figuring out who that expert would be or where to send the report (and perhaps even resolving the problem him/herself). A large fraction of the notes in this document will apply to debugging of any piece of code with which you may work, but some parts will clearly be BaBar specific. The emphasis is more on debugging problems that come from executables themselves rather than "external" problems, hence there will be no explicit discussion of objectivity, root or infrastructure problems (afs, nfs, network, etc.). 2. Reporting problems Before diving in to any technical details as to how to pull together information for a problem report, this section contains information on what information is useful and where does one post this information. 2.1 Where to post a problem report In BaBar, we use Remedy to report and track problems in our code. However, before posting a Remedy report, it is useful (and recommended) to search to see if that problem has already appeared either in Remedy or in Hypernews (HN). Often problem reports first appear in HN groups as people try to understand if a particular problem is due to user error or the code itself. Both HN and Remedy are described in a later section. A useful strategy for proceeding when you have a problem while running some BaBar executable (such as a crash) is: o Verify that.. - you can make the crash happen again - you are running standard code, i.e., part of a babar release - you are running on a correctly installed base - if you have new code in the executable, you should have established, with reasonable diligence, that your code is not the site of the crash. - you have a good build. ...To Finish... 2.2 What information is useful? When posting a question to HN or a bug report to Remedy it is important to remember that the people reading the problem report don't really know what you are doing (there are hundreds of people doing many different things in many different ways in BaBar), so providing a basic set of information can really simplify things. The goal should be to provide enough information such that the exact situation can be recognized and/or duplicated if necessary. A basic set of useful information would include: o Executable being run and tcl script o Release and platform - What code release were using? Which platform? o Tags checked out - Were there any additonal tags check out in the test release or did you make any modifications? o Environment variables set - Did you have an special environment variables set which would affect the running of the job? (Or perhaps while the executable was being compiled/linked, for example in Beta.) o Patch files, if any, e.g., BearPatches.tcl. o Type of crash - See section 3 below. o Traceback - See section 4 and 5 below. o Input type, collection, objectivity or kanga - if relevant, if you see the same result with multiple collections that is also useful information. o event number, begin-of-job or end-of-job? - What event did the crash happen in? Or did it happen as the beginning or end of job? All of this information may not be necessary in every case, but you will likely be asked for more information from the list if your problem is not recognized by the expert. 3. Types of crashes The first thing to note is that when your program crashes, the last (or one of the last) lines in your log file from that job will contain some printout like (for example) "Segmentation fault". What this tells you right away is what type of "signal" was sent to your program when it tried to do something that caused the crash. A general discusion of "signals" is beyond the scope of this HOWTO, but as this is the first clue one usually has as to what has happened, this section describes the most common ones and describes the sort of things that may cause them. This is not intended to be exhaustive, but instead to point out that the "signal" sent already contains some useful information. By way of introduction, in Unix, the interface between a program and the operating system (the kernel) is meant to mimic common hardware interfaces. For example, when a piece of hardware, say a disk controller, needs the attention of the CPU, it can issue a special electrical signal called an interrupt. The processor is diverted, as a result, from its thread of execution so that it can attend the disk drive. The processor can mask interrupts when it is doing critical things, and there are many kinds of interrupts (signals) that can occur. The interface between your program and the operating system follows this metaphor. When something important happens, as detected by the kernel, your program will be delivered an interrupt (a signal) that your program may process. There is a predetermined set of these signals (see below and /usr/include/sys/signal.h or a POSIX guide) and you may associate a piece of code to execute upon the receipt of any of the signals. If you have no "signal handler" installed to "catch" the delivery of a signal, your program is terminated. A number of common signals are used to indicate severe program faults, as described next. Note that there is some variation in the printout seen with platform and compiler. Below I include also the common abbreviations (e.g. "SEGV"). So without further ado: o SEGV - A segmentation violation or segmentation fault typically means that something is trying to access memory that it shouldn't be accessing. One common example of this is trying to access memory through a NULL pointer, for example: sunprompt> cat main.c #include <iostream.h> main() { int* bunk(0); cout << *bunk << endl; } sunprompt> CC main.c sunprompt> ./a.out Segmentation fault (core dumped) o ABRT - asserts are one common source of the "abort" signal, for example: sunprompt> cat main.c #include <assert.h> main() { int i=0; assert(i!=0); } sunprompt> CC main.c sunprompt> ./a.out Assertion failed: i!=0, file main.c, line 5 Abort (core dumped) Note that the actual assertion which was failed and the location is also printed. o FPE - A "Floating Point Error" usually indicates a numerical problem such as a division by zero or an overflow. One example would be: osfprompt> cat main.c main() { float a = 1.; float b = 0.; float c = a/b; } osfprompt> g++ main.c osfprompt> ./a.out Floating exception (core dumped) o ILL - If you receive a signal like this ("Illegal Instruction"), means that, while running, your program has tried to execute a machine "instruction" which does not exist. This can happen for a variety of reasons, including: (a) a memory overwrite that happens to overwrite part of the program stored in memory. This may result in the program trying, for example, to execute data as if it is a machine instruction. (b) an attempt to take an executable compiled on one platform for use on another, for example on an earlier version of the same chip. (c) a truncated or corrupted executable is loaded for execution (d) incomplete recompilation of source code, i.e. you changed one C++ clas and didn't recompile all other code affected by that change. o BUS - A "Bus Error" may come, for example, from accessing unaligned data (i.e. like trying to access a 4 byte integer with a pointer to the middle of it). What this means will vary from platform to platform. (I haven't come up with a good example of this one yet.) A "Bus Error" can also often indicate a memory overwrite, e.g. somebody wrote a number where a pointer is kept. _Often_ caused by going past the end of an array and into the system pointers at the start of the next memory block. 4. Debuggers The previous section described at a basic level the types of signals that may occur. If our programs were only 10 lines long, the "signal" would be enough to understand the origin of any crash by simple code inspection. Our executables are clearly much larger than this, so the next step is to determine what the code was doing when the crash occured. The easiest way to do this is to use a debugger. 4.1 core files After a crash, you should find a "core" file in the directory where the program was being executed. This is a memory image of the process when it crashed and can be used to figure out what the process was doing when the crash occured. If your job crashes and you do not find a "core" file, your shell settings are probably set to avoid core dumps: - If you use tcsh: o type 'limit coredumpsize'. The value is probably set to '0'. o type 'limit coredumpsize unlim' to remove the limit. - If you use bash: o type 'ulimit -c'. The value is probably set to '0'. o type 'ulimit -c unlimited' to remove the limit. You will have to rerun your executable and after it crashes again you should find a core file. To check that the core file does in fact come from the executable you were running, type 'file core'. It will tell you which executable the core comes from as well as some other information. (Not all systems do this, but our three supported systems Sun, Linux2 and OSF do.) See section 11.1 for details about handling very large core files. 4.2 dbx (Sun) The non-GUI debugger available on Sun is called 'dbx'. As an example of using the debugger to examine a core file, here is a slightly modified version of the program which caused the SEGV above: sunprompt> cat main.c #include <iostream.h> void foo() { int* bunk(0); cout << *bunk << endl; } main(int argc, char *argv[]) { foo(); } sunprompt> CC -g main.c sunprompt> ./a.out Segmentation fault (core dumped) sunprompt> dbx a.out core Reading symbolic information for a.out core file header read successfully Reading symbolic information for rtld /usr/lib/ld.so.1 dbx: program is not active Reading symbolic information for libm.so.1 Reading symbolic information for libC.so.5 Reading symbolic information for libw.so.1 Reading symbolic information for libc.so.1 Reading symbolic information for libdl.so.1 Reading symbolic information for libc_psr.so.1 program terminated by signal SEGV (no mapping at the fault address) Current function is foo 4 cout << *bunk << endl; (dbx) where =>[1] foo(), line 4 in "main.c" [2] main(argc = 1, argv = 0xeffff2fc), line 9 in "main.c" (dbx) quit Compilation/linking of a real BaBar executable is clearly done differently than this toy example, but a traceback is obtained in the same way. For a crash in BetaApp, for example, one would type: sunprompt> dbx BetaApp core and then 'where' as above. For more details on interpreting the the traceback, see section 5. For more information on commands type 'help' at the dbx prompt. The dbx debugger on Sun has one quirk with templates, if you see alot of complaints about them when you type 'where', simply type 'where' a second time and they will go away (leaving you with a clean traceback). 4.3 ladebug (OSF) On OSF a debugger called 'dbx' exists, but it is not for use with C++. The C++ debugger is called 'ladebug' (it can also be used for Fortran and C). As for dbx, it is launched to examine a core file with: osfprompt> ladebug your.exe core where "your.exe" is the name of your executable. The command 'where' will work as for dbx. For more information on commands, see the man page ('man ladebug') or type 'help' at the ladebug prompt. 4.4 gdb (Linux) The most common debugger used on linux is 'gdb'. The syntax for examining a core file is: linuxprompt> gdb your.exe core Some basic information is available from the man page ('man gdb'), but for detailed further information on gdb on linux type 'info gdb'. The command 'where' will work, equivalent alternatives in gdb are 'backtrace' or 'bt'. 4.5 Sun Workshop (Sun) There is a product called Workshop on the suns. It provides a graphical interface to debugging tools and it lives in /opt/SUNWspro/bin/workshop. Executing this command brings up a small list of icons. The one with a bug under a red barred-circle starts the debugger. The one that looks like a histogram is a performance analyzer. There are help boxes and web pages you can call up. The debugger can analyze memory usage patterns (see the "Windows" pull-down) as well as performance data. The basic pattern for the latter is to locate the menu with the "Collector" item and launch it. (This menu is in the debugger). Select what you want the Collector to monitor, then begin execution under the debugger. When the program finishes, launch the Analyzer, which will show you timing distributions. You tell the analyzer the name of the "experiment" that you set in the Collector window. Note that the analyzer has trouble running on a display that is fed by a channel established with ssh. The workshop debugger is really just a facade on top of dbx. You can use it to *learn* about dbx by asking it to show the dbx window. You drive the debugger with its GUI and watch what dbx commands come out. Or, you can just do whizbang stuff with the dbx window when the GUI is not sufficient. 4.6 Other misc. notes ...To Finish... o debugging symbols, optimization o core files versus running in the debugger itself 5. Obtaining information from a traceback (an example) Here is an example traceback obtained from a crash of Elf (Remedy report 3455): t@1 (l@1) signal SEGV (no mapping at the fault address) in MapChipNode::nChannels (optimized) =>[1] MapChipNode::nChannels(this = (nil)) (optimized), at 0x133a9d4 [2] L1DTsfTCDigiModule::convert(this = ???, list = ???, tc = ???) (optimized), at 0xcc80b0 [3] L1DTsfTCDigiModule::event(this = ???, anEvent = ???) (optimized), at 0xcc7ec8 [4] APPSequence::event(this = ???, anEvent = ???) (optimized), at 0x1562af4 [5] APPSequence::event(this = ???, anEvent = ???) (optimized), at 0x1562af4 [6] APPSequence::event(this = ???, anEvent = ???) (optimized), at 0x1562af4 [7] AppFramework::event(this = ???, anEvent = ???) (optimized), at 0x155a570 [8] OepFrameworkDriver::processTransition(this = 0xefffe840, tr = 0xed300008), at 0x134fc8c [9] OepFrameworkDriver::processDatum(this = 0xefffe840, datum = 0x49b2f18), at 0x134d618 [10] OepFrameworkDriver::readLoop(this = 0xefffe840), at 0x134cb44 [11] OepFrameworkDriver::init(this = 0xefffe840, interp = 0x31b03c8), at 0x134c684 [12] OepFTclMain::run(this = 0xefffe840, argc = 1, argv = 0x31ab4d4), at 0x134a6f4 [13] main(0x2, 0xefffea04, 0x1a50fd8, 0x31ab4d0, 0x31b0340, 0x0), at 0x869ec0 The first thing that appears in the traceback is the "signal" received (SEGV in this case, see section 3) and then the entire series of "frames" or functions called. The lowest numbered one is the one in which the problem occured and you can see the entire sequence of calls all the way back up to the "main" routine of the program. Several things can be seen immediately: o This crash is a SEGV and from Frame "[1]" we can see that the nChannels member function of a "MapChipNode" object has been called through a null pointer (note that this = nil) o The Framework module that was running at the time this happened was L1DTsfTCDigiModule (follow up the traceback until you get to the first one of the form Xxxx::event, this will likely be the guilty module if event processing was going on). 6. Hung jobs Occasionally you may find that you run an executable and it "hangs", i.e. it seems to stop making progress processing events or whatever it should be doing and stays in one place without any further output. How to debug this? One first clue as to what is happening is if the process is taking CPU or not (i.e. look with 'top' or something similar on the machine where the job is running). If it is taking lots of CPU without seemingly progressing, it may be caught in an infinite loop. If the job doesn't take CPU, it may be waiting for something. For more information as to what is going on, there are a couple of possiblities: 6.1 attaching processes in the debugger All of the debuggers that we currently use (see section 4 above) also allow one to "attach" an already running process to see what it is doing. First determine the process ID for your running executable and then do (on linux for example): linuxprompt> gdb your.exe <PID> where "your.exe" is the name of the executable you are running (including the path if necessary) and <PID> is the process ID for the running process. Analogous things are available for the other debuggers (see the man page for those). You need to do this on the machine where the job is running (clearly) and will in general only be able to attach processes which belong to you. Once you are in the debugger, you can type 'where' or anything else you might do to get more information about what is going on. 6.2 pstack on SunOS5 - On the Solaris machines there is a useful command which will provide a traceback-like list of function calls for a running executable. Simply determine the process ID of the running executable with 'top' or 'ps' and then do sunprompt> /usr/proc/bin/pstack <PID> where <PID> is the process ID. 7. Useful Framework features and tricks 7.1 NameAction The Framework has a variety of "Action's", one of the most useful ones for debugging is "NameAction". Consider the following situation, for example: occasionally a memory overwrite can render a traceback useless: Thread received signal SEGV pc address 0x100000000 is invalid; substituting RA stopped at [ 0x100000001] (ladebug) where >0 0x100000001 In this case it is clear that a memory overwrite has happened, the program counter (pc, which normally contains the memory address of the next machine instruction to execute) has an unrealistic value. To determine which Framework module was running when this happened, add the two lines: action enable NameAction to your tcl file just before the 'ev begin'. The Action prints a message just before and after <...> APPExecutable: SvtEffMon: before processing an event... APPExecutable: SvtEffMon: after processing an event... APPExecutable: DchOprMon: before processing an event... Thread received signal SEGV pc address 0x100000000 is invalid; substituting RA stopped at [ 0x100000001] (ladebug) where >0 0x100000001 and which module was running at the time becomes clear. The output is a bit, well, voluminous given the number of modules typically running so this is clearly useful only for debugging and not for routine running! In addition to cases like this particular example, NameAction can be useful to determine the origin of printout seen in the log file, verifying that the beginJob(..) or event(...) function of a particular module is being called, etc. This Action should be included in most standard BaBar executables, if it seems to be missing check that you have something like the following in your AppUserBuild.cc: #include "RecoUtils/AppActionName.hh" //... theFramework->actions()->append(new AppActionName); 7.2 Skipping events Occasionally a crash will happen well into an input file or collection and it is very useful for debugging to skip all events prior to the one in which the crash occurs. One way to do this is to use the fact that the module EvtCounter (which appears in most of the BaBar executables) is a Filter module. Suppose you want to skip to the 851st (sequential) event in an input collection, simply add the following to your tcl file: module talk EvtCounter skipStart set 1 skipStop set 850 exit and then do an 'ev begin -nev 851' to process only the 851st event (or 'ev begin' if you want to process all events starting from the 851st one). Skipping events is useful in many cases to examine a problematic event multiple times. In certain cases like memory overwrites, the crash may not appear unless the previous events are also processed so skipping events may not help. A faster method of doing this if you are reading from Objectivity is the following; module talk BdbEventInput first set 851 exit You should note in this case that the EventCounter only see events starting at 851 and therefore only starts counting there. 8. Searching the release for code fragments Often in the course of debugging, one will see either (a) a message in the log file or (b) a function call in the traceback that comes from someplace in the BaBar code base. It is often useful to figure out which piece of code is the source of the message or function call. The easiest way to do this is to use 'srtglimpse', which is a BaBar specific wrapper around a general glimpse utility. It prints lines from the BaBar code base which contain a given string. See the man page for 'srtglimpse' or type 'srtglimpse -h' for more information, however here is one example: prompt> srtglimpse -H 8.6.2 'foo' will print: Aslund/src/fcstubs.F: character*3 foo Aslund/src/fcstubs.F: equivalence (i, foo) Aslund/src/fcstubs.F: cdirtag = foo AssocTools/testAssocTools.cc: anotherString = new RWCString("foober"); <...> (You might want to pipe the output to 'more' or 'less' if there will be alot of it.) One thing to note is that 'srtglimpse' uses an "index" which is made in advance (while the release is being built) in order to make the search much faster. We make this index only for regular full builds (not for lettered ones) so if you are interested in release 8.6.2a you will have to specify '-H 8.6.2' as in the example above. 9. Hypernews ...To Finish... o The main HN page (listing all HN groups) is: http://babar-hn.slac.stanford.edu:5090/HyperNews/index/index.html and that of the prelimbugs HN group is: http://babar-hn.slac.stanford.edu:5090/HyperNews/get/prelimbugs.html 10. Remedy ...To Finish... o The URL for the Remedy problem tracking system is: www.slac.stanford.edu/BFROOT/www/Computing/Environment/Remedy/www/remedy1.html 11. Misc. Tricks 11.1 Putting the core file someplace besides the current directory Core files can be very big. If the directory where you run your executable (and where the core file would be dumped) is someplace where you have a limited quota (e.g. an afs volume or your home directory on many systems), you may find that you do not have enough space for your core file there. One trick which works on Sun and OSF (but not on Linux) is to use a soft link to put the core file someplace else: sunprompt> touch $SOMEPLACE/core sunprompt> ln -s $SOMEPLACE/core core where "$SOMEPLACE" here indicates some (presumably bigger) scratch space where you have write permission. Note also that dumping large core files to networked filesystems can take a long time, so there is some speed advantage if "$SOMEPLACE" is a disk local to the machine where the job is being run. If you are _really_ constrained for space, another possibility (that may allow you to at least get a traceback) is to limit your core file size such that it _will_ be truncated, i.e. in tcsh for example: sunprompt> limit coredumpsize 5M and hope that there will be enough to get the traceback. Your mileage may vary doing this. (You could truncate it too much.) 11.2 "watchmalloc" on Sun for finding memory management problems On Sun there is a method for finding certain types of memory management problems such as deleting the same object twice, use of uninitialized memory, etc. This uses a special shared library which replaces the default, you can use this by setting an environment variable (in tcsh for example): sunprompt> setenv LD_PRELOAD watchmalloc.so.1 Important: set this _only_ before running an executable interactively for debugging and make sure you 'unsetenv LD_PRELOAD' before (a) compiling/linking something or (b) submitting anything to the batch system. Both the batch system and the compiler have been seen to behave strangely if this is set. I have even seen a case where someone could not make the 'diff' command work properly with LD_PRELOAD set like this. So what does watchmalloc do? Many memory corruption problems do not actually cause a crash when the corruption actually happens, but only manifest themselves (as a crash) downstream in some other piece of code which is unlucky enough to need to use the corrupted memory. "watchmalloc" makes debugging easier by causing the crash to happen immediately when the corruption happens. So if (a) you think your problem may be due memory corruption as described above and (b) the actual traceback you find from the crash leads you to believe that the corruption is happening elsewhere, the strategy for using watchmalloc is: o compile code and link your executable o setenv LD_PRELOAD watchmalloc.so.1 o run your executable _interactively_ o if it crashes, examine the core file: the traceback should give you some indication of where the original corruption happened o make sure to 'unsetenv LD_PRELOAD' before doing anything else otherwise you may see strange behavior as noted above. For more information type 'man watchmalloc' on Sun. 11.3 Truss and strace Sometimes programs fail because they fail to obtain expected resources, e.g., they fail to locate a file they need. Such a program may not crash, but may just exit saying "Duh" or some other useless thing. A great example occurs when you install stripped, licensed code that you can not put in a debugger, but which does not run when you first install it: your guts tell you it is installed wrong, but how? There is a program called 'truss' on the Sun. Linux has a similar program called 'strace'. (I am not aware of something similar on OSF, a "strace" program exists, but does something different.) These programs can attach to a running program or can run a program under them, e.g. sunprompt> truss myProg -x paramX As the program executes, 'truss' and 'strace' watch for all system calls and print them out, along with arguments. Systems calls include opening and closing files, memory requests, socket manipulations, etc. Quite often, you can strace an aberrant program and see a system call just before the program exits or fails that is obviously related, or even the cause, of the program failure. 11.4 MALLOC_CHECK_ on Linux This is an alternative approach to watchmalloc on Sun. If you set this environment variable to 2 it will abort when it detects certain misuses of memory. Setting it to 1 will only print a message when these are detected. 12. Sources of further information ...To Finish... 13. Credits (Please add yourself if you correct, improve or add to the above.) Original Peter Elmer <[email protected]> 17-Jan-2000 Additions Ed Frank <[email protected]> 10-Apr-2000 Additions Stephen J. Gowdy <[email protected]> 10-Apr-2000 Misc. Bob Jacobsen <[email protected]> 21-Apr-2000