2010年11月9日 星期二

Core & Dump @ Solaris (2)


pstack, coreadm and symbol tables
Where symbol names come from?
In ELF files, symbols reside in two sections: .symtab and .dynsym.

On recent versions of Solaris, there is a new section, .SUNW_ldynsym, but for the purpose of this article it is identical to .dynsym, so I'll keep it simple and not talk about it.

Both sections are essentially tables that map name to a value; here we are interested in function names, so that value would be function address. When pstack unwinds the stack (starting from value of $pc and $fp/$sp registers that comes from special NOTE segment of core file), it goes through symbol tables of all files involved and find symbol with closest value.

For example, suppose we have this core file:

$ pstack core
core 'core' of 7719: ./a.out
fece586c strlen (8050ada, 8047a38, fed91c20, 0) + c
fed40814 printf (8050ad8, 0) + a8
08050969 ???????? (0, 8047b30, 8047a84, 80508bd, 1, 8047a90)
080509a2 main (1, 8047a90, 8047a98, fed93e40) + 12
080508bd _start (1, 8047b98, 0, 8047ba0, 8047bdc, 8047be7) + 7d
fece586c address belongs to libc.so.1 as can be seen from pmap(1) output:

$ pmap core
core 'core' of 7719: ./a.out
08046000 8K rwx-- [ stack ]
08050000 4K r-x--
08060000 4K rwx--
08061000 128K rwx-- [ heap ]
>>>FECC0000 760K r-x-- /lib/libc.so.1 <<<
FED8E000 32K rw--- /lib/libc.so.1
FED96000 8K rw--- /lib/libc.so.1
...
It is in code segment (r-x-- permissions gave that away) of /lib/libc.so.1.

Looking at libc.so.1 with elfdump we can see that global function strlen starts at offset 0x25860

$ elfdump -s /usr/lib/libc.so.1 | grep strlen
[2603] 0x00025860 0x00000045 FUNC GLOB D 37 .text strlen
So in our passed away process it would reside at 0xFECC0000 (base address of libc.so.1 in memory) + 0x25860 = 0xFECE5860. Hence 0xfece586c is 0xFECE5860+0xc, which is strlen+0xc

Symbol tables
As you can see in the above example, not all symbols were found. In this case, address 0x08050969 was not mapped to any symbol. That address belongs to a.out code segment starting at 0x08050000 and that's all we can tell. Yet the other symbol from the same segment is visible: main at 0x080509a2.

The difference is because those two symbols were present in different symbol tables while executable files are permitted to have only one: .dynsym (strictly speaking, that probably applies to dynamic executables only, but since Solaris 10 strongly discourages static linking, so we almost always have to deal with dynamic executables and shared libraries). This .dynsym section is used by run-time linker (ld.so.1(1)) and contains global names that program "exports" or "imports" from libraries; call to "main" is resolved at run time by looking up name "main" in .dynsym section and jumping to address associated with symbol found. Since this information is absolutely necessary at run time, .dynsym section always resides in a loadable segment and is always a part of process memory image (and thus a core file).

On the other hand, .symtab section that contains all symbols - including local ones - was useful mostly when linking relocatable object files (*.o). References inside one file can be resolved at compile time using offsets, so static functions does not have to have a name at run time, they are called directly using offset from current position. This is why .symtab section does not belong to a loadable segment and does not contribute to process' memory image in any way. And this is why it [used to be] customary to remove symbol table from final executables (using strip(1), for example) to save space and make life of support engineers harder.

In our case, ./a.out was indeed stripped:

$ elfdump -c a.out | grep symtab
$ elfdump -c a.out | grep dynsym
Section Header[4]: sh_name: .dynsym
It does have .dynsym, but no .symtab. By the way, main symbol indeed is present in .dynsym and has address 0x08050990:

$ elfdump -s -N .dynsym a.out | grep main
[28] 0x08050990 0x0000001a FUNC GLOB D 0 .text main
Loadable objects (executables and shared libraries)
Let's recompile a.out and see how it helps:

$ CC510 a.cc
$ ./a.out
Segmentation Fault (core dumped)
$ pstack core
core 'core' of 11761: ./a.out
fece586c strlen (8050ada, 8047a38, fed91c20, 0) + c
fed40814 printf (8050ad8, 0) + a8
08050969 __1cDfoo6F_i_ (0, 8047b30, 8047a84, 80508bd, 1, 8047a90) + 19
080509a2 main (1, 8047a90, 8047a98, fed93e40) + 12
080508bd _start (1, 8047b98, 0, 8047ba0, 8047bdc, 8047be7) + 7d
We now can see name __1cDfoo6F_i_ (mangled name of int foo()) instead of ???, but where would pstack get this information? __1cDfoo6F_i_ is not present in .dynsym, so it there was not information about this name in memory image of the process when it died:

$ strings core | grep __1cDfoo6F_i_
pstack(1) is smarter that that: it finds out which program generated this core file, locates it and uses its .symtab (if present, of course) to map symbols. Here's an excerpt from proc(1):

Some of the proc tools can need to derive the name of the
executable corresponding to the process which dumped core or
the names of shared libraries associated with the process.
These files are needed, for example, to provide symbol table
information for pstack(1). If the proc tool in question is
unable to locate the needed executable or shared library,
some symbol information is unavailable for display.
Let's delete a.out and see what happens:

$ rm a.out
$ pstack core
core 'core' of 11761: ./a.out
fece586c strlen (8050ada, 8047a38, fed91c20, 0) + c
fed40814 printf (8050ad8, 0) + a8
08050969 ???????? (0, 8047b30, 8047a84, 80508bd, 1, 8047a90)
080509a2 main (1, 8047a90, 8047a98, fed93e40) + 12
080508bd _start (1, 8047b98, 0, 8047ba0, 8047bdc, 8047be7) + 7d
We immediately got our ???'s back.

So pstack uses core file and executable/libraries as well in order to print readable names in stack trace.

Core file contents
If you have to send your core file to another person for inspection, you have him at a disadvantage: that person might not have your executable and even system libraries might be slightly different. If pstack would go look for address-to-symbol mapping there, it might end up printing wrong symbol names and question marks, making core file more harmful than helpful.

There is a way to embed symbol tables into the core file - using coreadm(1M) command. It allows to specify what kind of content you want the system to put into core file and it can even direct the system to pull .symtab from executable and shared libraries:

$ coreadm -I default+symtab(do this under root).
More information on coreadm can be found in its man page: coreadm(1M).

Side note: in fact, symbol tables of libc.so.1 and ld.so.1 were present in my core file even without "symtab" content requested as can be seen by elfdump -c core; seems to be an undocumented, but useful feature.

Let's turn .symtab inclusion on and see how if it helps:

$ su -
# coreadm -I default+symtab
# exit
$ ./a.out
Segmentation Fault (core dumped)
$ rm a.out
$ pstack core
core 'core' of 13604: ./a.out
fece586c strlen (8050ada, 8047a38, fed91c20, 0) + c
fed40814 printf (8050ad8, 0) + a8
08050969 __1cDfoo6F_i_ (0, 8047b30, 8047a84, 80508bd, 1, 8047a90) + 19
080509a2 main (1, 8047a90, 8047a98, fed93e40) + 12
080508bd _start (1, 8047b98, 0, 8047ba0, 8047bdc, 8047be7) + 7d
Core file now contains many symbol tables, one per loadobject:

$ elfdump -c core | grep symtab
Section Header[1]: sh_name: .symtab
Section Header[3]: sh_name: .symtab
Section Header[6]: sh_name: .symtab
Section Header[8]: sh_name: .symtab
Section Header[10]: sh_name: .symtab
Section Header[12]: sh_name: .symtab
and one of them has definition of our int foo() function that starts at 0x08050950:

$ elfdump -s core | grep foo
[56] 0x08050950 0x00000034 FUNC LOCL D 0 __1cDfoo6F_i_
How to prevent ??? to appear in stack trace?
Use pstack on the same machine
First and foremost, you can avoid many problems by first using pstack on the same machine where core file was generated. This will ensure that pstack uses the same binary and libraries as the process that generated core. Otherwise, you might end up looking at wrong symbols or (in the best case) a lot of question marks.

Don't strip binaries
In Solaris, it is no longer customary to strip binaries (see http://blogs.sun.com/ali/entry/which_solaris_files_are_stripped). Space savings are questionable and performance of unstripped binary does not suffer, so why having lives of those who will debug it difficult?

Don't delete binaries
By default, Solaris does not include .symtab into core files (except for libc.so and ld.so as I mentioned earlier, but that is not relevant here when we talk about user executables and libraries). So if you delete or move executable/library after core file was generated, pstack won't be able to find its .symtab and thus map addresses to local function names.

In other words, unless you've changed core file contents with coreadm(1M), don't delete your binaries before you have a chance to inspect core file. They are needed.

Use coreadm
Most of problems above can be eliminated with one blow:

# coreadm -I default+symtabThis tells the system to pull .symtab sections from every binary involved in the process and put them into core file. You no longer need binaries to see names instead of numbers in stack trace.



沒有留言:

張貼留言

文章分類