Bugzilla – Bug 4777
Excessive memory usage for SEG modules
Last modified: 2006-11-01 12:04:43
You need to log in before you can comment on or make changes to this bug.
We've found that at apparently random times, the WS-GRAM Scheduler Event Generator daemon for LSF starts to quickly use more and more memory (GBs) until our login node is unusable. We've deployed the WS-GRAM provided by TeraGrid (Globus 4.0.1) on our lonestar system. Lonestar is a Intel em64t running 64bit Redhat 4.0 with LSF is the batch scheduler. The Globus container is running on a login node while LSF servers are running on an administrative node. The LSF log files are NFS mounted to the login nodes. We recently started debugging this problem on an alternate login node by turning on debugging messages via SEG_LSF_DEBUG environment variable. Unfortunately, these log files aren't giving us information that helps us find the problem. We also examined the LSF SEG code and nothing glaring stood out. The only thing we noticed is a malloc() for state->event_idx_path that isn't free()d, but we're not even sure this malloc() is ever called. How do you suggest we debug this? We can give you access to our alternative login node to debug and the SEG LSF log file is available there for you to see.
Created an attachment (id=1079) [details] New LSF SEG package I think I've located the problem (it also may occur in other SEG modules). Can you try building and installing this package and see if it helps? joe
The LSF SEG package was installed, but the seg lsf daemon still chews up memory top - 16:02:28 up 1 day, 23:46, 5 users, load average: 0.61, 0.59, 0.45 Tasks: 329 total, 1 running, 326 sleeping, 1 stopped, 1 zombie Cpu(s): 0.2% us, 0.2% sy, 0.0% ni, 99.5% id, 0.1% wa, 0.0% hi, 0.0% si Mem: 12336528k total, 12306408k used, 30120k free, 1976k buffers Swap: 8289520k total, 152k used, 8289368k free, 10233248k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 22725 globust 16 0 257m 72m 13m S 0 0.6 0:29.00 java 22789 globust 16 0 31384 2108 1632 S 0 0.0 0:00.00 globus-schedule 22825 globust 16 0 31384 2056 1632 S 0 0.0 0:00.00 globus-schedule 23250 globust 15 0 62464 4620 1068 S 0 0.0 0:00.05 csh 23947 globust 15 0 77924 21m 2208 S 0 0.2 0:00.45 gdb 22806 globust 16 0 1754m 1.7g 1684 T 0 14.3 7:09.14 globus-schedule 23987 globust 19 0 0 0 0 Z 0 0.0 0:00.00 gdb <defunct> [root@lslogin2 var]# grep increas seg_lsf.log.24574 seg_lsf.log.22806:1161204629.065505000::[INFO] 1161204629.065508000::globus_l_lsf_increase_buffer() called seg_lsf.log.22806:1161204629.065517000::[INFO] 1161204629.065519000::globus_l_lsf_increase_buffer() exits w/success lslogin2% gdb GNU gdb Red Hat Linux (6.3.0.0-1.132.EL4rh) Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". (gdb) attach 22806 Attaching to process 22806 Reading symbols from /import/data2/TG/64bit/globus-wsrf-4.0.1-r3/libexec/globus-scheduler-event-generator...done. Using host libthread_db library "/lib64/tls/libthread_db.so.1". Reading symbols from /import/data2/TG/64bit/globus-wsrf-4.0.1-r3/lib/libglobus_scheduler_event_generator_gcc64dbg.so.0...done. Loaded symbols for /data/TG/globus-wsrf-4.0.1-r3/lib/libglobus_scheduler_event_generator_gcc64dbg.so.0 Reading symbols from /import/data2/TG/64bit/globus-wsrf-4.0.1-r3/lib/libglobus_gram_protocol_gcc64dbg.so.0...done. Loaded symbols for /data/TG/globus-wsrf-4.0.1-r3/lib/libglobus_gram_protocol_gcc64dbg.so.0 Reading symbols from /import/data2/TG/64bit/globus-wsrf-4.0.1-r3/lib/libglobus_io_gcc64dbg.so.0...done. Loaded symbols for /data/TG/globus-wsrf-4.0.1-r3/lib/libglobus_io_gcc64dbg.so.0 Reading symbols from /import/data2/TG/64bit/globus-wsrf-4.0.1-r3/lib/libglobus_xio_gcc64dbg.so.0...done. Loaded symbols for /data/TG/globus-wsrf-4.0.1-r3/lib/libglobus_xio_gcc64dbg.so.0 Reading symbols from /import/data2/TG/64bit/globus-wsrf-4.0.1-r3/lib/libgssapi_error_gcc64dbg.so.0...done. Loaded symbols for /data/TG/globus-wsrf-4.0.1-r3/lib/libgssapi_error_gcc64dbg.so.0 Reading symbols from /import/data2/TG/64bit/globus-wsrf-4.0.1-r3/lib/libglobus_gss_assist_gcc64dbg.so.0...done. Loaded symbols for /data/TG/globus-wsrf-4.0.1-r3/lib/libglobus_gss_assist_gcc64dbg.so.0 Reading symbols from /import/data2/TG/64bit/globus-wsrf-4.0.1-r3/lib/libglobus_gssapi_gsi_gcc64dbg.so.0...done. Loaded symbols for /data/TG/globus-wsrf-4.0.1-r3/lib/libglobus_gssapi_gsi_gcc64dbg.so.0 Reading symbols from /import/data2/TG/64bit/globus-wsrf-4.0.1-r3/lib/libglobus_gsi_proxy_core_gcc64dbg.so.0...done. Loaded symbols for /data/TG/globus-wsrf-4.0.1-r3/lib/libglobus_gsi_proxy_core_gcc64dbg.so.0 Reading symbols from /import/data2/TG/64bit/globus-wsrf-4.0.1-r3/lib/libglobus_gsi_credential_gcc64dbg.so.0...done. Loaded symbols for /data/TG/globus-wsrf-4.0.1-r3/lib/libglobus_gsi_credential_gcc64dbg.so.0 Reading symbols from /import/data2/TG/64bit/globus-wsrf-4.0.1-r3/lib/libglobus_gsi_callback_gcc64dbg.so.0...done. Reading symbols from /lib64/tls/libc.so.6...done. Loaded symbols for /lib64/tls/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /import/data2/TG/64bit/globus-wsrf-4.0.1-r3/lib/libglobus_seg_lsf_gcc64dbg.so.0...done. Loaded symbols for /import/data2/TG/64bit/globus-wsrf-4.0.1-r3/lib/libglobus_seg_lsf_gcc64dbg.so.0 0x00000035e98b9302 in __write_nocancel () from /lib64/tls/libc.so.6 (gdb) where #0 0x00000035e98b9302 in __write_nocancel () from /lib64/tls/libc.so.6 #1 0x00000035e9865328 in _IO_new_file_write () from /lib64/tls/libc.so.6 #2 0x00000035e9865517 in _IO_new_file_xsputn () from /lib64/tls/libc.so.6 #3 0x00000035e983f40f in buffered_vfprintf () from /lib64/tls/libc.so.6 #4 0x00000035e983f599 in vfprintf () from /lib64/tls/libc.so.6 #5 0x00002aaaac354041 in globus_i_SEG_LSF_debug_time_printf (fmt=0x2aaaac357487 "%s") at seg_lsf_module.c:192 #6 0x00002aaaac355127 in globus_l_lsf_read_callback (user_arg=0x509450) at seg_lsf_module.c:505 #7 0x00002aaaabd0e373 in globus_callback_space_poll (timestop=0x503630, space=0) at globus_callback_nothreads.c:1424 #8 0x0000000000401e7e in main (argc=5, argv=0x7fffffd18b68) at main.c:184 (gdb) where #0 0x00000035e98b9302 in __write_nocancel () from /lib64/tls/libc.so.6 #1 0x00000035e9865328 in _IO_new_file_write () from /lib64/tls/libc.so.6 #2 0x00000035e9865517 in _IO_new_file_xsputn () from /lib64/tls/libc.so.6 #3 0x00000035e983f40f in buffered_vfprintf () from /lib64/tls/libc.so.6 #4 0x00000035e983f599 in vfprintf () from /lib64/tls/libc.so.6 #5 0x00002aaaac354041 in globus_i_SEG_LSF_debug_time_printf (fmt=0x2aaaac357487 "%s") at seg_lsf_module.c:192 #6 0x00002aaaac355127 in globus_l_lsf_read_callback (user_arg=0x509450) at seg_lsf_module.c:505 #7 0x00002aaaabd0e373 in globus_callback_space_poll (timestop=0x503630, space=0) at globus_callback_nothreads.c:1424 #8 0x0000000000401e7e in main (argc=5, argv=0x7fffffd18b68) at main.c:184 (gdb) where #0 0x00000035e98b9302 in __write_nocancel () from /lib64/tls/libc.so.6 #1 0x00000035e9865328 in _IO_new_file_write () from /lib64/tls/libc.so.6 #2 0x00000035e9865517 in _IO_new_file_xsputn () from /lib64/tls/libc.so.6 #3 0x00000035e983f40f in buffered_vfprintf () from /lib64/tls/libc.so.6 #4 0x00000035e983f599 in vfprintf () from /lib64/tls/libc.so.6 #5 0x00002aaaac354041 in globus_i_SEG_LSF_debug_time_printf (fmt=0x2aaaac357487 "%s") at seg_lsf_module.c:192 #6 0x00002aaaac355127 in globus_l_lsf_read_callback (user_arg=0x509450) at seg_lsf_module.c:505 #7 0x00002aaaabd0e373 in globus_callback_space_poll (timestop=0x503630, space=0) at globus_callback_nothreads.c:1424 #8 0x0000000000401e7e in main (argc=5, argv=0x7fffffd18b68) at main.c:184 (gdb) where #0 0x00000035e98b9302 in __write_nocancel () from /lib64/tls/libc.so.6 #1 0x00000035e9865328 in _IO_new_file_write () from /lib64/tls/libc.so.6 #2 0x00000035e9865517 in _IO_new_file_xsputn () from /lib64/tls/libc.so.6 3 0x00000035e983f40f in buffered_vfprintf () from /lib64/tls/libc.so.6 #4 0x00000035e983f599 in vfprintf () from /lib64/tls/libc.so.6 #5 0x00002aaaac354041 in globus_i_SEG_LSF_debug_time_printf (fmt=0x2aaaac357487 "%s") at seg_lsf_module.c:192 #6 0x00002aaaac355127 in globus_l_lsf_read_callback (user_arg=0x509450) at seg_lsf_module.c:505 #7 0x00002aaaabd0e373 in globus_callback_space_poll (timestop=0x503630, space=0) at globus_callback_nothreads.c:1424 #8 0x0000000000401e7e in main (argc=5, argv=0x7fffffd18b68) at main.c:184 (gdb)
(In reply to comment #0) > How do you suggest we debug this? We can give you access to our alternative > login node to debug and the SEG LSF log file is available there for you to see. > That would help. My lonestar account is tg457177. Please give details about getting to the alternate node. joe
ssh (non-gsi) to lslogin2.tacc.utexas.edu. Note: The gsissh configuration may not be up to dated. [globust@lslogin2 grid-security]# grep tg457177 /etc/passwd tg457177:x:457177:80243:Joseph Bester:/home/teragrid/tg457177:/bin/csh
Created an attachment (id=1094) [details] New LSF SEG package Here's another leak fix (based on a fix for a similar problem with the PBS SEG module). joe
We just built the patch, install it our test node, and have started testing it by running some of inca ws gram test cases over the weekend. I will let you know how it goes on Monday.
Ok the daemon is not eating memory but it is eating CPU cycle with the new patch. top - 13:11:03 up 3 days, 20:54, 7 users, load average: 0.56, 0.54, 0.38 Tasks: 333 total, 2 running, 331 sleeping, 0 stopped, 0 zombie Cpu(s): 17.1% us, 11.3% sy, 0.0% ni, 71.6% id, 0.0% wa, 0.0% hi, 0.0% si Mem: 12336528k total, 12304260k used, 32268k free, 256k buffers Swap: 8289520k total, 152k used, 8289368k free, 11977808k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 17586 globust 25 0 31548 2376 1688 R 100 0.0 9:08.43 globus-schedule 17512 globust 16 0 264m 94m 14m S 12 0.8 4:30.09 java 17570 globust 16 0 31788 2492 1632 S 0 0.0 0:00.10 globus-schedule 17595 globust 16 0 31384 2068 1632 S 0 0.0 0:00.01 globus-schedule [root@lslogin2 libexec]# ps -ef | grep lsf | grep globust globust 17586 17512 8 11:32 ? 00:08:02 /data/TG/64bit/globus-wsrf-4.0.1-r3/libexec/globus-scheduler-event-generator -s lsf -t 1161202518 [root@lslogin2 libexec]# gdb GNU gdb Red Hat Linux (6.3.0.0-1.132.EL4rh) Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". (gdb) attach 17586 Attaching to process 17586 (gdb) where #0 0x00000035e98b9302 in __write_nocancel () from /lib64/tls/libc.so.6 #1 0x00000035e9865328 in _IO_new_file_write () from /lib64/tls/libc.so.6 #2 0x00000035e9865517 in _IO_new_file_xsputn () from /lib64/tls/libc.so.6 #3 0x00000035e983f40f in buffered_vfprintf () from /lib64/tls/libc.so.6 #4 0x00000035e983f599 in vfprintf () from /lib64/tls/libc.so.6 #5 0x00002aaaac354041 in globus_i_SEG_LSF_debug_time_printf (fmt=0x2aaaac357407 "%s") at seg_lsf_module.c:190 #6 0x00002aaaac355e2a in globus_l_lsf_clean_buffer (state=0x509440) at seg_lsf_module.c:722 #7 0x00002aaaac355282 in globus_l_lsf_read_callback (user_arg=0x509440) at seg_lsf_module.c:525 #8 0x00002aaaabd0e373 in globus_callback_space_poll (timestop=0x503630, space=0) at globus_callback_nothreads.c:1424 #9 0x0000000000401e7e in main (argc=5, argv=0x7ffffffb5398) at main.c:184
I gave the SEG LSF daemon a "kill -1" and it stopped running at 100%, but three hours later it happened again. Here is what gdb trace is showing: top - 16:26:24 up 4 days, 10 min, 7 users, load average: 0.21, 0.12, 0.09 Tasks: 345 total, 1 running, 344 sleeping, 0 stopped, 0 zombie top - 16:37:44 up 4 days, 21 min, 7 users, load average: 0.88, 0.37, 0.18 Tasks: 340 total, 2 running, 338 sleeping, 0 stopped, 0 zombie Cpu(s): 13.0% us, 12.4% sy, 0.0% ni, 74.6% id, 0.0% wa, 0.0% hi, 0.0% si Mem: 12336528k total, 1253444k used, 11083084k free, 11032k buffers Swap: 8289520k total, 152k used, 8289368k free, 867120k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 19161 globust 25 0 31420 2156 1676 R 100 0.0 1:30.56 globus-schedule lslogin2% gdb GNU gdb Red Hat Linux (6.3.0.0-1.132.EL4rh) Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". (gdb) attach 19161 Attaching to process 19161 (gdb) where #0 0x00000035e98b9272 in __read_nocancel () from /lib64/tls/libc.so.6 #1 0x00000035e9865262 in _IO_file_read_internal () from /lib64/tls/libc.so.6 #2 0x00000035e986453d in _IO_new_file_underflow () from /lib64/tls/libc.so.6 #3 0x00000035e9865762 in _IO_file_xsgetn_internal () from /lib64/tls/libc.so.6 #4 0x00000035e985b141 in fread () from /lib64/tls/libc.so.6 #5 0x00002aaaac3550c3 in globus_l_lsf_read_callback (user_arg=0x5092e0) at seg_lsf_module.c:500 #6 0x00002aaaabd0e373 in globus_callback_space_poll (timestop=0x503630, space=0) at globus_callback_nothreads.c:1424 #7 0x0000000000401e7e in main (argc=5, argv=0x7fffffb4e528) at main.c:184
Created an attachment (id=1103) [details] New LSF SEG package Here's another patch, that seems to have the features of not taking all memory and also not taking all CPU. Let me know if there are any other system resources I should also be nice about :) joe
We downed and installed the third patch, but the SEG lsf daemon is still having a going to 100%. top - 18:37:18 up 7 days, 2:21, 3 users, load average: 1.05, 1.01, 0.94 Tasks: 331 total, 2 running, 329 sleeping, 0 stopped, 0 zombie Cpu(s): 13.4% us, 12.2% sy, 0.0% ni, 74.5% id, 0.0% wa, 0.0% hi, 0.0% si Mem: 12336528k total, 3828648k used, 8507880k free, 37856k buffers Swap: 8289520k total, 152k used, 8289368k free, 1748300k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 29084 globust 25 0 31548 2324 1680 R 100 0.0 104:52.09 globus-schedule 29009 globust 17 0 264m 93m 13m S 0 0.8 4:25.54 java 29067 globust 16 0 32068 2804 1672 S 0 0.0 0:00.00 globus-schedule 29091 globust 16 0 31380 2076 1632 S 0 0.0 0:00.01 globus-schedule [root@lslogin2 init.d]# gdb GNU gdb Red Hat Linux (6.3.0.0-1.132.EL4rh) Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". (gdb) attach 29084 Attaching to process 29084 Reading symbols from /import/data2/TG/64bit/globus-wsrf-4.0.1-r3/libexec/globus-scheduler-event-generator...done. (gdb) where #0 0x00000035e98b9272 in __read_nocancel () from /lib64/tls/libc.so.6 #1 0x00000035e9865262 in _IO_file_read_internal () from /lib64/tls/libc.so.6 #2 0x00000035e986453d in _IO_new_file_underflow () from /lib64/tls/libc.so.6 #3 0x00000035e9865762 in _IO_file_xsgetn_internal () from /lib64/tls/libc.so.6 #4 0x00000035e985b141 in fread () from /lib64/tls/libc.so.6 #5 0x00002aaaac3550c3 in globus_l_lsf_read_callback (user_arg=0x509240) at seg_lsf_module.c:500 #6 0x00002aaaabd0e373 in globus_callback_space_poll (timestop=0x503630, space=0) at globus_callback_nothreads.c:1424 #7 0x0000000000401e7e in main (argc=5, argv=0x7fffffb333d8) at main.c:184 (gdb) quit [root@lslogin2 lib]# ls -lt | more total 153160 -rw-r--r-- 1 globust G-1130 51486 Oct 23 15:57 libglobus_seg_lsf_gcc64dbg.a -rwxr-xr-x 1 globust G-1130 3178 Oct 23 15:57 libglobus_seg_lsf_gcc64dbg.la lrwxrwxrwx 1 globust G-1130 35 Oct 23 15:57 libglobus_seg_lsf_gcc64dbg.so -> libglobus_seg_lsf_gcc64dbg.so.0.0.0 lrwxrwxrwx 1 globust G-1130 35 Oct 23 15:57 libglobus_seg_lsf_gcc64dbg.so.0 -> libglobus_seg_lsf_gcc64dbg.so.0.0.0 -rwxr-xr-x 1 globust G-1130 43727 Oct 23 15:57 libglobus_seg_lsf_gcc64dbg.so.0.0.0
(gdb) break seg_lsf_module.c:500 Breakpoint 1 at 0x2aaaac355080: file seg_lsf_module.c, line 500. (gdb) c Continuing. Breakpoint 1, globus_l_lsf_read_callback (user_arg=0x509240) at seg_lsf_module.c:500 500 rc = fread(state->buffer + state->buffer_point + state->buffer_valid, (gdb) print rc $4 = 0 (gdb) print state->buffer_valid $5 = 0 (gdb) print feof(state->fp) $6 = 0 (gdb) print state->fp $7 = (FILE *) 0x50a3d0 (gdb) step 503 SEGLsfDebug(SEG_LSF_DEBUG_TRACE, (gdb) step 506 if (rc < max_to_read) (gdb) step 508 if (feof(state->fp)) (gdb) step 520 state->buffer_valid += rc; (gdb) step 523 rc = globus_l_lsf_parse_events(state); (gdb) c Continuing. Breakpoint 1, globus_l_lsf_read_callback (user_arg=0x509240) at seg_lsf_module.c:500 500 rc = fread(state->buffer + state->buffer_point + state->buffer_valid, (gdb) print eof_hit $10 = 0 (gdb) print state->fp $11 = (FILE *) 0x50a3d0 (gdb)
It took all day, but I finally got it to misbehave in the way you are seeing and have a good idea of what is going wrong (different this time). I will work out a solution and try another run tomorrow. joe
Joe when the SEG LSF daemon hangs and does not respond to globusrun-ws then gdb shows the following: (gdb) where #0 0x00000035e98bf8f5 in __select_nocancel () from /lib64/tls/libc.so.6 #1 0x00002aaaaae0f85c in globus_l_xio_system_poll (user_args=0x0) at globus_xio_system_select.c:2145 #2 0x00002aaaabd0e373 in globus_callback_space_poll (timestop=0x503630, space=0) at globus_callback_nothreads.c:1424 #3 0x0000000000401e7e in main (argc=5, argv=0x7ffffff076f8) at main.c:184 (gdb) Is this what you are seeing?
What I saw was a loop in globus_l_lsf_read_callback occurring because of a file read error (probably because the log file being parsed was on NFS and was renamed because of an LSF log rotation). The fix is more complicated than I thought because the way the LSF log rotation happens changed a bit between the version the SEG module was developed (LSF 5.1) and what you have at TACC (6.2). I'm working on an updated version of the module which will handle the situation where LSF log rotation happens while the SEG is running. I think part of my difficulty in locating the problem was that, because of the logging changes in LSF, the older log records weren't being parsed like I expected from the SEG, so it only happened for me after the log file was rotated. joe
I've got a new version of the SEG module running on lslogin2 again. It is honoring the starting timestamp (for job state changes which occurred when the SEG was not monitoring them). Hopefully there will be a log rotation in the next few hours so I can verify that it won't react badly to that. joe
Created an attachment (id=1111) [details] New LSF SEG package I ran the SEG with this module over the weekend without troubles. There were several LSF log rotations that occurred during that time, so I feel pretty confident about this one.
Joe, SEG LSF ran over night without any problems on lslogin2. I informed Doru at NCSA about this patch for their LSF system. Could this patch help PBS, SGE, or Loadleveler SEG's? Or is it a LSF only issue? I am closing this ticket. Thanks again, David Carver
I've made similar fixes for the memory leak problems for the fork, pbs, and condor SEG modules that are part of the toolkit distribution and committed these all to CVS in the 4.0 branch and trunk. These all contained the 1st and 2nd leak problems I patched in the LSF module, but not the other issues. I'm not sure of where the other SEG modules come from, but they likely have the same leaks (though they may not leak at the same rate as you were seeing). The fixes for those would be - replace the first argument to globus_callback_register_oneshot with NULL - add a call to the increase_buffer function in the clean_buffer within the if statement - in increase buffer, only increment buffer_length in the if statement where the resizing actually occurs. I've put the update source packages on the ftp site and added advisories for these to http://www.globus.org/toolkit/advisories.html joe
This resolved the problem for TeraGrid Ticket 130032.