Bug 4777 - Excessive memory usage for SEG modules
: Excessive memory usage for SEG modules
Status: RESOLVED FIXED
: GRAM
wsrf scheduler interface
: 4.0.1
: Other Linux
: P1 critical
: 4.0.4
Assigned To:
: http://www.tacc.utexas.edu
:
:
:
  Show dependency treegraph
 
Reported: 2006-10-13 16:00 by
Modified: 2006-11-01 12:04 (History)


Attachments
New LSF SEG package (174.35 KB, application/x-tar)
2006-10-17 09:12, Joe Bester
Details
New LSF SEG package (174.32 KB, application/x-tar)
2006-10-20 11:04, Joe Bester
Details
New LSF SEG package (174.33 KB, application/x-tar)
2006-10-23 12:37, Joe Bester
Details
New LSF SEG package (175.24 KB, application/x-tar)
2006-10-30 07:54, Joe Bester
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2006-10-13 16:00:34
We've found that at apparently random times, the WS-GRAM Scheduler Event
Generator daemon for LSF starts to quickly use more and more memory (GBs) until
our login node is unusable.

We've deployed the WS-GRAM provided by TeraGrid (Globus 4.0.1) on our lonestar
system. Lonestar is a Intel em64t running 64bit Redhat 4.0 with LSF is the
batch scheduler.  The Globus container is running on a login node while LSF
servers are running on an administrative node. The LSF log files are NFS
mounted to the login nodes. 

We recently started debugging this problem on an alternate login node by
turning on debugging messages via SEG_LSF_DEBUG environment variable.
Unfortunately, these log files aren't giving us information that helps us find
the problem. We also examined the LSF SEG code and nothing glaring stood out.
The only thing we noticed is a malloc() for state->event_idx_path that isn't
free()d, but we're not even sure this malloc() is ever called.  

How do you suggest we debug this? We can give you access to our alternative
login node to debug and the SEG LSF log file is available there for you to see.
------- Comment #1 From 2006-10-17 09:12:49 -------
Created an attachment (id=1079) [details]
New LSF SEG package

I think I've located the problem (it also may occur in other SEG modules). Can
you try building and installing this package and see if it helps?

joe
------- Comment #2 From 2006-10-18 16:26:25 -------
The LSF SEG package was installed, but the seg lsf daemon still chews up memory


top - 16:02:28 up 1 day, 23:46,  5 users,  load average: 0.61, 0.59, 0.45
Tasks: 329 total,   1 running, 326 sleeping,   1 stopped,   1 zombie
Cpu(s):  0.2% us,  0.2% sy,  0.0% ni, 99.5% id,  0.1% wa,  0.0% hi,  0.0% si
Mem:  12336528k total, 12306408k used,    30120k free,     1976k buffers
Swap:  8289520k total,      152k used,  8289368k free, 10233248k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
22725 globust   16   0  257m  72m  13m S    0  0.6   0:29.00 java
22789 globust   16   0 31384 2108 1632 S    0  0.0   0:00.00 globus-schedule
22825 globust   16   0 31384 2056 1632 S    0  0.0   0:00.00 globus-schedule
23250 globust   15   0 62464 4620 1068 S    0  0.0   0:00.05 csh
23947 globust   15   0 77924  21m 2208 S    0  0.2   0:00.45 gdb
22806 globust   16   0 1754m 1.7g 1684 T    0 14.3   7:09.14 globus-schedule
23987 globust   19   0     0    0    0 Z    0  0.0   0:00.00 gdb <defunct>

[root@lslogin2 var]# grep increas  seg_lsf.log.24574

seg_lsf.log.22806:1161204629.065505000::[INFO]
1161204629.065508000::globus_l_lsf_increase_buffer() called
seg_lsf.log.22806:1161204629.065517000::[INFO]
1161204629.065519000::globus_l_lsf_increase_buffer() exits w/success


lslogin2% gdb
GNU gdb Red Hat Linux (6.3.0.0-1.132.EL4rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
(gdb) attach 22806
Attaching to process 22806
Reading symbols from
/import/data2/TG/64bit/globus-wsrf-4.0.1-r3/libexec/globus-scheduler-event-generator...done.
Using host libthread_db library "/lib64/tls/libthread_db.so.1".
Reading symbols from
/import/data2/TG/64bit/globus-wsrf-4.0.1-r3/lib/libglobus_scheduler_event_generator_gcc64dbg.so.0...done.
Loaded symbols for
/data/TG/globus-wsrf-4.0.1-r3/lib/libglobus_scheduler_event_generator_gcc64dbg.so.0
Reading symbols from
/import/data2/TG/64bit/globus-wsrf-4.0.1-r3/lib/libglobus_gram_protocol_gcc64dbg.so.0...done.
Loaded symbols for
/data/TG/globus-wsrf-4.0.1-r3/lib/libglobus_gram_protocol_gcc64dbg.so.0
Reading symbols from
/import/data2/TG/64bit/globus-wsrf-4.0.1-r3/lib/libglobus_io_gcc64dbg.so.0...done.
Loaded symbols for /data/TG/globus-wsrf-4.0.1-r3/lib/libglobus_io_gcc64dbg.so.0
Reading symbols from
/import/data2/TG/64bit/globus-wsrf-4.0.1-r3/lib/libglobus_xio_gcc64dbg.so.0...done.
Loaded symbols for
/data/TG/globus-wsrf-4.0.1-r3/lib/libglobus_xio_gcc64dbg.so.0
Reading symbols from
/import/data2/TG/64bit/globus-wsrf-4.0.1-r3/lib/libgssapi_error_gcc64dbg.so.0...done.
Loaded symbols for
/data/TG/globus-wsrf-4.0.1-r3/lib/libgssapi_error_gcc64dbg.so.0
Reading symbols from
/import/data2/TG/64bit/globus-wsrf-4.0.1-r3/lib/libglobus_gss_assist_gcc64dbg.so.0...done.
Loaded symbols for
/data/TG/globus-wsrf-4.0.1-r3/lib/libglobus_gss_assist_gcc64dbg.so.0
Reading symbols from
/import/data2/TG/64bit/globus-wsrf-4.0.1-r3/lib/libglobus_gssapi_gsi_gcc64dbg.so.0...done.
Loaded symbols for
/data/TG/globus-wsrf-4.0.1-r3/lib/libglobus_gssapi_gsi_gcc64dbg.so.0
Reading symbols from
/import/data2/TG/64bit/globus-wsrf-4.0.1-r3/lib/libglobus_gsi_proxy_core_gcc64dbg.so.0...done.
Loaded symbols for
/data/TG/globus-wsrf-4.0.1-r3/lib/libglobus_gsi_proxy_core_gcc64dbg.so.0
Reading symbols from
/import/data2/TG/64bit/globus-wsrf-4.0.1-r3/lib/libglobus_gsi_credential_gcc64dbg.so.0...done.
Loaded symbols for
/data/TG/globus-wsrf-4.0.1-r3/lib/libglobus_gsi_credential_gcc64dbg.so.0
Reading symbols from
/import/data2/TG/64bit/globus-wsrf-4.0.1-r3/lib/libglobus_gsi_callback_gcc64dbg.so.0...done.
Reading symbols from /lib64/tls/libc.so.6...done.
Loaded symbols for /lib64/tls/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from
/import/data2/TG/64bit/globus-wsrf-4.0.1-r3/lib/libglobus_seg_lsf_gcc64dbg.so.0...done.
Loaded symbols for
/import/data2/TG/64bit/globus-wsrf-4.0.1-r3/lib/libglobus_seg_lsf_gcc64dbg.so.0
0x00000035e98b9302 in __write_nocancel () from /lib64/tls/libc.so.6

(gdb) where
#0  0x00000035e98b9302 in __write_nocancel () from /lib64/tls/libc.so.6
#1  0x00000035e9865328 in _IO_new_file_write () from /lib64/tls/libc.so.6
#2  0x00000035e9865517 in _IO_new_file_xsputn () from /lib64/tls/libc.so.6
#3  0x00000035e983f40f in buffered_vfprintf () from /lib64/tls/libc.so.6
#4  0x00000035e983f599 in vfprintf () from /lib64/tls/libc.so.6
#5  0x00002aaaac354041 in globus_i_SEG_LSF_debug_time_printf
(fmt=0x2aaaac357487 "%s") at seg_lsf_module.c:192
#6  0x00002aaaac355127 in globus_l_lsf_read_callback (user_arg=0x509450) at
seg_lsf_module.c:505
#7  0x00002aaaabd0e373 in globus_callback_space_poll (timestop=0x503630,
space=0) at globus_callback_nothreads.c:1424
#8  0x0000000000401e7e in main (argc=5, argv=0x7fffffd18b68) at main.c:184
(gdb) where
#0  0x00000035e98b9302 in __write_nocancel () from /lib64/tls/libc.so.6
#1  0x00000035e9865328 in _IO_new_file_write () from /lib64/tls/libc.so.6
#2  0x00000035e9865517 in _IO_new_file_xsputn () from /lib64/tls/libc.so.6
#3  0x00000035e983f40f in buffered_vfprintf () from /lib64/tls/libc.so.6
#4  0x00000035e983f599 in vfprintf () from /lib64/tls/libc.so.6
#5  0x00002aaaac354041 in globus_i_SEG_LSF_debug_time_printf
(fmt=0x2aaaac357487 "%s") at seg_lsf_module.c:192
#6  0x00002aaaac355127 in globus_l_lsf_read_callback (user_arg=0x509450) at
seg_lsf_module.c:505
#7  0x00002aaaabd0e373 in globus_callback_space_poll (timestop=0x503630,
space=0) at globus_callback_nothreads.c:1424
#8  0x0000000000401e7e in main (argc=5, argv=0x7fffffd18b68) at main.c:184
(gdb) where
#0  0x00000035e98b9302 in __write_nocancel () from /lib64/tls/libc.so.6
#1  0x00000035e9865328 in _IO_new_file_write () from /lib64/tls/libc.so.6
#2  0x00000035e9865517 in _IO_new_file_xsputn () from /lib64/tls/libc.so.6
#3  0x00000035e983f40f in buffered_vfprintf () from /lib64/tls/libc.so.6
#4  0x00000035e983f599 in vfprintf () from /lib64/tls/libc.so.6
#5  0x00002aaaac354041 in globus_i_SEG_LSF_debug_time_printf
(fmt=0x2aaaac357487 "%s") at seg_lsf_module.c:192
#6  0x00002aaaac355127 in globus_l_lsf_read_callback (user_arg=0x509450) at
seg_lsf_module.c:505
#7  0x00002aaaabd0e373 in globus_callback_space_poll (timestop=0x503630,
space=0) at globus_callback_nothreads.c:1424
#8  0x0000000000401e7e in main (argc=5, argv=0x7fffffd18b68) at main.c:184
(gdb) where
#0  0x00000035e98b9302 in __write_nocancel () from /lib64/tls/libc.so.6
#1  0x00000035e9865328 in _IO_new_file_write () from /lib64/tls/libc.so.6
#2  0x00000035e9865517 in _IO_new_file_xsputn () from /lib64/tls/libc.so.6
3  0x00000035e983f40f in buffered_vfprintf () from /lib64/tls/libc.so.6
#4  0x00000035e983f599 in vfprintf () from /lib64/tls/libc.so.6
#5  0x00002aaaac354041 in globus_i_SEG_LSF_debug_time_printf
(fmt=0x2aaaac357487 "%s") at seg_lsf_module.c:192
#6  0x00002aaaac355127 in globus_l_lsf_read_callback (user_arg=0x509450) at
seg_lsf_module.c:505
#7  0x00002aaaabd0e373 in globus_callback_space_poll (timestop=0x503630,
space=0) at globus_callback_nothreads.c:1424
#8  0x0000000000401e7e in main (argc=5, argv=0x7fffffd18b68) at main.c:184
(gdb)
------- Comment #3 From 2006-10-19 10:20:45 -------
(In reply to comment #0)
> How do you suggest we debug this? We can give you access to our alternative
> login node to debug and the SEG LSF log file is available there for you to see.
> 


That would help. My lonestar account is tg457177. Please give details about
getting to the alternate node.

joe
------- Comment #4 From 2006-10-19 12:18:07 -------
ssh (non-gsi) to lslogin2.tacc.utexas.edu. Note: The gsissh configuration may
not be up to dated.

[globust@lslogin2 grid-security]# grep tg457177 /etc/passwd
tg457177:x:457177:80243:Joseph Bester:/home/teragrid/tg457177:/bin/csh
------- Comment #5 From 2006-10-20 11:04:23 -------
Created an attachment (id=1094) [details]
New LSF SEG package

Here's another leak fix (based on a fix for a similar problem with the PBS SEG
module).

joe
------- Comment #6 From 2006-10-20 12:01:51 -------
We just built the patch, install it our test node, and have started testing it
by running some of inca ws gram test cases over the weekend.    I will let you
know how it goes on Monday.
------- Comment #7 From 2006-10-20 13:13:05 -------
Ok the daemon is not eating memory but it is eating CPU cycle with the new
patch.

top - 13:11:03 up 3 days, 20:54,  7 users,  load average: 0.56, 0.54, 0.38
Tasks: 333 total,   2 running, 331 sleeping,   0 stopped,   0 zombie
Cpu(s): 17.1% us, 11.3% sy,  0.0% ni, 71.6% id,  0.0% wa,  0.0% hi,  0.0% si
Mem:  12336528k total, 12304260k used,    32268k free,      256k buffers
Swap:  8289520k total,      152k used,  8289368k free, 11977808k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
17586 globust   25   0 31548 2376 1688 R  100  0.0   9:08.43 globus-schedule    
17512 globust   16   0  264m  94m  14m S   12  0.8   4:30.09 java               
17570 globust   16   0 31788 2492 1632 S    0  0.0   0:00.10 globus-schedule    
17595 globust   16   0 31384 2068 1632 S    0  0.0   0:00.01 globus-schedule    

[root@lslogin2 libexec]# ps -ef | grep lsf | grep globust
globust  17586 17512  8 11:32 ?        00:08:02
/data/TG/64bit/globus-wsrf-4.0.1-r3/libexec/globus-scheduler-event-generator -s
lsf -t 1161202518

[root@lslogin2 libexec]# gdb
GNU gdb Red Hat Linux (6.3.0.0-1.132.EL4rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
(gdb) attach  17586
Attaching to process 17586

(gdb) where
#0  0x00000035e98b9302 in __write_nocancel () from /lib64/tls/libc.so.6
#1  0x00000035e9865328 in _IO_new_file_write () from /lib64/tls/libc.so.6
#2  0x00000035e9865517 in _IO_new_file_xsputn () from /lib64/tls/libc.so.6
#3  0x00000035e983f40f in buffered_vfprintf () from /lib64/tls/libc.so.6
#4  0x00000035e983f599 in vfprintf () from /lib64/tls/libc.so.6
#5  0x00002aaaac354041 in globus_i_SEG_LSF_debug_time_printf
(fmt=0x2aaaac357407 "%s") at seg_lsf_module.c:190
#6  0x00002aaaac355e2a in globus_l_lsf_clean_buffer (state=0x509440) at
seg_lsf_module.c:722
#7  0x00002aaaac355282 in globus_l_lsf_read_callback (user_arg=0x509440) at
seg_lsf_module.c:525
#8  0x00002aaaabd0e373 in globus_callback_space_poll (timestop=0x503630,
space=0) at globus_callback_nothreads.c:1424
#9  0x0000000000401e7e in main (argc=5, argv=0x7ffffffb5398) at main.c:184
------- Comment #8 From 2006-10-20 16:47:02 -------
I gave the SEG LSF daemon a "kill -1" and it stopped running at 100%, but three
hours later it happened again.  Here is what gdb trace is showing:

top - 16:26:24 up 4 days, 10 min,  7 users,  load average: 0.21, 0.12, 0.09
Tasks: 345 total,   1 running, 344 sleeping,   0 stopped,   0 zombie
top - 16:37:44 up 4 days, 21 min,  7 users,  load average: 0.88, 0.37, 0.18
Tasks: 340 total,   2 running, 338 sleeping,   0 stopped,   0 zombie
Cpu(s): 13.0% us, 12.4% sy,  0.0% ni, 74.6% id,  0.0% wa,  0.0% hi,  0.0% si
Mem:  12336528k total,  1253444k used, 11083084k free,    11032k buffers
Swap:  8289520k total,      152k used,  8289368k free,   867120k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
19161 globust   25   0 31420 2156 1676 R  100  0.0   1:30.56 globus-schedule    

lslogin2% gdb
GNU gdb Red Hat Linux (6.3.0.0-1.132.EL4rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
(gdb) attach 19161
Attaching to process 19161
(gdb) where
#0  0x00000035e98b9272 in __read_nocancel () from /lib64/tls/libc.so.6
#1  0x00000035e9865262 in _IO_file_read_internal () from /lib64/tls/libc.so.6
#2  0x00000035e986453d in _IO_new_file_underflow () from /lib64/tls/libc.so.6
#3  0x00000035e9865762 in _IO_file_xsgetn_internal () from /lib64/tls/libc.so.6
#4  0x00000035e985b141 in fread () from /lib64/tls/libc.so.6
#5  0x00002aaaac3550c3 in globus_l_lsf_read_callback (user_arg=0x5092e0) at
seg_lsf_module.c:500
#6  0x00002aaaabd0e373 in globus_callback_space_poll (timestop=0x503630,
space=0) at globus_callback_nothreads.c:1424
#7  0x0000000000401e7e in main (argc=5, argv=0x7fffffb4e528) at main.c:184
------- Comment #9 From 2006-10-23 12:37:58 -------
Created an attachment (id=1103) [details]
New LSF SEG package

Here's another patch, that seems to have the features of not taking all memory
and also not taking all CPU. Let me know if there are any other system
resources I should also be nice about :)

joe
------- Comment #10 From 2006-10-23 18:49:13 -------
We downed and installed the third patch, but the SEG lsf daemon is still having
a 
going to 100%.

top - 18:37:18 up 7 days,  2:21,  3 users,  load average: 1.05, 1.01, 0.94
Tasks: 331 total,   2 running, 329 sleeping,   0 stopped,   0 zombie
Cpu(s): 13.4% us, 12.2% sy,  0.0% ni, 74.5% id,  0.0% wa,  0.0% hi,  0.0% si
Mem:  12336528k total,  3828648k used,  8507880k free,    37856k buffers
Swap:  8289520k total,      152k used,  8289368k free,  1748300k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
29084 globust   25   0 31548 2324 1680 R  100  0.0 104:52.09 globus-schedule    
29009 globust   17   0  264m  93m  13m S    0  0.8   4:25.54 java               
29067 globust   16   0 32068 2804 1672 S    0  0.0   0:00.00 globus-schedule    
29091 globust   16   0 31380 2076 1632 S    0  0.0   0:00.01 globus-schedule    

[root@lslogin2 init.d]# gdb
GNU gdb Red Hat Linux (6.3.0.0-1.132.EL4rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
(gdb) attach 29084
Attaching to process 29084
Reading symbols from
/import/data2/TG/64bit/globus-wsrf-4.0.1-r3/libexec/globus-scheduler-event-generator...done.

(gdb) where
#0  0x00000035e98b9272 in __read_nocancel () from /lib64/tls/libc.so.6
#1  0x00000035e9865262 in _IO_file_read_internal () from /lib64/tls/libc.so.6
#2  0x00000035e986453d in _IO_new_file_underflow () from /lib64/tls/libc.so.6
#3  0x00000035e9865762 in _IO_file_xsgetn_internal () from /lib64/tls/libc.so.6
#4  0x00000035e985b141 in fread () from /lib64/tls/libc.so.6
#5  0x00002aaaac3550c3 in globus_l_lsf_read_callback (user_arg=0x509240) at
seg_lsf_module.c:500
#6  0x00002aaaabd0e373 in globus_callback_space_poll (timestop=0x503630,
space=0) at globus_callback_nothreads.c:1424
#7  0x0000000000401e7e in main (argc=5, argv=0x7fffffb333d8) at main.c:184
(gdb) quit

[root@lslogin2 lib]# ls -lt | more
total 153160
-rw-r--r--  1 globust G-1130   51486 Oct 23 15:57 libglobus_seg_lsf_gcc64dbg.a
-rwxr-xr-x  1 globust G-1130    3178 Oct 23 15:57 libglobus_seg_lsf_gcc64dbg.la
lrwxrwxrwx  1 globust G-1130      35 Oct 23 15:57 libglobus_seg_lsf_gcc64dbg.so
-> libglobus_seg_lsf_gcc64dbg.so.0.0.0
lrwxrwxrwx  1 globust G-1130      35 Oct 23 15:57
libglobus_seg_lsf_gcc64dbg.so.0 -> libglobus_seg_lsf_gcc64dbg.so.0.0.0
-rwxr-xr-x  1 globust G-1130   43727 Oct 23 15:57
libglobus_seg_lsf_gcc64dbg.so.0.0.0
------- Comment #11 From 2006-10-23 20:09:26 -------
(gdb) break  seg_lsf_module.c:500
Breakpoint 1 at 0x2aaaac355080: file seg_lsf_module.c, line 500.
(gdb) c
Continuing.

Breakpoint 1, globus_l_lsf_read_callback (user_arg=0x509240) at
seg_lsf_module.c:500
500             rc = fread(state->buffer + state->buffer_point +
state->buffer_valid,
(gdb) print rc
$4 = 0
(gdb) print  state->buffer_valid 
$5 = 0
(gdb) print feof(state->fp)
$6 = 0
(gdb) print state->fp
$7 = (FILE *) 0x50a3d0
(gdb) step
503             SEGLsfDebug(SEG_LSF_DEBUG_TRACE,
(gdb) step
506             if (rc < max_to_read)
(gdb) step
508                 if (feof(state->fp))
(gdb) step
520             state->buffer_valid += rc;
(gdb) step
523             rc = globus_l_lsf_parse_events(state);
(gdb) c
Continuing.

Breakpoint 1, globus_l_lsf_read_callback (user_arg=0x509240) at
seg_lsf_module.c:500
500             rc = fread(state->buffer + state->buffer_point +
state->buffer_valid,

(gdb) print eof_hit
$10 = 0

(gdb) print  state->fp
$11 = (FILE *) 0x50a3d0
(gdb) 
------- Comment #12 From 2006-10-25 16:48:40 -------
It took all day, but I finally got it to misbehave in the way you are seeing
and have a good idea of what is going wrong (different this time). I will work
out a solution and try another run tomorrow.

joe
------- Comment #13 From 2006-10-26 14:23:58 -------
Joe when the SEG LSF daemon hangs and does not respond to globusrun-ws then
gdb shows the following:

(gdb) where

#0  0x00000035e98bf8f5 in __select_nocancel () from /lib64/tls/libc.so.6

#1  0x00002aaaaae0f85c in globus_l_xio_system_poll (user_args=0x0) at
globus_xio_system_select.c:2145

#2  0x00002aaaabd0e373 in globus_callback_space_poll (timestop=0x503630,
space=0) at globus_callback_nothreads.c:1424

#3  0x0000000000401e7e in main (argc=5, argv=0x7ffffff076f8) at main.c:184

(gdb)

Is this what you are seeing?
------- Comment #14 From 2006-10-26 15:25:05 -------
What I saw was a loop in globus_l_lsf_read_callback occurring because of a file
read error (probably because the log file being parsed was on NFS and was
renamed because of an LSF log rotation).

The fix is more complicated than I thought because the way the LSF log rotation
happens changed a bit between the version the SEG module was developed (LSF
5.1) and what you have at TACC (6.2). I'm working on an updated version of the
module which will handle the situation where LSF log rotation happens while the
SEG is running.

I think part of my difficulty in locating the problem was that, because of the
logging changes in LSF, the older log records weren't being parsed like I
expected from the SEG, so it only happened for me after the log file was
rotated.

joe
------- Comment #15 From 2006-10-27 14:12:02 -------
I've got a new version of the SEG module running on lslogin2 again. It is
honoring the starting timestamp (for job state changes which occurred when the
SEG was not monitoring them). Hopefully there will be a log rotation in the
next few hours so I can verify that it won't react badly to that.

joe
------- Comment #16 From 2006-10-30 07:54:41 -------
Created an attachment (id=1111) [details]
New LSF SEG package

I ran the SEG with this module over the weekend without troubles. There were
several LSF log rotations that occurred during that time, so I feel pretty
confident about this one.
------- Comment #17 From 2006-10-31 10:36:42 -------
Joe,
SEG LSF ran over night without any problems on lslogin2.
I informed Doru at NCSA about this patch for their LSF system.
Could this patch help PBS, SGE, or Loadleveler SEG's?  Or is it a LSF only
issue?
I am closing this ticket.
Thanks again,
David Carver
------- Comment #18 From 2006-10-31 11:22:06 -------
I've made similar fixes for the memory leak problems for the fork, pbs, and
condor SEG modules that are part of the toolkit distribution and committed
these all to CVS in the 4.0 branch and trunk. These all contained the 1st and
2nd leak problems I patched in the LSF module, but not the other issues. I'm
not sure of where the other SEG modules come from, but they likely have the
same leaks (though they may not leak at the same rate as you were seeing). The
fixes for those would be
- replace the first argument to globus_callback_register_oneshot with NULL
- add a call to the increase_buffer function in the clean_buffer within the if
statement
- in increase buffer, only increment buffer_length in the if statement where
the resizing actually occurs.

I've put the update source packages on the ftp site and added advisories for
these to http://www.globus.org/toolkit/advisories.html


joe
------- Comment #19 From 2006-11-01 12:04:43 -------
This resolved the problem for TeraGrid Ticket 130032.