Bug 1501

Summary: memory leak in ftp_client when using connection caching and credential delegation
Product: GridFTP Reporter: Andrei Hutanu <hutanu@zib.de>
Component: GridFTPAssignee: Joseph M Link <link@mcs.anl.gov>
Status: RESOLVED FIXED    
Severity: major CC: allcock@mcs.anl.gov, dangulo@cs.uchicago.edu, keahey@mcs.anl.gov
Priority: P2    
Version: 3.2.0   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Attachments: Program reproducing the bug
added flush calls.. no effect
valgrind output
fixes leak in flush_url_state
fix leak when using same host for source and dest

Description From 2004-01-19 09:22:16
I am writing a small method that copies multiple files between two GridFTP
servers using the globus_ftp_client library version 1.10
In order to be efficient I was trying to use connection caching.
When the caching is activated there are around 200Kb in leaks 
for around 20 files copied. Any thoughts ?

Here is some code :

   <handles init with credential delegation>

   globus_mutex_init(&(my_task.mutex), GLOBUS_NULL);
   globus_cond_init(&(my_task.cond), GLOBUS_NULL);
   my_task.operation_error = (char*) globus_libc_malloc(sizeof(char)*3000);
for (int i = 0; i < size / 2; i++) {
    if (i == 0) {
      globus_ftp_client_handle_cache_url_state (&ftp_handle, URLs[2*i]);
      globus_ftp_client_handle_cache_url_state (&ftp_handle, URLs[2*i + 1]);
    }

    my_task.operation_done = 0;
    

#ifdef DEBUG
    printf("Copying from %s to %s\n", URLs[2*i], URLs[2*i + 1]);
#endif

    result = globus_ftp_client_third_party_transfer(&ftp_handle),
						    URLs[2*i],
                                                    &source_ftp_attr,
						    URLs[2*i + 1],
						    &dest_ftp_attr,
						    NULL,
						    my_delete_callback_ts,
						    (void*) &my_task);
    if (result != GLOBUS_SUCCESS) {
      retmsg = (char*)
globus_libc_malloc(strlen(globus_object_printable_to_string(globus_error_get(result)))
+ 100);
      assert(retmsg);
      globus_libc_sprintf(retmsg, "300 Copy error: %s\n",
			  globus_object_printable_to_string(globus_error_get(result)));
      goto return_retmsg;
    }

    globus_mutex_lock(&(my_task.mutex));
  
    while(!my_task.operation_done)
      globus_cond_wait(&(my_task.cond), &(my_task.mutex));

    globus_mutex_unlock(&(my_task.mutex));

    if (strcmp(my_task.operation_error, "200 OK") != 0) {
      retmsg = strdup(my_task.operation_error);
      assert(retmsg);
      goto return_retmsg;
    }

  }
  
  retmsg = strdup("200 OK");

 return_retmsg:


  globus_ftp_client_operationattr_destroy(&dest_ftp_attr);
  
  globus_ftp_client_operationattr_destroy(&source_ftp_attr);
  
  globus_ftp_client_handle_destroy (&ftp_handle);

  globus_ftp_client_handleattr_destroy(&ftp_handleattr);

  globus_libc_free(my_task.operation_error);
  globus_cond_destroy(&(my_task.cond));
  globus_mutex_destroy(&(my_task.mutex));

  return retmsg;
------- Comment #1 From 2004-01-27 02:32:42 -------
Can you try using:

globus_result_t
globus_ftp_client_handleattr_set_cache_all(
    globus_ftp_client_handleattr_t *     attr,
    globus_bool_t                        cache_all)

instead of

globus_ftp_client_handle_cache_url_state()

to see if that narrows this down?  If it still occurrs, it would be useful if 
you could provide a simplified test program that reproduces the problem.

Thank you,
Joe 
------- Comment #2 From 2004-07-13 10:05:01 -------
Created an attachment (id=409) [details]
Program reproducing the bug

This should reproduce the problem. It's not the cleanest code but 
the leak occurs in the loop starting at line 208.

Andrei
------- Comment #3 From 2004-07-13 11:39:08 -------
I am not seeing any runtime leaks (other than the couple caused by your 
strdups).  All other memory is still being referenced.  The majority of this 
memory will be freed when you call globus_module_deactivate.

If this is to be a long running application, then you will need to manage the 
number of cached urls, as they do not 'timeout' themselves.  You can still use 
globus_ftp_client_handleattr_set_cache_all() to cause everything to 
automatically be cached, but you'll need a max-cached and a timeout mechanism 
(look at globus_callback.h for periodic and delayed callbacks).  Simply call 
globus_ftp_client_handle_flush_url_state() to stop caching a specific url.  The 
cache mechanism only matches on scheme, host, and port.  So, you'll need to 
strip all paths from the urls you transfer before adding them to your 
cache 'manager'.

I just noticed a bug in the globus_ftp_client_handle_flush_url_state() call 
that could cause unpredictable behavior.  Before using it, in 
globus_ftp_client_handle.c, at line 1710 change 'parsed_url' to 'searcher'.

Let me know if this resolves your problem.
Joe
------- Comment #4 From 2004-07-22 05:31:26 -------
Created an attachment (id=412) [details]
added flush calls.. no effect

globus_ftp_client_handle_flush_url_state doesn't help.
The multiple_copy method still eats up memory and the connections to the server

are being maintained.

Andrei
------- Comment #5 From 2004-07-22 11:49:18 -------
Can you confirm that you applied that change to the flush_url_state() function 
that I mentioned in the previous note.

Joe
------- Comment #6 From 2004-07-22 12:25:14 -------
Also, a quick note. When you call flush_url_state() or handle_destroy(), the 
cached objects are destroyed asynchronously in the back ground.  It is not 
likely that you'll see the memory drop for a few seconds as the destruction is 
done gracefully by sending the ftp quit, waiting for the response, etc.  If 
this is a non-threaded build, then you wont see _any_ decrease until the next 
time the event system is polled, which in your app only happens when you call 
globus_cond_wait()... and since you're not explicity waiting for those 
background events (you cant), your app may actually complete before those 
things complete their shutdown.  If you add a globus_module_deactivate_all() 
just before you main() exits, you will guarantee that all resources are 
reclaimed before that call returns.

In any case, if you still think you're seeing leaks, can you run your app under 
valgrind or some other leak detecting tool?  As I've said, I do not see any 
leaks when I run the code.

Thanks,
Joe
------- Comment #7 From 2004-07-23 06:09:44 -------
Created an attachment (id=413) [details]
valgrind output

Attached the valgrind report. Memory leaks are only a few. It seeems
that the majority of the allocated memory is still reachable.
Unfortunatelly, 
 globus_ftp_client_handle_flush_url_state(&(the_struct->ftp_handle),
					       URLs[i]);
 on each URL used in a transfer set followed by the
  handle destroy 
 globus_ftp_client_handle_destroy (&(the_struct->ftp_handle));

doesn't free that memory.
module_deactivate cannot be the only solution to free the memory allocated by
the operations since this is only a part of a multi-threaded application that
uses
the ftp module in other places.

You probably don't see the leaks because you are not giving the right
parameters to the application (list of source, destination URL pairs) -- my
fault sorry.

Something like (use more URL's to see the leaky behavior even with top):

./a.out
gsiftp://litchi.zib.de/data/gridlab/viz_input/short_herrmann_pi2/ADM-BS_Gx_3D_diagonal.xg
gsiftp://litchi.zib.de/tmp/aa/ADM-BS_Gx_3D_diagonal.xg
gsiftp://litchi.zib.de/data/gridlab/viz_input/short_herrmann_pi2/ADM-BS_Gx_maximum.xg
gsiftp://litchi.zib.de/tmp/aa/ADM-BS_Gx_maximum.xg
gsiftp://litchi.zib.de/data/gridlab/viz_input/short_herrmann_pi2/ADM-BS_Gx_minimum.xg
gsiftp://litchi.zib.de/tmp/aa/ADM-BS_Gx_minimum.xg
gsiftp://litchi.zib.de/data/gridlab/viz_input/short_herrmann_pi2/ADM-BS_Gx_norm1.xg
gsiftp://litchi.zib.de/tmp/aa/ADM-BS_Gx_norm1.xg
gsiftp://litchi.zib.de/data/gridlab/viz_input/short_herrmann_pi2/ADM-BS_Gx_norm2.xg
gsiftp://litchi.zib.de/tmp/aa/ADM-BS_Gx_norm2.xg

the build is multithreaded.

Andrei
------- Comment #8 From 2004-07-23 12:12:51 -------
Created an attachment (id=414) [details]
fixes leak in flush_url_state

Ok, now I see why you're seeing the leak and not me.  You're using the same
host/port for the source and destination.  This causes the cache to act
differently than it normally would since it only allows one entry per host/port
pair.  It's also a rare case since there is no point (other than testing ;-))
in doing a transfer between two servers on the same machine.

I will track down and fix this leak.  If you'd like to test this theory, you
can change the source hostname to the actual ip in every url you use and verify
that the leak is gone.

Also, your use of flush_state flushed out another leak that is fixed by the
attached patch.

Joe
------- Comment #9 From 2004-07-23 12:54:52 -------
Created an attachment (id=415) [details]
fix leak when using same host for source and dest


This should fix everything.  An update package will be made available on our
advisories page soon.