Apache 1.3.33 Tomcat 5.0.30 mod_jk 1.2.14 Solaris 9 Hi, There seems to be a problem in supporting Failover of a failed Tomcat server on Solaris 9 with mod_jk 1.2.14. When I simulate the failed Tomcat server by pulling the network cable, the worker will do all the appropriate socket_timeouts and retries, but when that fails, as it should, another worker is not chosen, and the failed worker is not put in_error. Here is a snippet from the log: [trace] service::jk_lb_worker.c (551): enter [trace] get_most_suitable_worker::jk_lb_worker.c (453): enter [debug] get_most_suitable_worker::jk_lb_worker.c (539): found best worker (giraffe) using by request method [trace] get_most_suitable_worker::jk_lb_worker.c (543): exit [debug] service::jk_lb_worker.c (587): service worker=giraffe jvm_route=giraffe [trace] ajp_service::jk_ajp_common.c (1630): enter [debug] ajp_service::jk_ajp_common.c (1670): processing with 3 retries [info] ajp_service::jk_ajp_common.c (1749): Sending request to tomcat failed, recoverable operation attempt=1 [info] ajp_service::jk_ajp_common.c (1749): Sending request to tomcat failed, recoverable operation attempt=2 [info] ajp_service::jk_ajp_common.c (1749): Sending request to tomcat failed, recoverable operation attempt=3 [error] ajp_service::jk_ajp_common.c (1758): Error connecting to tomcat. Tomcat is probably not started or is listening on the wrong port. worker=giraffe failed [trace] ajp_service::jk_ajp_common.c (1768): exit <log ends> The ajp_service::jk_ajp_common.c method is called by the service::jk_lb_worker.c method, but there are no more log messages after ajp_service returns. We should at least see an exit trace log for the service method. I put in logging statements in the code, and found the offending lines. From jk_lb_worker.c ln. 603-607: service_stat = end->service(end, s, l, &is_service_error); /* IT IS ONE OF THESE TWO LINES THAT CAUSES THE THREAD TO DIE OR HANG */ rec->s->readed += end->rd; rec->s->transferred += end->wr; end->done(&end, l); A logging message directly after the end->service method is called is seen, but one right before the end->done method is called is not. In any case, if I comment out the two lines that update the shared memory, everything works as expected.
Created attachment 16117 [details] A full log file showing one failed request
I would like to see your workers.properties and the Jk directives in your httpd configuration. Also: Could you please check, if the failing Apache child processes are still there, or if they died? The log chows the PID of the relevant process directly behind the date. If the process is still there, you can use /usr/proc/bin/pstack PID to write a thrad dump. That way we can confirm, in which function the hanging process sits. You could attach the pstack. If the process is not there any more, it might have crashed. In case you succeed in dumping a core file of the process, you can again use pstack on that core file. Finally it might be interesting to use truss (or if you are already familiar with it dtrace) to find out, if the error is related to some system call. Configure apache to only use very few children (like startng 2 servers and having sparemax and min also equals to 2). Start the server via truss -f -o some_output_file -w all -r all -v all \ /usr/local/apache/bin/apachectl start and then redo your tests (apache will be a little slow). The file some_output_file contains information about all system calls, about bytes read and written, signals received, errnos etc. To have a somewhat easier test case: In the log you attached one can see, that the working request for the html produces to image requests going to two apache processes in parallel which both fail. I ould be a simpler retest to only use single requests, not something in parallel, e.g. not automatically reloading embedded objevcts via the browser.
Created attachment 16119 [details] workers.properties
Created attachment 16120 [details] Mod_jk configuration file
The pstack returned nothing, and I confirmed by watching TOP that the process is created then dies when the http request is over. I am having trouble running truss, I'll get back to you with that when I can.
Config is pretty basic and looks OK. The process dead is strange and I don't see an immediate reason for it. It would be good, if you could reproduce it with without parallel requests to simplify the situation. Did you build apache and mod_jk yourself? Have they been compiled using the same compiler and the same CFLAGS?
Sorry for all this. We were building on a seperate machine using the gcc compiler, because our primary Sun build machine wasnt working at the time. Now it is and I recompiled on it, and everything is hunkey-dorey now.
FYI: The compiler we used was the Forte from Sun, the compiler flag is: CFLAGS="-mt -fast -xarch=v8plusa -xtarget=ultra2 -xO5 -KPIC" LDFLAGS="-L/lib -R/usr/local/lib -R/usr/lib -R/lib -lpthread"
Most likely this is a duplicate of 36525 and fixed in the already released 1.2.15. Since it was a 64Bit alignment bug, some compilers might show the bug, others not.