Hi Am 12.08.2013 um 14:02 schrieb Ralph Böhme <rb netafp com>:
Am 12.08.2013 um 11:32 schrieb Martyn Russell <martyn lanedo com>:On 11/08/13 09:24, Ralph Böhme wrote:As I'm only really at the beginning of an in depth analysis, I can't say for sure that the hangs I see are the cause of this, but knowing there seems to exist a fundamental design flaw in tracker-extra-pdf, I'm asking for thoughts on this.So we use the parent/child set up because some PDFs take a REALLY long time to process and we have a 10 second window for them to be indexed.Yes, I know. My test data contains many, many very large PDFs (up 1000 pages).After that we kill the child process and return.Understood. Just for references, this happens in src/tracker-extract/tracker-extract-pdf.c:extract_content().We did this because we didn't want to kill the tracker-extract process all the time.I just stumbled across this one, didn't notice before. A 20 seconds watchdog timeout setup in src/tracker-extract/tracker-controller.c:metadata_data_new().In reality, this is actually what tracker-extract was built to do, so ...It seems in reality the 20 s watchdog timer is setup too early! Afair it is not setup before the actual extraction is started, but already at the moment the extraction event is received. Now if several extraction requests a received and all of these are files (e.g. PDFs) that take long to extract, the watchdog timers for requests still on the pending queue are already expiring, I see this sequence of events in the debug log: 12 Aug 2013, 11:55:49: Tracker: <--- [12|0] handle_method_call_get_metadata_fast (uri:'file:///tank/test/PDF/Security/Network%20Security%20With%20Openssl.pdf', mime:'application/pdf', index_fd:0) Extraction requests 1-11 are still being processes, so it takes some time before this one gets it shot here: 12 Aug 2013, 11:56:06: Tracker: Dispatching 'file:///tank/test/PDF/Security/Network%20Security%20With%20Openssl.pdf' in main thread Unfortunately this one takes more then 4 seconds, so the watchdog killer steps in: 12 Aug 2013, 11:56:10: Tracker-Critical **: Extraction task for 'file:///tank/test/PDF/Security/Network%20Security%20With%20Openssl.pdf' went rogue and took more than 20 seconds. Forcing exit. ...and tracker-extract gets restarted. Increasing WATCHDOG_TIMEOUT from 20 to a large value will workaround this, but a proper fix would be to setup the timer in src/tracker-extract/tracker-extract.c:dispatch_task_cb() or similar.Afaict, the right design would involve an exec() in the child and using some other IPC channel. I'll happily volunteer.Yea, so we are actually calling exit() in the child. See: extract_content_child_process()What has the exit() call to do with this? Afaict that's completely unrelated to the issue of using fork() in programs using glib.Thoughts?Are you sure it isn't a difficult PDF taking too long?Yes, heres a SBT from a stuck process: ...
now here's another one, different cause this time:
-----------------  lwp# 2 / thread# 2  --------------------
 fe6d4749 lwp_park (0, 0, 0)
 fe6cc93b mutex_lock_impl (fe773098, 0, 0, fe671657, 80c8000, 1) + 166
 fe6cca7e mutex_lock (fe773098) + 16
 fe670fb5 malloc   (2c, fe772000, fddee368, fe6cbb23, fe7730a4, 1000000) + 2d
 fe39d565 g_malloc (2c, fe772000, fddee398, fe6cbb23, fe7730a4, 0) + 25
 fe3e2d0f _g_gnulib_printf_parse (fcff4a6c, fddee420, fddee410, fe773098, fe773098, 0) + 2f
 fe3e36a0 _g_gnulib_vasnprintf (0, fddee46c, fcff4a6c, fddee58c, fddee480, 6) + 30
 fe3e4df3 _g_gnulib_vasprintf (fddee4bc, fcff4a6c, fddee58c, fddee4ac, fe447474, 80) + 33
 fe3de8a4 g_vasprintf (fddee4bc, fcff4a6c, fddee58c, 2c, fde1a000, fe772000) + 34
 fe3b7962 g_strdup_vprintf (fcff4a6c, fddee58c, 0, fe6cbcc9, fe7730a4, 1000000) + 32
 fe39ecaa g_logv   (fcff4fa9, 80, fcff4a6c, fddee58c, fdce0240) + 3a
 fe39f022 g_log    (fcff4fa9, 80, fcff4a6c, fe6c34ee, fe777000, 0) + 32
 fcff2728 extract_content_child_cleanup (12, 0, fddee6b0, fe6ae447, 12, fe772000) + 38
 fe6d47d5 __sighndlr (12, 0, fddee6b0, fcff26f0) + 15
 fe6c7dae call_user_handler (12) + 2d2
 fe6c8007 sigacthandler (12, 0, fddee6b0) + ee
 --- called from signal handler with signal 18 (SIGCLD) ---
 fe671033 _malloc_unlocked (5) + 43
 fe670fc0 malloc   (5, fe27cac2, fddeec88, fe3dd554, 884e088, 8085800) + 38
 fe39d565 g_malloc (5, 0, 4, fe3cf3ac, 884e088, fe27cac1) + 25
 fe3b756b g_strdup (819a800, 0, fddeecd8, fe23403c, 857d950, 0) + 3b
 fe246a88 _g_dbus_method_invocation_new (819a800, 8105320, 813fa50, 8110288, 80c4da8, 8085800) + 188
 fe22c588 validate_and_maybe_schedule_method_call (1, 0, 80bfdd0, 80c2070, 80bff98, 8083058) + e8
 fe22d361 on_worker_message_received (808ab10, 8091c30, 8085800, fddeee08, fe447474, 807e590) + 4b1
 fe242c0e _g_dbus_worker_do_read_cb (8088828, 8090618, 808ab10, fe6cd084, 80819b0, 0) + 72e
 fe1d5515 g_simple_async_result_complete (8090618, 0, fddeee88, 8087358, 85b0ac0) + b5
 fe1d557d complete_in_idle_cb (8090618, fe447474, fddeeef8, fe447474, fe3df359) + 1d
 fe393c71 g_idle_dispatch (85b0ac0, fe1d5560, 8090618, fddeeed0, 8081ab0, 1) + 21
 fe397285 g_main_context_dispatch (808dca8, 0, 8081ab0, 1, 3, 0) + 135
 fe397648 g_main_context_iterate.isra.21 (1, 808dca8, 0, fe397aaa, fe2a189c, 8088488) + 218
 fe397b23 g_main_loop_run (80884b8, 807db80, 0, fe6cd275, fe447474, 807db80) + 83
 fe24070c gdbus_shared_thread_func (8088488, 807db80, 208, fe6d37e9, 0, 0) + 2c
 fe3c0442 g_thread_proxy (807db80, fe772000, fddeefe8, fe6d43f9) + 72
 fe6d444c _thrp_setup (fdce0240) + 9d
 fe6d46f0 _lwp_start (fdce0240, 0, 0, 0, 0, 0)
-----------------  lwp# 3 / thread# 3  --------------------
Looks like we're using malloc() in a signal handler. That must be l.469 in tracker-extract-pdf.c:
        g_debug ("Parent:   Zombie %d reaped", child_pid);
I wonder though why this is evaluated, as my tracker log setting are only "detailed":
root beast:~# tracker-control --get-log-verbosity
Komponenten:
  Store    : detailed
  Extract  : detailed
  Writeback: detailed
Indizierer (Nur jene, die in der Konfiguration aufgelistet sind):
  Files    : detailed
root beast:~# 
I'll run another test with the log statements removed and report back.
-r
-- 
Ralph Böhme <rb netafp com>
Netatalk Developer | Support | Services
Curslacker Deich 254, 21039 Hamburg, Germany
http://www.netafp.com/
Attachment:
smime.p7s
Description: S/MIME cryptographic signature