Von: smila-dev-bounces@xxxxxxxxxxx [mailto:smila-dev-bounces@xxxxxxxxxxx]
Im Auftrag von Daniel Hänßgen
Gesendet: Freitag, 15. August 2014 11:08
An: Smila project developer mailing list
Betreff: [smila-dev] Problems with RecoverableExceptions
Hello everyone,
we implemented a crawler able to walk through an atlassian confluence instance. When we got into an error we threw an RecoverableException, cause those errors seemed to be just temporarily and most of them where time-outs
when the confluence server wasn’t able to answer in the desired speed.
Using this method we found some strange behavior that might not be intended:
First of all: When the number of retries are reached the job is terminated (not the record that is affected) leaving the solr-index in an inconsistent state. (Some data is indexed some not, and while using the delta worker
data from the previous run was deleted that actually wasn’t meant to be deleted!)
Your Job is terminated (“FAILED”), cause it’s defined resp. running in “runOnce” mode.
The most important aspects of this are:
·
For “runOnce” Jobs, there’s only one Workflow Run for the whole Job Run
·
A Workflow Run fails if one task of it fails
·
A Job Run fails if there is no successful Workflow Run
(see
http://wiki.eclipse.org/SMILA/Documentation/JobDefinitions and other job manager pages for details)
We circumvented this problem by always catching an error an dropping the record immediately. But this can’t be the absolute solution. Sometimes just an time-out occurs and a retry would be much appreciated!
If your crawler worker knows that the error is caused by a timeout and the task will succeed later on, you could catch the timeout in the worker and let the job
management repeat the task by throwing a org.eclipse.smila.taskworker.PostponeTaskException. This will finish the task with result POSTPONE which will cause a retry of the task.
This wasn’t really well documented in the wiki, POSTPONE is explained here:
http://wiki.eclipse.org/SMILA/Documentation/TaskManager
and I also added it here:
https://wiki.eclipse.org/SMILA/Documentation/HowTo/How_to_write_a_Worker#Exception_Handling_and_Logging
If you don’t want to repeat it endlessly that way, your worker could add (and check) a counter property/parameter to the task.
What we also recognized was that after the job was marked FAILED some errors were seen in the log:
2014-08-05 09:53:58,040 WARN [pool-4-thread-7 ] taskworker.DefaultTaskLogFactory - Task df1acc5f-940b-49fc-8d0b-67f0c4ad0561: Task 'df1acc5f-940b-49fc-8d0b-67f0c4ad0561' for
job 'crawlConfluence' and run '20140805-095231322337' is unknown, maybe already finished or workflow run was canceled.
org.eclipse.smila.jobmanager.exceptions.IllegalJobStateException: Task 'df1acc5f-940b-49fc-8d0b-67f0c4ad0561' for job 'crawlConfluence' and run '20140805-095231322337' is unknown, maybe already finished or workflow run
was canceled.
After searching we found out that these errors maybe caused by those other workers still working while the actual job has been failed. So the log is their way to say: “we recognized that the job is failed”. Is that the
fact?
Due to the kind of (asynchronous) processing in job management, a worker doesn’t know anything about the state of a job run.
So the worker just finishes its task, even if the job run is already gone (FAILED, CANCELED)
And when the worker finishes the task and sends it answer to the task/job manager, the job mgmt can’t find infos about that task anymore, cause the job run is gone which causes the cleanup
of the whole job run state data (which is stored in zookeeper). That’s the reason of the exception.
But the really troubling errors where those:
2014-08-05 09:53:51,657 ERROR [pool-4-thread-1 ] taskworker.DefaultTaskLogFactory - Task 2910cefa-4a02-48ff-b4b3-c2666f0b854d: Error while executing task 2910cefa-4a02-48ff-b4b3-c2666f0b854d
in worker
com.eccenca.importing.confluence.worker.ConfluenceObjectFetcherWorker@6481c861: Object with id 'pageBucket/257543c8-b090-4f34-848a-2e63b0863b1c0' does not exist in store 'temp'.
org.eclipse.smila.objectstore.NoSuchObjectException: Object with id 'pageBucket/257543c8-b090-4f34-848a-2e63b0863b1c0' does not exist in store 'temp'.
All of a sudden some records were missing leaving the objectstore in an inconsistent state. And if we restarted the job those errors occurred again. So there is some clean up missing.
All temp data from transient buckets should be cleaned up if a job run finishes.
But I observed a (maybe) similar behaviour which I thougt might be a bug that tasks are not cleaned up from task management after a job run fails. Didn’t find the time to have a closer
look yet...
Our questions: Is there a way (or is there something planned) to have those RecoverableExectons not causing a brutal failure? Something like: “Drop after n retries”. And maybe the last problem described concerning the
objectstore a bug?
Feel free to enter a bugzilla issue. It would be great if you could reproduce it in a way that uses built-in-SMILA workers, so we have a chance to reproduce it, too.
BTW, any plans to contribute your worker to SMILA? ;)
Best regards,
Andreas