Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [smila-dev] Problems with RecoverableExceptions

Hi Andreas,

 

okay, i see. But for better understanding the whole workflow let me ask again: Is it volitional that the Task-Id changes within one worker? I thought (and the documentation said that) that a Worker is working on one specific task and when the task is finished another worker is used to perform the next task. (And of course multiple tasks are parallel processed by multiple workers but the number of currently processed tasks are equal to the number of workers currently performing).

 

Kind Regards

Daniel

 

Von: smila-dev-bounces@xxxxxxxxxxx [mailto:smila-dev-bounces@xxxxxxxxxxx] Im Auftrag von Andreas Weber
Gesendet: Mittwoch, 20. August 2014 15:17
An: Smila project developer mailing list
Betreff: Re: [smila-dev] Problems with RecoverableExceptions

 

Hi Daniel,

 

you shouldn’t use the objectstore for storing intermediate worker states.

As I said, I would recommend to implement it on the task level.

 

You could implement your crawler in a way that the number of records per task is reduced resp. confiigurable via crawler parameters (see SMILA File-Crawler parameters: maxFilesPerBulk/minFilesPerBulk/directoriesPerBulk).

And a timeout shouldn’t happen that often, so it should be no problem to repeat the whole task if one record get’s a timeout.

(If such timeouts really happen very often, then, without knowing the details, I would suggest that something is wrong with the general idea/design.)

 

Best regards,

Andreas

 

Von: smila-dev-bounces@xxxxxxxxxxx [mailto:smila-dev-bounces@xxxxxxxxxxx] Im Auftrag von Daniel Hänßgen
Gesendet: Mittwoch, 20. August 2014 11:05
An: Smila project developer mailing list
Betreff: Re: [smila-dev] Problems with RecoverableExceptions

 

Hi Andreas,

 

not sure if the following problems have anything to do with the problem that the objectstore got in an inconsistent state, but:

 

I tried to make a “drop after n retires”-feature for our confluence crawler and I wanted to use the object-store, because the counters of a task are not fine enough (so that every record has its own counter) and had to be declared in the first place.

So I created an object-store for each worker consisting of the Task-Id, the Workername and the suffix “-tries”. And then something happened that I didn’t expect:

The TaskId changed during a perform()-run of a worker. Here’s more detail:

2014-08-20 10:39:59,867   worker.ConfluenceObjectFetcherWorker               INFO  [3-thread-2] –

WorkerName: ConfluenceObjectFetcherWorker Storename: 9b84876a-e93f-4d4f-a728-69bb9ec206daConfluenceObjectFetcherWorker-tries Current TaskId: 9b84876a-e93f-4d4f-a728-69bb9ec206da

2014-08-20 10:39:59,984   worker.ConfluenceObjectFetcherWorker               INFO  [3-thread-1] –

WorkerName: ConfluenceObjectFetcherWorker Storename: 9b84876a-e93f-4d4f-a728-69bb9ec206daConfluenceObjectFetcherWorker-tries Curent TaskId: 54996208-f4e3-4a1a-90c1-93fceddfde43

 

The Storename is created as written before and is concenated at the beginning of the perform()-Method of the worker and stored in an objects-variable. The “Current TaskId” is return from calling taskContext.getTask().getTaskId() at the very moment.

Is this correct, that the TaskId can change during a Worker-Run? If so I don’t really have the possibility to clean up those object-stores in which I store the number of tries per record…

 

Kind Regards,

Daniel

 

Von: smila-dev-bounces@xxxxxxxxxxx [mailto:smila-dev-bounces@xxxxxxxxxxx] Im Auftrag von Andreas Weber
Gesendet: Freitag, 15. August 2014 16:49
An: Smila project developer mailing list
Betreff: Re: [smila-dev] Problems with RecoverableExceptions

 

Hi Daniel,

 

comments inline below.

 

Best regards,

Andreas

 

Von: smila-dev-bounces@xxxxxxxxxxx [mailto:smila-dev-bounces@xxxxxxxxxxx] Im Auftrag von Daniel Hänßgen
Gesendet: Freitag, 15. August 2014 11:08
An: Smila project developer mailing list
Betreff: [smila-dev] Problems with RecoverableExceptions

 

Hello everyone,

 

we implemented a crawler able to walk through an atlassian confluence instance. When we got into an error we threw an RecoverableException, cause those errors seemed to be just temporarily and most of them where time-outs when the confluence server wasn’t able to answer in the desired speed.

Using this method we found some strange behavior that might not be intended:

First of all: When the number of retries are reached the job is terminated (not the record that is affected) leaving the solr-index in an inconsistent state. (Some data is indexed some not, and while using the delta worker data from the previous run was deleted that actually wasn’t meant to be deleted!)

 

Your Job is terminated (“FAILED”), cause it’s defined resp. running in “runOnce” mode.

The most important aspects of this are:

·         For “runOnce” Jobs, there’s only one Workflow Run for the whole Job Run

·         A Workflow Run fails if one task of it fails

·         A Job Run fails if there is no successful Workflow Run

(see http://wiki.eclipse.org/SMILA/Documentation/JobDefinitions and other job manager pages for details)

 

We circumvented this problem by always catching an error an dropping the record immediately. But this can’t be the absolute solution. Sometimes just an time-out occurs and a retry would be much appreciated!

 

      If your crawler worker knows that the error is caused by a timeout and the task will succeed later on, you could catch the timeout in the worker and let the job management repeat the task by throwing a org.eclipse.smila.taskworker.PostponeTaskException. This will finish the task with result POSTPONE which will cause a retry of the task.   

             This wasn’t really well documented in the wiki, POSTPONE is explained here:  http://wiki.eclipse.org/SMILA/Documentation/TaskManager and I also added it here:  https://wiki.eclipse.org/SMILA/Documentation/HowTo/How_to_write_a_Worker#Exception_Handling_and_Logging

             If you don’t want to repeat it endlessly that way, your worker could add (and check) a counter property/parameter to the task.

 

What we also recognized was that after the job was marked FAILED some errors were seen in the log:

2014-08-05 09:53:58,040 WARN  [pool-4-thread-7                              ] taskworker.DefaultTaskLogFactory              - Task df1acc5f-940b-49fc-8d0b-67f0c4ad0561: Task 'df1acc5f-940b-49fc-8d0b-67f0c4ad0561' for job 'crawlConfluence' and run '20140805-095231322337' is unknown, maybe already finished or workflow run was canceled.

org.eclipse.smila.jobmanager.exceptions.IllegalJobStateException: Task 'df1acc5f-940b-49fc-8d0b-67f0c4ad0561' for job 'crawlConfluence' and run '20140805-095231322337' is unknown, maybe already finished or workflow run was canceled.

 

After searching we found out that these errors maybe caused by those other workers still working while the actual job has been failed. So the log is their way to say: “we recognized that the job is failed”. Is that the fact?

 

             Due to the kind of (asynchronous) processing in job management, a worker doesn’t know anything about the state of a job run.

             So the worker just finishes its task, even if the job run is already gone (FAILED, CANCELED)

             And when the worker finishes the task and sends it answer to the task/job manager, the job mgmt can’t find infos about that task anymore, cause the job run is gone which causes the cleanup of the whole job run state data (which is stored in zookeeper). That’s the reason of the exception.

 

But the really troubling errors where those:

2014-08-05 09:53:51,657 ERROR [pool-4-thread-1                              ] taskworker.DefaultTaskLogFactory              - Task 2910cefa-4a02-48ff-b4b3-c2666f0b854d: Error while executing task 2910cefa-4a02-48ff-b4b3-c2666f0b854d in worker com.eccenca.importing.confluence.worker.ConfluenceObjectFetcherWorker@6481c861: Object with id 'pageBucket/257543c8-b090-4f34-848a-2e63b0863b1c0' does not exist in store 'temp'.

org.eclipse.smila.objectstore.NoSuchObjectException: Object with id 'pageBucket/257543c8-b090-4f34-848a-2e63b0863b1c0' does not exist in store 'temp'.

 

All of a sudden some records were missing leaving the objectstore in an inconsistent state. And if we restarted the job those errors occurred again. So there is some clean up missing.

 

             All temp data from transient buckets should be cleaned up if a job run finishes.

             But I observed a (maybe) similar behaviour which I thougt might be a bug that tasks are not cleaned up from task management after a job run fails. Didn’t find the time to have a closer look yet...

 

Our questions: Is there a way (or is there something planned) to have those RecoverableExectons not causing a brutal failure? Something like: “Drop after n retries”. And maybe the last problem described concerning the objectstore a bug?

 

             Feel free to enter a bugzilla issue. It would be great if you could reproduce it in a way that uses built-in-SMILA workers,  so we have a chance to reproduce it, too.

 

             BTW, any plans to contribute your worker to SMILA?   ;)

 

             Best regards,

               Andreas

 

 

_______________________________________________
smila-dev mailing list
smila-dev@xxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://dev.eclipse.org/mailman/listinfo/smila-dev
_______________________________________________

smila-dev mailing list

smila-dev@xxxxxxxxxxx

To change your delivery options, retrieve your password, or unsubscribe from this list, visit

https://dev.eclipse.org/mailman/listinfo/smila-dev

Back to the top