Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Eclipse Projects » SeMantic Information Logistics Architecture (SMILA) » EOFException in CrawlThread
EOFException in CrawlThread [message #652876] Mon, 07 February 2011 08:01 Go to next message
Andrej Rosenheinrich is currently offline Andrej Rosenheinrich
Messages: 22
Registered: August 2010
Junior Member
Hi again,

looking at the logfiles we notice on a regular basis the following exception:

2011-02-04 21:00:24,326 ERROR [Thread-39 ] impl.CrawlThread - Error while processing record with Id whatever of dataSourceId
org.eclipse.smila.connectivity.framework.CrawlerException: org.eclipse.smila.connectivity.framework.CrawlerException: java.io.EOFException
at org.eclipse.smila.connectivity.framework.crawler.web.WebCraw ler.getMObject(WebCrawler.java:361)
at org.eclipse.smila.connectivity.framework.util.internal.DataR eferenceImpl.getRecord(DataReferenceImpl.java:100)
at org.eclipse.smila.connectivity.framework.impl.CrawlThread.pr ocessDataReferences(CrawlThread.java:352)
at org.eclipse.smila.connectivity.framework.impl.CrawlThread.ru n(CrawlThread.java:235)
Caused by: org.eclipse.smila.connectivity.framework.CrawlerException: java.io.EOFException
at org.eclipse.smila.connectivity.framework.crawler.web.WebCraw ler.deserializeIndexDocument(WebCrawler.java:830)
at org.eclipse.smila.connectivity.framework.crawler.web.WebCraw ler.getRecord(WebCrawler.java:577)
at org.eclipse.smila.connectivity.framework.crawler.web.WebCraw ler.getMObject(WebCrawler.java:359)
... 3 more
Caused by: java.io.EOFException
at java.io.ObjectInputStream$BlockDataInputStream.peekByte(Obje ctInputStream.java:2553)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java :1296)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java: 350)
at java.util.ArrayList.readObject(ArrayList.java:593)
at sun.reflect.GeneratedMethodAccessor47.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe thodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass .java:974)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.j ava:1848)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStre am.java:1752)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java :1328)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStrea m.java:1946)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.j ava:1870)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStre am.java:1752)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java :1328)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java: 350)
at org.eclipse.smila.connectivity.framework.crawler.web.WebCraw ler.deserializeIndexDocument(WebCrawler.java:828)


This exception appears when crawling with one or multiple threads. Crawls are running for several hours using the same dataSourceId, so if the configfile is wrong the exception should be thrown from the beginning and more often. Is there any explanation for this behavior? Is this a problem of SMILA or the underlying operation system?

Thanks,
Andrej
Re: EOFException in CrawlThread [message #652905 is a reply to message #652876] Mon, 07 February 2011 09:39 Go to previous messageGo to next message
Daniel Stucky is currently offline Daniel Stucky
Messages: 35
Registered: July 2009
Member
Hi,

no real idea whats the problem here but maybe there is an issue with the caching of the crawled files. A MD5hash of the URL is used as the filename. Perhaps it's a problem when these files are overwritten (perhaps with changed content) because you said you do several crawls of the same data source. Or there is a conflict that two threads read/write from/to this file object ?

You should add some debug output to the de-/serializeIndexDocument methods of class WebCrawler to see what files cause this problem.

Daniel
Re: EOFException in CrawlThread [message #653403 is a reply to message #652905] Wed, 09 February 2011 11:31 Go to previous messageGo to next message
Andrej Rosenheinrich is currently offline Andrej Rosenheinrich
Messages: 22
Registered: August 2010
Junior Member
Just running with DEBUG option doesnt give more informations, unfortunatly. As you said i'll try to add someoutput to the de-/serializeIndexDocument methods of class WebCrawler. But what makes me wonder is the line "ERROR [Thread-39 ] impl.CrawlThread - Error while processing record with Id whatever of dataSourceId ".
whatever is here the name of the config file for the crawl, so shouldnt it be the dataSourceId instead of the recordId? Is this error message misleading or is it possible that those values sometimes are mixed up?
Re: EOFException in CrawlThread [message #654128 is a reply to message #653403] Mon, 14 February 2011 04:47 Go to previous message
Daniel Stucky is currently offline Daniel Stucky
Messages: 35
Registered: July 2009
Member
Yes, that exception message was screwed up. I just fixed it to include the record id and the data source id. Hopefully the correct error message helps you with your problem tracking.

Daniel
Previous Topic:Changelog?
Next Topic:Storing all crawled data?
Goto Forum:
  


Current Time: Wed Jul 23 10:11:20 EDT 2014

Powered by FUDForum. Page generated in 0.04501 seconds