|Serialized Java objects remains on disk after deserializing [message #655711]
||Tue, 22 February 2011 12:05
Registered: August 2010
The web crawler consists of two threads. The first one downloads the content and post processes it. The results is written to disk as Java serialized object into the directory workspace/.metadata/.plugins/org.eclipse.smila.connectivity. framework.crawler.web.|
The second thread gets the serialized objects and processes them further to add them to the blackboard. After the second thread deserialized the object it doesn't remove the file. So the directory is swellow with unnecessary data.
To prevent swelling in the class WebCrawler one line can be added to the deserializeIndexDocument method. After the statement
can be written delete the file after deserializing it.
|Re: Serialized Java objects remains on disk after deserializing [message #656247 is a reply to message #655897]
||Thu, 24 February 2011 16:44
Originally posted by: juergen.schumacher.attensity.com|
Am 23.02.2011, 09:22 Uhr, schrieb Andrej Rosenheinrich =
> Can someone confirm if those thoughts are right? Or would it cause =
> sidefeffects to delete those files?
Sorry, I'm not very accustomed to the Web Crawer, so I'm not sure. From =
reading the code,
I could imagine situations where a URL is visited twice during crawling =
and the second visit
happens before the first visit is completely processed, so the file woul=
not be recreated
and the processing of the first visit would delete it before the second =
visit can read it.
Or something like this. On the other hand this sounds like a dubious =
behaviour anyway (:
Daniel, Tom, do you know anything about this?
Powered by FUDForum
. Page generated in 0.02299 seconds