Skip to main content


Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Archived » SeMantic Information Logistics Architecture (SMILA) » Flushing records to late
Flushing records to late [message #660190] Thu, 17 March 2011 09:50 Go to next message
SMILANewBee is currently offline SMILANewBeeFriend
Messages: 42
Registered: August 2010
Member
Hello,

the web crawlers crawls some page and the CrawlThread collects them into a record buffer. If the record buffer is full it will be flushed after a new record is processed. If the crawler waits for new pages a long time the buffer will not be flushed within the wait time.

This can be handled easily to move the flush statement within the "processDataReference" to the top.

Have we understand this problem correctly?
Re: Flushing records to late [message #660259 is a reply to message #660190] Thu, 17 March 2011 15:17 Go to previous messageGo to next message
Daniel Stucky is currently offline Daniel StuckyFriend
Messages: 35
Registered: July 2009
Member
Hi,

I'm not sure which flush statement you want to move.
- flushRecords() is called in run() after the CrawlThread was stopped to flush any remaining records. It is also called within checkForFlush() if the buffersize is reached or the time limit is reached
- checkForFlush() is called in processDataReferences() before a dataReference is processed.

The waiting you described occurs within run() when
dataReferences = _crawler.getNext();
is called. So I don't think you will gain anything by moving any of the flush methods.

To ensure that records are buffered at most "flushinterval" seconds we would have to implement it in another way, usinf a separate thread to trigger the flush if the time has elapsed.
Re: Flushing records to late [message #660357 is a reply to message #660190] Fri, 18 March 2011 05:34 Go to previous message
SMILANewBee is currently offline SMILANewBeeFriend
Messages: 42
Registered: August 2010
Member
Hello,

I mean the second case where the "checkForFlush" is called in the "processDataReferences" method.

You are right. We have modified the "WebCrawler" class in such way that the crawler returns an empty array if nothing is crawled in the iteration. So in this case no flush will be processed.

But your idea with the thread is great because it makes sure that the flush interval will be meet independent of the crawling.
Previous Topic:Announcing changes in SMILA
Next Topic:Nightly builds are now also available for Mac OS X!
Goto Forum:
  


Current Time: Tue Mar 19 06:35:55 GMT 2024

Powered by FUDForum. Page generated in 0.02170 seconds
.:: Contact :: Home ::.

Powered by: FUDforum 3.0.2.
Copyright ©2001-2010 FUDforum Bulletin Board Software

Back to the top