Skip to main content



      Home
Home » Archived » SeMantic Information Logistics Architecture (SMILA) » Flushing records to late
Flushing records to late [message #660190] Thu, 17 March 2011 05:50 Go to next message
Eclipse UserFriend
Hello,

the web crawlers crawls some page and the CrawlThread collects them into a record buffer. If the record buffer is full it will be flushed after a new record is processed. If the crawler waits for new pages a long time the buffer will not be flushed within the wait time.

This can be handled easily to move the flush statement within the "processDataReference" to the top.

Have we understand this problem correctly?
Re: Flushing records to late [message #660259 is a reply to message #660190] Thu, 17 March 2011 11:17 Go to previous messageGo to next message
Eclipse UserFriend
Hi,

I'm not sure which flush statement you want to move.
- flushRecords() is called in run() after the CrawlThread was stopped to flush any remaining records. It is also called within checkForFlush() if the buffersize is reached or the time limit is reached
- checkForFlush() is called in processDataReferences() before a dataReference is processed.

The waiting you described occurs within run() when
dataReferences = _crawler.getNext();
is called. So I don't think you will gain anything by moving any of the flush methods.

To ensure that records are buffered at most "flushinterval" seconds we would have to implement it in another way, usinf a separate thread to trigger the flush if the time has elapsed.
Re: Flushing records to late [message #660357 is a reply to message #660190] Fri, 18 March 2011 01:34 Go to previous message
Eclipse UserFriend
Hello,

I mean the second case where the "checkForFlush" is called in the "processDataReferences" method.

You are right. We have modified the "WebCrawler" class in such way that the crawler returns an empty array if nothing is crawled in the iteration. So in this case no flush will be processed.

But your idea with the thread is great because it makes sure that the flush interval will be meet independent of the crawling.
Previous Topic:Announcing changes in SMILA
Next Topic:Nightly builds are now also available for Mac OS X!
Goto Forum:
  


Current Time: Sat Jul 12 13:08:30 EDT 2025

Powered by FUDForum. Page generated in 0.57166 seconds
.:: Contact :: Home ::.

Powered by: FUDforum 3.0.2.
Copyright ©2001-2010 FUDforum Bulletin Board Software

Back to the top