Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Eclipse Projects » SeMantic Information Logistics Architecture (SMILA) » Content type filter don't work to avoid unnecessary downloads
Content type filter don't work to avoid unnecessary downloads [message #654608] Wed, 16 February 2011 08:36 Go to next message
SMILANewBee is currently offline SMILANewBee
Messages: 42
Registered: August 2010
Member
Hello,

we noticed that all content will be download from the web. If one website contains a link to a flash video this video will be download. If this content is processed further it is dependent of the filters. If the content type of flash videos is filtered out the flash video will not be persisted. However the content is downloaded. So many of unnecessary content will be downloaded.

A predication of the content would be necessary to prevent downloading content. This prevention can be done to examine the extension of the url. If the url ends with flv the url can be sorted out. Another way is to ask the sever to give the content type of the resource. If the content type matches the expected one then the web crawler ignores the link and no content will be downloaded.

What do you think about our suggestions?
Re: Content type filter don't work to avoid unnecessary downloads [message #655123 is a reply to message #654608] Fri, 18 February 2011 07:26 Go to previous messageGo to next message
Daniel Stucky is currently offline Daniel Stucky
Messages: 35
Registered: July 2009
Member
Hi,

according to the documentation on http://wiki.eclipse.org/SMILA/Documentation/Web_Crawler it should already be possible to set Filters on for example Content-Types.
Unfortunately I do not know if these filters are applied before or after downloading. Of course it makes much more sense to not download resources you want to filter out (except for html pages which you need to get links for site traversal).

Did you already check out these Filters ?

Otherwise we have to ask Tom for more input on the Crawler features and limitations.

Daniel
Re: Content type filter don't work to avoid unnecessary downloads [message #655721 is a reply to message #655123] Tue, 22 February 2011 07:50 Go to previous message
SMILANewBee is currently offline SMILANewBee
Messages: 42
Registered: August 2010
Member
The content type filters takes place after downloading the content. The problem is that the crawler can't know the content type before crawling. Some content can be identified by the file extension but not all links have one.

One solution is to send a test request to the sever. So the server doesn't answer with the full content but only with the header. So the crawler can filter out if the content has not the expected one. If the content is expected another request is done that bring the server to send the content.
Previous Topic:NullPointerException while crawling
Next Topic:Missing cleaning method in Pipelet
Goto Forum:
  


Current Time: Tue Jul 29 14:57:54 EDT 2014

Powered by FUDForum. Page generated in 0.02140 seconds