Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Eclipse Projects » SeMantic Information Logistics Architecture (SMILA) » Content type filter don't work to avoid unnecessary downloads
Content type filter don't work to avoid unnecessary downloads [message #654608] Wed, 16 February 2011 13:36 Go to next message
SMILANewBee is currently offline SMILANewBeeFriend
Messages: 42
Registered: August 2010
Member
Hello,

we noticed that all content will be download from the web. If one website contains a link to a flash video this video will be download. If this content is processed further it is dependent of the filters. If the content type of flash videos is filtered out the flash video will not be persisted. However the content is downloaded. So many of unnecessary content will be downloaded.

A predication of the content would be necessary to prevent downloading content. This prevention can be done to examine the extension of the url. If the url ends with flv the url can be sorted out. Another way is to ask the sever to give the content type of the resource. If the content type matches the expected one then the web crawler ignores the link and no content will be downloaded.

What do you think about our suggestions?
Re: Content type filter don't work to avoid unnecessary downloads [message #655123 is a reply to message #654608] Fri, 18 February 2011 12:26 Go to previous messageGo to next message
Daniel Stucky is currently offline Daniel StuckyFriend
Messages: 35
Registered: July 2009
Member
Hi,

according to the documentation on http://wiki.eclipse.org/SMILA/Documentation/Web_Crawler it should already be possible to set Filters on for example Content-Types.
Unfortunately I do not know if these filters are applied before or after downloading. Of course it makes much more sense to not download resources you want to filter out (except for html pages which you need to get links for site traversal).

Did you already check out these Filters ?

Otherwise we have to ask Tom for more input on the Crawler features and limitations.

Daniel
Re: Content type filter don't work to avoid unnecessary downloads [message #655721 is a reply to message #655123] Tue, 22 February 2011 12:50 Go to previous message
SMILANewBee is currently offline SMILANewBeeFriend
Messages: 42
Registered: August 2010
Member
The content type filters takes place after downloading the content. The problem is that the crawler can't know the content type before crawling. Some content can be identified by the file extension but not all links have one.

One solution is to send a test request to the sever. So the server doesn't answer with the full content but only with the header. So the crawler can filter out if the content has not the expected one. If the content is expected another request is done that bring the server to send the content.
Previous Topic:NullPointerException while crawling
Next Topic:Missing cleaning method in Pipelet
Goto Forum:
  


Current Time: Tue Nov 25 22:32:49 GMT 2014

Powered by FUDForum. Page generated in 0.16494 seconds
.:: Contact :: Home ::.

Powered by: FUDforum 3.0.2.
Copyright ©2001-2010 FUDforum Bulletin Board Software