Eclipse Community Forums: SeMantic Information Logistics Architecture (SMILA) » Content type filter don't work to avoid unnecessary downloads

Home » Archived » SeMantic Information Logistics Architecture (SMILA) » Content type filter don't work to avoid unnecessary downloads

Show: Today's Messages :: Show Polls :: Message Navigator

Content type filter don't work to avoid unnecessary downloads [message #654608]

Wed, 16 February 2011 13:36

SMILANewBee

Messages: 42
Registered: August 2010

Member

Hello,

we noticed that all content will be download from the web. If one website contains a link to a flash video this video will be download. If this content is processed further it is dependent of the filters. If the content type of flash videos is filtered out the flash video will not be persisted. However the content is downloaded. So many of unnecessary content will be downloaded.

A predication of the content would be necessary to prevent downloading content. This prevention can be done to examine the extension of the url. If the url ends with flv the url can be sorted out. Another way is to ask the sever to give the content type of the resource. If the content type matches the expected one then the web crawler ignores the link and no content will be downloaded.

What do you think about our suggestions?

Report message to a moderator

Re: Content type filter don't work to avoid unnecessary downloads [message #655123 is a reply to message #654608]

Fri, 18 February 2011 12:26

Daniel Stucky

Messages: 35
Registered: July 2009

Member

Hi,

according to the documentation on http://wiki.eclipse.org/SMILA/Documentation/Web_Crawler it should already be possible to set Filters on for example Content-Types.
Unfortunately I do not know if these filters are applied before or after downloading. Of course it makes much more sense to not download resources you want to filter out (except for html pages which you need to get links for site traversal).

Did you already check out these Filters ?

Otherwise we have to ask Tom for more input on the Crawler features and limitations.

Daniel

Report message to a moderator

Re: Content type filter don't work to avoid unnecessary downloads [message #655721 is a reply to message #655123]

Tue, 22 February 2011 12:50

SMILANewBee

Messages: 42
Registered: August 2010

Member

The content type filters takes place after downloading the content. The problem is that the crawler can't know the content type before crawling. Some content can be identified by the file extension but not all links have one.

One solution is to send a test request to the sever. So the server doesn't answer with the full content but only with the header. So the crawler can filter out if the content has not the expected one. If the content is expected another request is done that bring the server to send the content.

Report message to a moderator

Previous Topic:	NullPointerException while crawling
Next Topic:	Missing cleaning method in Pipelet

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

]

Current Time: Thu Apr 25 22:36:21 GMT 2024

.:: Contact :: Home ::.

Breadcrumbs

Sign up to our Newsletter