Skip to main content


Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Archived » SeMantic Information Logistics Architecture (SMILA) » Storing all crawled data?
Storing all crawled data? [message #648480] Thu, 13 January 2011 06:18 Go to next message
SMILANewBee is currently offline SMILANewBeeFriend
Messages: 42
Registered: August 2010
Member
We have noticed that the web crawler gets also images and other content from the web besides plain html sites. This content is downloaded but never stored into the binary storage. Is this a behaviour that is volitional?
Re: Storing all crawled data? [message #648538 is a reply to message #648480] Thu, 13 January 2011 13:00 Go to previous messageGo to next message
Igor Novakovic is currently offline Igor NovakovicFriend
Messages: 54
Registered: July 2009
Member
Hi,

generally web crawler fetches all the objects that are linked on a
previously fetched web page. If you want to restrict that you can set
appropriate filters in the web crawler's configuration file
(DataSourceConnectionConfig).
For more details on configuring web crawler see
http://wiki.eclipse.org/SMILA/Documentation/Web_Crawler

Cheers
Igor


Am 13.01.2011 07:18, schrieb SMILANewBee:
> We have noticed that the web crawler gets also images and other content
> from the web besides plain html sites. This content is downloaded but
> never stored into the binary storage. Is this a behaviour that is
> volitional?
Re: Storing all crawled data? [message #648547 is a reply to message #648538] Thu, 13 January 2011 13:30 Go to previous messageGo to next message
SMILANewBee is currently offline SMILANewBeeFriend
Messages: 42
Registered: August 2010
Member
This is correct but our question is: why is the web crawler downloading all content if the content is refused not stored into the Binary Storage?

If I specify an exclude rule on the content type in the web.xml, the rule has no effect on the Binary Storage. So I can say that the rule is ignored. Certainly the rule isn't ignored internally but from the outside it looks like that.
Re: Storing all crawled data? [message #648558 is a reply to message #648547] Thu, 13 January 2011 14:09 Go to previous messageGo to next message
Igor Novakovic is currently offline Igor NovakovicFriend
Messages: 54
Registered: July 2009
Member
> This is correct but our question is: why is the web crawler downloading
> all content if the content is refused not stored into the Binary Storage?
You mean that the content is first downloaded and then discarded by the
web crawler instead of not downloading it at all in the first place?
If so then this is a bug and we have to open a Bugzilla issue to trace
it, or did I misunderstand something?



> If I specify an exclude rule on the content type in the web.xml, the
> rule has no effect on the Binary Storage. So I can say that the rule is
> ignored. Certainly the rule isn't ignored internally but from the
> outside it looks like that.
I'm not sure I understand what you mean.
If you specify an exclude filter then all objects that match the rule in
that filter should not even reach the SMILA's persistence layer.
Otherwise, if the object has not been filtered out, then it should be
persisted in both record and binary store.

Cheers
Igor
Re: Storing all crawled data? [message #648568 is a reply to message #648558] Thu, 13 January 2011 14:43 Go to previous messageGo to next message
SMILANewBee is currently offline SMILANewBeeFriend
Messages: 42
Registered: August 2010
Member
You understand it correctly. I have made a crawled and let me show the binary storage directory in the working directory of SMILA. I use the "file" command of Linux to determine the file types. There was no image file although an url of an image was crawled.

My understanding of the exclude filter is as yours. If I exclude images they should not be downloaded but there are not downloaded anyway Sad.
Re: Storing all crawled data? [message #648842 is a reply to message #648568] Fri, 14 January 2011 16:53 Go to previous messageGo to next message
Igor Novakovic is currently offline Igor NovakovicFriend
Messages: 54
Registered: July 2009
Member
> You understand it correctly. I have made a crawled and let me show the
> binary storage directory in the working directory of SMILA. I use the
> "file" command of Linux to determine the file types. There was no image
> file although an url of an image was crawled.

I've just checked this, so we have a clear bug here.
For some reason all downloaded resources (that have not been filtered
out) are not persisted as records. Perhaps some kind of hard-coded web
crawler filtering is taking place there.
Anyway we need to examine it, so I've opened a new issue or this:
https://bugs.eclipse.org/bugs/show_bug.cgi?id=334396


>
> My understanding of the exclude filter is as yours. If I exclude images
> they should not be downloaded but there are not downloaded anyway :(.

Are you sure that the filtered out resources are downloaded anyway?
How did you prove it?

Cheers
Igor
Re: Storing all crawled data? [message #649274 is a reply to message #648842] Tue, 18 January 2011 14:35 Go to previous messageGo to next message
SMILANewBee is currently offline SMILANewBeeFriend
Messages: 42
Registered: August 2010
Member
I'm not sure to 100% but I noticed that the content is serialized as Java objects into the directory CONFIGURATION_DIRECTORY/workspace/.metadata/.plugins/org.ecl ipse.smila.connectivity.framework.crawler.web .

In the class "WebCrawler" in the run method of the "CrawlingProducerThread" all documents were serialized to disk in the before mentioned directory. I think it could be possible that the images are serialized to that directory and are further processed (in this case refused).
Re: Storing all crawled data? [message #653533 is a reply to message #648480] Thu, 10 February 2011 07:53 Go to previous messageGo to next message
SMILANewBee is currently offline SMILANewBeeFriend
Messages: 42
Registered: August 2010
Member
We have analysed the record flow in more detail. Its seems that all content is downloaded but in the WebSiteIterator some content is discarded. This happens in the method "indexDocs". There is the following line:
 if (fetcherOutput.getParse() != null) { 


The fetcher output is produced by the fetcher. If the content type is for example "image" the content will be set in the FetcherOutput but no
Parse
object. So the before mentioned "if" fails. This means that no
document
is set and
null
will be returned. So from this "IndexDocument" there can no record be constructed in the "run" method of "CrawlingProducerThread".
Re: Storing all crawled data? [message #654130 is a reply to message #653533] Mon, 14 February 2011 09:54 Go to previous message
Daniel Stucky is currently offline Daniel StuckyFriend
Messages: 35
Registered: July 2009
Member
Please take a look at https://bugs.eclipse.org/bugs/show_bug.cgi?id=334396.

There I described the cause for the current behavior. Unfortunately I don't know of any quick fix to solve this and I don't know the design decisions behind the ParserManager and Parsers.

Perhaps Thomas Menzel can give us some additional information here?
Previous Topic:EOFException in CrawlThread
Next Topic:NullPointerException while crawling
Goto Forum:
  


Current Time: Tue Mar 19 04:20:00 GMT 2024

Powered by FUDForum. Page generated in 0.02569 seconds
.:: Contact :: Home ::.

Powered by: FUDforum 3.0.2.
Copyright ©2001-2010 FUDforum Bulletin Board Software

Back to the top