Eclipse Community Forums: SeMantic Information Logistics Architecture (SMILA)

Help

Home

Home » Archived » SeMantic Information Logistics Architecture (SMILA) » Storing all crawled data?

Show: Today's Messages :: Show Polls :: Message Navigator

Storing all crawled data? [message #648480]

Thu, 13 January 2011 06:18

SMILANewBee

Messages: 42
Registered: August 2010

Member

We have noticed that the web crawler gets also images and other content from the web besides plain html sites. This content is downloaded but never stored into the binary storage. Is this a behaviour that is volitional?

Report message to a moderator

Re: Storing all crawled data? [message #648538 is a reply to message #648480]

Thu, 13 January 2011 13:00

Igor Novakovic

Messages: 54
Registered: July 2009

Member

Hi,

generally web crawler fetches all the objects that are linked on a
previously fetched web page. If you want to restrict that you can set
appropriate filters in the web crawler's configuration file
(DataSourceConnectionConfig).
For more details on configuring web crawler see
http://wiki.eclipse.org/SMILA/Documentation/Web_Crawler

Cheers
Igor

Am 13.01.2011 07:18, schrieb SMILANewBee:
> We have noticed that the web crawler gets also images and other content
> from the web besides plain html sites. This content is downloaded but
> never stored into the binary storage. Is this a behaviour that is
> volitional?

Report message to a moderator

Re: Storing all crawled data? [message #648547 is a reply to message #648538]

Thu, 13 January 2011 13:30

SMILANewBee

Messages: 42
Registered: August 2010

Member

This is correct but our question is: why is the web crawler downloading all content if the content is refused not stored into the Binary Storage?

If I specify an exclude rule on the content type in the web.xml, the rule has no effect on the Binary Storage. So I can say that the rule is ignored. Certainly the rule isn't ignored internally but from the outside it looks like that.

Report message to a moderator

Re: Storing all crawled data? [message #648558 is a reply to message #648547]

Thu, 13 January 2011 14:09

Igor Novakovic

Messages: 54
Registered: July 2009

Member

> This is correct but our question is: why is the web crawler downloading
> all content if the content is refused not stored into the Binary Storage?
You mean that the content is first downloaded and then discarded by the
web crawler instead of not downloading it at all in the first place?
If so then this is a bug and we have to open a Bugzilla issue to trace
it, or did I misunderstand something?

> If I specify an exclude rule on the content type in the web.xml, the
> rule has no effect on the Binary Storage. So I can say that the rule is
> ignored. Certainly the rule isn't ignored internally but from the
> outside it looks like that.
I'm not sure I understand what you mean.
If you specify an exclude filter then all objects that match the rule in
that filter should not even reach the SMILA's persistence layer.
Otherwise, if the object has not been filtered out, then it should be
persisted in both record and binary store.

Cheers
Igor

Report message to a moderator

Re: Storing all crawled data? [message #648568 is a reply to message #648558]

Thu, 13 January 2011 14:43

SMILANewBee

Messages: 42
Registered: August 2010

Member

You understand it correctly. I have made a crawled and let me show the binary storage directory in the working directory of SMILA. I use the "file" command of Linux to determine the file types. There was no image file although an url of an image was crawled.

My understanding of the exclude filter is as yours. If I exclude images they should not be downloaded but there are not downloaded anyway Sad

Report message to a moderator

Re: Storing all crawled data? [message #648842 is a reply to message #648568]

Fri, 14 January 2011 16:53

Igor Novakovic

Messages: 54
Registered: July 2009

Member

> You understand it correctly. I have made a crawled and let me show the
> binary storage directory in the working directory of SMILA. I use the
> "file" command of Linux to determine the file types. There was no image
> file although an url of an image was crawled.

I've just checked this, so we have a clear bug here.
For some reason all downloaded resources (that have not been filtered
out) are not persisted as records. Perhaps some kind of hard-coded web
crawler filtering is taking place there.
Anyway we need to examine it, so I've opened a new issue or this:
https://bugs.eclipse.org/bugs/show_bug.cgi?id=334396

>
> My understanding of the exclude filter is as yours. If I exclude images
> they should not be downloaded but there are not downloaded anyway :(.

Are you sure that the filtered out resources are downloaded anyway?
How did you prove it?

Cheers
Igor

Report message to a moderator

Re: Storing all crawled data? [message #649274 is a reply to message #648842]

Tue, 18 January 2011 14:35

SMILANewBee

Messages: 42
Registered: August 2010

Member

I'm not sure to 100% but I noticed that the content is serialized as Java objects into the directory CONFIGURATION_DIRECTORY/workspace/.metadata/.plugins/org.ecl ipse.smila.connectivity.framework.crawler.web .

In the class "WebCrawler" in the run method of the "CrawlingProducerThread" all documents were serialized to disk in the before mentioned directory. I think it could be possible that the images are serialized to that directory and are further processed (in this case refused).

Report message to a moderator

Re: Storing all crawled data? [message #653533 is a reply to message #648480]

Thu, 10 February 2011 07:53

SMILANewBee

Messages: 42
Registered: August 2010

Member

We have analysed the record flow in more detail. Its seems that all content is downloaded but in the WebSiteIterator some content is discarded. This happens in the method "indexDocs". There is the following line:

 if (fetcherOutput.getParse() != null) {

The fetcher output is produced by the fetcher. If the content type is for example "image" the content will be set in the FetcherOutput but no

Parse

object. So the before mentioned "if" fails. This means that no

document

is set and

null

will be returned. So from this "IndexDocument" there can no record be constructed in the "run" method of "CrawlingProducerThread".

Report message to a moderator

Re: Storing all crawled data? [message #654130 is a reply to message #653533]

Mon, 14 February 2011 09:54

Daniel Stucky

Messages: 35
Registered: July 2009

Member

Please take a look at https://bugs.eclipse.org/bugs/show_bug.cgi?id=334396.

There I described the cause for the current behavior. Unfortunately I don't know of any quick fix to solve this and I don't know the design decisions behind the ParserManager and Parsers.

Perhaps Thomas Menzel can give us some additional information here?

Report message to a moderator

Previous Topic:	EOFException in CrawlThread
Next Topic:	NullPointerException while crawling

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

]

Current Time: Fri Apr 26 10:42:56 GMT 2024

.:: Contact :: Home ::.

Breadcrumbs

Sign up to our Newsletter