Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Eclipse Projects » SeMantic Information Logistics Architecture (SMILA) » Crawler - Configuration and behavior
Crawler - Configuration and behavior [message #551806] Mon, 09 August 2010 11:39 Go to next message
Andrej Rosenheinrich is currently offline Andrej Rosenheinrich
Messages: 22
Registered: August 2010
Junior Member
Hello,

i'm new to SMILA, it is an impressive tool, but unfortunatly also a bit tricky to understand Wink

At the moment I am trying to understand the behavior of a crawler and how to configure it. I understand the attributes i can gather and what a collected record would look like. What I dont get yet is where "under the hood" the actual crawling is done and what possibilties to change that i have without changing the actual implementation. For example, when crawling a website, can i configure the format of the content? Is content necessarily an attachment? Can I get the full HTML-code of a website as content or for instance just the text between certain tags, or only the text and no HTML-code at all?

Are there easy answers to those question or is there a more specific description to crawlers than whats in the wiki? Would you be interested in comments on the wiki, btw.?

Thanks in advance!
Andrej
Re: Crawler - Configuration and behavior [message #551918 is a reply to message #551806] Tue, 10 August 2010 03:46 Go to previous messageGo to next message
Daniel Stucky is currently offline Daniel Stucky
Messages: 35
Registered: July 2009
Member
Hi Andreij,

thanks for your interest in SMILA.

A crawlers main purpose is to provide data "as is" from a specific data source. In case of the WebCrawler it means that it starts the crawling on a given start URL (called Seed) and returns the resource specified by this URL. You also get access to the available information within the HTTP header of the webserver's response when sending the resource. So far this is standard HTTP functionality.

Further steps depend on the mime-type of the resource:
- if the resource is an HTML document the Crawler extracts all links to other resources (documents, images, etc.) and follows those links in regards to your configuration. In addition information provided in META tags is extracted (e.g. content encoding of the HTML document). The HTML document itself remains unmodified
- if the resource is not an HTML document no further steps are done

Within your Crawler configuration you can specify how to map the possible information provided for a crawled object to record attributes and/or attachments. Note that not all information may be available for every crawled object.

If you store the content as a record attribute or attachment depends on what you are crawling. If you know that you can only receive HTML documents it's ok to use an attribute. However, if you do not provide rules to filter everything else a Crawler run may also return images, PDFs and so on. For binary data you have to use an attachment.

What the Crawler does not provide is content extraction. So you cannot specify in the Crawler configuration to return only a specific part of the HTML, like a section, paragraph or page. Therefore you have to configure a BPEL pipeline that does the processing of the data provided by a Crawler. A BPEL pipeline gives you much more flexibility and control of what to do with the data, this would not be possible within the Crawler configuration. It also allows reuse, for example the processing of an HTML document may be identical whether it comes form a WebCrawler or FilesystemCrawler.


For more information about Crawlers and the configuration options check out the documentation in our wiki:

http://wiki.eclipse.org/SMILA/Documentation/ConnectivityFramework
http://wiki.eclipse.org/SMILA/Documentation/Crawler
http://wiki.eclipse.org/SMILA/Documentation/Web_Crawler

I hope this helps!

Bye,
Daniel
Re: Crawler - Configuration and behavior [message #551928 is a reply to message #551918] Tue, 10 August 2010 04:29 Go to previous messageGo to next message
Andrej Rosenheinrich is currently offline Andrej Rosenheinrich
Messages: 22
Registered: August 2010
Junior Member
Hi Daniel,

thanks for your fast answer. Still I have some questions left.

You wrote "A crawlers main purpose is to provide data "as is" from a specific data source. In case of the WebCrawler it means that it starts the crawling on a given start URL (called Seed) and returns the resource specified by this URL.". What is the "as is resource" of an URL? Is it the complete html-code of the page? Or just the text, meaning the whole page minus the html-code? What would be the returned content?

Is there a documentation of the handling of the different mime-types anywhere? Couldnt find it in the wiki.

Thanks!

Greets
Andrej
Re: Crawler - Configuration and behavior [message #551944 is a reply to message #551928] Tue, 10 August 2010 05:06 Go to previous messageGo to next message
Daniel Stucky is currently offline Daniel Stucky
Messages: 35
Registered: July 2009
Member
Hi Andreij,

with "as is" I meant the unmodified content as sent by the webserver. In case of an HTML document it is the complete markup (html tags + text).

I don't think that there is an example on the wiki but take a look at the default pipelines that are shiped with SMILA ( SMILA.application\configuration\org.eclipse.smila.processing .bpel\pipelines). The addpipeline.bpel contains conditions that checkj if there is a mimetype attribute set and selects alternative processing for HTML/XML and plain text content. For HTNL/XML the HtmlToTextPipelet is called in order to extract the plain text from the content (removing all markup). I guess that this is the functionality you where looking for.


Bye,
Daniel
Re: Crawler - Configuration and behavior [message #552166 is a reply to message #551944] Wed, 11 August 2010 03:20 Go to previous messageGo to next message
Andrej Rosenheinrich is currently offline Andrej Rosenheinrich
Messages: 22
Registered: August 2010
Junior Member
Hi Daniel,

yes, that was the explanation I was looking for. Thanks a lot. On my journey through the depths of the SMILA crawler I came across a few more questions. At what point filters and limits are checked, when i start crawling or after crawling the seed? What I am trying at the moment is to configure the crawler to just crawl the site i pass as seed, nothing more. When setting <CrawlingModel Type="MaxDepth" Value="1"/> the crawler stops immediately, not crawling anything at all, because it claims that the maxdepth was exceeded. When setting the value to 2 it crawls, but obviously more than just the seed. What is the (probably very simple) configuration for my needs?
Does " Maximum depth exceeded!" automaticly mean a crawlerstate = aborted, or what is the condition to reach status = finished?

[Updated on: Wed, 11 August 2010 03:38]

Report message to a moderator

Re: Crawler - Configuration and behavior [message #552183 is a reply to message #552166] Wed, 11 August 2010 03:54 Go to previous messageGo to next message
Daniel Stucky is currently offline Daniel Stucky
Messages: 35
Registered: July 2009
Member
Hi, this sounds like a bug or some misconfiguration. Of course it should be possible to just crawl the seed url without any link following.

State "aborted" is only entered if the crawl process was stopped by the user or was stopped because of an internal error. In case of an error there should be some log entry in the SMILA.log file. You should also see an exception when using JConsole.

Could you please open a new bugzilla entry and attach your web crawler configuration ?

Bye,
Daniel
Re: Crawler - Configuration and behavior [message #564873 is a reply to message #551944] Wed, 11 August 2010 03:20 Go to previous messageGo to next message
Andrej Rosenheinrich is currently offline Andrej Rosenheinrich
Messages: 22
Registered: August 2010
Junior Member
Hi Daniel,

yes, that was the explanation I was looking for. Thanks a lot. On my journey through the depths of the SMILA crawler I came across a few more questions. At what point filters and limits are checked, when i start crawling or after crawling the seed? What I am trying at the moment is to configure the crawler to just crawl the site i pass as seed, nothing more. When setting <CrawlingModel Type="MaxDepth" Value="1"/> the crawler stops immediately, not crawling anything at all, because it claims that the maxdepth was exceeded. When setting the value to 2 it crawls, but obviously more than just the seed. What is the (probably very simple) configuration for my needs?
Re: Crawler - Configuration and behavior [message #564894 is a reply to message #564873] Wed, 11 August 2010 03:54 Go to previous message
Daniel Stucky is currently offline Daniel Stucky
Messages: 35
Registered: July 2009
Member
Hi, this sounds like a bug or some misconfiguration. Of course it should be possible to just crawl the seed url without any link following.

State "aborted" is only entered if the crawl process was stopped by the user or was stopped because of an internal error. In case of an error there should be some log entry in the SMILA.log file. You should also see an exception when using JConsole.

Could you please open a new bugzilla entry and attach your web crawler configuration ?

Bye,
Daniel
Previous Topic:Crawler - Configuration and behavior
Next Topic:Deploying SMILA with Eclipse 3.6
Goto Forum:
  


Current Time: Wed Apr 16 03:29:24 EDT 2014

Powered by FUDForum. Page generated in 0.02597 seconds