|Re: Crawler - Configuration and behavior [message #564787 is a reply to message #564749]
||Tue, 10 August 2010 07:46
| Daniel Stucky
Registered: July 2009
thanks for your interest in SMILA.
A crawlers main purpose is to provide data "as is" from a specific data source. In case of the WebCrawler it means that it starts the crawling on a given start URL (called Seed) and returns the resource specified by this URL. You also get access to the available information within the HTTP header of the webserver's response when sending the resource. So far this is standard HTTP functionality.
Further steps depend on the mime-type of the resource:
- if the resource is an HTML document the Crawler extracts all links to other resources (documents, images, etc.) and follows those links in regards to your configuration. In addition information provided in META tags is extracted (e.g. content encoding of the HTML document). The HTML document itself remains unmodified
- if the resource is not an HTML document no further steps are done
Within your Crawler configuration you can specify how to map the possible information provided for a crawled object to record attributes and/or attachments. Note that not all information may be available for every crawled object.
If you store the content as a record attribute or attachment depends on what you are crawling. If you know that you can only receive HTML documents it's ok to use an attribute. However, if you do not provide rules to filter everything else a Crawler run may also return images, PDFs and so on. For binary data you have to use an attachment.
What the Crawler does not provide is content extraction. So you cannot specify in the Crawler configuration to return only a specific part of the HTML, like a section, paragraph or page. Therefore you have to configure a BPEL pipeline that does the processing of the data provided by a Crawler. A BPEL pipeline gives you much more flexibility and control of what to do with the data, this would not be possible within the Crawler configuration. It also allows reuse, for example the processing of an HTML document may be identical whether it comes form a WebCrawler or FilesystemCrawler.
For more information about Crawlers and the configuration options check out the documentation in our wiki:
I hope this helps!
Powered by FUDForum
. Page generated in 0.12396 seconds