Eclipse Community Forums: SeMantic Information Logistics Architecture (SMILA) » How to crawl a web page and save the resulting html pages?

Home » Archived » SeMantic Information Logistics Architecture (SMILA) » How to crawl a web page and save the resulting html pages?

How to crawl a web page and save the resulting html pages? [message #892550]

Thu, 28 June 2012 11:16

Eclipse User

What is the easiest way to crawl a website and save the htmls in some local folder? I tried the sample web crawl job from the documentation but it doesn't seem to produce any output. Maybe I need to define a fetcher? But how can I add one in the example (source code below)?

Edit: The forum prevents me from posting links before I have 25 messages so I cannot include the link and the JSON, for me it is the second google hit for "sample web crawl job smila".

[Updated on: Thu, 28 June 2012 11:17] by Moderator

Re: How to crawl a web page and save the resulting html pages? [message #892695 is a reply to message #892550]

Fri, 29 June 2012 05:16

Eclipse User

Hi Konrad,

did you try the "5 Minutes to Success" http://wiki.eclipse.org/SMILA/Documentation_for_5_Minutes_to_Success ? With this example you should be able to crawl a website and create an index from the crawled records.

Please note that the crawled websites are not persisted for later use in the objectstore, they are only saved temporary until they are indexed. If you want to store the crawled websites you have to modify somehow your workflow.

One option for example is to add a new Pipelet to the "AddPipeline" to store the records somewhere. You could use the SMILA record store or save them into a database or to the filesystem. That's up to you.

Please also note that the sample jobs make use of delta indexing. If you are indexing the same websites multiple times delta indexing will probably filter out most pages because they were already crawled and did not change. You may want to disable delta indexing while you are experimenting with SMILA or you just delete the delta indexing entries before you start your crawl job.

I hope this helps.

Bye,
Daniel

Re: How to crawl a web page and save the resulting html pages? [message #892698 is a reply to message #892695]

Fri, 29 June 2012 05:20

Eclipse User

I just realized that there is already a FileWriterPipelet available which might do the job: http://wiki.eclipse.org/SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets#org.eclipse.smila.processing.pipelets.FileWriterPipelet

Re: How to crawl a web page and save the resulting html pages? [message #892720 is a reply to message #892698]

Fri, 29 June 2012 06:58

Eclipse User

Ok I now created a crawling workflow with disabled delta indexing, is the following correct (I just modified the webCrawling workflow and removed the delta indexing part)?

{
  "name": "webCrawlingNoDeltaIndexing",
  "modes": [
    "runOnce"
  ],
  "startAction": {
    "worker": "webCrawler",
    "input": {
      "linksToCrawl": "linksToCrawlBucket"
    },
    "output": {
      "linksToCrawl": "linksToCrawlBucket",
      "crawledRecords": "crawledLinksBucket"
    }
  },
  "actions": [
    {
      "worker": "webExtractor",
      "input": {
        "compounds": "compoundLinksBucket"
      },
      "output": {
        "files": "fetchedLinksBucket"
      }
    },
    {
      "worker": "webFetcher",
      "input": {
        "linksToFetch": "updatedLinksBucket"
      },
      "output": {
        "fetchedLinks": "fetchedLinksBucket"
      }
    },
    {
      "worker": "updatePusher",
      "input": {
        "recordsToPush": "fetchedLinksBucket"
      }
    }
  ]
}

[Updated on: Fri, 29 June 2012 06:58] by Moderator

Re: How to crawl a web page and save the resulting html pages? [message #892726 is a reply to message #892720]

Fri, 29 June 2012 07:31

Eclipse User

Looks good. In your crawl job you should also set "deltaImportStrategy" to "disabled"

Re: How to crawl a web page and save the resulting html pages? [message #892728 is a reply to message #892720]

Fri, 29 June 2012 07:31

Eclipse User

Hi Conrad,

instead of changing the workflow it may be easier to just switch delta
indexing off in the job definition by setting an additional parameter:

"parameters":{
...
"deltaImportStrategy":"disabled",
...

(see http://wiki.eclipse.org/SMILA/Documentation/Importing/DeltaCheck)

Best regards,
Andreas

Am 29.06.2012 12:58, schrieb Konrad Höffner:
> Ok I now created a crawling workflow with disabled delta indexing, is
> the following correct (I just modified the webCrawling workflow and
> removed the delta indexing part) ?:
>
> {
> "name": "webCrawlingNoDeltaIndexing",
> "modes": [
> "runOnce"
> ],
> "startAction": {
> "worker": "webCrawler",
> "input": {
> "linksToCrawl": "linksToCrawlBucket"
> },
> "output": {
> "linksToCrawl": "linksToCrawlBucket",
> "crawledRecords": "crawledLinksBucket"
> }
> },
> "actions": [
> {
> "worker": "webExtractor",
> "input": {
> "compounds": "compoundLinksBucket"
> },
> "output": {
> "files": "fetchedLinksBucket"
> }
> },
> {
> "worker": "webFetcher",
> "input": {
> "linksToFetch": "updatedLinksBucket"
> },
> "output": {
> "fetchedLinks": "fetchedLinksBucket"
> }
> },
> {
> "worker": "updatePusher",
> "input": {
> "recordsToPush": "fetchedLinksBucket"
> }
> }
> ]
> }

Re: How to crawl a web page and save the resulting html pages? [message #892767 is a reply to message #892698]

Fri, 29 June 2012 10:28

Eclipse User

Daniel Stucky wrote on Fri, 29 June 2012 05:20

I just realized that there is already a FileWriterPipelet available which might do the job: [...]

Thanks! I am now adding the piplet to the AddPipeline, am I doing it the right way below? I took the definition of the AddPipeline and modified it as follows.

[...]
<process name="AddPipelineSaveToFile" [...]
[...]
 <sequence name="AddPipelineSaveToFile">
[...]
</extensionActivity>

<extensionActivity>
  <proc:invokePipelet name="writeFile">
    <proc:pipelet class="org.eclipse.smila.processing.pipelets.FileWriterPipelet" />
    <proc:variables input="request" />
    <proc:configuration>
      <rec:Val key="pathAttribute">output</rec:Val>
      <rec:Val key="contentAttachment">Content</rec:Val>
    </proc:configuration>
  </proc:invokePipelet>
</extensionActivity>

 <reply name="end" partnerLink="Pipeline" portType="proc:ProcessorPortType" operation="process" variable="request" />
 </sequence>
</process>

[Updated on: Fri, 29 June 2012 10:28] by Moderator

Previous Topic:	embedded SOLR does not seem to get started
Next Topic:	Release 1.1 is out!

Goto Forum:

-=] Back to Top [=-

Current Time: Tue Jul 22 14:29:22 EDT 2025

.:: Contact :: Home ::.

Breadcrumbs

Sign up to our Newsletter