Skip to main content



      Home
Home » Archived » SeMantic Information Logistics Architecture (SMILA) » How to crawl a web page and save the resulting html pages?
How to crawl a web page and save the resulting html pages? [message #892550] Thu, 28 June 2012 11:16 Go to next message
Eclipse UserFriend
What is the easiest way to crawl a website and save the htmls in some local folder? I tried the sample web crawl job from the documentation but it doesn't seem to produce any output. Maybe I need to define a fetcher? But how can I add one in the example (source code below)?

Edit: The forum prevents me from posting links before I have 25 messages so I cannot include the link and the JSON, for me it is the second google hit for "sample web crawl job smila".

[Updated on: Thu, 28 June 2012 11:17] by Moderator

Re: How to crawl a web page and save the resulting html pages? [message #892695 is a reply to message #892550] Fri, 29 June 2012 05:16 Go to previous messageGo to next message
Eclipse UserFriend
Hi Konrad,

did you try the "5 Minutes to Success" http://wiki.eclipse.org/SMILA/Documentation_for_5_Minutes_to_Success ? With this example you should be able to crawl a website and create an index from the crawled records.

Please note that the crawled websites are not persisted for later use in the objectstore, they are only saved temporary until they are indexed. If you want to store the crawled websites you have to modify somehow your workflow.

One option for example is to add a new Pipelet to the "AddPipeline" to store the records somewhere. You could use the SMILA record store or save them into a database or to the filesystem. That's up to you.

Please also note that the sample jobs make use of delta indexing. If you are indexing the same websites multiple times delta indexing will probably filter out most pages because they were already crawled and did not change. You may want to disable delta indexing while you are experimenting with SMILA or you just delete the delta indexing entries before you start your crawl job.

I hope this helps.

Bye,
Daniel
Re: How to crawl a web page and save the resulting html pages? [message #892698 is a reply to message #892695] Fri, 29 June 2012 05:20 Go to previous messageGo to next message
Eclipse UserFriend
I just realized that there is already a FileWriterPipelet available which might do the job: http://wiki.eclipse.org/SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets#org.eclipse.smila.processing.pipelets.FileWriterPipelet
Re: How to crawl a web page and save the resulting html pages? [message #892720 is a reply to message #892698] Fri, 29 June 2012 06:58 Go to previous messageGo to next message
Eclipse UserFriend
Ok I now created a crawling workflow with disabled delta indexing, is the following correct (I just modified the webCrawling workflow and removed the delta indexing part)?

{
  "name": "webCrawlingNoDeltaIndexing",
  "modes": [
    "runOnce"
  ],
  "startAction": {
    "worker": "webCrawler",
    "input": {
      "linksToCrawl": "linksToCrawlBucket"
    },
    "output": {
      "linksToCrawl": "linksToCrawlBucket",
      "crawledRecords": "crawledLinksBucket"
    }
  },
  "actions": [
    {
      "worker": "webExtractor",
      "input": {
        "compounds": "compoundLinksBucket"
      },
      "output": {
        "files": "fetchedLinksBucket"
      }
    },
    {
      "worker": "webFetcher",
      "input": {
        "linksToFetch": "updatedLinksBucket"
      },
      "output": {
        "fetchedLinks": "fetchedLinksBucket"
      }
    },
    {
      "worker": "updatePusher",
      "input": {
        "recordsToPush": "fetchedLinksBucket"
      }
    }
  ]
}

[Updated on: Fri, 29 June 2012 06:58] by Moderator

Re: How to crawl a web page and save the resulting html pages? [message #892726 is a reply to message #892720] Fri, 29 June 2012 07:31 Go to previous messageGo to next message
Eclipse UserFriend
Looks good. In your crawl job you should also set "deltaImportStrategy" to "disabled"
Re: How to crawl a web page and save the resulting html pages? [message #892728 is a reply to message #892720] Fri, 29 June 2012 07:31 Go to previous messageGo to next message
Eclipse UserFriend
Hi Conrad,

instead of changing the workflow it may be easier to just switch delta
indexing off in the job definition by setting an additional parameter:

"parameters":{
...
"deltaImportStrategy":"disabled",
...

(see http://wiki.eclipse.org/SMILA/Documentation/Importing/DeltaCheck)

Best regards,
Andreas


Am 29.06.2012 12:58, schrieb Konrad Höffner:
> Ok I now created a crawling workflow with disabled delta indexing, is
> the following correct (I just modified the webCrawling workflow and
> removed the delta indexing part) ?:
>
> {
> "name": "webCrawlingNoDeltaIndexing",
> "modes": [
> "runOnce"
> ],
> "startAction": {
> "worker": "webCrawler",
> "input": {
> "linksToCrawl": "linksToCrawlBucket"
> },
> "output": {
> "linksToCrawl": "linksToCrawlBucket",
> "crawledRecords": "crawledLinksBucket"
> }
> },
> "actions": [
> {
> "worker": "webExtractor",
> "input": {
> "compounds": "compoundLinksBucket"
> },
> "output": {
> "files": "fetchedLinksBucket"
> }
> },
> {
> "worker": "webFetcher",
> "input": {
> "linksToFetch": "updatedLinksBucket"
> },
> "output": {
> "fetchedLinks": "fetchedLinksBucket"
> }
> },
> {
> "worker": "updatePusher",
> "input": {
> "recordsToPush": "fetchedLinksBucket"
> }
> }
> ]
> }
Re: How to crawl a web page and save the resulting html pages? [message #892767 is a reply to message #892698] Fri, 29 June 2012 10:28 Go to previous message
Eclipse UserFriend
Daniel Stucky wrote on Fri, 29 June 2012 05:20
I just realized that there is already a FileWriterPipelet available which might do the job: [...]


Thanks! I am now adding the piplet to the AddPipeline, am I doing it the right way below? I took the definition of the AddPipeline and modified it as follows.

[...]
<process name="AddPipelineSaveToFile" [...]
[...]
 <sequence name="AddPipelineSaveToFile">
[...]
</extensionActivity>

<extensionActivity>
  <proc:invokePipelet name="writeFile">
    <proc:pipelet class="org.eclipse.smila.processing.pipelets.FileWriterPipelet" />
    <proc:variables input="request" />
    <proc:configuration>
      <rec:Val key="pathAttribute">output</rec:Val>
      <rec:Val key="contentAttachment">Content</rec:Val>
    </proc:configuration>
  </proc:invokePipelet>
</extensionActivity>

 <reply name="end" partnerLink="Pipeline" portType="proc:ProcessorPortType" operation="process" variable="request" />
 </sequence>
</process>

[Updated on: Fri, 29 June 2012 10:28] by Moderator

Previous Topic:embedded SOLR does not seem to get started
Next Topic:Release 1.1 is out!
Goto Forum:
  


Current Time: Tue Jul 22 14:29:22 EDT 2025

Powered by FUDForum. Page generated in 0.08012 seconds
.:: Contact :: Home ::.

Powered by: FUDforum 3.0.2.
Copyright ©2001-2010 FUDforum Bulletin Board Software

Back to the top