Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Eclipse Projects » SeMantic Information Logistics Architecture (SMILA) » How to crawl a web page and save the resulting html pages?
How to crawl a web page and save the resulting html pages? [message #892550] Thu, 28 June 2012 15:16 Go to next message
Konrad Höffner is currently offline Konrad Höffner
Messages: 5
Registered: June 2012
Junior Member
What is the easiest way to crawl a website and save the htmls in some local folder? I tried the sample web crawl job from the documentation but it doesn't seem to produce any output. Maybe I need to define a fetcher? But how can I add one in the example (source code below)?

Edit: The forum prevents me from posting links before I have 25 messages so I cannot include the link and the JSON, for me it is the second google hit for "sample web crawl job smila".

[Updated on: Thu, 28 June 2012 15:17]

Report message to a moderator

Re: How to crawl a web page and save the resulting html pages? [message #892695 is a reply to message #892550] Fri, 29 June 2012 09:16 Go to previous messageGo to next message
Daniel Stucky is currently offline Daniel Stucky
Messages: 35
Registered: July 2009
Member
Hi Konrad,

did you try the "5 Minutes to Success" http://wiki.eclipse.org/SMILA/Documentation_for_5_Minutes_to_Success ? With this example you should be able to crawl a website and create an index from the crawled records.

Please note that the crawled websites are not persisted for later use in the objectstore, they are only saved temporary until they are indexed. If you want to store the crawled websites you have to modify somehow your workflow.

One option for example is to add a new Pipelet to the "AddPipeline" to store the records somewhere. You could use the SMILA record store or save them into a database or to the filesystem. That's up to you.

Please also note that the sample jobs make use of delta indexing. If you are indexing the same websites multiple times delta indexing will probably filter out most pages because they were already crawled and did not change. You may want to disable delta indexing while you are experimenting with SMILA or you just delete the delta indexing entries before you start your crawl job.

I hope this helps.

Bye,
Daniel
Re: How to crawl a web page and save the resulting html pages? [message #892698 is a reply to message #892695] Fri, 29 June 2012 09:20 Go to previous messageGo to next message
Daniel Stucky is currently offline Daniel Stucky
Messages: 35
Registered: July 2009
Member
I just realized that there is already a FileWriterPipelet available which might do the job: http://wiki.eclipse.org/SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets#org.eclipse.smila.processing.pipelets.FileWriterPipelet
Re: How to crawl a web page and save the resulting html pages? [message #892720 is a reply to message #892698] Fri, 29 June 2012 10:58 Go to previous messageGo to next message
Konrad Höffner is currently offline Konrad Höffner
Messages: 5
Registered: June 2012
Junior Member
Ok I now created a crawling workflow with disabled delta indexing, is the following correct (I just modified the webCrawling workflow and removed the delta indexing part)?

{
  "name": "webCrawlingNoDeltaIndexing",
  "modes": [
    "runOnce"
  ],
  "startAction": {
    "worker": "webCrawler",
    "input": {
      "linksToCrawl": "linksToCrawlBucket"
    },
    "output": {
      "linksToCrawl": "linksToCrawlBucket",
      "crawledRecords": "crawledLinksBucket"
    }
  },
  "actions": [
    {
      "worker": "webExtractor",
      "input": {
        "compounds": "compoundLinksBucket"
      },
      "output": {
        "files": "fetchedLinksBucket"
      }
    },
    {
      "worker": "webFetcher",
      "input": {
        "linksToFetch": "updatedLinksBucket"
      },
      "output": {
        "fetchedLinks": "fetchedLinksBucket"
      }
    },
    {
      "worker": "updatePusher",
      "input": {
        "recordsToPush": "fetchedLinksBucket"
      }
    }
  ]
}

[Updated on: Fri, 29 June 2012 10:58]

Report message to a moderator

Re: How to crawl a web page and save the resulting html pages? [message #892726 is a reply to message #892720] Fri, 29 June 2012 11:31 Go to previous messageGo to next message
Daniel Stucky is currently offline Daniel Stucky
Messages: 35
Registered: July 2009
Member
Looks good. In your crawl job you should also set "deltaImportStrategy" to "disabled"
Re: How to crawl a web page and save the resulting html pages? [message #892728 is a reply to message #892720] Fri, 29 June 2012 11:31 Go to previous messageGo to next message
Andreas Weber is currently offline Andreas Weber
Messages: 24
Registered: July 2009
Junior Member
Hi Conrad,

instead of changing the workflow it may be easier to just switch delta
indexing off in the job definition by setting an additional parameter:

"parameters":{
...
"deltaImportStrategy":"disabled",
...

(see http://wiki.eclipse.org/SMILA/Documentation/Importing/DeltaCheck)

Best regards,
Andreas


Am 29.06.2012 12:58, schrieb Konrad Höffner:
> Ok I now created a crawling workflow with disabled delta indexing, is
> the following correct (I just modified the webCrawling workflow and
> removed the delta indexing part) ?:
>
> {
> "name": "webCrawlingNoDeltaIndexing",
> "modes": [
> "runOnce"
> ],
> "startAction": {
> "worker": "webCrawler",
> "input": {
> "linksToCrawl": "linksToCrawlBucket"
> },
> "output": {
> "linksToCrawl": "linksToCrawlBucket",
> "crawledRecords": "crawledLinksBucket"
> }
> },
> "actions": [
> {
> "worker": "webExtractor",
> "input": {
> "compounds": "compoundLinksBucket"
> },
> "output": {
> "files": "fetchedLinksBucket"
> }
> },
> {
> "worker": "webFetcher",
> "input": {
> "linksToFetch": "updatedLinksBucket"
> },
> "output": {
> "fetchedLinks": "fetchedLinksBucket"
> }
> },
> {
> "worker": "updatePusher",
> "input": {
> "recordsToPush": "fetchedLinksBucket"
> }
> }
> ]
> }
Re: How to crawl a web page and save the resulting html pages? [message #892767 is a reply to message #892698] Fri, 29 June 2012 14:28 Go to previous message
Konrad Höffner is currently offline Konrad Höffner
Messages: 5
Registered: June 2012
Junior Member
Daniel Stucky wrote on Fri, 29 June 2012 05:20
I just realized that there is already a FileWriterPipelet available which might do the job: [...]


Thanks! I am now adding the piplet to the AddPipeline, am I doing it the right way below? I took the definition of the AddPipeline and modified it as follows.

[...]
<process name="AddPipelineSaveToFile" [...]
[...]
 <sequence name="AddPipelineSaveToFile">
[...]
</extensionActivity>

<extensionActivity>
  <proc:invokePipelet name="writeFile">
    <proc:pipelet class="org.eclipse.smila.processing.pipelets.FileWriterPipelet" />
    <proc:variables input="request" />
    <proc:configuration>
      <rec:Val key="pathAttribute">output</rec:Val>
      <rec:Val key="contentAttachment">Content</rec:Val>
    </proc:configuration>
  </proc:invokePipelet>
</extensionActivity>

 <reply name="end" partnerLink="Pipeline" portType="proc:ProcessorPortType" operation="process" variable="request" />
 </sequence>
</process>

[Updated on: Fri, 29 June 2012 14:28]

Report message to a moderator

Previous Topic:embedded SOLR does not seem to get started
Next Topic:Release 1.1 is out!
Goto Forum:
  


Current Time: Sat Oct 25 02:58:28 GMT 2014

Powered by FUDForum. Page generated in 0.02051 seconds
.:: Contact :: Home ::.

Powered by: FUDforum 3.0.2.
Copyright ©2001-2010 FUDforum Bulletin Board Software