How to crawl a web page and save the resulting html pages? [message #892550] |
Thu, 28 June 2012 11:16  |
Eclipse User |
|
|
|
What is the easiest way to crawl a website and save the htmls in some local folder? I tried the sample web crawl job from the documentation but it doesn't seem to produce any output. Maybe I need to define a fetcher? But how can I add one in the example (source code below)?
Edit: The forum prevents me from posting links before I have 25 messages so I cannot include the link and the JSON, for me it is the second google hit for "sample web crawl job smila".
[Updated on: Thu, 28 June 2012 11:17] by Moderator
|
|
|
|
|
Re: How to crawl a web page and save the resulting html pages? [message #892720 is a reply to message #892698] |
Fri, 29 June 2012 06:58   |
Eclipse User |
|
|
|
Ok I now created a crawling workflow with disabled delta indexing, is the following correct (I just modified the webCrawling workflow and removed the delta indexing part)?
{
"name": "webCrawlingNoDeltaIndexing",
"modes": [
"runOnce"
],
"startAction": {
"worker": "webCrawler",
"input": {
"linksToCrawl": "linksToCrawlBucket"
},
"output": {
"linksToCrawl": "linksToCrawlBucket",
"crawledRecords": "crawledLinksBucket"
}
},
"actions": [
{
"worker": "webExtractor",
"input": {
"compounds": "compoundLinksBucket"
},
"output": {
"files": "fetchedLinksBucket"
}
},
{
"worker": "webFetcher",
"input": {
"linksToFetch": "updatedLinksBucket"
},
"output": {
"fetchedLinks": "fetchedLinksBucket"
}
},
{
"worker": "updatePusher",
"input": {
"recordsToPush": "fetchedLinksBucket"
}
}
]
}
[Updated on: Fri, 29 June 2012 06:58] by Moderator
|
|
|
|
Re: How to crawl a web page and save the resulting html pages? [message #892728 is a reply to message #892720] |
Fri, 29 June 2012 07:31   |
Eclipse User |
|
|
|
Hi Conrad,
instead of changing the workflow it may be easier to just switch delta
indexing off in the job definition by setting an additional parameter:
"parameters":{
...
"deltaImportStrategy":"disabled",
...
(see http://wiki.eclipse.org/SMILA/Documentation/Importing/DeltaCheck)
Best regards,
Andreas
Am 29.06.2012 12:58, schrieb Konrad Höffner:
> Ok I now created a crawling workflow with disabled delta indexing, is
> the following correct (I just modified the webCrawling workflow and
> removed the delta indexing part) ?:
>
> {
> "name": "webCrawlingNoDeltaIndexing",
> "modes": [
> "runOnce"
> ],
> "startAction": {
> "worker": "webCrawler",
> "input": {
> "linksToCrawl": "linksToCrawlBucket"
> },
> "output": {
> "linksToCrawl": "linksToCrawlBucket",
> "crawledRecords": "crawledLinksBucket"
> }
> },
> "actions": [
> {
> "worker": "webExtractor",
> "input": {
> "compounds": "compoundLinksBucket"
> },
> "output": {
> "files": "fetchedLinksBucket"
> }
> },
> {
> "worker": "webFetcher",
> "input": {
> "linksToFetch": "updatedLinksBucket"
> },
> "output": {
> "fetchedLinks": "fetchedLinksBucket"
> }
> },
> {
> "worker": "updatePusher",
> "input": {
> "recordsToPush": "fetchedLinksBucket"
> }
> }
> ]
> }
|
|
|
Re: How to crawl a web page and save the resulting html pages? [message #892767 is a reply to message #892698] |
Fri, 29 June 2012 10:28  |
Eclipse User |
|
|
|
Daniel Stucky wrote on Fri, 29 June 2012 05:20I just realized that there is already a FileWriterPipelet available which might do the job: [...]
Thanks! I am now adding the piplet to the AddPipeline, am I doing it the right way below? I took the definition of the AddPipeline and modified it as follows.
[...]
<process name="AddPipelineSaveToFile" [...]
[...]
<sequence name="AddPipelineSaveToFile">
[...]
</extensionActivity>
<extensionActivity>
<proc:invokePipelet name="writeFile">
<proc:pipelet class="org.eclipse.smila.processing.pipelets.FileWriterPipelet" />
<proc:variables input="request" />
<proc:configuration>
<rec:Val key="pathAttribute">output</rec:Val>
<rec:Val key="contentAttachment">Content</rec:Val>
</proc:configuration>
</proc:invokePipelet>
</extensionActivity>
<reply name="end" partnerLink="Pipeline" portType="proc:ProcessorPortType" operation="process" variable="request" />
</sequence>
</process>
[Updated on: Fri, 29 June 2012 10:28] by Moderator
|
|
|
Powered by
FUDForum. Page generated in 0.04157 seconds