Hi,
Am 28.08.2012 13:02, schrieb Corinth, Rene:
I’m working for the
PT-DLR (http://www.pt-dlr.de/)
and we are managing a lot of websites. Now we want to replace
our actual search engine with Smila. By default Smila is
indexing http://wiki.eclipse.org/SMILA/
and it’s easy to change the startURL in the jobs.json.
Now my problem: I want
to give Smila more than one website (e.g. url1.com +
url2.com). So the indexing should work independent of each
other.
You can add more crawl job definitions, one for each web site.
Either add them to the configuration jobs.json file, or POST them to
/smila/jobmanager/jobs.
Another possibility to do this in one job is described on
http://wiki.eclipse.org/SMILA/Documentation/Importing/CrawlingMultipleStartURLs.
In addition if I
implement a search form in the website, it should show only
content from itself, for example:
If I’m searching something in url1.com, stuff
from url1.com should be shown only.
For each crawled page you could extract the domain part of the URL
into a new attribute and then in the search request add a filter to
restrict the result to those pages with the required domain
attribute value.
On adding attributes to the index see http://wiki.eclipse.org/SMILA/Documentation/Solr_3.5
On filtering see see http://wiki.eclipse.org/SMILA/Documentation/Search#Query_Parameters
Does anybody know where
I could find some tutorials for my case or can give me some
hints.
Sorry, there is currently no complete tutorial on this.
Cheers,
Juergen.
|