|Re: [smila-dev] Ac2i8hax4G5oncFQQb+ufJxSD7APjQAEgUUQ|
The robots.txt is respected in the trunk version, since we just recently introduced this feature.
See http://wiki.eclipse.org/SMILA/Documentation/Importing/Crawler/Web#User-Agent_and_robots.txt for a short overview.
But inline tags like won’t be respected.
Thanks for the link, I’ll have a look at it J
Hope this helps.
One tip for your “BMWI” problem:
You could either try that with the boilerpipe pipelet (see http://wiki.eclipse.org/SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets.boilerpipe) which should be able to get only the relevant stuff from your web pages, or you could introduce a pipelet (or a script that is executed by the ScriptPipelet), that cuts out the stuff of which you’re sure you won’t need.
But I guess the boilerpipe approach could be the most promising for your problem. Have a try and check if the results improve. (And if you like, give us some feedback.)
Andreas, thank you very much. It took really a long time to pass this step, first I thought the problem it’s me ;)
There is still a lot to do, but step by step ;)
If somebody want to see the progress:
But I still have some questions. Does SMILA respect the robots.txt and the <!—noindex--> Tag ?
It seems not……if I search for “BMWI” I received a lot of matches because the copyright is on every single side. Maybe someone have an idea?