Hi René,
 
The robots.txt is respected in the trunk version, since we just recently introduced this feature.
See http://wiki.eclipse.org/SMILA/Documentation/Importing/Crawler/Web#User-Agent_and_robots.txt for a short overview.
 
But inline tags like won’t be respected.
 
Thanks for the link, I’ll have a look at it J
 
Hope this helps.
 
One tip for your “BMWI” problem:
You could either try that with the boilerpipe pipelet (see http://wiki.eclipse.org/SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets.boilerpipe) which should be able to get only the relevant stuff from your web pages, or you could introduce a pipelet (or a script that is executed by the ScriptPipelet), that cuts out the stuff of which you’re sure you won’t need.
 
But I guess the boilerpipe approach could be the most promising for your problem. Have a try and check if the results improve. (And if you like, give us some feedback.)
 
Bye,
Andreas
 
 
Von: smila-dev-bounces@xxxxxxxxxxx [mailto:smila-dev-bounces@xxxxxxxxxxx] Im Auftrag von Corinth, Rene
Gesendet: Freitag, 5. Oktober 2012 19:56
An: smila-dev@xxxxxxxxxxx
Betreff: [smila-dev] Ac2i8hax4G5oncFQQb+ufJxSD7APjQAEgUUQ
 
That’s rocks!
 
Andreas, thank you very much. It took really a long time to pass this step, first I thought the problem it’s me ;)
 
There is still a lot to do, but step by step ;)
 
If somebody want to see the progress: 
 
http://www.theseus-programm.de/de/75_smila.php 
 
But I still have some questions. Does SMILA respect the robots.txt and the <!—noindex--> Tag ? 
It seems not……if I search for “BMWI” I received a lot of matches because the copyright is on every single side. Maybe someone have an idea?
 
Cheers René