[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [smila-dev] Ac2i8hax4G5oncFQQb+ufJxSD7APjQAEgUUQ

Hi René,


The robots.txt is respected in the trunk version, since we just recently introduced this feature.

See http://wiki.eclipse.org/SMILA/Documentation/Importing/Crawler/Web#User-Agent_and_robots.txt for a short overview.


But inline tags like won’t be respected.


Thanks for the link, I’ll have a look at it J


Hope this helps.


One tip for your “BMWI” problem:

You could either try that with the boilerpipe pipelet (see http://wiki.eclipse.org/SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets.boilerpipe) which should be able to get only the relevant stuff from your web pages, or you could introduce a pipelet (or a script that is executed by the ScriptPipelet), that cuts out the stuff of which you’re sure you won’t need.


But I guess the boilerpipe approach could be the most promising for your problem. Have a try and check if the results improve. (And if you like, give us some feedback.)






Von: smila-dev-bounces@xxxxxxxxxxx [mailto:smila-dev-bounces@xxxxxxxxxxx] Im Auftrag von Corinth, Rene
Gesendet: Freitag, 5. Oktober 2012 19:56
An: smila-dev@xxxxxxxxxxx
Betreff: [smila-dev] Ac2i8hax4G5oncFQQb+ufJxSD7APjQAEgUUQ


That’s rocks!


Andreas, thank you very much. It took really a long time to pass this step, first I thought the problem it’s me ;)


There is still a lot to do, but step by step ;)


If somebody want to see the progress:




But I still have some questions. Does SMILA respect the robots.txt and the <!—noindex--> Tag ?

It seems not……if I search for “BMWI” I received a lot of matches because the copyright is on every single side. Maybe someone have an idea?


Cheers René