Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [smila-dev] Ac2i8hax4G5oncFQQb+ufJxSD7APjQAEgUUQ

Hi René,

 

The robots.txt is respected in the trunk version, since we just recently introduced this feature.

See http://wiki.eclipse.org/SMILA/Documentation/Importing/Crawler/Web#User-Agent_and_robots.txt for a short overview.

 

But inline tags like won’t be respected.

 

Thanks for the link, I’ll have a look at it J

 

Hope this helps.

 

One tip for your “BMWI” problem:

You could either try that with the boilerpipe pipelet (see http://wiki.eclipse.org/SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets.boilerpipe) which should be able to get only the relevant stuff from your web pages, or you could introduce a pipelet (or a script that is executed by the ScriptPipelet), that cuts out the stuff of which you’re sure you won’t need.

 

But I guess the boilerpipe approach could be the most promising for your problem. Have a try and check if the results improve. (And if you like, give us some feedback.)

 

Bye,

Andreas

 

 

Von: smila-dev-bounces@xxxxxxxxxxx [mailto:smila-dev-bounces@xxxxxxxxxxx] Im Auftrag von Corinth, Rene
Gesendet: Freitag, 5. Oktober 2012 19:56
An: smila-dev@xxxxxxxxxxx
Betreff: [smila-dev] Ac2i8hax4G5oncFQQb+ufJxSD7APjQAEgUUQ

 

That’s rocks!

 

Andreas, thank you very much. It took really a long time to pass this step, first I thought the problem it’s me ;)

 

There is still a lot to do, but step by step ;)

 

If somebody want to see the progress:

 

http://www.theseus-programm.de/de/75_smila.php

 

But I still have some questions. Does SMILA respect the robots.txt and the <!—noindex--> Tag ?

It seems not……if I search for “BMWI” I received a lot of matches because the copyright is on every single side. Maybe someone have an idea?

 

Cheers René


Back to the top