Re: [smila-dev] Ac2i8hax4G5oncFQQb+ufJxSD7APjQAEgUUQ

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [smila-dev] Ac2i8hax4G5oncFQQb+ufJxSD7APjQAEgUUQ

From: Andreas Schank <andreas.schank@xxxxxxxxxxx>
Date: Mon, 8 Oct 2012 08:59:52 +0200
Accept-language: de-DE
Acceptlanguage: de-DE
Delivered-to: smila-dev@xxxxxxxxxxx
List-archive: <https://dev.eclipse.org/mailman/private/smila-dev>
List-help: <mailto:smila-dev-request@eclipse.org?subject=help>
List-subscribe: <https://dev.eclipse.org/mailman/listinfo/smila-dev>, <mailto:smila-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://dev.eclipse.org/mailman/options/smila-dev>, <mailto:smila-dev-request@eclipse.org?subject=unsubscribe>
Thread-index: Ac2jITTULsNUXPoMRLyXwo2bEDs9rAB/mhjQ
Thread-topic: Ac2i8hax4G5oncFQQb+ufJxSD7APjQAEgUUQ

Hi René,

The robots.txt is respected in the trunk version, since we just recently introduced this feature.

See http://wiki.eclipse.org/SMILA/Documentation/Importing/Crawler/Web#User-Agent_and_robots.txt for a short overview.

But inline tags like won’t be respected.

Thanks for the link, I’ll have a look at it J

Hope this helps.

One tip for your “BMWI” problem:

You could either try that with the boilerpipe pipelet (see http://wiki.eclipse.org/SMILA/Documentation/Bundle_org.eclipse.smila.processing.pipelets.boilerpipe) which should be able to get only the relevant stuff from your web pages, or you could introduce a pipelet (or a script that is executed by the ScriptPipelet), that cuts out the stuff of which you’re sure you won’t need.

But I guess the boilerpipe approach could be the most promising for your problem. Have a try and check if the results improve. (And if you like, give us some feedback.)

Bye,

Andreas

Von: smila-dev-bounces@xxxxxxxxxxx [mailto:smila-dev-bounces@xxxxxxxxxxx] Im Auftrag von Corinth, Rene
Gesendet: Freitag, 5. Oktober 2012 19:56
An: smila-dev@xxxxxxxxxxx
Betreff: [smila-dev] Ac2i8hax4G5oncFQQb+ufJxSD7APjQAEgUUQ

That’s rocks!

Andreas, thank you very much. It took really a long time to pass this step, first I thought the problem it’s me ;)

There is still a lot to do, but step by step ;)

If somebody want to see the progress:

http://www.theseus-programm.de/de/75_smila.php

But I still have some questions. Does SMILA respect the robots.txt and the <!—noindex--> Tag ?

It seems not……if I search for “BMWI” I received a lot of matches because the copyright is on every single side. Maybe someone have an idea?

Cheers René

References:
- [smila-dev] Ac2i8hax4G5oncFQQb+ufJxSD7APjQAEgUUQ
  - From: Corinth, Rene

Prev by Date: [smila-dev] Ac2i8hax4G5oncFQQb+ufJxSD7APjQAEgUUQ
Next by Date: [smila-dev] SMILA as Search engine
Previous by thread: [smila-dev] Ac2i8hax4G5oncFQQb+ufJxSD7APjQAEgUUQ
Next by thread: [smila-dev] Zookeeper fatal error and SMILA shutdown
Index(es):
- Date
- Thread

Breadcrumbs