codelibs / elasticsearch-river-web Goto Github PK

View Code? Open in Web Editor NEW

234.0 234.0 57.0 329 KB

Web Crawler for Elasticsearch

License: Apache License 2.0

Java 88.44% HTML 11.02% Shell 0.55%

elasticsearch-river-web's People

Contributors

Stargazers

Watchers

Forkers

rgaidot ldayan studiointeract kodokux finalspy choumy linkerlin akashaio jirkamat goswamivijay yuchaozhou davidd2k alvinchiou egorz-eng mjabuz hancy2013 ubuntuevangelist rygbee yusong666666 icastilho janes openbizgit niegem desperado1992 nibircse sdicocco syi2005s forrest001 eddieosi cstur flyinfish itmgr ped4enko fbaligand amaniyechuri sam65536 hhy5277 conradbm xiafei5688 zhuqianbin bulksecuritygeneratorprojectv2 2733284198 iq-scm

elasticsearch-river-web's Issues

HTTP Proxy support

Supports HTTP proxy to crawl contents.
An example for the configuration is:

curl -XPUT 'localhost:9200/_river/my_web/_meta' -d '{
    "type" : "web",
    "crawl" : {
...
        "proxy" : {
          "host" : "proxy.server.com",
          "port" : 8080
        },

Strange behavior between robots and myindex incides when crawling large site

I have started a crawl which I am monitoring right now. With about 10 threads and 200 msec interval. The thing is, robot indices has 25k documents which is way more than mycrawlindex about 9200 documents. I know robot.txt handles the crawling url list however, it came to strange that pushing the data into ES is slow?

Btw, the cpu usage is peaked at %80-90.

script property supports an array value

A long MVEL script is a low readability.
So, you can write MVEL script in values of array :)

Use client instance in script

Elasticsearch's client is available in script property.

Facet issue

{
"query" : { "query_string" : {"query" : "ck3"} },
"facets" : {
"tags" : { "terms" : {"fields" : ["metakey","metaprod","metasol","metares"]} }
}
}

Printers and Media is a single word in meta tag content attribute.
When i run this code then its giving me the counts in such a way that if the word is 'Printers and Media' then its giving me as
{
term:"printers",
count:280
},
{
term:"and",
count : 300
}
{
term:"media",
count:100
}

But i need as
{
term:"Printers and Media"
count:200
}

like this what would be the changes i need to do in the query.Please suggest.

Thanks in advance.

Unable to crawl from directory

I am unable to crawl the pdf files situated in other webserver. The scenario is that the pdf's folder is accessible through ftp but not http. I need to give full url of the pdf file(ex:http://xyz.com/pdf/search.pdf) . But i want to crawl from the folder itself .How can i crawl these files through elasticsearch.

Thanks in advance.

NoClassSettingsException[Failed to load class with value [web]]

Hi there,

I've just created a brand new Centos VM (v6), installed ElasticSearch v1.0.0RC2 and elasticsearch-river-web v1.1.0 as per the instructions.

I then have gone to setup my crawl by running the following:

# create robot
curl -XPUT 'http://localhost:9200:443/robot/'

# Create Index
curl -XPUT "http://localhost:9200:443/compassion_uat/"

# create the duplicate mapping index
curl -XPUT "http://localhost:9200:443/compassion_uat/compassion_web/_mapping/" -d '
{
  "compassion_web" : {
    "dynamic_templates" : [
      {
        "url" : {
          "match" : "url",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      },
      {
        "method" : {
          "match" : "method",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      },
      {
        "charSet" : {
          "match" : "charSet",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      },
      {
        "mimeType" : {
          "match" : "mimeType",
          "mapping" : {
            "type" : "string",
            "store" : "yes",
            "index" : "not_analyzed"
          }
        }
      }
    ]
  }
}
'


# create the crawler
curl -XPUT 'http://localhost:9200:443/_river/compassion_web/_meta' -d '
{
    "type" : "web",
    "crawl" : {
        "index" : "compassion_uat",
        "url" : ["https://compassionau.custhelp.com/ci/sitemap/"],
        "includeFilter" : ["https://compassionau.custhelp.com/.*"],
        "maxDepth" : 30,
        "maxAccessCount" : 1000,
        "numOfThread" : 10,
        "interval" : 1000,
                "incremental" : true,
        "overwrite" : true,
                "robotsTxt" : false,
                "userAgent" : "bingbot",
        "target" : [        
                  {
            "pattern" : {
              "url" : "https://compassionau.custhelp.com/app/answers/detail/.*",
              "mimeType" : "text/html"
            },
            "properties" : {
              "title" : {
                "text" : "h1#rn_Summary"
              },
              "body" : {
                "text" : "div#rn_AnswerText",
                "trimSpaces" : true
              }
            }
          }
        ]
    },
    "schedule" : {
        "cron" : "*/2 * * * * ?"
    }
}
'

After doing this I cannot see any documents appearing in the index, so I have looked at the _river index and can see the following error:

NoClassSettingsException[Failed to load class with value [web]]; nested: ClassNotFoundException[web];

Have I missed a step?

Thanks,
Tim.

Crawling Authenticated Sites

Hi,

Is there a way to crawl authenticated websites using this plugin? If yes, can you guide me on how to acheive it. The site could be either NTLM authenticated or might have a Forms based authentication.

S2Container is available in MVEL script

Source code improvement

Fix issues from Sonar.

Create robot index automatically if it does not exist

how to install the latest source code?

I am trying to update the river with the latest version with using

/bin/plugin --install codelibs/elasticsearch-river-web

To be sure about I have the latest version (other wise it installs fro maven repositry which I believe it is not the latest master version.)

How to index secured page(via Forms authentication) using Elastic Search service

Hi geeks,

I have a requirement to index secured pages via Forms authentication using elastic search. I have used BASIC authentication feature provided in this plugin which didn't worked for me. Please provide any suggestions.

Thanks,
Srinivas V

Possible enhancement request: Duplicates

I have a few situations where the same document is indexed twice because it has two different parentUrls. Is it possible to prevent this? It would be nice if I could provide duplicate exclusion rules. For example, if the md5 of properties body + title + language is the same for an existing document, ignore it.

I realize this would increase the indexing time as you would basically have to do a search first but is something like this possible? Or is there a recommended approach for managing this common situation? Maybe I'm missing an option in ElasticSearch itself.

meta tags

I don't see a way to grab and map certain meta tags to a property from the crawled html page. Is that an implemented feature?

Cannot create index?

Hi there,

I've been using this plugin now for a few weeks with no issues (I'm running version 1.0.1) until I decided a few days ago to remove all my indexes and create new ones again from scratch.

Unfortunately now I can't seem to create my crawler indexes. I run the appropriate CURL command to create the index and I receive the {"ok":true...} json response, but when I try to query the index I receive a IndexMissingException.

The process I'm following is as follows:

a. Install robot index (as per instructions):

curl -XPUT '192.168.1.26:9200/robot/'

b. I then attempt to create an index using:

curl -XPUT '192.168.1.26:9200/_river/my_web/_meta' -d "{
    \"type\" : \"web\",
    \"crawl\" : {
        \"index\" : \"compassion_test\",
        \"url\" : [\"http://uat.compassiondev.net.au/\"],
        \"includeFilter\" : [\"http://uat.compassiondev.net.au/.*\"],
        \"maxDepth\" : 3,
        \"maxAccessCount\" : 100,
        \"numOfThread\" : 5,
        \"interval\" : 1000,
        \"overwrite\" : true,
        \"target\" : [
          {
            \"pattern\" : {
              \"url\" : \"http://uat.compassiondev.net.au/.*\",
              \"mimeType\" : \"text/html\"
            },
            \"properties\" : {
              \"title\" : {
                \"text\" : \"title\"
              },
              \"body\" : {
                \"text\" : \"div#page_content\",
                \"trimSpaces\" : true
              }
            }
          }
        ]
    }
}"

I receive the following json response:

{"ok":true,"_index":"_river","_type":"my_web","_id":"_meta","_version":1}

But the index doesn't seem to exist (I receive the exception mentioned above)...

Is there something that I've missed? Any help would be greatly appreciated. Thanks!

How to index text inside <div> tags

Hi,

Can anyone help me on indexing text between particular

tags something like:
< div data-canvas-width="125.304" data-font-name="g_font_580_0" data-angle="0" style="font-size: 24px; font-family: sans-serif; left: 64px; top: 172px; transform: rotate(0deg) scale(1.00243, 1); transform-origin: 0% 0% 0px;" dir="ltr">Automotive < /div>

This is to index some content in pdf files as per my requirement.

Thanks In Advance,
Srinivas

Meta tag crawling issue.

I want to crawl the meta tag 'name' and 'content' attributes values .I am able to crawl the head tag text and html .But not the only meta tag attributes. Can anyone help me out.

Thanks in advance.

Praveen.

Duplicated URLs

When I checked my url list, I have seen that there are same URLS are indexed with different _ids. The pages are same. I have set to :

"maxDepth": 7,
"maxAccessCount": 500,
"numOfThread": 10,
"interval": 200,
"incremental": true,
"overwrite": true,

the fresh install using tutorial from readme file doesnt work

I am using the latest version of ES as of today with a fresh install. The installation with river worked fine also. However, the scraping (or crawling) scenario doesnt start. I followed instructions in readmefile however no luck. Of course I was careful at the cronjob, also used "0 0 * * *?" (which I think , this means start the crawling right now). I was lucky with yahoo example but only 5-6 links extracted. I have tested the scraping with different urls. I cant see what is going on (which page is to crawled and so on). Only I get "scheduled". Here is the log I have taken from yahoo example. After receiving this, the river stops. I have no luck using the river to crawl other sites. Any hints?

[2014-04-21 19:00:29,600][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/comics/
[2014-04-21 19:00:29,833][INFO ][org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer] No scraping rule.
[2014-04-21 19:00:30,580][INFO ][org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer] No scraping rule.
[2014-04-21 19:00:30,764][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/abc-news/
[2014-04-21 19:00:31,712][INFO ][org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer] No scraping rule.
[2014-04-21 19:00:36,438][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/originals/
[2014-04-21 19:00:36,455][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/yahoo_news_photos/
[2014-04-21 19:00:36,457][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/katie-couric-aereo-tv-supreme-court-212342689.html
[2014-04-21 19:00:37,284][INFO ][org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer] No scraping rule.
[2014-04-21 19:00:37,531][INFO ][org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer] No scraping rule.
[2014-04-21 19:00:39,906][INFO ][cluster.metadata ] [Black Bolt] [webindex] update_mapping yahoo_com
[2014-04-21 19:00:41,241][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/boston-marathon-bombing/
[2014-04-21 19:00:41,247][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/u-says-examining-toxic-chemical-syria-172200500.html
[2014-04-21 19:00:42,402][INFO ][org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer] No scraping rule.
[2014-04-21 19:00:43,176][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/syria-elections-set-june-3-amid-civil-war-180035620.html
[2014-04-21 19:00:43,255][INFO ][cluster.metadata ] [Black Bolt] [webindex] update_mapping yahoo_com
[2014-04-21 19:00:46,463][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/boy-scouts-shutdown-troop-for-refusing-to-banish-gay-scoutmaster-171244503.html
[2014-04-21 19:00:47,523][INFO ][org.seasar.robot.helper.impl.LogHelperImpl] Crawling URL: http://news.yahoo.com/obama-plans-clemency-for-hundreds-of-drug-offenders--162714911.html
No scraping rule.

Update dependencies

Update dependencies for ES, S2Robot and es-util.

動的にクロールURLを設定するには

お世話になります。
日本語で質問をさせてください。

特定のriverをあらかじめ登録せさておいて、
別のアプリケーションから robot/queue にクロールしたいURL登録して
動的に特定のURLをクロールさせることは可能でしょうか？

elasticsearch 1.0.0.RC2 support

Unable to load river-web plugin using eclipse,getting following error

Caused by: org.elasticsearch.common.inject.CreationException: Guice creation errors:

Error injecting constructor, org.seasar.framework.exception.ResourceNotFoundRuntimeException: [ESSR0055]app.dicon
at org.codelibs.elasticsearch.web.service.S2ContainerService.(Unknown Source)
while locating org.codelibs.elasticsearch.web.service.S2ContainerService

1 error
at org.elasticsearch.common.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:344)
at org.elasticsearch.common.inject.InjectorBuilder.injectDynamically(InjectorBuilder.java:178)
at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:110)
at org.elasticsearch.common.inject.Guice.createInjector(Guice.java:96)
at org.elasticsearch.common.inject.Guice.createInjector(Guice.java:70)
at org.elasticsearch.common.inject.ModulesBuilder.createInjector(ModulesBuilder.java:59)

Crawling sitemap.xml?

Hi there,

Just wondering if there's a property within this library that will allow it to crawl sitemap.xml files as a starting point (enter it as a url)?

E.g. https://compassionau.custhelp.com/ci/sitemap/

Thanks!

Updated Schedule doesnt affect

When I update the config file with new river config, with an updated schedule time, the river doesnt change its time to a later time, the schedule doesnt start. However, If i restart ES, it automatically starts.

"schedule": {
"cron": "0 14 4 * * ?"
}

Narrow Your Search Feature in Elasticsearch

Hi,

How can i implement Narrow your search feature which is in GSA.I tried with completion suggester am getting the follow error:
index: metaindex
shard: 4
reason: BroadcastShardOperationFailedException[[metaindex][4] ]; nested: ElasticsearchException[failed to execute suggest]; nested: ElasticsearchException[Field [body] is not a completion suggest field];

Can anyone pls help me.

Thanks in advance.

Not all pages being crawled

Hi @marevol,

Thanks for the help over the last few days - it is really appreciated!

I have managed to get the crawling working across my two sites, however I'm noticing that not all the pages are being crawled, which is quite strange.

There are pages within my primary navigation that are being skipped altogether, even though they appear right beside another that is being crawled?

I have left the crawler to run overnight but it hasn't yet discovered these pages?

I created the crawler (after setting up the other indexes) by:

curl -XPUT 'http://localhost:9200/_river/compassion_web/_meta' -d '
{
    "type" : "web",
    "crawl" : {
        "index" : "compassion_uat",
        "url" : ["https://compassionau.custhelp.com/ci/sitemap/", "http://uat.compassiondev.net.au/"],
        "includeFilter" : ["https://compassionau.custhelp.com/.*", "http://uat.compassiondev.net.au/.*"],
        "maxDepth" : 30,
        "maxAccessCount" : 1000,
        "numOfThread" : 10,
        "interval" : 1000,
                "incremental" : true,
        "overwrite" : true,
                "userAgent" : "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Elasticsearch River Web/1.1.0",
        "target" : [        
                  {
            "pattern" : {
              "url" : "https://compassionau.custhelp.com/app/answers/detail/a_id/[0-9]*",
              "mimeType" : "text/html"
            },
            "properties" : {
              "title" : {
                "text" : "h1#rn_Summary"
              },
              "body" : {
                "text" : "div#rn_AnswerText",
                "trimSpaces" : true
              }
            }
          },
                    {
            "pattern" : {
              "url" : "http://uat.compassiondev.net.au/.*",
              "mimeType" : "text/html"
            },
            "properties" : {
              "title" : {
                "text" : "h1",
                                "trimSpaces": true
              },
              "body" : {
                "text" : "div#main",
                "trimSpaces" : true
              }
            }
          }
        ]
    },
    "schedule" : {
        "cron" : "*/15 * * * * ?"
    }
}

Is there something I could have missed?

I notice that in one of your examples on the main wiki page you add

"menus" : {
                "text" : "ul.nav-list li a",
                "isArray" : true
              }

To the "properties" for one of your sites, could that have something to do with it, or is it unrelated?

One time crawling

If no "cron" property, a crawler starts immediately(actually, 1 min later) and then unregister it after the crawling is completed.

question - is there a way to specify a file containing URL's to crawl......?

is there a way to specify a file containing URL to crawl......?

It exceeds maxAccessCount

Crawler exceeds number of maxAccessCount that is defined in the config. Example, I have limited to 500 sites, however it crawled about 770 webpages. Is this a bug or system feature?

Improve encoding handling

Add "preloadSizeForCharset" to set a preload content size for deciding the encoding.

How to prioritize the searchresults based on type

I have many types in my index for example a type for downloads,for resources,for main aspx pages,for casestudies and when the user types manuals or user manuals or pdf's then the search should prioritize in such a way that pdf and manuals should be shown first.If any other keyword like downloads then the search results should be shown from downloads types.Can anyone help.

Thanks in advance.

Support attachment type

A file other than HTML is stored as attachment type.
The attachment type is provided by:
https://github.com/elasticsearch/elasticsearch-mapper-attachments

Update s2robot

Update to S2Robot 0.6.0.

Duplicated contents of different URLs

When they are the same contents although URL differs,
(like the contents which belong to 2 or more category)
Isn't there any setup which merges URLs and is used as one data?

Specify crawled urls in properties

If you set "isChildUrl" property to true, crawled urls are used by the property value.

"crawl" : {
...
    "target" : [
      {
...
        "properties" : {
          "childUrl" : {
            "value" : ["http://fess.codelibs.org/","http://fess.codelibs.org/ja/"],
            "isArray" : true,
            "isChildUrl" : true
          },

Script support

This feature rewrites a stored value before indexing.
Therefore, you can replace a crawled data with something by MVEL.

How can I get attribute src of the image tag?

for example.
I have this html.

<div id="thumb">
  <a href="#"><img src="http://example.com/test.jpg" /></a>
</div>

I want to get the src of the image and save.

I tried this way but it did not work.

{
"pattern": {
 .....        
  "properties":{
    "image_src": {
      "attr": "img[src]",
      "args": ["div#thumb a"],
      "trimSpaces": true
    }
  }
}

Can anyone teach me how can I get the src?

Thank you.

question about support of robots.txt

Are all the contents of robots.txt supported?
e.g.)
Allow, Request-rate, Crawl-delay, Visit-time...

elasticsearch 1.0.0.RC1 support

Incremental crawling does not work when stored index does not have a mapping

From #18.

The check method of status

日本語で質問させていただきます。

設定したWeb Riverのステータス (クローリング中、クローリング完了等) を取得することは可能でしょうか。
ご確認の程よろしくお願いいたします。

Is it possible to get (in crawling, crawling completion, etc.) the status of the Web River?

Hook scripts on start/execute/finish/stop actions

It's better to call a given script when starting/stoping a crawler.

{
  "crawl" : {
  ...
    "script":{
      "start":"script...",
      "execute":"script...",
      "finish":"script...",
      "close":"script..."
    },

Improve multi-thread accessing

A multi-thread accessing has some problems, sucn as #35.

How to delete the lost web page?

Is there a way to delete from indices the deleted web page using "river-web" ?

Getting junk content when reading the indexed pdf files through river-web

I have indexed the pdf documents using river-web and its showing all junk content not even html also. Reference :
%PDF-1.4 % 379 0 obj <> endobj 405 0 obj <>/Filter/FlateDecode/ID[<0C462C2110F28740BFF2829978DDE58A><16AE43F1074EB84F96B5E99D3DB91FB7>]/Index[379 38]/Info 378 0 R/Length 125/Prev 330619/Root 380 0 R/Size 417/Type/XRef/W[1 3 1]>>stream h bbd\u0010``b``\u0016\u0004 S@$\u0013\u000B d=\u0005f e @$ 4X \u0003 \u0000f+6; \u0004"\u0019 " '\u0010D \u0002 H @ \u001E \u001CD\u001Eu\u0000 \u001CV q?c\u0010[\u0002H ۚ \u0005v\u001B\u0003# �\u0006 \u0019 \u0001\u0002 \u0000 ?\u0010f endstream endobj startxref 0 %%EOF 416 0 obj <>stream h b``eZ\u0005" %0\b2 \u0003\u000BP c 7< \u0007\u0018\u0018_\u0017 \u0019\u0005 N z D w \u001D\u0018 ;\u001A\u0018 +:\u0018\u0018 ;: ʀ@ r \u0000bi &3\u0006\u0001 s S\u0018 1\u0006 \n<e\u0011b b~ r \" \u000E\u0003\u00033; @ ܎\u0015 \u001B p p=\"\" } ! 0 =\"\" \u001Aq\u0006 g 4\u000B\u0003 7 =\"\" \u0017 r =\"\" _ \u0002 b ( =\"\" endstream=\"\" endobj=\"\" 380=\"\" 0=\"\" obj=\"\" <<=\"\" arkinfo<<=\"\" arked=\"\" true=\"\">\n >/Metadata 25 0 R/Pages 377 0 R/StructTreeRoot 48 0 R/Type/Catalog/ViewerPreferences<>>> endobj 381 0 obj <>/ExtGState<>/Font<>/ProcSet[/PDF/Text/ImageC]/XObject<>>>/Rotate 0/StructParents 0/TrimBox[0.0 0.0 612.0 792.0]/Type/Page>> endobj 382 0 obj <>stream hޔ N H\u0010 a Չ ~ۖ\* \u0010 \u0015 \u0000" $\u0016 \u000E > ͬ \u0010 z swg~Xs 8 \n <b\" !f\" 1a22l\u000Bδ w! 1 % q1a @3a\u0015i\u001A& l \\ m 0 +w &+\u000B8 \u001D \u001C\u001F ) 5\u001A a _g \u000F \u001E �\u000F}=\"\">\n 8 x& \u0017c\u0011 6 \u0016U< r b̙ ] v噄c p抹\u000F D m 8 ݝ` ^{ \u001B qY-]\u000E ^psr INN L ǚj a4t \u0015 \u0016\u001E(\u001C 4� \u001A xH] +W o \u0007 wO ;\u001D A Ӭ EV : � f3_ " X 0 6 , I n S_4x \u0006 n g Eì _ j 7 @ r \u0007 \ v \u0014 } Y * \u0000 |]CZ. \u000E\u0016 /\u0000ճr u \u0005 U G ͢ \u001Ef ( \u001E \u0006j� 6!\u001B(2TL \u001C\u000B ˬ `\u000F \u0000\u0006 \u0001\u000E \b>\u0002 6 ) \b \u0013| \u000B \u0002\u000E.! )`\b \u00022 \u001C P@ + \u000E\u0015 \u001A \u0007 - u \u0006 G !L) \݆ X H w ) h ʛ AĄ0\u0011 $ \u0006n o rw ]9v Uo[Y { d ʱ[z8 �q: 3 L | T I\u0017wUFK\u0017 \u0015j qy y \u0019 Q㗟Y " U A / 8 2Ry. ] \u0005j l n f\u0018 C ( u &0Z_6 \be# T N=S \u000E Yb O E=N4g& E (\u0005 :< )- \u0006 " E L\u0019\u0004`li ̨ i ] 0'} .%3 @i\u001Bې ]Y !- 6q ,Jbl \u0006 tdg y dE 8و t 0JKe# qL ^)&5 h\u0016 ( \u001A\u0004;i B Z% $Ӹc: tz̯ *RO2 Ά X L^ +% @ <~ QY 0` #޴ yJ '4y " \u0018 DCb A b\u0002 }me O !\u001A C4|= Z \u00192\b J4l\u000F A\u0006 x OȠ\u001F A \u0017Ƞ J \u001E\u0011 +P =\u001B o ^ ]hl+׶ H? a \u001AQ\u0014z "W ® : p b ˱ & ukx q N ( N ^ 4 =u a\u0016S 㪑- h # \u0012\u001D ݬ v \u0000\u0003\u00006 endstream endobj 383 0 obj <>stream H \ ͊ 0\u0014 z -g\u0016 \u001D[ ʀ1d \u0019Ȣ?4 \u00038 \u001A\u001A ( "o\u001D 0 \u0006\u001C} W v ۇq 8 \u0007 \u0018 - ^\u001F y jU a ] /ݬ p . \u000F I5 .~ % f Y\u0015 \u0018 m \u000F .u R / x] m/ !=\u001F K } + s ؅ WM ~ n>үU> =7kn; ]T #\u0015 H ^[ Te OK 5y ~% \u001A I jj ר +r\u0005 5ؐ ؑ\u001DxC $6 iI >\u0006} \u0018 1 c X \u0005\u000BY 4 4[ \u0016 # w0 \u0018d b A ,p\u0016 \u000B za z:\u000B E\u001E \u000E \b Dx 3\u0017f E o`\u0012 \u0012 \u0012 \u0012 \u0012 {E. \\u000E \u001C=\u001D<\u001D=\u001D<\u001D=\u001D<\u001D=\u001D<\u001D=\u001D<\u001D}\u001C|\u001C ]Z0d i¸ B� r� 1 q t br ? y uڅK \u0015`\u0000% އ endstream endobj 384 0 obj <>stream H \ n @\u0010 < \u001C C XB \u001Cg# ?Zg\u001F\u0000 ؋ \u0006 o S\u0014JVk O T� Iw } N. 1 !L v \u0018 m ; s %Y __ !I : ˾; IY g|x ƻ{ 6 1<& c۝ ï ѥ 0 Mn 6\u001BׄS4 Z ߪKp i t� {>W ߇ y \u0011 p\u001D : Uw\u000EI + o yn; ՘ ..^ T k ^? / s |5 \u0010 Rc] L _ B \u0002 úq 5 \u00105 !)\u000B *P Ȩ3 : . \u000Bh \u0016hO Y 0j &s\u0001 \u0005 - \u0016 \u0005 \u0002\u0006! A `\u00102\b\u0018 \u0002\u0006! AXWPWXWPWXK Z J 0+AV¬\u0004Y \u0012d噕GV l\u001El l\u001El l\u001El l\u001El l\u001El x |wŻ+ \u0015 J� _ W .=\u0004M� _g� 0 2\u0007E\u000E ,rХW .} \u001Ct Q` ld60\u001B Ff\u0003 ld60\u001B Ff\u0003 l \u0019C \u0018x U Z | \u0019֬ \u001Au \u001B\u0007| \u0017 O8 4 | o \u0018 8 \u0004 \u0006m\u0017>N \u001F\܅+ + \u0000)q!n endstream endobj 385 0 obj <>stream H W n\u001B7\u0010}߯ #\u0017 ּ-/@\u0010 N ES >4Aa˲-D ]+ p ˕T!\b\u0010qvy9s s%n wJ| o> hu \u0014oߞ|z� A( q t} ;/ \u001A | Ʊ\u0012 M\u001D K K) >Qy \u00107 #p h h Au / \u001F a \u001B8 ʤ \u0004 B m @ \u0003/\u001A^j }T\u0001 PB+ )k{\u0013쐴u\u0016q . M 99\u001B ...

Can you pls help me out.

Thanks in advance.

Question: Boilerpipe?

Does the crawler uses boilerpipe ? I have seen that it has in dependencies however i couldnt see it uses it. What is the reason for storing it?

Thanks

Add a parameter for adding request headers

Using given values as request headers when crawling a site, it allow you to add a parameter to a crawl data setting.

codelibs / elasticsearch-river-web Goto Github PK

elasticsearch-river-web's People

Contributors

Stargazers

Watchers

Forkers

elasticsearch-river-web's Issues

Recommend Projects

Recommend Topics

Recommend Org