agileworksorg / elasticsearch-river-csv Goto Github PK

View Code? Open in Web Editor NEW

91.0 91.0 45.0 109 KB

CSV river for ElasticSearch

License: Apache License 2.0

Groovy 84.92% Java 14.69% Shell 0.25% Batchfile 0.15%

elasticsearch-river-csv's People

Contributors

Stargazers

Watchers

Forkers

magaton adgonzal psic jheys fenghenglicai xuyanhui socialpercon maxlang shadow000fire pombredanne rbarbantan mallorymegan1984 gianni77 sloukam ondrej-kvasnovsky igneous notdang thangnvdigdinos malshash raymondfrancis brankoajzele trax-air flicker581 aritrachatterjee15 lingster imclab dfranssen nakanonaoki qaqua lokkju nicolasydder akhilrsinghal wittyameta jacdobro dipsec adderall stevengonsalvez rafaelveloso ramaneekgill michael-ancestor bryant1410 outlook-com changsongyang

elasticsearch-river-csv's Issues

Insertion Stops After 25000

I am working on elasticsearch-river-csv. I was able to insert into ES only upto 25k my record size is 100k.
Operating Environment
Windows 8.1
ES 1.7
Log File

Going to execute new bulk composed of 1000 actions , after this it print waits for next poll time.

my query
PUT /_river/my_csv_river/_meta
{
"type" : "csv",
"csv_file" : {
"folder" : "/home/hariganesh/Downloads/CSV",
"filename_pattern" : ".*.csv$",
"poll":"5m",
"fields" : [
"column1",
"column2",
"column3"
],
"first_line_is_header" : "false",
"field_separator" : ",",
"escape_character" : ";",
"quote_character" : """,
"field_id" : "id",
"field_id_include" : "false",
"field_timestamp" : "imported_at",
"concurrent_requests" : "1",
"charset" : "UTF-8",
"script_before_all": "/path/to/before_all.sh",
"script_after_all": "/path/to/after_all.sh",
"script_before_file": "/path/to/before_file.sh",
"script_after_file": "/path/to/after_file.sh"
},
"index" : {
"index" : "my_csv_data",
"type" : "csv_type",
"bulk_size" : 100,
"bulk_threshold" : 10
}
}
Input file in csv
8848488,Harinath,"A, B, C,D"
8848489,Hari,"E,F,G,H"

Can you say what might be problem.

script_before_file

Is there something special I need to do to get this to print out to the screen what the file is doing?

I'm getting an error and I am not able to tell if it's because it's not setting a variable or....

field_separator is “|++|”，how to configured？

Data type is always String

I have a problem that the CSV River ignores the mapping settings, and just inserts String values. I noticed that this behavior is probably caused by the code line found at https://github.com/AgileWorksOrg/elasticsearch-river-csv/blob/master/src/main/groovy/org/agileworks/elasticsearch/river/csv/OpenCSVFileProcessor.groovy#L78

builder.field((String) fieldName, line[position])

It's not really clear to me how I can achieve that I can somehow either define what data type should be used, or that the mapping is adhered.

Can you give me some hints? Is this possible at all?

Incomplete Data

First of all, thank you for the very useful plugin.

I installed version 2.0.0 and I'm using it to import a CSV file composed of 495089 entries into Elasticsearch 1.0.0. The curl I'm issuing in order to import the data is the following:

curl -XPUT localhost:9200/_river/my_csv_river/_meta -d '
{
    "type" : "csv",
    "csv_file" : {
        "folder" : "/tmp",
        "filename_pattern" : ".*\\.csv$",
        "poll":"5m",
        "first_line_is_header":"true",
        "field_separator" : ";",
        "escape_character" : "\n",
        "quote_character" : "\""
    },
    "index" : {
        "index" : "my_index",
        "type" : "item",
        "bulk_size" : 100,
        "bulk_threshold" : 10
    }
}'

After the execution of the curl, river-csv indicates that it processed the whole file (with the whole 495089 records). However, Elasticsearch contains only a portion of the data that varies slightly when I redo the whole import process from scratch. For instance, after my last attempt to import the data, Elasticsearch contains only 114237 records out of the original 495089 ones. Is there something wrong that I'm doing and that I'm not aware of?

How do I add multiple CSV files?

I want to add multiple CSV files as different indices. How do I do it if all the csv files are present in the same /root/csv folder path?
How do I add multiple CSV files to the same index as different types?

csv file is not importing into my index

I am using ElasticSearch 1.4.4 and csvriver plugin version 2.1.2.

My csvriver stops as

Processing File myfile.csv , it stops there and not proceedign further. Can you please Guide me what may the provlem??

Index is not created

I am using the following command to import data from csv file.

curl -XPUT localhost:9200/_river/my_csv_river/_meta -d '
{
"type" : "csv",
"csv_file" : {
"folder" : "/home/testuser/river_test",
"filename_pattern" : ".*.csv$",
"first_line_is_header" : "true",
"field_separator" : ",",
"charset" : "UTF-8"
},
"index" : {
"index" : "myvar_model",
"type" : "csv_type",
"bulk_size" : 100,
"bulk_threshold" : 10
}
}'

But the index is not created. I searched for documentaion i cant found. I am using elasticsearch version:1.2.4 and plugin version 2.1.0.

Mappings update

How do I update the mappings I have created in the river_meta ? I changed the order of the columns in mappings and restarted ES as well, but I see the old order in the logs

Using configuration: org.agileworks.elasticsearch.river.csv.Configuration(/usr/local/data, .*.csv, false, [ekv_raw, ekv_flight, event_id, cookie_id, dpId, vertical, activity_group, activity, eventDateTime, departureDate, returnDate, origin, destination, destination_country_code, destination_state, destination_city, carrier, cabinClassGroup, currency, duration, travelers, airFare, bookedDate], 1m, ztmp_inventory_tool_sample, invData, 10000, , ", ,, 10, 5, id, null, null, null, null, null, UTF-8)

Any suggestions to clear river settings and recreate the mappings?

Imported Data is Different from Original

Why is the imported data different from the original. I have indexed the same data directly from database to ES via NEST and everything was okay. I changed the seperator as semicolon.

This is inside the csv file
69377;Decortie Labirent Kitaplık - Venge;Decortie Labirent Kitaplık - Venge;;/labirent-kitaplik-venge-dcm063/p/69377;http://image.x.com/ProductImages/156767/Bevidea-mobilya-kitaplik-dcm063-1_250x250.jpg;139.00;Decortie;Renk=Venge|Ürün Çeşidi=Kitaplıklar|Kullanım Alanı=Ev|Ürün Grupları=Mobilya|Marka=Decortie|Malzeme=Yonga Levha;Mobilyalar;Kitaplıklar ;;;;;;Mobilyalar/Kitaplıklar ;DCM063;139.00;1;2012-10-24 14:48:13.013;;False;True;False;307992;985;False;labirent-kitaplik-venge-dcm063;0;Mobilyalar | Kitaplıklar | | | | | | ;0;.0;219446;

and this is the indexed data

"_source": {

"��P\u0000r\u0000o\u0000d\u0000u\u0000c\u0000t\u0000N\u0000u\u0000m\u0000b\u0000e\u0000r\u0000;\u0000T\u0000i\u0000t\u0000l\u0000e\u0000;\u0000D\u0000e\u0000s\u0000c\u0000r\u0000i\u0000p\u0000t\u0000i\u0000o\u0000n\u0000;\u0000S\u0000h\u0000o\u0000r\u0000t\u0000;\u0000D\u0000e\u0000e\u0000p\u0000l\u0000i\u0000n\u0000k\u0000;\u0000I\u0000m\u0000a\u0000g\u0000e\u0000U\u0000R\u0000L\u0000;\u0000P\u0000r\u0000i\u0000c\u0000e\u0000;\u0000B\u0000r\u0000a\u0000n\u0000d\u0000;\u0000A\u0000t\u0000t\u0000r\u0000i\u0000b\u0000u\u0000t\u0000e\u0000s\u0000;\u0000C\u0000a\u0000t\u0000e\u0000g\u0000o\u0000r\u0000y\u00001\u0000;\u0000C\u0000a\u0000t\u0000e\u0000g\u0000o\u0000r\u0000y\u00002\u0000;\u0000C\u0000a\u0000t\u0000e\u0000g\u0000o\u0000r\u0000y\u00003\u0000;\u0000C\u0000a\u0000t\u0000e\u0000g\u0000o\u0000r\u0000y\u00004\u0000;\u0000C\u0000a\u0000t\u0000e\u0000g\u0000o\u0000r\u0000y\u00005\u0000;\u0000C\u0000a\u0000t\u0000e\u0000g\u0000o\u0000r\u0000y\u00006\u0000;\u0000C\u0000a\u0000t\u0000e\u0000g\u0000o\u0000r\u0000y\u00007\u0000;\u0000C\u0000a\u0000t\u0000e\u0000g\u0000o\u0000r\u0000y\u0000P\u0000a\u0000t\u0000h\u0000;\u0000E\u0000A\u0000N\u0000;\u0000O\u0000l\u0000d\u0000P\u0000r\u0000i\u0000c\u0000e\u0000;\u0000S\u0000a\u0000l\u0000e\u0000s\u0000R\u0000a\u0000n\u0000k\u0000i\u0000n\u0000g\u0000;\u0000C\u0000r\u0000e\u0000a\u0000t\u0000e\u0000d\u0000O\u0000n\u0000U\u0000t\u0000c\u0000;\u0000C\u0000a\u0000m\u0000p\u0000a\u0000i\u0000g\u0000n\u0000E\u0000n\u0000d\u0000O\u0000n\u0000U\u0000t\u0000c\u0000;\u0000I\u0000s\u0000F\u0000a\u0000s\u0000t\u0000S\u0000h\u0000i\u0000p\u0000p\u0000i\u0000n\u0000g\u0000;\u0000I\u0000s\u0000F\u0000r\u0000e\u0000e\u0000S\u0000h\u0000i\u0000p\u0000p\u0000i\u0000n\u0000g\u0000;\u0000I\u0000s\u0000O\u0000u\u0000t\u0000l\u0000e\u0000t\u0000;\u0000P\u0000r\u0000o\u0000d\u0000c\u0000u\u0000t\u0000v\u0000a\u0000r\u0000i\u0000a\u0000n\u0000t\u0000I\u0000d\u0000;\u0000S\u0000t\u0000o\u0000c\u0000k\u0000Q\u0000u\u0000a\u0000n\u0000t\u0000i\u0000t\u0000y\u0000;\u0000O\u0000u\u0000t\u0000O\u0000f\u0000S\u0000t\u0000o\u0000c\u0000k\u0000;\u0000S\u0000e\u0000N\u0000a\u0000m\u0000e\u0000;\u0000C\u0000a\u0000m\u0000p\u0000a\u0000i\u0000g\u0000n\u0000I\u0000d\u0000;\u0000A\u0000l\u0000l\u0000C\u0000a\u0000t\u0000e\u0000g\u0000o\u0000r\u0000i\u0000e\u0000s\u0000;\u0000D\u0000i\u0000s\u0000c\u0000o\u0000u\u0000n\u0000t\u0000;\u0000A\u0000v\u0000g\u0000R\u0000a\u0000t\u0000i\u0000n\u0000g\u0000;\u0000D\u0000i\u0000s\u0000p\u0000l\u0000a\u0000y\u0000O\u0000r\u0000d\u0000e\u0000r\u0000;\u0000": "\u00009\u00002\u00003\u00006\u00008\u0000;\u0000A\u0000r\u0000t\u0000e\u0000m\u0000a\u0000 \u0000D\u0000i\u0000a\u0000g\u0000o\u0000n\u0000 \u0000D\u0000u\u0000_\u0001 \u0000B\u0000a\u0000t\u0000a\u0000r\u0000y\u0000a\u0000s\u00001\u0001;\u0000A\u0000r\u0000t\u0000e\u0000m\u0000a\u0000 \u0000D\u0000i\u0000a\u0000g\u0000o\u0000n\u0000 \u0000D\u0000u\u0000_\u0001 \u0000B\u0000a\u0000t\u0000a\u0000r\u0000y\u0000a\u0000s\u00001\u0001;\u0000;\u0000/\u0000d\u0000i\u0000a\u0000g\u0000o\u0000n\u0000-\u0000d\u0000u\u0000s\u0000-\u0000b\u0000a\u0000t\u0000a\u0000r\u0000y\u0000a\u0000s\u0000i\u0000-\u0000y\u0000y\u0000a\u00000\u00007\u00008\u0000/\u0000p\u0000/\u00009\u00002\u00003\u00006\u00008\u0000;\u0000h\u0000t\u0000t\u0000p\u0000:\u0000/\u0000/\u0000i\u0000m\u0000a\u0000g\u0000e\u0000.\u0000e\u0000v\u0000i\u0000d\u0000e\u0000a\u0000.\u0000c\u0000o\u0000m\u0000/\u0000P\u0000r\u0000o\u0000d\u0000u\u0000c\u0000t\u0000I\u0000m\u0000a\u0000g\u0000e\u0000s\u0000/\u00001\u00003\u00003\u00004\u00001\u00008\u0000/\u0000t\u0000h\u0000u\u0000m\u0000b\u0000s\u0000/\u0000B\u0000y\u0000y\u0000a\u00000\u00007\u00008\u0000_\u00001\u0000_\u00002\u00005\u00000\u0000p\u0000x\u0000.\u0000j\u0000p\u0000g\u0000;\u00006\u00009\u00009\u0000.\u00000\u00000\u0000;\u0000A\u0000r\u0000t\u0000e\u0000m\u0000a\u0000;\u0000M\u0000a\u0000r\u0000k\u0000a\u0000=\u0000A\u0000r\u0000t\u0000e\u0000m\u0000a\u0000|\u0000�\u0000r\u0000�\u0000n\u0000 \u0000G\u0000r\u0000u\u0000p\u0000l\u0000a\u0000r\u00001\u0001=\u0000T\u0000e\u0000s\u0000i\u0000s\u0000a\u0000t\u0000|\u0000K\u0000u\u0000l\u0000l\u0000a\u0000n\u00001\u0001m\u0000 \u0000A\u0000l\u0000a\u0000n\u00001\u0001=\u0000B\u0000a\u0000n\u0000y\u0000o\u0000|\u0000�\u0000r\u0000�\u0000n\u0000 \u0000�\u0000e\u0000_\u0001i\u0000d\u0000i\u0000=\u0000D\u0000u\u0000_\u0001 \u0000B\u0000a\u0000t\u0000a\u0000r\u0000y\u0000a\u0000s\u00001\u0001;\u0000B\u0000a\u0000n\u0000y\u0000o\u0000;\u0000D\u0000u\u0000_\u0001 \u0000T\u0000a\u0000k\u00001\u0001m\u0000l\u0000a\u0000r\u00001\u0001 \u0000v\u0000e\u0000 \u0000A\u0000k\u0000s\u0000e\u0000s\u0000u\u0000a\u0000r\u0000l\u0000a\u0000r\u00001\u0001;\u0000D\u0000u\u0000_\u0001 \u0000B\u0000a\u0000t\u0000a\u0000r\u0000y\u0000a\u0000l\u0000a\u0000r\u00001\u0001;\u0000;\u0000;\u0000;\u0000;\u0000B\u0000a\u0000n\u0000y\u0000o\u0000/\u0000D\u0000u\u0000_\u0001 \u0000T\u0000a\u0000k\u00001\u0001m\u0000l\u0000a\u0000r\u00001\u0001 \u0000v\u0000e\u0000 \u0000A\u0000k\u0000s\u0000e\u0000s\u0000u\u0000a\u0000r\u0000l\u0000a\u0000r\u00001\u0001/\u0000D\u0000u\u0000_\u0001 \u0000B\u0000a\u0000t\u0000a\u0000r\u0000y\u0000a\u0000l\u0000a\u0000r\u00001\u0001;\u0000Y\u0000Y\u0000A\u00000\u00007\u00008\u0000;\u00006\u00009\u00009\u0000.\u00000\u00000\u0000;\u00000\u0000;\u00002\u00000\u00001\u00002\u0000-\u00000\u00007\u0000-\u00002\u00000\u0000 \u00001\u00000\u0000:\u00004\u00002\u0000:\u00003\u00004\u0000.\u00003\u00003\u00007\u0000;\u0000;\u0000F\u0000a\u0000l\u0000s\u0000e\u0000;\u0000F\u0000a\u0000l\u0000s\u0000e\u0000;\u0000F\u0000a\u0000l\u0000s\u0000e\u0000;\u00003\u00003\u00000\u00009\u00008\u00003\u0000;\u00002\u00000\u00000\u0000;\u0000F\u0000a\u0000l\u0000s\u0000e\u0000;\u0000d\u0000i\u0000a\u0000g\u0000o\u0000n\u0000-\u0000d\u0000u\u0000s\u0000-\u0000b\u0000a\u0000t\u0000a\u0000r\u0000y\u0000a\u0000s\u0000i\u0000-\u0000y\u0000y\u0000a\u00000\u00007\u00008\u0000;\u00000\u0000;\u0000B\u0000a\u0000n\u0000y\u0000o\u0000 \u0000|\u0000 \u0000D\u0000u\u0000_\u0001 \u0000T\u0000a\u0000k\u00001\u0001m\u0000l\u0000a\u0000r\u00001\u0001 \u0000v\u0000e\u0000 \u0000A\u0000k\u0000s\u0000e\u0000s\u0000u\u0000a\u0000r\u0000l\u0000a\u0000r\u00001\u0001 \u0000|\u0000 \u0000D\u0000u\u0000_\u0001 \u0000B\u0000a\u0000t\u0000a\u0000r\u0000y\u0000a\u0000l\u0000a\u0000r\u00001\u0001 \u0000|\u0000 \u0000 \u0000|\u0000 \u0000 \u0000|\u0000 \u0000 \u0000|\u0000 \u0000 \u0000|\u0000 \u0000;\u00000\u0000;\u0000.\u00000\u0000;\u00002\u00001\u00002\u00006\u00002\u00009\u0000;\u0000"

CSV River strange behavior

Hi all,

I have recently replaced my bulk import mechanism (PHP and bulk API) with river csv. What I have noticed so far is a strange behavior that shows up after a certain index size (around 10.000.000 docs and ~1.5G disk size). So when the index is small everything works fine, I have set the bulk_size=1000, concurrent_requests=4 and bulk_threashold=10. After a couple of hours when index become bigger the whole process slows down and the import of .csv files becomes really slow. I have checked the elastic .log files and I figured out that the execution circle (polling time) of the import is interrupted. For instance here is what I get from the logs

logs

[2015-02-23 20:08:55,135][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] File has been processed 54eb7eed2bbe9.csv.processing
[2015-02-23 20:08:55,136][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] File 54eb7eed2bbe9.csv.processing, processed lines 2300
[2015-02-23 20:08:55,137][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Processing file 54eb81de7e37f.csv
[2015-02-23 20:08:55,146][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions
[2015-02-23 20:09:52,079][ERROR][marvel.agent.exporter ] [Domina] error sending data to [http://[0:0:0:0:0:0:0:0]:9200/.marvel-2015.02.23/_bulk]: [SocketTimeoutException[Read timed out]]
[2015-02-23 20:09:54,170][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Executed bulk composed of 1000 actions
[2015-02-23 20:09:54,286][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions
[2015-02-23 20:10:41,762][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Executed bulk composed of 1000 actions
[2015-02-23 20:10:41,911][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions
[2015-02-23 20:10:52,411][ERROR][marvel.agent.exporter ] [Domina] error sending data to [http://[0:0:0:0:0:0:0:0]:9200/.marvel-2015.02.23/_bulk]: SocketTimeoutException[Read timed out]
[2015-02-23 20:11:37,582][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Executed bulk composed of 1000 actions
[2015-02-23 20:11:37,758][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] File has been processed 54eb81de7e37f.csv.processing
[2015-02-23 20:11:37,759][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] File 54eb81de7e37f.csv.processing, processed lines 2985
[2015-02-23 20:11:37,759][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Processing file 54eb807bf351c.csv
[2015-02-23 20:11:37,765][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions
[2015-02-23 20:12:02,830][ERROR][marvel.agent.exporter ] [Domina] error sending data to [http://[0:0:0:0:0:0:0:0]:9200/.marvel-2015.02.23/_bulk]: SocketTimeoutException[Read timed out]
[2015-02-23 20:12:30,479][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Executed bulk composed of 1000 actions
[2015-02-23 20:12:30,536][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions
[2015-02-23 20:13:03,132][ERROR][marvel.agent.exporter ] [Domina] error sending data to [http://[0:0:0:0:0:0:0:0]:9200/.marvel-2015.02.23/_bulk]: [SocketTimeoutException[Read timed out]]
[2015-02-23 20:13:24,458][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Executed bulk composed of 1000 actions
[2015-02-23 20:13:24,581][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions
[2015-02-23 20:14:03,423][ERROR][marvel.agent.exporter ] [Domina] error sending data to [http://[0:0:0:0:0:0:0:0]:9200/.marvel-2015.02.23/_bulk]: SocketTimeoutException[Read timed out]
[2015-02-23 20:14:12,914][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Executed bulk composed of 1000 actions
[2015-02-23 20:14:13,010][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] File has been processed 54eb807bf351c.csv.processing
[2015-02-23 20:14:13,010][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] File 54eb807bf351c.csv.processing, processed lines 2924
[2015-02-23 20:14:13,011][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Processing file 54eb7eb509a30.csv
[2015-02-23 20:14:13,032][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions
[2015-02-23 20:15:11,204][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Executed bulk composed of 1000 actions
[2015-02-23 20:15:11,311][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions
[2015-02-23 20:15:13,741][ERROR][marvel.agent.exporter ] [Domina] error sending data to [http://[0:0:0:0:0:0:0:0]:9200/.marvel-2015.02.23/_bulk]: SocketTimeoutException[Read timed out]

As you can see there is no accuracy between time periods. The one circle ends at 2015-02-23 20:13:24 and the next start at 2015-02-23 20:14:12. Next you can find the csv river and index settings

CSV River

{
    "type": "csv",
    "csv_file": {
        "folder": "/vagrant/CSV/",
        "filename_pattern": ".*\.csv$",
        "poll": "1m",
        "fields": [
            "serverId",
            "duration",
            "requestTime",
            "responseTime",
            "statementType",
            "isRealQuery",
            "queryFailed",
            "sqlQuery",
            "transactionId",
            "clientName",
            "serverName",
            "serverUniqueName",
            "affectedTables",
            "queryError",
            "canonCommandType",
            "canonicalId"
        ],
        "first_line_is_header": "false",
        "concurrent_requests": "4",
        "charset": "UTF-8"
    },
    "index": {
        "index": "maxweb",
        "type": "queries",
        "bulk_size": 1000,
        "bulk_threshold": 10
    }
}

Index mapping

{
    "mappings": {
        "queries": {
            "transform": {
                "script": "ctx._source['affectedTables'] = ctx._source['affectedTables']?.tokenize(',')",
                "lang": "groovy"
            },
            "_all": {
                "enabled": false
            },
            "_source": {
                "compress": false
            },
            "properties": {
                "affectedTables": {
                    "type": "string",
                    "index": "not_analyzed",
                    "copy_to": [
                        "suggest_tables"
                    ]
                },
                "canonCommandType": {
                    "type": "integer",
                    "index": "no"
                },
                "canonicalId": {
                    "type": "string",
                    "index": "no"
                },
                "clientName": {
                    "type": "string",
                    "index": "no"
                },
                "duration": {
                    "type": "double",
                    "doc_values": true
                },
                "isRealQuery": {
                    "type": "string"
                },
                "queryError": {
                    "type": "string",
                    "index": "no"
                },
                "queryFailed": {
                    "type": "boolean"
                },
                "requestTime": {
                    "type": "double",
                    "doc_values": true
                },
                "responseTime": {
                    "type": "double",
                    "index": "no"
                },
                "serverId": {
                    "type": "long",
                    "doc_values": true
                },
                "serverName": {
                    "type": "string",
                    "index": "no"
                },
                "serverUniqueName": {
                    "type": "string",
                    "index": "no"
                },
                "sqlQuery": {
                    "type": "string",
                    "norms": {
                        "enabled": false
                    }
                },
                "statementType": {
                    "type": "integer",
                    "doc_values": true
                },
                "suggest_tables": {
                    "type": "completion",
                    "analyzer": "simple",
                    "payloads": false,
                    "preserve_separators": true,
                    "preserve_position_increments": true,
                    "max_input_length": 50
                }
            }
        }
    }
}

elasticasearch.yml

index.refresh_interval: 30s
index.translog.flush_threshold_ops: 50000
index.translog.flush_threshold_size: 512mb
indices.fielddata.cache.size: 20%
indices.cache.filter.size: 20%
indices.memory.index_buffer_size: 40%
index.merge.scheduler.max_thread_count : 1
bootstrap.mlockall: true

/etc/sysconfig/elasticsearch

MAX_LOCKED_MEMORY=unlimited
MAX_OPEN_FILES=65535
ES_JAVA_OPTS=-server
ES_HEAP_SIZE=512m

index status

{
    "_shards": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "indices": {
        "maxweb": {
            "index": {
                "primary_size_in_bytes": 3413521092,
                "size_in_bytes": 3413521092
            },
            "translog": {
                "operations": 4423
            },
            "docs": {
                "num_docs": 17886624,
                "max_doc": 17886624,
                "deleted_docs": 0
            },
            "merges": {
                "current": 0,
                "current_docs": 0,
                "current_size_in_bytes": 0,
                "total": 29,
                "total_time_in_millis": 28204,
                "total_docs": 262490,
                "total_size_in_bytes": 60109517
            },
            "refresh": {
                "total": 158,
                "total_time_in_millis": 15612
            },
            "flush": {
                "total": 3,
                "total_time_in_millis": 23029
            },
            "shards": {
                "0": [{
                    "routing": {
                        "state": "STARTED",
                        "primary": true,
                        "node": "oYGAJctoTTmSU1wD021byA",
                        "relocating_node": null,
                        "shard": 0,
                        "index": "maxweb"
                    },
                    "state": "STARTED",
                    "index": {
                        "size_in_bytes": 3413521092
                    },
                    "translog": {
                        "id": 1424570700536,
                        "operations": 4423
                    },
                    "docs": {
                        "num_docs": 17886624,
                        "max_doc": 17886624,
                        "deleted_docs": 0
                    },
                    "merges": {
                        "current": 0,
                        "current_docs": 0,
                        "current_size_in_bytes": 0,
                        "total": 29,
                        "total_time_in_millis": 28204,
                        "total_docs": 262490,
                        "total_size_in_bytes": 60109517
                    },
                    "refresh": {
                        "total": 158,
                        "total_time_in_millis": 15612
                    },
                    "flush": {
                        "total": 3,
                        "total_time_in_millis": 23029
                    }
                }]
            }
        }
    }
}

index stats

{
         "primaries": {
            "docs": {
               "count": 17890687,
               "deleted": 0
            },
            "store": {
               "size_in_bytes": 3416809219,
               "throttle_time_in_millis": 669
            },
            "indexing": {
               "index_total": 76407,
               "index_time_in_millis": 9773626,
               "index_current": 2,
               "delete_total": 0,
               "delete_time_in_millis": 0,
               "delete_current": 0,
               "noop_update_total": 0,
               "is_throttled": false,
               "throttle_time_in_millis": 0
            },
            "get": {
               "total": 0,
               "time_in_millis": 0,
               "exists_total": 0,
               "exists_time_in_millis": 0,
               "missing_total": 0,
               "missing_time_in_millis": 0,
               "current": 0
            },
            "search": {
               "open_contexts": 0,
               "query_total": 0,
               "query_time_in_millis": 0,
               "query_current": 0,
               "fetch_total": 0,
               "fetch_time_in_millis": 0,
               "fetch_current": 0
            },
            "merges": {
               "current": 0,
               "current_docs": 0,
               "current_size_in_bytes": 0,
               "total": 31,
               "total_time_in_millis": 29507,
               "total_docs": 281907,
               "total_size_in_bytes": 64514039
            },
            "refresh": {
               "total": 165,
               "total_time_in_millis": 16431
            },
            "flush": {
               "total": 3,
               "total_time_in_millis": 23029
            },
            "warmer": {
               "current": 0,
               "total": 368,
               "total_time_in_millis": 119
            },
            "filter_cache": {
               "memory_size_in_bytes": 0,
               "evictions": 0
            },
            "id_cache": {
               "memory_size_in_bytes": 0
            },
            "fielddata": {
               "memory_size_in_bytes": 0,
               "evictions": 0
            },
            "percolate": {
               "total": 0,
               "time_in_millis": 0,
               "current": 0,
               "memory_size_in_bytes": -1,
               "memory_size": "-1b",
               "queries": 0
            },
            "completion": {
               "size_in_bytes": 32864
            },
            "segments": {
               "count": 26,
               "memory_in_bytes": 9095212,
               "index_writer_memory_in_bytes": 324016,
               "index_writer_max_memory_in_bytes": 103887667,
               "version_map_memory_in_bytes": 22792,
               "fixed_bit_set_memory_in_bytes": 0
            },
            "translog": {
               "operations": 8246,
               "size_in_bytes": 4267399
            },
            "suggest": {
               "total": 0,
               "time_in_millis": 0,
               "current": 0
            },
            "query_cache": {
               "memory_size_in_bytes": 0,
               "evictions": 0,
               "hit_count": 0,
               "miss_count": 0
            }
         },
         "total": {
            "docs": {
               "count": 17890687,
               "deleted": 0
            },
            "store": {
               "size_in_bytes": 3416809219,
               "throttle_time_in_millis": 669
            },
            "indexing": {
               "index_total": 76407,
               "index_time_in_millis": 9773626,
               "index_current": 2,
               "delete_total": 0,
               "delete_time_in_millis": 0,
               "delete_current": 0,
               "noop_update_total": 0,
               "is_throttled": false,
               "throttle_time_in_millis": 0
            },
            "get": {
               "total": 0,
               "time_in_millis": 0,
               "exists_total": 0,
               "exists_time_in_millis": 0,
               "missing_total": 0,
               "missing_time_in_millis": 0,
               "current": 0
            },
            "search": {
               "open_contexts": 0,
               "query_total": 0,
               "query_time_in_millis": 0,
               "query_current": 0,
               "fetch_total": 0,
               "fetch_time_in_millis": 0,
               "fetch_current": 0
            },
            "merges": {
               "current": 0,
               "current_docs": 0,
               "current_size_in_bytes": 0,
               "total": 31,
               "total_time_in_millis": 29507,
               "total_docs": 281907,
               "total_size_in_bytes": 64514039
            },
            "refresh": {
               "total": 165,
               "total_time_in_millis": 16431
            },
            "flush": {
               "total": 3,
               "total_time_in_millis": 23029
            },
            "warmer": {
               "current": 0,
               "total": 368,
               "total_time_in_millis": 119
            },
            "filter_cache": {
               "memory_size_in_bytes": 0,
               "evictions": 0
            },
            "id_cache": {
               "memory_size_in_bytes": 0
            },
            "fielddata": {
               "memory_size_in_bytes": 0,
               "evictions": 0
            },
            "percolate": {
               "total": 0,
               "time_in_millis": 0,
               "current": 0,
               "memory_size_in_bytes": -1,
               "memory_size": "-1b",
               "queries": 0
            },
            "completion": {
               "size_in_bytes": 32864
            },
            "segments": {
               "count": 26,
               "memory_in_bytes": 9095212,
               "index_writer_memory_in_bytes": 324016,
               "index_writer_max_memory_in_bytes": 103887667,
               "version_map_memory_in_bytes": 22792,
               "fixed_bit_set_memory_in_bytes": 0
            },
            "translog": {
               "operations": 8246,
               "size_in_bytes": 4267399
            },
            "suggest": {
               "total": 0,
               "time_in_millis": 0,
               "current": 0
            },
            "query_cache": {
               "memory_size_in_bytes": 0,
               "evictions": 0,
               "hit_count": 0,
               "miss_count": 0
            }
         }
      }

*** Is store.throttle_time_in_millis: 669 cosidered as an important factor? I am asking since I use doc_values on my mapping so maybe I am pushig too much my little VM :)

Finally I did notice some high I/O traffic with iotop

Here is the sys info

Vagrant
OS: CentOS release 6.6
RAM: 2GB
CPU: Intel(R) Core(TM) i5-2430M CPU @ 2.40GHz (2 cores)

Thanks a lot for your time

Regards,
Alex

The proxylab team | http://www.proxylab.io/

Disable polling / run on demand

This is awesome, thank you!

Is there a way to disable polling entirely, and instead trigger the river by an API call?

I'm guessing I can set poll to a long time period, but it feels a bit of a cheat ;-)

Not importing data

Running ES 1.7 and River-CSV 2.2.1

I'm trying to pull in a large number of CSV files but I'm having zero luck.

Using the command
curl -XPUT localhost:9200/_river/my_csv_river/_meta -d '

{
"type" : "csv",
"csv_file" : {
"folder" : "/tmp/csv",
"filename_pattern" : ".*.csv$",
"poll":"5m",
"first_line_is_header":"true"
}
}'
{"_index":"_river","_type":"my_csv_river","_id":"_meta","_version":1,"created":true}

I see the index created in EL, but nothing happens after that. All I see in the ES log is this

[2015-08-31 08:59:20,753][INFO ][cluster.metadata ] [EL-PROD-01] [_river] update_mapping my_csv_river
[2015-08-31 08:59:20,993][INFO ][cluster.metadata ] [EL-PROD-01] [_river] update_mapping my_csv_river

Ideas?

quote character: "

Getting java.lang.ArrayIndexOutOfBoundsException Error after a while

I am processing large CSV files with the river to index it into ES. After a few bulks, I got the following Error :

[2014-01-28 08:08:15,712][WARN ][river.csv                ] [Stratosfire] [csv][csv] ongoing bulk, [10] crossed threshold [10], waiting
[2014-01-28 08:08:16,171][WARN ][river.csv                ] [Stratosfire] [csv][csv] ongoing bulk, [10] crossed threshold [10], waiting
[2014-01-28 08:08:16,455][WARN ][river.csv                ] [Stratosfire] [csv][csv] ongoing bulk, [10] crossed threshold [10], waiting
[2014-01-28 08:08:17,061][WARN ][river.csv                ] [Stratosfire] [csv][csv] ongoing bulk, [10] crossed threshold [10], waiting
[2014-01-28 08:08:21,019][WARN ][river.csv                ] [Stratosfire] [csv][csv] ongoing bulk, [10] crossed threshold [10], waiting
[2014-01-28 08:08:22,108][ERROR][river.csv                ] 
java.lang.ArrayIndexOutOfBoundsException

This may be realted to my csv that is mostly Chinese text and there could be few characters that wont be UTF-8 compliant. However, is there a logic that can be implement to prevent this to make everything fail ?

Thanks !

Gitches with two double quotes

Hi folks,

I have a very strange behavior when I use double quote to escape double quote ( e.q : ""content"").

This is my csv's cells involved in my issue:
cell one :"371620ZV001 (DTJZL MDPB) - UDYAQNIANP VBRHVEXHOWIC EO HVWQBZ : QEF 2013-12-30 22:30 VN AQHH FRQ WMABP LF GSXYPMA OQ RTIQAGUOELL ""T:\HTTUMNC\"":ZT TZXGDJ B?ADEHW YWMLSQUR TIQ BHONLSUACNH SH MAB : 1 1 WQLZQO(R) NRVX QM VL FOJLVYTLYA",

cell two : UTDRXSPA < PFSB Q3906181,

cell tree: 08/01/2014 15:20:24 ZQOMOE/OLHIRL (QSQEPX, EHBVJA RABFH): 08/01/2014 15:19:57 FWUSUQ/URNOXR (WYWQVD, QNHBPG XGHLN) ** HECMMCJJ NQOQQVIJ / JZAAEHTU ** : YMMJPWU, RL LQD A : 5334645 Z FVH YXHQCO XR : 07-01-14 11:01 XHXYOW QI SRQ : YLBD SGJEYQ : 371620 CAUYGHB : UQXVX NZ MZQHF TV MIPNXGME THGFVPBDVLFVT",

result in ES:
"371620ZV001 (DTJZL MDPB) - UDYAQNIANP VBRHVEXHOWIC EO HVWQBZ : QEF 2013-12-30 22:30 VN AQHH FRQ WMABP LF GSXYPMA OQ RTIQAGUOELL "T:\HTTUMNC\"":ZT TZXGDJ B?ADEHW YWMLSQUR TIQ BHONLSUACNH SH MAB : 1 1 WQLZQO(R) NRVX QM VL FOJLVYTLYA",UTDRXSPA < PFSB Q3906181,08/01/2014 15:20:24 ZQOMOE/OLHIRL (QSQEPX,

I suppose this block is incriminated in my glitch because 2 double quote are still present in ES field.
""T:\HTTUMNC\""

Some one could help me to solve that ?

Nicolas.

Install file for 2.0.0 missing 'v'? Prevents download/install

When I try:

plugin -install xxBedy/elasticsearch-river-csv/2.0.0

I get:

-> Installing xxBedy/elasticsearch-river-csv/2.0.0...
Trying http://download.elasticsearch.org/xxBedy/elasticsearch-river-csv/elasticsearch-river-csv-2.0.0.zip...
Trying http://search.maven.org/remotecontent?filepath=xxBedy/elasticsearch-river-csv/2.0.0/elasticsearch-river-csv-2.0.0.zip...
Trying https://oss.sonatype.org/service/local/repositories/releases/content/xxBedy/elasticsearch-river-csv/2.0.0/elasticsearch-river-csv-2.0.0.zip...
Trying https://github.com/xxBedy/elasticsearch-river-csv/archive/v2.0.0.zip...
Trying https://github.com/xxBedy/elasticsearch-river-csv/archive/master.zip...
Failed to install xxBedy/elasticsearch-river-csv/2.0.0, reason: failed to download out of all possible locations..., use -verbose to get detailed information

I think its because the file name for 2.0.0 is "2.0.0.zip" - the downloader is looking for "v2.0.0.zip" ?

this plugin dont work elasticsearch 1.1.1

this plugin dont work elasticsearch 1.1.1
right ?

How to use the plugin in the client

I have two machine,

one has installed elasticsearch service(servier) and the elasticsearch-river-csv plugin;while the other has not and served as a client.

I usually run curl -X on the client.

But when i run the sample in the readme, can not import csv successfully

ps1: sample can ran successfully on the server)
ps2: log (only received two lines)

[2015-01-09 03:16:18,999][INFO ][cluster.metadata ] [Big Man] [river2] creating index, cause [auto(index api)], shards [5]/[1], mappings []
[2015-01-09 03:16:19,461][INFO ][cluster.metadata ] [Big Man] [river2] update_mapping my_csv_river

Is this for sure? because i have not installed plugin on the client?
what else should i do , can i use curl -X on the client.

Thank you for your attention and your good app.

==sample tested on the server (OK)================================
curl -XPUT localhost:9200/_river/my_csv_river/_meta -d '
{
"type" : "csv",
"csv_file" : {
"folder" : "/tmp",
"first_line_is_header":"true"
}
}'
==sample tested on the client (failed)================================
curl -XPUT localhost:9200/river2/my_csv_river/_meta -d '
{
"type" : "csv",
"csv_file" : {
"folder" : "/home/aqua/tmp",
"first_line_is_header":"true"
}
}'

Notification for csv indexed

Suppose i have configured a csv-river for a folder. In that within particular time new csv data are uploaded means how can i get the status whether the csv file is index or not(status of the file)??

Whether i have to use the following option:
"script_before_file": "/path/to/before_file.sh",
"script_after_file": "/path/to/after_file.sh"

I have to update the database that the file is indexed. So whether i have to write shell script to do that??

Disable the pool parameter

By default the pool parameter is set to 60 minutes. I want to disable that parameter so that i have to manually trigger csv-river to index the data. Is it possible?

Add possibility to change file encoding

Trying to add 2 rivers, first one works but second one doesn't

Hello !

First of all, I'll like to thank you for this great plugin !

I'm having issue when adding a second river to my ES, the second one does not start.

Here is my first river :

curl -XPUT localhost:9200/_river/club/_meta -d '
{
    "type" : "csv",
    "csv_file" : {
        "folder" : "/public",
        "filename_pattern" : "^CLUB.*\\.txt$",
        "poll":"5m",
        "fields" : [
            "idclub",
            "shortname",
            "name",
            "decoupe",
            "region",
            "ce",
            "data1",
            "data2",
            "data3",
            "data4",
            "address1",
            "address2",
            "route",
            "cp",
            "ville",
            "tel1",
            "tel2",
            "email"
        ],
        "first_line_is_header" : "false",
        "field_separator" : "\t",
        "escape_character" : "#",
        "quote_character" : "\\\\",
        "field_id" : "idclub"
    },
    "index" : {
        "index" : "clubs",
        "type" : "club",
        "bulk_size" : 10000,
        "bulk_threshold" : 1
    }
}'

When adding this, i can see in ES logs

[2014-03-12 10:09:37,300][DEBUG][action.index             ] [Silver Surfer] Sending mapping updated to master: index [_river] type [club]
[2014-03-12 10:09:37,348][INFO ][river.csv                ] [Silver Surfer] [csv][club] starting csv stream
[2014-03-12 10:09:37,355][INFO ][river.csv                ] [Silver Surfer] [csv][club] Using configuration: org.elasticsearch.river.csv.Configuration(/public, ^CLUB.*\.txt$, false, [idclub, shortname, name, decoupe, region, ce, data1, data2, data3, data4, address1, address2, route, cp, ville, tel1, tel2, email], 5m, ffa_clubs, club, 10000, #, \,    , 0, 1, idclub)
[2014-03-12 10:09:37,355][INFO ][river.csv                ] [Silver Surfer] [csv][club] Going to process files {}
[2014-03-12 10:09:37,355][INFO ][river.csv                ] [Silver Surfer] [csv][club] next run waiting for 5m
[2014-03-12 10:09:37,355][DEBUG][action.index             ] [Silver Surfer] Sending mapping updated to master: index [_river] type [club]

And here is my second river definition :

curl -XPUT localhost:9200/_river/licence/_meta -d '
{
    "type" : "csv",
    "csv_file" : {
        "folder" : "/public",
        "filename_pattern" : "^LICENCE.*\\.txt$",
        "poll":"5m",
        "fields" : [
            "id",
            "lastname",
            "firstname",
            "birthdate",
            "categorie",
            "gender",
            "club",
            "firstLicenceIn",
            "LicenceUntil",
            "data1",
            "data2",
            "data3",
            "nationality",
            "data4",
            "data5",
            "firstLicenceDate",
            "lastLicenceDate",
            "data6",
            "data7",
            "data8",
            "data9",
            "data10",
            "data11",
            "ON",
            "licenceType"
        ],
        "first_line_is_header" : "false",
        "field_separator" : "\t",
        "escape_character" : "#",
        "quote_character" : "\\\\",
        "field_id" : "id"
    },
    "index" : {
        "index" : "licencies",
        "type" : "licencie",
        "bulk_size" : 10000,
        "bulk_threshold" : 1
    }
}'

When adding this one, in ES log i can read :

[2014-03-12 10:10:13,083][DEBUG][action.index             ] [Silver Surfer] Sending mapping updated to master: index [_river] type [licence]
[2014-03-12 10:10:13,129][DEBUG][action.index             ] [Silver Surfer] Sending mapping updated to master: index [_river] type [licence]

and in ES _river data : NoClassSettingsException[Failed to load class with value [csv]]; nested: ClassNotFoundException[csv];

Am I doing something wrong ? Or is there an issue when trying to add a second river ?

TAB as field_separator

I am trying to import a TSV file with tab separated fields, but I can't figure out how to make this work.

I have tried to set field_separator to "TAB", "tab", "\t", a "" (typed TAB inside quotes), but I will always get an import where all the fields are imported into one single elasticsearch field (the first one).

The documentation should include something about this, as tab separated flatfiles are quite common.

add support for parent field

Ability to specify parent document.

csv file location in es cluster?

hi,what I know,elasticsearch river instance is singleton,but in cluster,csv filels where to put? every nodes's directory?

Import of null or empty values from CSV

We have a csv log with some IP addresses in it. We would like to store them as ip type in elasticsearch, to make them searchable by range. Unfortunately, sometimes one of IPs is missing in CSV line. The plugin tries to insert empty string ("") into the field, and this fails, because the empty string is not recognized as valid IP.

For this field it would be nice if it is not indexed at all when it is empty. Other empty fields, like numerical ones, may also benefit from the change.

CSV Not importing on server

I'm on Ubuntu 12.04.4 LTS

Updated the system

sudo apt-get update && sudo apt-get upgrade

I installed elasticsearch using the debian file.

It seems to be installed correctly.

elasticsearch.yml file

changed cluster.name
changed node.name

I built river-csv with:

Clone elasticsearch-river-csv-source: git clone https://github.com/xxBedy/elasticsearch-river-csv.git

Installed maven

apt-get install maven

Ran maven

mvn clean package

Installed plugin

bash /usr/share/elasticsearch/bin/plugin -url file:/path_to_csv_river_repository/target/release/elasticsearch-river-csv.zip -install elasticsearch-river-csv

checked if plugin installed correctly

bash plugin -l
Installed plugins:
- river-csv

This is what it installed:

elasticsearch-river-csv-2.0.0.jar groovy-all-2.2.1.jar opencsv-2.3.jar

Location of above

/usr/share/elasticsearch/plugins/river-csv

the test csv file is located at /root/csv/demofile.csv

Restarted elastic search

service elasticsearch reboot

Ran

curl -XPUT localhost:9200/_river/cj/_meta -d '
{
"type" : "csv",
"csv_file" : {
"folder" : "/root/csv",
"filename_pattern" : ".*.csv$",
"poll":"5m",
"first_line_is_header":"true"
}
}'

It works on my local machine (osx) but not on the server.

Ran

curl -XGET "http://localhost:9200/_search" -d'
{
"query": {
"match_all": {}
}
}'

Output

{"took":3,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":2,"max_score":1.0,"hits":[{"_index":"_river","_type":"cj","_id":"_meta","_score":1.0, "_source" :
{
"type" : "csv",
"csv_file" : {
"folder" : "/root/csv",
"filename_pattern" : ".*.csv$",
"poll":"5m",
"first_line_is_header":"true"
},
"index" : {
"index" : "my_index",
"type" : "item",
"bulk_size" : 100,
"bulk_threshold" : 10
}
}},{"_index":"_river","_type":"cj","_id":"_status","_score":1.0, "_source" : {"node":{"id":"6cl4wAj4TcWE0UdN2ayhRg","name":"Sunstreak","transport_address":"inet[/IPADDRESSISHERE:9300]"}}}]}}

My thoughts

I don't think it's seeing the csv file. I thought it might be a permissions issue, so I chmod 777 the csv file.

It's on a digitalocean server with 1gb ram. The CSV file is 10mb. I've tried with smaller csv files that only have a few lines, but same issue.

Log file

Nothing intresting going on inside it.

Autocomplete

Is there any way to use this plugin for autocomplete (http://www.elasticsearch.org/blog/you-complete-me/, http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-suggesters-completion.html)?
According to the documentation, it would be necessary to duplicate the fields to be used for autocomplete into a separate field of the type called completion, e.g. for the field name:

{
  "mappings": {
    "hotel" : {
      "properties" : {
        "name" : { "type" : "string" },
        "city" : { "type" : "string" },
        "name_suggest" : {
          "type" :     "completion"
        }
      }
    }
  }
}

I tried manually changing the type used by this river, but it was changed back on the next import cycle.

Is field type mapping possible ?

I might have missed it, but it looks like all fields coming in through the river-csv are strings.

I have used the following minimal PUT request :

{
    "type" : "csv",
    "csv_file" : {
        "folder" : "C:\\localdata\\COUNTER3",
        "first_line_is_header":"true"
    }
}

Can the [fields] parameter be used to specify a field type mapping too?

Or should I specify the mapping on the index with the Elasticsearch mapping api ?

Error on import

While importing a huge CSV file I ran into an error in the logs.

[2014-03-26 04:24:56,924][ERROR][org.agileworks.elasticsearch.river.csv.CSVRiver] 22
java.lang.ArrayIndexOutOfBoundsException: 22
    at org.codehaus.groovy.runtime.dgmimpl.arrays.ObjectArrayGetAtMetaMethod$MyPojoMetaMethodSite.call(ObjectArrayGetAtMetaMethod.java:57)
    at org.agileworks.elasticsearch.river.csv.OpenCSVFileProcessor.process(OpenCSVFileProcessor.groovy:44)
    at org.agileworks.elasticsearch.river.csv.CSVConnector.processAllFiles(CSVConnector.groovy:42)
    at org.agileworks.elasticsearch.river.csv.CSVConnector.run(CSVConnector.groovy:19)
    at java.lang.Thread.run(Thread.java:744)

It freezes the import process at this error and doesn't go on to the next csv file.

Parent-child relationship

The Mapping

"files":
{
"size" : {
"index" : "no",
"type" : "integer"
},
"filename" : {
"index" : "no",
"type" : "string"
},
"file_id" : {
"index" : "no",
"type" : "string"
}
}
},
"lines" : {
"_routing" : {
"required" : true
},
"properties" : {
"index_ref" : {
"index" : "no",
"type" : "integer"
},
"line_data" : {
"index" : "no",
"type" : "string"
},
"keyword" : {
"index" : "not_analyzed",
"store" : true,
"type" : "string"
}
},
"_parent" : {
"type" : "files"
}
}
}

Data sample (one file)

files

size, filename, file_id
417, some_path, the_id

lines

index_ref, line_data, _parent, keyword
0, some_data1, the_id, the_key_1
1, some_data2, the_id, the_key_2
2, some_data3, the_id, the_key_3
3, some_data4, the_id, the_key_4
...

Curl commands

curl -X PUT localhost:9200/my_index
curl -X PUT localhost:9200/my_index/file/_mapping -d ...
curl -X PUT localhost:9200/my_index/line/_mapping -d ...

curl -XPUT localhost:9200/_river/file_csv_river/_meta -d '
{
"type": "csv",
"csv_file": {
"folder": "/tmp",
"filename_pattern": ".*.md$",
"poll": "5m",
"first_line_is_header": "true",
"field_separator": ",",
"escape_character": "\n",
"field_id": "file_id",
"quote_character": "'",
"charset": "UTF-8"
},
"index": {
"index": "my_index",
"type": "file",
"bulk_size": 1000,
"bulk_threshold": 0
}
}'

curl -XPUT localhost:9200/_river/line_csv_river/_meta -d '
{
"type": "csv",
"csv_file": {
"folder": "/tmp",
"filename_pattern": ".*.lines$",
"poll": "5m",
"first_line_is_header": "true",
"field_separator": ",",
"escape_character": "\n",
"quote_character": "'",
"charset": "UTF-8"
},
"index": {
"index": "my_index",
"type": "line",
"bulk_size": 1000,
"bulk_threshold": 0
}
}'

The error

The insertion for 'files' works fine (no error, all the files are inserted).
However when I insert the 'lines':

org.elasticsearch.action.RoutingMissingException: routing is required for [...]

Since I want my 'lines' to be children of 'files', according to the file_id (which I specify as _parent in every lines csv), how do I do that?

Missing year and name of copyright owner in license file

License file does not contain year and name of copyright owner...

There is only predefined line:

Copyright [yyyy] [name of copyright owner]

Importing CSV doesn't create any elastic search documents..

Could you post a sample csv and the _river for that csv. I was able to create a river but the documents are not getting created. Looks like river is processing the csv file and adds .processing.imported at the end of the file, but I cannot find the contents of that csv in my elastic search.

I'm using elastic search 0.20.5

Thanks

[Question] Numbers

When I try to import numbers, csv-river added them as string. Is there a possibility to add this as number value?

Greetings

Error message is not clear / helpful

In case of invalid request message is not clear to user.

see #26

need to update error message with specific advise.

Feature request

This is probably outside the scope of the river, but I'd like to put the idea out there anyways.

I would like to be able to run a bash command before/after each CSV file import.

make it work with ES 1.3.x

Current version throws exception:

Caused by: groovy.lang.GroovyRuntimeException: Conflicting module versions. Module [groovy-all is loaded in version 2.3.2 and you are trying to load version 2.2.1
    at org.codehaus.groovy.runtime.metaclass.MetaClassRegistryImpl$DefaultModuleListener.onModule(MetaClassRegistryImpl.java:509)
    at org.codehaus.groovy.runtime.m12n.ExtensionModuleScanner.scanExtensionModuleFromProperties(ExtensionModuleScanner.java:78)
    at org.codehaus.groovy.runtime.m12n.ExtensionModuleScanner.scanExtensionModuleFromMetaInf(ExtensionModuleScanner.java:72)
    at org.codehaus.groovy.runtime.m12n.ExtensionModuleScanner.scanClasspathModules(ExtensionModuleScanner.java:54)
    at org.codehaus.groovy.runtime.metaclass.MetaClassRegistryImpl.<init>(MetaClassRegistryImpl.java:110)
    at org.codehaus.groovy.runtime.metaclass.MetaClassRegistryImpl.<init>(MetaClassRegistryImpl.java:71)
    at groovy.lang.GroovySystem.<clinit>(GroovySystem.java:33)
    ... 34 more

Support for sub folder while pooling

Whether csvrover will support for subfolder while pooling?

My Folder structure for csv files is as follows:

ParentFolder
subfolder1
a.csv
subfolder2
b.csv

I can configure a river for ParentFolder. suppos if i put csv file into subfolder2 the river configuration will work automatically?? Or i have to configure for each subfolder????

Can i import data from csv to elasticsearch ? Please give a good suggestion to do the needfull. Thanks

processed csv files are not always renamed to processing.imported

Hello,

I am using elasticsearch-1.3.2-1.noarch on a 2 node cluster
And the ALL.zip from http://fec.gov/disclosurep/PDownload.do and split it in separate files per state.
I have created 1 index with a type per state and load state by state

When starting multiple loads at once not all files reach the state processing.imported although all lines have been loaded.

Regards Hans-Peter

Possibility to overwrite the existing document

Hi,

First of all, thank you for the plugin. I am using it and encounter an issue that the plugin creates a random id. In our scenario, we want to use a field as an id so that the existing document will be replaced if the id exists.

I would like to request to add this feature in the plugin.

The workaround I have so far is to use this code instead:

currentRequest.add(Requests.indexRequest(indexName).type(typeName).create(false).source(builder));

Best Regards,
Gluay

No csv_file configuration found

Hey all, I'm trying to load a csv file using the read.me tutorial word for word. Using the "minimal curl "section as an example, I run:

curl -XPUT localhost:9200/_river/my_csv_river/_meta -d '
{
    "type" : "csv",
    "test.csv" : {
        "folder" : "/root",
        "first_line_is_header":"true"
    }
}'

The response I get on the command line is:

{"_index":"_river","_type":"my_csv_river","_id":"_meta","_version":11,"created":false}

The /root/test.csv file contains:

barcode,date
1234567890,2014-05-01
2345678901,2014-05-02
3456789012,2014-05-03

I am trying to view the results of this load in Kibana, which tells me:

1) Error injecting constructor, org.agileworks.elasticsearch.river.csv.ConfigurationException:
No csv_file configuration found. See read.me (https://github.com/xxBedy/elasticsearch-river-csv)
at org.agileworks.elasticsearch.river.csv.CSVRiver.<init>(Unknown Source)
while locating org.agileworks.elasticsearch.river.csv.CSVRiver
while locating org.elasticsearch.river.River

I'm not sure to which "csv_file configuration" this message is referring. It says to see the read.me but the readme doesn't say anything about this.

Maybe I am just missing something totally obvious. Can anybody help ?

Thanks,
-Nick

other potentially relevant information:

java version "1.7.0_51"
OpenJDK Runtime Environment (IcedTea 2.4.4) (7u51-2.4.4-0ubuntu0.13.04.2)
OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)

bin/elasticsearch -v
Version: 1.2.2, Build: 9902f08/2014-07-09T12:02:32Z, JVM: 1.7.0_51

"script_before_file" doesn't work

I tried to use river-csv with my elastic server, here are my configuration
curl -XPUT localhost:9200/_river/postarea/_meta -d '
{
"type" : "csv",
"csv_file" : {
"folder" : "/tmp",
"filename_pattern" : "myfile.csv",
"first_line_is_header":"true",
"poll" : "1m",
"charset" : "UTF-8",
"script_before_file" : "/tmp/before_file.sh",
"script_before_all" : "/tmp/before_file.sh"
},
"index" : {
"index" : "myindex",
"type" : "mytype",
"bulk_size" : 100,
"bulk_threshold" : 10
}
}'

And here are my /tmp/before_file.sh

!/bin/sh

echo "greetings from shell before all, will process $*"

curl -XDELETE 'localhost:9200/myindex'

touch hex.xxx

Unfortunately neighter 'script_before_file' or 'script_before_all' get executed before myfile.csv get imported.

Anything wrong with my configuration?

upload hangs after about 470000 rows

Hi,

I am using elasticsearch-1.3.2-1.noarch on a 2 node cluster
And the ALL.zip from http://fec.gov/disclosurep/PDownload.do

And the following curl statement to upload:
curl -XPUT localhost:9200/_river/my_csv_river/_meta -d '
{
"type" : "csv",
"csv_file" : {
"folder" : "/u01/app/div/temp",
"first_line_is_header":"true"
},
"index" : {
"index":"contributions",
"bulk_size":100000,
"bulk_threshold":10,
"type":"csv_type"
}
}'
The unzipped file has about 5M rows.
After about 470000 stops and seems to hang.
But the java process is using 1 cpu:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20180 elastics 20 0 2044m 1.0g 22m S 99.6 34.4 15:40.89 java

is this because of the analyzing of the columns?
How can I improve this?

Regards Hans-Peter

Drop bad records but continue

I have a date field in my mapping and when the CSV file has a record that doesn't match it the whole river stops
It would be great if failed records could get written to a .bad file, and the original processing continue.

Not supporting geo_location type

If i have latitude/longitude in my CSV file, its treating as String instead of geo_location type. Not sure whether i am doing something wrong or plugin wont support this feature.

Installation

Just cant seem to get this river installed.

Running this:
bin/plugin -install xxBedy/elasticsearch-river-csv/1.0.1

I proceeded to clone the repository and tried to install, but the cli then says that it was assuming this plugin was a site and it aborted the installation.

Then I tried to just move the files into the plugin directory. Then ran the command to start indexing my csv. Nothing really happens though.

Any ideas?

java.lang.ArrayIndexOutOfBoundsException: 1 in elasticsearch running

[2014-12-30 10:03:35,887][ERROR][org.agileworks.elasticsearch.river.csv.CSVRiver] [Lacuna] [csv][my_csv_river] Error has occured during processing file 'PDUserDeviceDataTable.csv.processing' , skipping line: '[9999249575";"968";"cisco_user1";"00:12:F3:1C:02:6A";"2";"484";"0";"27.6";"48.4";"1419836497";"20.0";"46.0";"15.0";"56.0";"1000";"500";"12.8396621";"77.6616388]' and continue in processing
java.lang.ArrayIndexOutOfBoundsException: 1
at org.codehaus.groovy.runtime.BytecodeInterface8.objectArrayGet(BytecodeInterface8.java:360)
at org.agileworks.elasticsearch.river.csv.OpenCSVFileProcessor.processDataLine(OpenCSVFileProcessor.groovy:72)
at org.agileworks.elasticsearch.river.csv.OpenCSVFileProcessor.this$2$processDataLine(OpenCSVFileProcessor.groovy)
at org.agileworks.elasticsearch.river.csv.OpenCSVFileProcessor$this$2$processDataLine.callCurrent(Unknown Source)
at org.agileworks.elasticsearch.river.csv.OpenCSVFileProcessor.process(OpenCSVFileProcessor.groovy:49)
at org.agileworks.elasticsearch.river.csv.CSVConnector.processAllFiles(CSVConnector.groovy:47)
at org.agileworks.elasticsearch.river.csv.CSVConnector.run(CSVConnector.groovy:20)
at java.lang.Thread.run(Thread.java:745)
the curl command used to create the index

Use the following command to create an index.

curl -XPUT localhost:9200/_river/my_csv_river/_meta -d '
{
"type" : "csv",
"csv_file" : {
"folder" : "/home/paqs/Downloads/kibana/dec",
"filename_pattern" : ".*.csv$",
"poll":"1m",
"fields" : [
"Sno",
"userld",
"userName",
"deviceld",
"deviceCurrentMode",
"co2Level",
"dustLevel",
"temperature",
"relativeHumidity",
"timeStamp",
"tempLow",
"tempHigh",
"rhLow",
"rhHigh",
"dust",
"pollution",
"latitude",
"longitude"
],
"first_line_is_header" : "false",
"field_separator" : ",",
"escape_character" : "",
"quote_character" : """,
"field_id" : "id",
"field_timestamp" : "imported_at",
"concurrent_requests" : "1",
"charset" : "UTF-8",
"script_before_file": "/home/paqs/Downloads/kibana/dec/before_file.sh",
"script_after_file": "/home/paqs/Downloads/kibana/dec/after_file.sh",
"script_before_all": "/home/paqs/Downloads/kibana/dec/before_all.sh",
"script_after_all": "/home/paqs/Downloads/kibana/dec/after_all.sh"
},
"index" : {
"index" : "decdevicedata",
"type" : "alert",
"bulk_size" : 1000,
"bulk_threshold" : 10
}
}'

#4 Create a mapping

curl -XPUT http://localhost:9200/decdevicedata -d '
{
"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"alert" : {
"properties" : {
"Sno": {"type" : "integer"},
"co2Level" : {"type" : "integer"},
"deviceCurrentMode" : {"type" : "integer"},
"deviceld" : {"type" : "string"},
"dust" : {"type" : "integer"},
"dustLevel" : {"type" : "integer"},
"latitude": {"type" : "integer"},
"longitude": {"type" : "integer"},
"pollution" : {"type" : "integer"},
"relativeHumidity" : {"type" : "float"},
"rhLow": {"type" : "float"},
"rhHigh": {"type" : "float"},
"temperature": {"type" : "float"},
"tempLow": {"type" : "float"},
"tempHigh": {"type" : "float"},
"timeStamp" : {"type" : "date", "ignore_malformed" : true, "format" : "dateOptionalTime"},
"userld" : {"type" : "integer"},
"userName" : {"type" : "string"}

        }
    }
}

event on start or end polling data

hi,

love the river so far. works quite well.

i need some method to fire some custom query to the elastisearch before an import starts. u think thats possible somehow?

reason why:

delete all documents containing a special id in a certain field.
then the csv feeds new/up to date documents to the index.

get the idea? its actually no issue.