agileworksorg / elasticsearch-river-csv Goto Github PK
View Code? Open in Web Editor NEWCSV river for ElasticSearch
License: Apache License 2.0
CSV river for ElasticSearch
License: Apache License 2.0
I am working on elasticsearch-river-csv. I was able to insert into ES only upto 25k my record size is 100k.
Operating Environment
Windows 8.1
ES 1.7
Log File
my query
PUT /_river/my_csv_river/_meta
{
"type" : "csv",
"csv_file" : {
"folder" : "/home/hariganesh/Downloads/CSV",
"filename_pattern" : ".*.csv$",
"poll":"5m",
"fields" : [
"column1",
"column2",
"column3"
],
"first_line_is_header" : "false",
"field_separator" : ",",
"escape_character" : ";",
"quote_character" : """,
"field_id" : "id",
"field_id_include" : "false",
"field_timestamp" : "imported_at",
"concurrent_requests" : "1",
"charset" : "UTF-8",
"script_before_all": "/path/to/before_all.sh",
"script_after_all": "/path/to/after_all.sh",
"script_before_file": "/path/to/before_file.sh",
"script_after_file": "/path/to/after_file.sh"
},
"index" : {
"index" : "my_csv_data",
"type" : "csv_type",
"bulk_size" : 100,
"bulk_threshold" : 10
}
}
Input file in csv
8848488,Harinath,"A, B, C,D"
8848489,Hari,"E,F,G,H"
Can you say what might be problem.
Is there something special I need to do to get this to print out to the screen what the file is doing?
I'm getting an error and I am not able to tell if it's because it's not setting a variable or....
I have a problem that the CSV River ignores the mapping settings, and just inserts String values. I noticed that this behavior is probably caused by the code line found at https://github.com/AgileWorksOrg/elasticsearch-river-csv/blob/master/src/main/groovy/org/agileworks/elasticsearch/river/csv/OpenCSVFileProcessor.groovy#L78
builder.field((String) fieldName, line[position])
It's not really clear to me how I can achieve that I can somehow either define what data type should be used, or that the mapping is adhered.
Can you give me some hints? Is this possible at all?
First of all, thank you for the very useful plugin.
I installed version 2.0.0 and I'm using it to import a CSV file composed of 495089 entries into Elasticsearch 1.0.0. The curl I'm issuing in order to import the data is the following:
curl -XPUT localhost:9200/_river/my_csv_river/_meta -d '
{
"type" : "csv",
"csv_file" : {
"folder" : "/tmp",
"filename_pattern" : ".*\\.csv$",
"poll":"5m",
"first_line_is_header":"true",
"field_separator" : ";",
"escape_character" : "\n",
"quote_character" : "\""
},
"index" : {
"index" : "my_index",
"type" : "item",
"bulk_size" : 100,
"bulk_threshold" : 10
}
}'
After the execution of the curl, river-csv indicates that it processed the whole file (with the whole 495089 records). However, Elasticsearch contains only a portion of the data that varies slightly when I redo the whole import process from scratch. For instance, after my last attempt to import the data, Elasticsearch contains only 114237 records out of the original 495089 ones. Is there something wrong that I'm doing and that I'm not aware of?
I want to add multiple CSV files as different indices. How do I do it if all the csv files are present in the same /root/csv folder path?
How do I add multiple CSV files to the same index as different types?
I am using ElasticSearch 1.4.4 and csvriver plugin version 2.1.2.
My csvriver stops as
Processing File myfile.csv , it stops there and not proceedign further. Can you please Guide me what may the provlem??
I am using the following command to import data from csv file.
curl -XPUT localhost:9200/_river/my_csv_river/_meta -d '
{
"type" : "csv",
"csv_file" : {
"folder" : "/home/testuser/river_test",
"filename_pattern" : ".*.csv$",
"first_line_is_header" : "true",
"field_separator" : ",",
"charset" : "UTF-8"
},
"index" : {
"index" : "myvar_model",
"type" : "csv_type",
"bulk_size" : 100,
"bulk_threshold" : 10
}
}'
But the index is not created. I searched for documentaion i cant found. I am using elasticsearch version:1.2.4 and plugin version 2.1.0.
How do I update the mappings I have created in the river_meta ? I changed the order of the columns in mappings and restarted ES as well, but I see the old order in the logs
Using configuration: org.agileworks.elasticsearch.river.csv.Configuration(/usr/local/data, .*.csv, false, [ekv_raw, ekv_flight, event_id, cookie_id, dpId, vertical, activity_group, activity, eventDateTime, departureDate, returnDate, origin, destination, destination_country_code, destination_state, destination_city, carrier, cabinClassGroup, currency, duration, travelers, airFare, bookedDate], 1m, ztmp_inventory_tool_sample, invData, 10000, , ", ,, 10, 5, id, null, null, null, null, null, UTF-8)
Any suggestions to clear river settings and recreate the mappings?
Why is the imported data different from the original. I have indexed the same data directly from database to ES via NEST and everything was okay. I changed the seperator as semicolon.
This is inside the csv file
69377;Decortie Labirent Kitaplık - Venge;Decortie Labirent Kitaplık - Venge;;/labirent-kitaplik-venge-dcm063/p/69377;http://image.x.com/ProductImages/156767/Bevidea-mobilya-kitaplik-dcm063-1_250x250.jpg;139.00;Decortie;Renk=Venge|Ürün Çeşidi=Kitaplıklar|Kullanım Alanı=Ev|Ürün Grupları=Mobilya|Marka=Decortie|Malzeme=Yonga Levha;Mobilyalar;Kitaplıklar ;;;;;;Mobilyalar/Kitaplıklar ;DCM063;139.00;1;2012-10-24 14:48:13.013;;False;True;False;307992;985;False;labirent-kitaplik-venge-dcm063;0;Mobilyalar | Kitaplıklar | | | | | | ;0;.0;219446;
and this is the indexed data
"_source": {
"��P\u0000r\u0000o\u0000d\u0000u\u0000c\u0000t\u0000N\u0000u\u0000m\u0000b\u0000e\u0000r\u0000;\u0000T\u0000i\u0000t\u0000l\u0000e\u0000;\u0000D\u0000e\u0000s\u0000c\u0000r\u0000i\u0000p\u0000t\u0000i\u0000o\u0000n\u0000;\u0000S\u0000h\u0000o\u0000r\u0000t\u0000;\u0000D\u0000e\u0000e\u0000p\u0000l\u0000i\u0000n\u0000k\u0000;\u0000I\u0000m\u0000a\u0000g\u0000e\u0000U\u0000R\u0000L\u0000;\u0000P\u0000r\u0000i\u0000c\u0000e\u0000;\u0000B\u0000r\u0000a\u0000n\u0000d\u0000;\u0000A\u0000t\u0000t\u0000r\u0000i\u0000b\u0000u\u0000t\u0000e\u0000s\u0000;\u0000C\u0000a\u0000t\u0000e\u0000g\u0000o\u0000r\u0000y\u00001\u0000;\u0000C\u0000a\u0000t\u0000e\u0000g\u0000o\u0000r\u0000y\u00002\u0000;\u0000C\u0000a\u0000t\u0000e\u0000g\u0000o\u0000r\u0000y\u00003\u0000;\u0000C\u0000a\u0000t\u0000e\u0000g\u0000o\u0000r\u0000y\u00004\u0000;\u0000C\u0000a\u0000t\u0000e\u0000g\u0000o\u0000r\u0000y\u00005\u0000;\u0000C\u0000a\u0000t\u0000e\u0000g\u0000o\u0000r\u0000y\u00006\u0000;\u0000C\u0000a\u0000t\u0000e\u0000g\u0000o\u0000r\u0000y\u00007\u0000;\u0000C\u0000a\u0000t\u0000e\u0000g\u0000o\u0000r\u0000y\u0000P\u0000a\u0000t\u0000h\u0000;\u0000E\u0000A\u0000N\u0000;\u0000O\u0000l\u0000d\u0000P\u0000r\u0000i\u0000c\u0000e\u0000;\u0000S\u0000a\u0000l\u0000e\u0000s\u0000R\u0000a\u0000n\u0000k\u0000i\u0000n\u0000g\u0000;\u0000C\u0000r\u0000e\u0000a\u0000t\u0000e\u0000d\u0000O\u0000n\u0000U\u0000t\u0000c\u0000;\u0000C\u0000a\u0000m\u0000p\u0000a\u0000i\u0000g\u0000n\u0000E\u0000n\u0000d\u0000O\u0000n\u0000U\u0000t\u0000c\u0000;\u0000I\u0000s\u0000F\u0000a\u0000s\u0000t\u0000S\u0000h\u0000i\u0000p\u0000p\u0000i\u0000n\u0000g\u0000;\u0000I\u0000s\u0000F\u0000r\u0000e\u0000e\u0000S\u0000h\u0000i\u0000p\u0000p\u0000i\u0000n\u0000g\u0000;\u0000I\u0000s\u0000O\u0000u\u0000t\u0000l\u0000e\u0000t\u0000;\u0000P\u0000r\u0000o\u0000d\u0000c\u0000u\u0000t\u0000v\u0000a\u0000r\u0000i\u0000a\u0000n\u0000t\u0000I\u0000d\u0000;\u0000S\u0000t\u0000o\u0000c\u0000k\u0000Q\u0000u\u0000a\u0000n\u0000t\u0000i\u0000t\u0000y\u0000;\u0000O\u0000u\u0000t\u0000O\u0000f\u0000S\u0000t\u0000o\u0000c\u0000k\u0000;\u0000S\u0000e\u0000N\u0000a\u0000m\u0000e\u0000;\u0000C\u0000a\u0000m\u0000p\u0000a\u0000i\u0000g\u0000n\u0000I\u0000d\u0000;\u0000A\u0000l\u0000l\u0000C\u0000a\u0000t\u0000e\u0000g\u0000o\u0000r\u0000i\u0000e\u0000s\u0000;\u0000D\u0000i\u0000s\u0000c\u0000o\u0000u\u0000n\u0000t\u0000;\u0000A\u0000v\u0000g\u0000R\u0000a\u0000t\u0000i\u0000n\u0000g\u0000;\u0000D\u0000i\u0000s\u0000p\u0000l\u0000a\u0000y\u0000O\u0000r\u0000d\u0000e\u0000r\u0000;\u0000": "\u00009\u00002\u00003\u00006\u00008\u0000;\u0000A\u0000r\u0000t\u0000e\u0000m\u0000a\u0000 \u0000D\u0000i\u0000a\u0000g\u0000o\u0000n\u0000 \u0000D\u0000u\u0000_\u0001 \u0000B\u0000a\u0000t\u0000a\u0000r\u0000y\u0000a\u0000s\u00001\u0001;\u0000A\u0000r\u0000t\u0000e\u0000m\u0000a\u0000 \u0000D\u0000i\u0000a\u0000g\u0000o\u0000n\u0000 \u0000D\u0000u\u0000_\u0001 \u0000B\u0000a\u0000t\u0000a\u0000r\u0000y\u0000a\u0000s\u00001\u0001;\u0000;\u0000/\u0000d\u0000i\u0000a\u0000g\u0000o\u0000n\u0000-\u0000d\u0000u\u0000s\u0000-\u0000b\u0000a\u0000t\u0000a\u0000r\u0000y\u0000a\u0000s\u0000i\u0000-\u0000y\u0000y\u0000a\u00000\u00007\u00008\u0000/\u0000p\u0000/\u00009\u00002\u00003\u00006\u00008\u0000;\u0000h\u0000t\u0000t\u0000p\u0000:\u0000/\u0000/\u0000i\u0000m\u0000a\u0000g\u0000e\u0000.\u0000e\u0000v\u0000i\u0000d\u0000e\u0000a\u0000.\u0000c\u0000o\u0000m\u0000/\u0000P\u0000r\u0000o\u0000d\u0000u\u0000c\u0000t\u0000I\u0000m\u0000a\u0000g\u0000e\u0000s\u0000/\u00001\u00003\u00003\u00004\u00001\u00008\u0000/\u0000t\u0000h\u0000u\u0000m\u0000b\u0000s\u0000/\u0000B\u0000y\u0000y\u0000a\u00000\u00007\u00008\u0000_\u00001\u0000_\u00002\u00005\u00000\u0000p\u0000x\u0000.\u0000j\u0000p\u0000g\u0000;\u00006\u00009\u00009\u0000.\u00000\u00000\u0000;\u0000A\u0000r\u0000t\u0000e\u0000m\u0000a\u0000;\u0000M\u0000a\u0000r\u0000k\u0000a\u0000=\u0000A\u0000r\u0000t\u0000e\u0000m\u0000a\u0000|\u0000�\u0000r\u0000�\u0000n\u0000 \u0000G\u0000r\u0000u\u0000p\u0000l\u0000a\u0000r\u00001\u0001=\u0000T\u0000e\u0000s\u0000i\u0000s\u0000a\u0000t\u0000|\u0000K\u0000u\u0000l\u0000l\u0000a\u0000n\u00001\u0001m\u0000 \u0000A\u0000l\u0000a\u0000n\u00001\u0001=\u0000B\u0000a\u0000n\u0000y\u0000o\u0000|\u0000�\u0000r\u0000�\u0000n\u0000 \u0000�\u0000e\u0000_\u0001i\u0000d\u0000i\u0000=\u0000D\u0000u\u0000_\u0001 \u0000B\u0000a\u0000t\u0000a\u0000r\u0000y\u0000a\u0000s\u00001\u0001;\u0000B\u0000a\u0000n\u0000y\u0000o\u0000;\u0000D\u0000u\u0000_\u0001 \u0000T\u0000a\u0000k\u00001\u0001m\u0000l\u0000a\u0000r\u00001\u0001 \u0000v\u0000e\u0000 \u0000A\u0000k\u0000s\u0000e\u0000s\u0000u\u0000a\u0000r\u0000l\u0000a\u0000r\u00001\u0001;\u0000D\u0000u\u0000_\u0001 \u0000B\u0000a\u0000t\u0000a\u0000r\u0000y\u0000a\u0000l\u0000a\u0000r\u00001\u0001;\u0000;\u0000;\u0000;\u0000;\u0000B\u0000a\u0000n\u0000y\u0000o\u0000/\u0000D\u0000u\u0000_\u0001 \u0000T\u0000a\u0000k\u00001\u0001m\u0000l\u0000a\u0000r\u00001\u0001 \u0000v\u0000e\u0000 \u0000A\u0000k\u0000s\u0000e\u0000s\u0000u\u0000a\u0000r\u0000l\u0000a\u0000r\u00001\u0001/\u0000D\u0000u\u0000_\u0001 \u0000B\u0000a\u0000t\u0000a\u0000r\u0000y\u0000a\u0000l\u0000a\u0000r\u00001\u0001;\u0000Y\u0000Y\u0000A\u00000\u00007\u00008\u0000;\u00006\u00009\u00009\u0000.\u00000\u00000\u0000;\u00000\u0000;\u00002\u00000\u00001\u00002\u0000-\u00000\u00007\u0000-\u00002\u00000\u0000 \u00001\u00000\u0000:\u00004\u00002\u0000:\u00003\u00004\u0000.\u00003\u00003\u00007\u0000;\u0000;\u0000F\u0000a\u0000l\u0000s\u0000e\u0000;\u0000F\u0000a\u0000l\u0000s\u0000e\u0000;\u0000F\u0000a\u0000l\u0000s\u0000e\u0000;\u00003\u00003\u00000\u00009\u00008\u00003\u0000;\u00002\u00000\u00000\u0000;\u0000F\u0000a\u0000l\u0000s\u0000e\u0000;\u0000d\u0000i\u0000a\u0000g\u0000o\u0000n\u0000-\u0000d\u0000u\u0000s\u0000-\u0000b\u0000a\u0000t\u0000a\u0000r\u0000y\u0000a\u0000s\u0000i\u0000-\u0000y\u0000y\u0000a\u00000\u00007\u00008\u0000;\u00000\u0000;\u0000B\u0000a\u0000n\u0000y\u0000o\u0000 \u0000|\u0000 \u0000D\u0000u\u0000_\u0001 \u0000T\u0000a\u0000k\u00001\u0001m\u0000l\u0000a\u0000r\u00001\u0001 \u0000v\u0000e\u0000 \u0000A\u0000k\u0000s\u0000e\u0000s\u0000u\u0000a\u0000r\u0000l\u0000a\u0000r\u00001\u0001 \u0000|\u0000 \u0000D\u0000u\u0000_\u0001 \u0000B\u0000a\u0000t\u0000a\u0000r\u0000y\u0000a\u0000l\u0000a\u0000r\u00001\u0001 \u0000|\u0000 \u0000 \u0000|\u0000 \u0000 \u0000|\u0000 \u0000 \u0000|\u0000 \u0000 \u0000|\u0000 \u0000;\u00000\u0000;\u0000.\u00000\u0000;\u00002\u00001\u00002\u00006\u00002\u00009\u0000;\u0000"
Hi all,
I have recently replaced my bulk import mechanism (PHP and bulk API) with river csv. What I have noticed so far is a strange behavior that shows up after a certain index size (around 10.000.000 docs and ~1.5G disk size). So when the index is small everything works fine, I have set the bulk_size=1000, concurrent_requests=4 and bulk_threashold=10. After a couple of hours when index become bigger the whole process slows down and the import of .csv files becomes really slow. I have checked the elastic .log files and I figured out that the execution circle (polling time) of the import is interrupted. For instance here is what I get from the logs
[2015-02-23 20:08:55,135][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] File has been processed 54eb7eed2bbe9.csv.processing
[2015-02-23 20:08:55,136][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] File 54eb7eed2bbe9.csv.processing, processed lines 2300
[2015-02-23 20:08:55,137][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Processing file 54eb81de7e37f.csv
[2015-02-23 20:08:55,146][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions
[2015-02-23 20:09:52,079][ERROR][marvel.agent.exporter ] [Domina] error sending data to [http://[0:0:0:0:0:0:0:0]:9200/.marvel-2015.02.23/_bulk]: [SocketTimeoutException[Read timed out]]
[2015-02-23 20:09:54,170][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Executed bulk composed of 1000 actions
[2015-02-23 20:09:54,286][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions
[2015-02-23 20:10:41,762][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Executed bulk composed of 1000 actions
[2015-02-23 20:10:41,911][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions
[2015-02-23 20:10:52,411][ERROR][marvel.agent.exporter ] [Domina] error sending data to [http://[0:0:0:0:0:0:0:0]:9200/.marvel-2015.02.23/_bulk]: SocketTimeoutException[Read timed out]
[2015-02-23 20:11:37,582][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Executed bulk composed of 1000 actions
[2015-02-23 20:11:37,758][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] File has been processed 54eb81de7e37f.csv.processing
[2015-02-23 20:11:37,759][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] File 54eb81de7e37f.csv.processing, processed lines 2985
[2015-02-23 20:11:37,759][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Processing file 54eb807bf351c.csv
[2015-02-23 20:11:37,765][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions
[2015-02-23 20:12:02,830][ERROR][marvel.agent.exporter ] [Domina] error sending data to [http://[0:0:0:0:0:0:0:0]:9200/.marvel-2015.02.23/_bulk]: SocketTimeoutException[Read timed out]
[2015-02-23 20:12:30,479][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Executed bulk composed of 1000 actions
[2015-02-23 20:12:30,536][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions
[2015-02-23 20:13:03,132][ERROR][marvel.agent.exporter ] [Domina] error sending data to [http://[0:0:0:0:0:0:0:0]:9200/.marvel-2015.02.23/_bulk]: [SocketTimeoutException[Read timed out]]
[2015-02-23 20:13:24,458][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Executed bulk composed of 1000 actions
[2015-02-23 20:13:24,581][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions
[2015-02-23 20:14:03,423][ERROR][marvel.agent.exporter ] [Domina] error sending data to [http://[0:0:0:0:0:0:0:0]:9200/.marvel-2015.02.23/_bulk]: SocketTimeoutException[Read timed out]
[2015-02-23 20:14:12,914][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Executed bulk composed of 1000 actions
[2015-02-23 20:14:13,010][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] File has been processed 54eb807bf351c.csv.processing
[2015-02-23 20:14:13,010][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] File 54eb807bf351c.csv.processing, processed lines 2924
[2015-02-23 20:14:13,011][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Processing file 54eb7eb509a30.csv
[2015-02-23 20:14:13,032][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions
[2015-02-23 20:15:11,204][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Executed bulk composed of 1000 actions
[2015-02-23 20:15:11,311][INFO ][org.agileworks.elasticsearch.river.csv.CSVRiver] [Domina] [csv][maxweb] Going to execute new bulk composed of 1000 actions
[2015-02-23 20:15:13,741][ERROR][marvel.agent.exporter ] [Domina] error sending data to [http://[0:0:0:0:0:0:0:0]:9200/.marvel-2015.02.23/_bulk]: SocketTimeoutException[Read timed out]
As you can see there is no accuracy between time periods. The one circle ends at 2015-02-23 20:13:24 and the next start at 2015-02-23 20:14:12. Next you can find the csv river and index settings
{
"type": "csv",
"csv_file": {
"folder": "/vagrant/CSV/",
"filename_pattern": ".*\.csv$",
"poll": "1m",
"fields": [
"serverId",
"duration",
"requestTime",
"responseTime",
"statementType",
"isRealQuery",
"queryFailed",
"sqlQuery",
"transactionId",
"clientName",
"serverName",
"serverUniqueName",
"affectedTables",
"queryError",
"canonCommandType",
"canonicalId"
],
"first_line_is_header": "false",
"concurrent_requests": "4",
"charset": "UTF-8"
},
"index": {
"index": "maxweb",
"type": "queries",
"bulk_size": 1000,
"bulk_threshold": 10
}
}
{
"mappings": {
"queries": {
"transform": {
"script": "ctx._source['affectedTables'] = ctx._source['affectedTables']?.tokenize(',')",
"lang": "groovy"
},
"_all": {
"enabled": false
},
"_source": {
"compress": false
},
"properties": {
"affectedTables": {
"type": "string",
"index": "not_analyzed",
"copy_to": [
"suggest_tables"
]
},
"canonCommandType": {
"type": "integer",
"index": "no"
},
"canonicalId": {
"type": "string",
"index": "no"
},
"clientName": {
"type": "string",
"index": "no"
},
"duration": {
"type": "double",
"doc_values": true
},
"isRealQuery": {
"type": "string"
},
"queryError": {
"type": "string",
"index": "no"
},
"queryFailed": {
"type": "boolean"
},
"requestTime": {
"type": "double",
"doc_values": true
},
"responseTime": {
"type": "double",
"index": "no"
},
"serverId": {
"type": "long",
"doc_values": true
},
"serverName": {
"type": "string",
"index": "no"
},
"serverUniqueName": {
"type": "string",
"index": "no"
},
"sqlQuery": {
"type": "string",
"norms": {
"enabled": false
}
},
"statementType": {
"type": "integer",
"doc_values": true
},
"suggest_tables": {
"type": "completion",
"analyzer": "simple",
"payloads": false,
"preserve_separators": true,
"preserve_position_increments": true,
"max_input_length": 50
}
}
}
}
}
index.refresh_interval: 30s
index.translog.flush_threshold_ops: 50000
index.translog.flush_threshold_size: 512mb
indices.fielddata.cache.size: 20%
indices.cache.filter.size: 20%
indices.memory.index_buffer_size: 40%
index.merge.scheduler.max_thread_count : 1
bootstrap.mlockall: true
MAX_LOCKED_MEMORY=unlimited
MAX_OPEN_FILES=65535
ES_JAVA_OPTS=-server
ES_HEAP_SIZE=512m
{
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"indices": {
"maxweb": {
"index": {
"primary_size_in_bytes": 3413521092,
"size_in_bytes": 3413521092
},
"translog": {
"operations": 4423
},
"docs": {
"num_docs": 17886624,
"max_doc": 17886624,
"deleted_docs": 0
},
"merges": {
"current": 0,
"current_docs": 0,
"current_size_in_bytes": 0,
"total": 29,
"total_time_in_millis": 28204,
"total_docs": 262490,
"total_size_in_bytes": 60109517
},
"refresh": {
"total": 158,
"total_time_in_millis": 15612
},
"flush": {
"total": 3,
"total_time_in_millis": 23029
},
"shards": {
"0": [{
"routing": {
"state": "STARTED",
"primary": true,
"node": "oYGAJctoTTmSU1wD021byA",
"relocating_node": null,
"shard": 0,
"index": "maxweb"
},
"state": "STARTED",
"index": {
"size_in_bytes": 3413521092
},
"translog": {
"id": 1424570700536,
"operations": 4423
},
"docs": {
"num_docs": 17886624,
"max_doc": 17886624,
"deleted_docs": 0
},
"merges": {
"current": 0,
"current_docs": 0,
"current_size_in_bytes": 0,
"total": 29,
"total_time_in_millis": 28204,
"total_docs": 262490,
"total_size_in_bytes": 60109517
},
"refresh": {
"total": 158,
"total_time_in_millis": 15612
},
"flush": {
"total": 3,
"total_time_in_millis": 23029
}
}]
}
}
}
}
{
"primaries": {
"docs": {
"count": 17890687,
"deleted": 0
},
"store": {
"size_in_bytes": 3416809219,
"throttle_time_in_millis": 669
},
"indexing": {
"index_total": 76407,
"index_time_in_millis": 9773626,
"index_current": 2,
"delete_total": 0,
"delete_time_in_millis": 0,
"delete_current": 0,
"noop_update_total": 0,
"is_throttled": false,
"throttle_time_in_millis": 0
},
"get": {
"total": 0,
"time_in_millis": 0,
"exists_total": 0,
"exists_time_in_millis": 0,
"missing_total": 0,
"missing_time_in_millis": 0,
"current": 0
},
"search": {
"open_contexts": 0,
"query_total": 0,
"query_time_in_millis": 0,
"query_current": 0,
"fetch_total": 0,
"fetch_time_in_millis": 0,
"fetch_current": 0
},
"merges": {
"current": 0,
"current_docs": 0,
"current_size_in_bytes": 0,
"total": 31,
"total_time_in_millis": 29507,
"total_docs": 281907,
"total_size_in_bytes": 64514039
},
"refresh": {
"total": 165,
"total_time_in_millis": 16431
},
"flush": {
"total": 3,
"total_time_in_millis": 23029
},
"warmer": {
"current": 0,
"total": 368,
"total_time_in_millis": 119
},
"filter_cache": {
"memory_size_in_bytes": 0,
"evictions": 0
},
"id_cache": {
"memory_size_in_bytes": 0
},
"fielddata": {
"memory_size_in_bytes": 0,
"evictions": 0
},
"percolate": {
"total": 0,
"time_in_millis": 0,
"current": 0,
"memory_size_in_bytes": -1,
"memory_size": "-1b",
"queries": 0
},
"completion": {
"size_in_bytes": 32864
},
"segments": {
"count": 26,
"memory_in_bytes": 9095212,
"index_writer_memory_in_bytes": 324016,
"index_writer_max_memory_in_bytes": 103887667,
"version_map_memory_in_bytes": 22792,
"fixed_bit_set_memory_in_bytes": 0
},
"translog": {
"operations": 8246,
"size_in_bytes": 4267399
},
"suggest": {
"total": 0,
"time_in_millis": 0,
"current": 0
},
"query_cache": {
"memory_size_in_bytes": 0,
"evictions": 0,
"hit_count": 0,
"miss_count": 0
}
},
"total": {
"docs": {
"count": 17890687,
"deleted": 0
},
"store": {
"size_in_bytes": 3416809219,
"throttle_time_in_millis": 669
},
"indexing": {
"index_total": 76407,
"index_time_in_millis": 9773626,
"index_current": 2,
"delete_total": 0,
"delete_time_in_millis": 0,
"delete_current": 0,
"noop_update_total": 0,
"is_throttled": false,
"throttle_time_in_millis": 0
},
"get": {
"total": 0,
"time_in_millis": 0,
"exists_total": 0,
"exists_time_in_millis": 0,
"missing_total": 0,
"missing_time_in_millis": 0,
"current": 0
},
"search": {
"open_contexts": 0,
"query_total": 0,
"query_time_in_millis": 0,
"query_current": 0,
"fetch_total": 0,
"fetch_time_in_millis": 0,
"fetch_current": 0
},
"merges": {
"current": 0,
"current_docs": 0,
"current_size_in_bytes": 0,
"total": 31,
"total_time_in_millis": 29507,
"total_docs": 281907,
"total_size_in_bytes": 64514039
},
"refresh": {
"total": 165,
"total_time_in_millis": 16431
},
"flush": {
"total": 3,
"total_time_in_millis": 23029
},
"warmer": {
"current": 0,
"total": 368,
"total_time_in_millis": 119
},
"filter_cache": {
"memory_size_in_bytes": 0,
"evictions": 0
},
"id_cache": {
"memory_size_in_bytes": 0
},
"fielddata": {
"memory_size_in_bytes": 0,
"evictions": 0
},
"percolate": {
"total": 0,
"time_in_millis": 0,
"current": 0,
"memory_size_in_bytes": -1,
"memory_size": "-1b",
"queries": 0
},
"completion": {
"size_in_bytes": 32864
},
"segments": {
"count": 26,
"memory_in_bytes": 9095212,
"index_writer_memory_in_bytes": 324016,
"index_writer_max_memory_in_bytes": 103887667,
"version_map_memory_in_bytes": 22792,
"fixed_bit_set_memory_in_bytes": 0
},
"translog": {
"operations": 8246,
"size_in_bytes": 4267399
},
"suggest": {
"total": 0,
"time_in_millis": 0,
"current": 0
},
"query_cache": {
"memory_size_in_bytes": 0,
"evictions": 0,
"hit_count": 0,
"miss_count": 0
}
}
}
*** Is store.throttle_time_in_millis: 669 cosidered as an important factor? I am asking since I use doc_values on my mapping so maybe I am pushig too much my little VM :)
Vagrant
OS: CentOS release 6.6
RAM: 2GB
CPU: Intel(R) Core(TM) i5-2430M CPU @ 2.40GHz (2 cores)
Thanks a lot for your time
Regards,
Alex
The proxylab team | http://www.proxylab.io/
This is awesome, thank you!
Is there a way to disable polling entirely, and instead trigger the river by an API call?
I'm guessing I can set poll to a long time period, but it feels a bit of a cheat ;-)
Running ES 1.7 and River-CSV 2.2.1
I'm trying to pull in a large number of CSV files but I'm having zero luck.
Using the command
curl -XPUT localhost:9200/_river/my_csv_river/_meta -d '
{
"type" : "csv",
"csv_file" : {
"folder" : "/tmp/csv",
"filename_pattern" : ".*.csv$",
"poll":"5m",
"first_line_is_header":"true"
}
}'
{"_index":"_river","_type":"my_csv_river","_id":"_meta","_version":1,"created":true}
I see the index created in EL, but nothing happens after that. All I see in the ES log is this
[2015-08-31 08:59:20,753][INFO ][cluster.metadata ] [EL-PROD-01] [_river] update_mapping my_csv_river
[2015-08-31 08:59:20,993][INFO ][cluster.metadata ] [EL-PROD-01] [_river] update_mapping my_csv_river
Ideas?
I am processing large CSV files with the river to index it into ES. After a few bulks, I got the following Error :
[2014-01-28 08:08:15,712][WARN ][river.csv ] [Stratosfire] [csv][csv] ongoing bulk, [10] crossed threshold [10], waiting
[2014-01-28 08:08:16,171][WARN ][river.csv ] [Stratosfire] [csv][csv] ongoing bulk, [10] crossed threshold [10], waiting
[2014-01-28 08:08:16,455][WARN ][river.csv ] [Stratosfire] [csv][csv] ongoing bulk, [10] crossed threshold [10], waiting
[2014-01-28 08:08:17,061][WARN ][river.csv ] [Stratosfire] [csv][csv] ongoing bulk, [10] crossed threshold [10], waiting
[2014-01-28 08:08:21,019][WARN ][river.csv ] [Stratosfire] [csv][csv] ongoing bulk, [10] crossed threshold [10], waiting
[2014-01-28 08:08:22,108][ERROR][river.csv ]
java.lang.ArrayIndexOutOfBoundsException
This may be realted to my csv that is mostly Chinese text and there could be few characters that wont be UTF-8 compliant. However, is there a logic that can be implement to prevent this to make everything fail ?
Thanks !
Hi folks,
I have a very strange behavior when I use double quote to escape double quote ( e.q : ""content"").
This is my csv's cells involved in my issue:
cell one :"371620ZV001 (DTJZL MDPB) - UDYAQNIANP VBRHVEXHOWIC EO HVWQBZ : QEF 2013-12-30 22:30 VN AQHH FRQ WMABP LF GSXYPMA OQ RTIQAGUOELL ""T:\HTTUMNC\"":ZT TZXGDJ B?ADEHW YWMLSQUR TIQ BHONLSUACNH SH MAB : 1 1 WQLZQO(R) NRVX QM VL FOJLVYTLYA",
cell two : UTDRXSPA < PFSB Q3906181,
cell tree: 08/01/2014 15:20:24 ZQOMOE/OLHIRL (QSQEPX, EHBVJA RABFH): 08/01/2014 15:19:57 FWUSUQ/URNOXR (WYWQVD, QNHBPG XGHLN) ** HECMMCJJ NQOQQVIJ / JZAAEHTU ** : YMMJPWU, RL LQD A : 5334645 Z FVH YXHQCO XR : 07-01-14 11:01 XHXYOW QI SRQ : YLBD SGJEYQ : 371620 CAUYGHB : UQXVX NZ MZQHF TV MIPNXGME THGFVPBDVLFVT",
result in ES:
"371620ZV001 (DTJZL MDPB) - UDYAQNIANP VBRHVEXHOWIC EO HVWQBZ : QEF 2013-12-30 22:30 VN AQHH FRQ WMABP LF GSXYPMA OQ RTIQAGUOELL "T:\HTTUMNC\"":ZT TZXGDJ B?ADEHW YWMLSQUR TIQ BHONLSUACNH SH MAB : 1 1 WQLZQO(R) NRVX QM VL FOJLVYTLYA",UTDRXSPA < PFSB Q3906181,08/01/2014 15:20:24 ZQOMOE/OLHIRL (QSQEPX,
I suppose this block is incriminated in my glitch because 2 double quote are still present in ES field.
""T:\HTTUMNC\""
Some one could help me to solve that ?
BR
Nicolas.
When I try:
plugin -install xxBedy/elasticsearch-river-csv/2.0.0
I get:
-> Installing xxBedy/elasticsearch-river-csv/2.0.0...
Trying http://download.elasticsearch.org/xxBedy/elasticsearch-river-csv/elasticsearch-river-csv-2.0.0.zip...
Trying http://search.maven.org/remotecontent?filepath=xxBedy/elasticsearch-river-csv/2.0.0/elasticsearch-river-csv-2.0.0.zip...
Trying https://oss.sonatype.org/service/local/repositories/releases/content/xxBedy/elasticsearch-river-csv/2.0.0/elasticsearch-river-csv-2.0.0.zip...
Trying https://github.com/xxBedy/elasticsearch-river-csv/archive/v2.0.0.zip...
Trying https://github.com/xxBedy/elasticsearch-river-csv/archive/master.zip...
Failed to install xxBedy/elasticsearch-river-csv/2.0.0, reason: failed to download out of all possible locations..., use -verbose to get detailed information
I think its because the file name for 2.0.0 is "2.0.0.zip" - the downloader is looking for "v2.0.0.zip" ?
this plugin dont work elasticsearch 1.1.1
right ?
I have two machine,
one has installed elasticsearch service(servier) and the elasticsearch-river-csv plugin;while the other has not and served as a client.
I usually run curl -X on the client.
But when i run the sample in the readme, can not import csv successfully
ps1: sample can ran successfully on the server)
ps2: log (only received two lines)[2015-01-09 03:16:18,999][INFO ][cluster.metadata ] [Big Man] [river2] creating index, cause [auto(index api)], shards [5]/[1], mappings []
[2015-01-09 03:16:19,461][INFO ][cluster.metadata ] [Big Man] [river2] update_mapping my_csv_river
Is this for sure? because i have not installed plugin on the client?
what else should i do , can i use curl -X on the client.
Thank you for your attention and your good app.
==sample tested on the server (OK)================================
curl -XPUT localhost:9200/_river/my_csv_river/_meta -d '
{
"type" : "csv",
"csv_file" : {
"folder" : "/tmp",
"first_line_is_header":"true"
}
}'
==sample tested on the client (failed)================================
curl -XPUT localhost:9200/river2/my_csv_river/_meta -d '
{
"type" : "csv",
"csv_file" : {
"folder" : "/home/aqua/tmp",
"first_line_is_header":"true"
}
}'
Suppose i have configured a csv-river for a folder. In that within particular time new csv data are uploaded means how can i get the status whether the csv file is index or not(status of the file)??
Whether i have to use the following option:
"script_before_file": "/path/to/before_file.sh",
"script_after_file": "/path/to/after_file.sh"
I have to update the database that the file is indexed. So whether i have to write shell script to do that??
By default the pool parameter is set to 60 minutes. I want to disable that parameter so that i have to manually trigger csv-river to index the data. Is it possible?
Hello !
First of all, I'll like to thank you for this great plugin !
I'm having issue when adding a second river to my ES, the second one does not start.
Here is my first river :
curl -XPUT localhost:9200/_river/club/_meta -d '
{
"type" : "csv",
"csv_file" : {
"folder" : "/public",
"filename_pattern" : "^CLUB.*\\.txt$",
"poll":"5m",
"fields" : [
"idclub",
"shortname",
"name",
"decoupe",
"region",
"ce",
"data1",
"data2",
"data3",
"data4",
"address1",
"address2",
"route",
"cp",
"ville",
"tel1",
"tel2",
"email"
],
"first_line_is_header" : "false",
"field_separator" : "\t",
"escape_character" : "#",
"quote_character" : "\\\\",
"field_id" : "idclub"
},
"index" : {
"index" : "clubs",
"type" : "club",
"bulk_size" : 10000,
"bulk_threshold" : 1
}
}'
When adding this, i can see in ES logs
[2014-03-12 10:09:37,300][DEBUG][action.index ] [Silver Surfer] Sending mapping updated to master: index [_river] type [club]
[2014-03-12 10:09:37,348][INFO ][river.csv ] [Silver Surfer] [csv][club] starting csv stream
[2014-03-12 10:09:37,355][INFO ][river.csv ] [Silver Surfer] [csv][club] Using configuration: org.elasticsearch.river.csv.Configuration(/public, ^CLUB.*\.txt$, false, [idclub, shortname, name, decoupe, region, ce, data1, data2, data3, data4, address1, address2, route, cp, ville, tel1, tel2, email], 5m, ffa_clubs, club, 10000, #, \, , 0, 1, idclub)
[2014-03-12 10:09:37,355][INFO ][river.csv ] [Silver Surfer] [csv][club] Going to process files {}
[2014-03-12 10:09:37,355][INFO ][river.csv ] [Silver Surfer] [csv][club] next run waiting for 5m
[2014-03-12 10:09:37,355][DEBUG][action.index ] [Silver Surfer] Sending mapping updated to master: index [_river] type [club]
And here is my second river definition :
curl -XPUT localhost:9200/_river/licence/_meta -d '
{
"type" : "csv",
"csv_file" : {
"folder" : "/public",
"filename_pattern" : "^LICENCE.*\\.txt$",
"poll":"5m",
"fields" : [
"id",
"lastname",
"firstname",
"birthdate",
"categorie",
"gender",
"club",
"firstLicenceIn",
"LicenceUntil",
"data1",
"data2",
"data3",
"nationality",
"data4",
"data5",
"firstLicenceDate",
"lastLicenceDate",
"data6",
"data7",
"data8",
"data9",
"data10",
"data11",
"ON",
"licenceType"
],
"first_line_is_header" : "false",
"field_separator" : "\t",
"escape_character" : "#",
"quote_character" : "\\\\",
"field_id" : "id"
},
"index" : {
"index" : "licencies",
"type" : "licencie",
"bulk_size" : 10000,
"bulk_threshold" : 1
}
}'
When adding this one, in ES log i can read :
[2014-03-12 10:10:13,083][DEBUG][action.index ] [Silver Surfer] Sending mapping updated to master: index [_river] type [licence]
[2014-03-12 10:10:13,129][DEBUG][action.index ] [Silver Surfer] Sending mapping updated to master: index [_river] type [licence]
and in ES _river data : NoClassSettingsException[Failed to load class with value [csv]]; nested: ClassNotFoundException[csv];
Am I doing something wrong ? Or is there an issue when trying to add a second river ?
I am trying to import a TSV file with tab separated fields, but I can't figure out how to make this work.
I have tried to set field_separator to "TAB", "tab", "\t", a "" (typed TAB inside quotes), but I will always get an import where all the fields are imported into one single elasticsearch field (the first one).
The documentation should include something about this, as tab separated flatfiles are quite common.
Ability to specify parent document.
hi,what I know,elasticsearch river instance is singleton,but in cluster,csv filels where to put? every nodes's directory?
We have a csv log with some IP addresses in it. We would like to store them as ip type in elasticsearch, to make them searchable by range. Unfortunately, sometimes one of IPs is missing in CSV line. The plugin tries to insert empty string ("") into the field, and this fails, because the empty string is not recognized as valid IP.
For this field it would be nice if it is not indexed at all when it is empty. Other empty fields, like numerical ones, may also benefit from the change.
sudo apt-get update && sudo apt-get upgrade
It seems to be installed correctly.
changed cluster.name
changed node.name
Clone elasticsearch-river-csv-source: git clone https://github.com/xxBedy/elasticsearch-river-csv.git
apt-get install maven
mvn clean package
bash /usr/share/elasticsearch/bin/plugin -url file:/path_to_csv_river_repository/target/release/elasticsearch-river-csv.zip -install elasticsearch-river-csv
bash plugin -l
Installed plugins:
- river-csv
elasticsearch-river-csv-2.0.0.jar groovy-all-2.2.1.jar opencsv-2.3.jar
/usr/share/elasticsearch/plugins/river-csv
service elasticsearch reboot
curl -XPUT localhost:9200/_river/cj/_meta -d '
{
"type" : "csv",
"csv_file" : {
"folder" : "/root/csv",
"filename_pattern" : ".*.csv$",
"poll":"5m",
"first_line_is_header":"true"
}
}'
curl -XGET "http://localhost:9200/_search" -d'
{
"query": {
"match_all": {}
}
}'
{"took":3,"timed_out":false,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":2,"max_score":1.0,"hits":[{"_index":"_river","_type":"cj","_id":"_meta","_score":1.0, "_source" :
{
"type" : "csv",
"csv_file" : {
"folder" : "/root/csv",
"filename_pattern" : ".*.csv$",
"poll":"5m",
"first_line_is_header":"true"
},
"index" : {
"index" : "my_index",
"type" : "item",
"bulk_size" : 100,
"bulk_threshold" : 10
}
}},{"_index":"_river","_type":"cj","_id":"_status","_score":1.0, "_source" : {"node":{"id":"6cl4wAj4TcWE0UdN2ayhRg","name":"Sunstreak","transport_address":"inet[/IPADDRESSISHERE:9300]"}}}]}}
I don't think it's seeing the csv file. I thought it might be a permissions issue, so I chmod 777 the csv file.
It's on a digitalocean server with 1gb ram. The CSV file is 10mb. I've tried with smaller csv files that only have a few lines, but same issue.
Nothing intresting going on inside it.
Is there any way to use this plugin for autocomplete (http://www.elasticsearch.org/blog/you-complete-me/, http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-suggesters-completion.html)?
According to the documentation, it would be necessary to duplicate the fields to be used for autocomplete into a separate field of the type called completion, e.g. for the field name:
{
"mappings": {
"hotel" : {
"properties" : {
"name" : { "type" : "string" },
"city" : { "type" : "string" },
"name_suggest" : {
"type" : "completion"
}
}
}
}
}
I tried manually changing the type used by this river, but it was changed back on the next import cycle.
I might have missed it, but it looks like all fields coming in through the river-csv are strings.
I have used the following minimal PUT request :
{
"type" : "csv",
"csv_file" : {
"folder" : "C:\\localdata\\COUNTER3",
"first_line_is_header":"true"
}
}
Can the [fields] parameter be used to specify a field type mapping too?
Or should I specify the mapping on the index with the Elasticsearch mapping api ?
While importing a huge CSV file I ran into an error in the logs.
[2014-03-26 04:24:56,924][ERROR][org.agileworks.elasticsearch.river.csv.CSVRiver] 22
java.lang.ArrayIndexOutOfBoundsException: 22
at org.codehaus.groovy.runtime.dgmimpl.arrays.ObjectArrayGetAtMetaMethod$MyPojoMetaMethodSite.call(ObjectArrayGetAtMetaMethod.java:57)
at org.agileworks.elasticsearch.river.csv.OpenCSVFileProcessor.process(OpenCSVFileProcessor.groovy:44)
at org.agileworks.elasticsearch.river.csv.CSVConnector.processAllFiles(CSVConnector.groovy:42)
at org.agileworks.elasticsearch.river.csv.CSVConnector.run(CSVConnector.groovy:19)
at java.lang.Thread.run(Thread.java:744)
It freezes the import process at this error and doesn't go on to the next csv file.
"files":
{
"size" : {
"index" : "no",
"type" : "integer"
},
"filename" : {
"index" : "no",
"type" : "string"
},
"file_id" : {
"index" : "no",
"type" : "string"
}
}
},
"lines" : {
"_routing" : {
"required" : true
},
"properties" : {
"index_ref" : {
"index" : "no",
"type" : "integer"
},
"line_data" : {
"index" : "no",
"type" : "string"
},
"keyword" : {
"index" : "not_analyzed",
"store" : true,
"type" : "string"
}
},
"_parent" : {
"type" : "files"
}
}
}
size, filename, file_id
417, some_path, the_id
index_ref, line_data, _parent, keyword
0, some_data1, the_id, the_key_1
1, some_data2, the_id, the_key_2
2, some_data3, the_id, the_key_3
3, some_data4, the_id, the_key_4
...
curl -X PUT localhost:9200/my_index
curl -X PUT localhost:9200/my_index/file/_mapping -d ...
curl -X PUT localhost:9200/my_index/line/_mapping -d ...
curl -XPUT localhost:9200/_river/file_csv_river/_meta -d '
{
"type": "csv",
"csv_file": {
"folder": "/tmp",
"filename_pattern": ".*.md$",
"poll": "5m",
"first_line_is_header": "true",
"field_separator": ",",
"escape_character": "\n",
"field_id": "file_id",
"quote_character": "'",
"charset": "UTF-8"
},
"index": {
"index": "my_index",
"type": "file",
"bulk_size": 1000,
"bulk_threshold": 0
}
}'
curl -XPUT localhost:9200/_river/line_csv_river/_meta -d '
{
"type": "csv",
"csv_file": {
"folder": "/tmp",
"filename_pattern": ".*.lines$",
"poll": "5m",
"first_line_is_header": "true",
"field_separator": ",",
"escape_character": "\n",
"quote_character": "'",
"charset": "UTF-8"
},
"index": {
"index": "my_index",
"type": "line",
"bulk_size": 1000,
"bulk_threshold": 0
}
}'
The insertion for 'files' works fine (no error, all the files are inserted).
However when I insert the 'lines':
org.elasticsearch.action.RoutingMissingException: routing is required for [...]
Since I want my 'lines' to be children of 'files', according to the file_id (which I specify as _parent in every lines csv), how do I do that?
License file does not contain year and name of copyright owner...
There is only predefined line:
Copyright [yyyy] [name of copyright owner]
Could you post a sample csv and the _river for that csv. I was able to create a river but the documents are not getting created. Looks like river is processing the csv file and adds .processing.imported at the end of the file, but I cannot find the contents of that csv in my elastic search.
I'm using elastic search 0.20.5
Thanks
When I try to import numbers, csv-river added them as string. Is there a possibility to add this as number value?
Greetings
In case of invalid request message is not clear to user.
see #26
need to update error message with specific advise.
This is probably outside the scope of the river, but I'd like to put the idea out there anyways.
I would like to be able to run a bash command before/after each CSV file import.
Current version throws exception:
Caused by: groovy.lang.GroovyRuntimeException: Conflicting module versions. Module [groovy-all is loaded in version 2.3.2 and you are trying to load version 2.2.1
at org.codehaus.groovy.runtime.metaclass.MetaClassRegistryImpl$DefaultModuleListener.onModule(MetaClassRegistryImpl.java:509)
at org.codehaus.groovy.runtime.m12n.ExtensionModuleScanner.scanExtensionModuleFromProperties(ExtensionModuleScanner.java:78)
at org.codehaus.groovy.runtime.m12n.ExtensionModuleScanner.scanExtensionModuleFromMetaInf(ExtensionModuleScanner.java:72)
at org.codehaus.groovy.runtime.m12n.ExtensionModuleScanner.scanClasspathModules(ExtensionModuleScanner.java:54)
at org.codehaus.groovy.runtime.metaclass.MetaClassRegistryImpl.<init>(MetaClassRegistryImpl.java:110)
at org.codehaus.groovy.runtime.metaclass.MetaClassRegistryImpl.<init>(MetaClassRegistryImpl.java:71)
at groovy.lang.GroovySystem.<clinit>(GroovySystem.java:33)
... 34 more
Whether csvrover will support for subfolder while pooling?
My Folder structure for csv files is as follows:
ParentFolder
subfolder1
a.csv
subfolder2
b.csv
I can configure a river for ParentFolder. suppos if i put csv file into subfolder2 the river configuration will work automatically?? Or i have to configure for each subfolder????
Hello,
I am using elasticsearch-1.3.2-1.noarch on a 2 node cluster
And the ALL.zip from http://fec.gov/disclosurep/PDownload.do and split it in separate files per state.
I have created 1 index with a type per state and load state by state
When starting multiple loads at once not all files reach the state processing.imported although all lines have been loaded.
Regards Hans-Peter
Hi,
First of all, thank you for the plugin. I am using it and encounter an issue that the plugin creates a random id. In our scenario, we want to use a field as an id so that the existing document will be replaced if the id exists.
I would like to request to add this feature in the plugin.
The workaround I have so far is to use this code instead:
currentRequest.add(Requests.indexRequest(indexName).type(typeName).create(false).source(builder));
Best Regards,
Gluay
Hey all, I'm trying to load a csv file using the read.me tutorial word for word. Using the "minimal curl "section as an example, I run:
curl -XPUT localhost:9200/_river/my_csv_river/_meta -d '
{
"type" : "csv",
"test.csv" : {
"folder" : "/root",
"first_line_is_header":"true"
}
}'
The response I get on the command line is:
{"_index":"_river","_type":"my_csv_river","_id":"_meta","_version":11,"created":false}
The /root/test.csv file contains:
barcode,date
1234567890,2014-05-01
2345678901,2014-05-02
3456789012,2014-05-03
I am trying to view the results of this load in Kibana, which tells me:
1) Error injecting constructor, org.agileworks.elasticsearch.river.csv.ConfigurationException:
No csv_file configuration found. See read.me (https://github.com/xxBedy/elasticsearch-river-csv)
at org.agileworks.elasticsearch.river.csv.CSVRiver.<init>(Unknown Source)
while locating org.agileworks.elasticsearch.river.csv.CSVRiver
while locating org.elasticsearch.river.River
I'm not sure to which "csv_file configuration" this message is referring. It says to see the read.me but the readme doesn't say anything about this.
Maybe I am just missing something totally obvious. Can anybody help ?
Thanks,
-Nick
other potentially relevant information:
java version "1.7.0_51"
OpenJDK Runtime Environment (IcedTea 2.4.4) (7u51-2.4.4-0ubuntu0.13.04.2)
OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)
bin/elasticsearch -v
Version: 1.2.2, Build: 9902f08/2014-07-09T12:02:32Z, JVM: 1.7.0_51
I tried to use river-csv with my elastic server, here are my configuration
curl -XPUT localhost:9200/_river/postarea/_meta -d '
{
"type" : "csv",
"csv_file" : {
"folder" : "/tmp",
"filename_pattern" : "myfile.csv",
"first_line_is_header":"true",
"poll" : "1m",
"charset" : "UTF-8",
"script_before_file" : "/tmp/before_file.sh",
"script_before_all" : "/tmp/before_file.sh"
},
"index" : {
"index" : "myindex",
"type" : "mytype",
"bulk_size" : 100,
"bulk_threshold" : 10
}
}'
And here are my /tmp/before_file.sh
echo "greetings from shell before all, will process $*"
curl -XDELETE 'localhost:9200/myindex'
touch hex.xxx
Unfortunately neighter 'script_before_file' or 'script_before_all' get executed before myfile.csv get imported.
Anything wrong with my configuration?
Hi,
I am using elasticsearch-1.3.2-1.noarch on a 2 node cluster
And the ALL.zip from http://fec.gov/disclosurep/PDownload.do
And the following curl statement to upload:
curl -XPUT localhost:9200/_river/my_csv_river/_meta -d '
{
"type" : "csv",
"csv_file" : {
"folder" : "/u01/app/div/temp",
"first_line_is_header":"true"
},
"index" : {
"index":"contributions",
"bulk_size":100000,
"bulk_threshold":10,
"type":"csv_type"
}
}'
The unzipped file has about 5M rows.
After about 470000 stops and seems to hang.
But the java process is using 1 cpu:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20180 elastics 20 0 2044m 1.0g 22m S 99.6 34.4 15:40.89 java
is this because of the analyzing of the columns?
How can I improve this?
Regards Hans-Peter
I have a date field in my mapping and when the CSV file has a record that doesn't match it the whole river stops
It would be great if failed records could get written to a .bad file, and the original processing continue.
If i have latitude/longitude in my CSV file, its treating as String instead of geo_location type. Not sure whether i am doing something wrong or plugin wont support this feature.
Just cant seem to get this river installed.
Running this:
bin/plugin -install xxBedy/elasticsearch-river-csv/1.0.1
Returns this:
-> Installing xxBedy/elasticsearch-river-csv/1.0.1...
Trying http://download.elasticsearch.org/xxBedy/elasticsearch-river-csv/elasticsearch-river-csv-1.0.1.zip...
Trying http://search.maven.org/remotecontent?filepath=xxBedy/elasticsearch-river-csv/1.0.1/elasticsearch-river-csv-1.0.1.zip...
Trying https://oss.sonatype.org/service/local/repositories/releases/content/xxBedy/elasticsearch-river-csv/1.0.1/elasticsearch-river-csv-1.0.1.zip...
Trying https://github.com/xxBedy/elasticsearch-river-csv/zipball/v1.0.1... (assuming site plugin)
Failed to install xxBedy/elasticsearch-river-csv/1.0.1, reason: failed to download out of all possible locations..., use -verbose to get detailed information
I proceeded to clone the repository and tried to install, but the cli then says that it was assuming this plugin was a site and it aborted the installation.
Then I tried to just move the files into the plugin directory. Then ran the command to start indexing my csv. Nothing really happens though.
Any ideas?
[2014-12-30 10:03:35,887][ERROR][org.agileworks.elasticsearch.river.csv.CSVRiver] [Lacuna] [csv][my_csv_river] Error has occured during processing file 'PDUserDeviceDataTable.csv.processing' , skipping line: '[9999249575";"968";"cisco_user1";"00:12:F3:1C:02:6A";"2";"484";"0";"27.6";"48.4";"1419836497";"20.0";"46.0";"15.0";"56.0";"1000";"500";"12.8396621";"77.6616388]' and continue in processing
java.lang.ArrayIndexOutOfBoundsException: 1
at org.codehaus.groovy.runtime.BytecodeInterface8.objectArrayGet(BytecodeInterface8.java:360)
at org.agileworks.elasticsearch.river.csv.OpenCSVFileProcessor.processDataLine(OpenCSVFileProcessor.groovy:72)
at org.agileworks.elasticsearch.river.csv.OpenCSVFileProcessor.this$2$processDataLine(OpenCSVFileProcessor.groovy)
at org.agileworks.elasticsearch.river.csv.OpenCSVFileProcessor$this$2$processDataLine.callCurrent(Unknown Source)
at org.agileworks.elasticsearch.river.csv.OpenCSVFileProcessor.process(OpenCSVFileProcessor.groovy:49)
at org.agileworks.elasticsearch.river.csv.CSVConnector.processAllFiles(CSVConnector.groovy:47)
at org.agileworks.elasticsearch.river.csv.CSVConnector.run(CSVConnector.groovy:20)
at java.lang.Thread.run(Thread.java:745)
the curl command used to create the index
Use the following command to create an index.
curl -XPUT localhost:9200/_river/my_csv_river/_meta -d '
{
"type" : "csv",
"csv_file" : {
"folder" : "/home/paqs/Downloads/kibana/dec",
"filename_pattern" : ".*.csv$",
"poll":"1m",
"fields" : [
"Sno",
"userld",
"userName",
"deviceld",
"deviceCurrentMode",
"co2Level",
"dustLevel",
"temperature",
"relativeHumidity",
"timeStamp",
"tempLow",
"tempHigh",
"rhLow",
"rhHigh",
"dust",
"pollution",
"latitude",
"longitude"
],
"first_line_is_header" : "false",
"field_separator" : ",",
"escape_character" : "",
"quote_character" : """,
"field_id" : "id",
"field_timestamp" : "imported_at",
"concurrent_requests" : "1",
"charset" : "UTF-8",
"script_before_file": "/home/paqs/Downloads/kibana/dec/before_file.sh",
"script_after_file": "/home/paqs/Downloads/kibana/dec/after_file.sh",
"script_before_all": "/home/paqs/Downloads/kibana/dec/before_all.sh",
"script_after_all": "/home/paqs/Downloads/kibana/dec/after_all.sh"
},
"index" : {
"index" : "decdevicedata",
"type" : "alert",
"bulk_size" : 1000,
"bulk_threshold" : 10
}
}'
#4 Create a mapping
curl -XPUT http://localhost:9200/decdevicedata -d '
{
"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"alert" : {
"properties" : {
"Sno": {"type" : "integer"},
"co2Level" : {"type" : "integer"},
"deviceCurrentMode" : {"type" : "integer"},
"deviceld" : {"type" : "string"},
"dust" : {"type" : "integer"},
"dustLevel" : {"type" : "integer"},
"latitude": {"type" : "integer"},
"longitude": {"type" : "integer"},
"pollution" : {"type" : "integer"},
"relativeHumidity" : {"type" : "float"},
"rhLow": {"type" : "float"},
"rhHigh": {"type" : "float"},
"temperature": {"type" : "float"},
"tempLow": {"type" : "float"},
"tempHigh": {"type" : "float"},
"timeStamp" : {"type" : "date", "ignore_malformed" : true, "format" : "dateOptionalTime"},
"userld" : {"type" : "integer"},
"userName" : {"type" : "string"}
}
}
}
}'
hi,
love the river so far. works quite well.
i need some method to fire some custom query to the elastisearch before an import starts. u think thats possible somehow?
reason why:
delete all documents containing a special id in a certain field.
then the csv feeds new/up to date documents to the index.
get the idea? its actually no issue.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.