hurence / logisland Goto Github PK

Scalable stream processing platform for advanced realtime analytics on top of Kafka and Spark. LogIsland also supports MQTT and Kafka Streams (Flink being in the roadmap). The platform does complex event processing and is suitable for time series analysis. A large set of valuable ready to use processors, data sources and sinks are available.

Home Page: https://logisland.github.io

License: Other

Scala 3.58% Shell 0.56% Java 39.57% Makefile 0.05% Python 30.77% Roff 23.68% HTML 0.26% CSS 0.10% JavaScript 0.76% XSLT 0.55% Dockerfile 0.05% Clojure 0.07%

analytics big-data cassandra complex-event-processing elasticsearch influxdb kafka kafka-streams pattern-recognition solr spark stream-processing

logisland's People

Contributors

Stargazers

Watchers

logisland's Issues

write a LogIsland NIFI MQtt tutorial

move docker monolithic container to Docker Compose one

kafka.common.OffsetOutOfRangeException

Testing on ... use case (usr log & parser), it crashes after a while with the following error:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4279.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4279.0 (TID 4279, localhost): kafka.common.OffsetOutOfRangeException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) at java.lang.reflect.Constructor.newInstance(Unknown Source) at java.lang.Class.newInstance(Unknown Source) at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:86) at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.handleFetchErr(KafkaRDD.scala:184) at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:193) at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:208) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:282) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)
Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:920) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:918) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:918) at com.hurence.logisland.job.LogParserJob$$anonfun$main$2.apply(LogParserJob.scala:100) at com.hurence.logisland.job.LogParserJob$$anonfun$main$2.apply(LogParserJob.scala:98) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50) at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:224) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:223) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: kafka.common.OffsetOutOfRangeException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) at java.lang.reflect.Constructor.newInstance(Unknown Source) at java.lang.Class.newInstance(Unknown Source) at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:86) at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.handleFetchErr(KafkaRDD.scala:184) at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:193) at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:208) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:282) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) ... 3 more

add field auto extractor processor

lot of unstructured String records may contain structured information like that could be automatically inferred in a processor

json blocs
key/value fields in the form of "this un unstrctured fields with fieldA=valueA and some other stuff fieldB=valueB"

add a Documentation generator for plugins

add HDFS burner component

this processor takes all records and send them to HDFS. parameters are

partitioning strategy
compression level
hdfs block size
output format (with serializer) => Avro, CSV, Parquet, ORC ...

suppress the verbosity of logs in yarn mode

there are too many useless log lines

add processor documentation generation

add a remote debugging tutorial

handle partitioning with hostid

store consumed offsets in Kafka instead of Zookeeper

can be Zookeeper, HBase, ES, Couchbase

much better thant current chekpointing

add type checking for SplitText component

EventIndexerJob [IndexAlreadyExistsException]

The creation is inside a foreach partition, so multiple node receive the non existence of an index at roughly the same time and each one try to create an Index, resulting into an 'IndexAlreadyExistsException'.

add adapter for Nifi plugins

logisland plugins should be usable into Nifi Dataflow

ElasticsearchEventIndexer, bulkLoad function write a confusing log

in the afterBulk function, logger.info(response.buildFailureMessage()) is called, writing a confusing message that can lead the reader to think that there has been a problem during bulk processing. The message is the following: 'Bulk processor failed: failure in bulk execution:' Even when there is no errors...

add plugin directory to class path

avoid putting plugins in lib folder

generify MultilineSplitBloc component

add a config file to auto start parser and indexer jobs

typo in logisland-docs\_static\logisland-architecture.png

I guess it should be "while they appear" rather than "while the appear"

deploy artefacts to maven central

use sonatype account

typo on architecture diagram on README

"while they appear" is "while the appear"

SplitText and Multiline processor should return Record with only raw_content field if there's no REGEX match

can be an optional behavior

add a global logisland.properties

to set the configuration of all external services (Kafka, Spark, HDFS, ...)

add a full integrated test for components

2 levels

Docker container with all the components
Embedded Kafka server + embeded Elasticsaerch

Write a tutorial on how to write a plugin

add a check for in/out topics parameters validity

if input or output topic is not set in yaml conf, NullPointerException is thrown

add an autoscaler daemon

Log-island should handle all the scalability burden in background

autoscale kafka partition
manage spark executor-cores and memory in an elastic way

add Nifi EL support

Expression language is really powerful to express programatic values for fields

add a RESTful API for components live update

a REST API will help to monitor and update components properties for parsers, processors and engines.

design the API with Raml
implement it with VertX
or for embedding into Ambari view just implement with JAX-RS (https://github.com/mulesoft/raml-for-jax-rs)

POST component/<COMPONENT_ID>/statuts?state=RUNNING
POST component/<COMPONENT_ID>/statuts?state=PAUSE
GET component/<COMPONENT_ID>/statuts
GET component/<COMPONENT_ID>/metrics
GET component/<COMPONENT_ID>/configuration
POST component/<COMPONENT_ID>/configuration?<PARAM_NAME>=<PARAM_VALUE>
PUT component
...

hurence / logisland Goto Github PK

logisland's People

Contributors

Stargazers

Watchers

Forkers

logisland's Issues

Recommend Projects

Recommend Topics

Recommend Org