h4nek / flink-rc Goto Github PK
View Code? Open in Web Editor NEWA reservoir computing library for Apache Flink framework
License: Apache License 2.0
A reservoir computing library for Apache Flink framework
License: Apache License 2.0
A more general version of the readCsvFile method that would allow to read columns into a List of Doubles directly (instead of individual Doubles in a Tuple).
How large can the matrix W be?
Spectral radius computation:
Bias column in W_in corresponding to first coordinate of u... Implicit, should u be modified, should W_in be modified (different instantiation of its first column)?
W not symmetric in CRJ? (due to unidirectional cycle) Do we want a jump from every node instead?
-> A: Symmetric (just jumps) can be more efficient for spectral radius computation, etc. But we can keep other topologies in code as alternatives.
Reorder the MSE definition? - different from the Least Squares method we're using to define the problem. Is used to check the correctness after the training phase... (link)
MSE -> NRMSE (consider)?
->A: We can have NRMSE instead (more used in ML community), but be consistent throughout thesis.
-- Possibly consider later.
Cite the whole paragraph - after the text, comma?
->A: We can have a citation at the beginning and make it apparent that the rest is from the same source. Or cite (reference) multiple times.
Citing nonacademic sources (OPT skripta, ...)
->A: Care about it later, can be used for now..
We have parallelism == 1 to avoid outputting to multiple directories (possibly other problems?)
-> probably the reason why one stream is completely processed before another?
Problems with sharing parameters in parallel environment -> Operator State is probably needed?
Even with Operator State, elements arrive in a weird way.
Stop (close) the example after no input comes in some time range? Otherwise we can't test two approaches consecutively.
Using line search, we can have an optimal step size for each iteration.
Might be too expensive in practice.
Should the whole project be public all the time?
->A: Private
Type of Licence of the Project (Apache License 2.0 - open source?)?
->A: Choose something open source (later)
How to integrate new LM methods into Flink?
See notes for the basic formula.
Testing data can be same as for LR (Glaciers).
Random matrices
Flink State (?)
What should be included?
Stream vector multiplications... (?)
Constant column (bias) in W_in (?)
Normalize the (input) data (-1;1)
Indexed I/O of reservoir
Concatenation phase, and general methods connecting the reservoir to readout.
Add EJML impl. for comparison?
Initialize nonserializable fields through open..
How to test the correctness?
A class that provides easy creation of different individual example apps.
It should run as much common code as possible. Execute the Flink job.
Can be quite useful as it should ease examples creation/modification and provide a clearer solution, easier to grasp.
For the whole RC (different from the existing LM examples), using DataStream
Another one using DataSet for training (third could be DataSet only)
Reading data (CSV) -> How to deal with different datatypes?
Make instance for each example. Ideally using DataStream for everything.
Create some Unit tests or simple examples (integration tests)...
Test for different step size (regularization factor) parameter values
Take the whole dataset for LM training for graphs
Implement the Weather App example (linear, quadratic predictions; compare output with R (MSE))
Implement Glacier Meltdown example
(same as Weather App (not need to compare with R) & plot the predictions (Matlab))
An app that processes the data online - in real time
Unit testing of individual methods (fit, predict)
Testing
GD training without dividing by the number of samples? (impossible for DataStreams)
Multiple options to implement the creation of Linear Model (Linear Regression training):
Computing MSE:
Keyed vs Non-Keyed Stream:
Could PINV be done by using DataSet or DataStream collection?
Inputlength parameter necessary (LinearRegressionPrimitive)?
Criteria for stopping the iterative method (# of steps, convergence threshold) (?)
A FileNotFoundException or similar sometimes occurs probably due to a race condition when creating and reading the files used for predictions in some example app. We use multiple threads to start writing and predicting at the same time, but it's just for the example purposes. Normally we could create a serial job as well.
Implement a method for DataStream API that is analogical to readCsvFile in DataSet API?
Implementing a Linear Regression where the model is trained and improved even during the prediction/testing phase (the two phases are not separated but interconnected instead).
Optimal alpha parameters should be passed as sets of rules in a stream that broadcasts them to every processed element of the input stream.
They should probably be stored using an operator state (CheckpointedState).
One way to make methods faster should be by using Java 8 Streams instead of regular loops, etc.
Have all DataSet (and primitive array...) methods in batch directory, and possibly implement "online" LR training with streams in streaming directory (merge with Continuous LR?).
LM Tests
The desired features of the implementation:
Flink Timestamps
Flink State
Flink Checkpointing
Non-mandatory (extra)
Parallelism (>1)
Scalability - running on remote clusters?
Rework Gradient Descent to be more efficient (streams?)
Gradient Descent Learning Rate Getting Smaller With Time (e.g. steps decay)
Write a JavaDoc for every class and method. Also check for commenting parts of code.
Try writing the code documentation before writing the code, as a note of what should be done.
Incorporate the Flink's timestamps & watermarks mechanism. Describe it in thesis.
Keying by timestamps?
Unrealistic to key on equality if they are too exact.
Unless we're guaranteed the same timestamp. E.g. every hour.
Otherwise we need an explicit ID to key by.
Test the correctness of Gradient Descent implementation, used in the "online" LR...
The results may vary based on:
We'll have to decide what type of DataStream (or DataSet) we accept.
For the RC part...
For the LM part:
Input:
Tuple
(2):
<Integer, ...>
-- our records will have some sort of ID (Integer) | basic, but might be more general?<Timestamp, ...>
-- time dependent records (or Date class)<..., List<Double>>
-- our input will be a vector of valuesList<Double>
-- our records will have implicit timestamps (processed by Flink) and an explicit vector of valuesLabeledVector
-- A vector (of double values) supporting vector operations, with a double labelโ Collection needs to have elements of the same type, so it doesn't make sense to use it, unless we choose a general type that all elements have in common.
(We can't simply use the Number class as it doesn't support arithmetic operations. Therefore we'd have to overload the methods for all the different subclasses.)
List<ChosenType>
where ChosenType (working name) would be our custom class that can contain (as its field) an instance of any chosen class (serves as universal container) - List for input vectors, anything for other (meta)data of each record...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.