h4nek / flink-rc Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 536 KB

A reservoir computing library for Apache Flink framework

License: Apache License 2.0

Java 100.00%

apache-flink java-8 linear-regression maven reservoir-computing

flink-rc's People

Contributors

Watchers

flink-rc's Issues

Generic method for reading CSV files

A more general version of the readCsvFile method that would allow to read columns into a List of Doubles directly (instead of individual Doubles in a Tuple).

Questions for next meeting #2

How large can the matrix W be?
Spectral radius computation:
- from a certain size (N_x = 2000+), the computation never finishes (timeout) for dense matrices; works for sparse... although slow
  -> A: Not very large, around 1000s.
Bias column in W_in corresponding to first coordinate of u... Implicit, should u be modified, should W_in be modified (different instantiation of its first column)?
W not symmetric in CRJ? (due to unidirectional cycle) Do we want a jump from every node instead?
-> A: Symmetric (just jumps) can be more efficient for spectral radius computation, etc. But we can keep other topologies in code as alternatives.
Reorder the MSE definition? - different from the Least Squares method we're using to define the problem. Is used to check the correctness after the training phase... (link)
MSE -> NRMSE (consider)?
->A: We can have NRMSE instead (more used in ML community), but be consistent throughout thesis.
-- Possibly consider later.
Cite the whole paragraph - after the text, comma?
->A: We can have a citation at the beginning and make it apparent that the rest is from the same source. Or cite (reference) multiple times.
Citing nonacademic sources (OPT skripta, ...)
->A: Care about it later, can be used for now..

Questions for next meeting #3

u(t) part of x(t) in our case (RC)?
-> yes, added after reservoir output is produced? ([1 u(t) x(t)])
Should the project be portable (relative paths)?
->A: Yes, (use getcwd?) relative paths...
Tests for the whole RC and readout only (not reservoir only)?
->A: Yes, no need to test "just reservoir".
Indexed data for reservoir probably (same as for readout)? I.e. (idx, u(t)) -> (idx, x(t)). In all cases (training, testing, predicting -> here it might not be necessary).
->A: Indexes should be derived from timestamps

Weather App Example Problems

We have parallelism == 1 to avoid outputting to multiple directories (possibly other problems?)

-> probably the reason why one stream is completely processed before another?
Problems with sharing parameters in parallel environment -> Operator State is probably needed?
Even with Operator State, elements arrive in a weird way.
- Sometimes the alpha vector is updated with a value from the fit() function and sometimes it is not for the whole stream (predict() using alphaInit instead)
Stop (close) the example after no input comes in some time range? Otherwise we can't test two approaches consecutively.

Gradient Descent - Optimize step size

Using line search, we can have an optimal step size for each iteration.

Might be too expensive in practice.

Questions for next meeting

Should the whole project be public all the time?
->A: Private
Type of Licence of the Project (Apache License 2.0 - open source?)?
->A: Choose something open source (later)
How to integrate new LM methods into Flink?
- Modify the DataStream class
- Have custom read methods that create the desired type of stream
- A "wrapper" class of LM utilities (as is probably done in FlinkML)
- Create custom operators for the methods (see https://ci.apache.org/projects/flink/flink-docs-release-1.1/internals/add_operator.html)
  ->A: Probably just a utility class of separate methods that accept the streams explicitly as parameters...

Implementing simple RC model

See notes for the basic formula.
Testing data can be same as for LR (Glaciers).

Random matrices
Flink State (?)
What should be included?
Stream vector multiplications... (?)
Constant column (bias) in W_in (?)
Normalize the (input) data (-1;1)
Indexed I/O of reservoir
Concatenation phase, and general methods connecting the reservoir to readout.

Add EJML impl. for comparison?

Initialize nonserializable fields through open..
How to test the correctness?

Higher-level example

A class that provides easy creation of different individual example apps.
It should run as much common code as possible. Execute the Flink job.

Can be quite useful as it should ease examples creation/modification and provide a clearer solution, easier to grasp.

For the whole RC (different from the existing LM examples), using DataStream
Another one using DataSet for training (third could be DataSet only)
Reading data (CSV) -> How to deal with different datatypes?
- Read each line as String[] parse each field, possibly with custom (user-defined) parsers.
Make instance for each example. Ideally using DataStream for everything.

Testing

Create some Unit tests or simple examples (integration tests)...

Test for different step size (regularization factor) parameter values
Take the whole dataset for LM training for graphs
Implement the Weather App example (linear, quadratic predictions; compare output with R (MSE))
- Use the record's date as a key?
Implement Glacier Meltdown example
(same as Weather App (not need to compare with R) & plot the predictions (Matlab))
An app that processes the data online - in real time
Unit testing of individual methods (fit, predict)

Testing

RC Correctness (Examples for the whole RC -off of higher-level example)
Streaming ability (handling real-time data, timestamps)
Distributed? (running on a cluster, possibly introducing parallelism)
Finally All together (Real-time real examples, training e.g. on history, then predicting future)

LM Implementation

GD training without dividing by the number of samples? (impossible for DataStreams)

Multiple options to implement the creation of Linear Model (Linear Regression training):

Accept an array (matrix) of training data (non-Flink solution)
Create a DataSet from the training data (offline?)
Use part of a DataStream to train the model (online?)

Computing MSE:

For each record separately (just a squared error)
For the whole stream
- Compute an estimate along the way (during training)
For a couple of subsequent records (mini batch)

Keyed vs Non-Keyed Stream:

Non-Keyed has a parallelism of 1
Keyed Stream can have a state
- Operator State (e.g. Broadcast State) is shared between all keys (non-keyed state) (https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/state/state.html#using-managed-operator-state)

Could PINV be done by using DataSet or DataStream collection?

Inputlength parameter necessary (LinearRegressionPrimitive)?

Criteria for stopping the iterative method (# of steps, convergence threshold) (?)

Questions for next meeting #4

Predictive vs generative model? Will we generally just be predicting based on older data (output column same as the input)? Ability to specify time delay (e.g. predictions for y(t) = u(t+7)...)?
-> Primarily time-based predictions of the same data.. (can have more input columns...); Yes, time-ahead as a parameter for the library.
The (y-axis) scales for training and testing plot are different. Should we require the same?
How should the plot for reservoirs comparison look? individual points? (scatter) or some more complex surface/mesh area?
MSE on unscaled data only?
-> Better only for scaled I/O, for interexample comparison.
LR not used for time-steps ahead prediction? It fits non-linear curve, when x(t) = y(t-1)...
-> Low-priority, we don't compare RC and LR on the same settings. Just demonstrate LR functionality, which we already did..
Leaky integration extension?
Should we make an example that demonstrates distributive (cluster) execution, fault-tolerance, and other Flink features?
Should the final project only contain most up-to-date files, or everything..? (possibility of 2 branches - 1 full, 1 light)
Supervisor address? In thesis spec.; Hope the assignment is OK, without signatures? (others have as well)
MSEs for LR examples - keep unnormalized?

LM Examples - File I/O Exception sometimes occurs

A FileNotFoundException or similar sometimes occurs probably due to a race condition when creating and reading the files used for predictions in some example app. We use multiple threads to start writing and predicting at the same time, but it's just for the example purposes. Normally we could create a serial job as well.

readCsvFile for DataStream

Implement a method for DataStream API that is analogical to readCsvFile in DataSet API?

Implementing Continuous (Incremental) LR

Implementing a Linear Regression where the model is trained and improved even during the prediction/testing phase (the two phases are not separated but interconnected instead).
Optimal alpha parameters should be passed as sets of rules in a stream that broadcasts them to every processed element of the input stream.
They should probably be stored using an operator state (CheckpointedState).

Faster methods using Java streams

One way to make methods faster should be by using Java 8 Streams instead of regular loops, etc.

LM Reorganization

Have all DataSet (and primitive array...) methods in batch directory, and possibly implement "online" LR training with streams in streaming directory (merge with Continuous LR?).

LM Tests

Training (always a DataSet)
Testing
- batch
- streaming

Features

The desired features of the implementation:

Flink Timestamps
- How should we use the timestamps, key by them w/ tolerance? What to do if we don't have them (have some default creation logic?)
- Probably not part of the library but useful mostly for the examples.
Flink State
- Operator State (CheckpointedState) vs Keyed State (requires keyed streams - what should be the key?)
Flink Checkpointing
- Gives an ability to recover in case of system failure (fault tolerance).

Non-mandatory (extra)

Parallelism (>1)
- We should need keyed streams to make it work.
Scalability - running on remote clusters?
Rework Gradient Descent to be more efficient (streams?)
Gradient Descent Learning Rate Getting Smaller With Time (e.g. steps decay)

Write JavaDoc

Write a JavaDoc for every class and method. Also check for commenting parts of code.
Try writing the code documentation before writing the code, as a note of what should be done.

Flink Timestamps

Incorporate the Flink's timestamps & watermarks mechanism. Describe it in thesis.

Keying by timestamps?
Unrealistic to key on equality if they are too exact.
Unless we're guaranteed the same timestamp. E.g. every hour.

Otherwise we need an explicit ID to key by.

Gradient Descent Tests

Test the correctness of Gradient Descent implementation, used in the "online" LR...

The results may vary based on:

Using Batch (one matrix/vector computation (iteration) for the whole dataset) vs Stochastic (one iteration for each sample) Gradient Descent
Doing multiple iterations of the whole dataset or just 1 (makes more sense for Batch GD)
Dividing by the number of samples (2* # of samples -- for convenience)
Taking the norm instead of the norm squared (this could actually be harmful, as the function then becomes less sensitive and less smooth (|x| vs x^2))
Setting learning rate too big (diverges) or too small (never converges close enough)
Choosing Alpha with the best MSE instead of the last Alpha
Shuffling the dataset first
Modifying the input dataset before computing (e.g. some graphs show an exponential dependency; when the numbers are big but with small variance (years), it can also cause problems)

Supported type of stream elements?

We'll have to decide what type of DataStream (or DataSet) we accept.

For the RC part...

For the LM part:
Input:

Tuple(2):
- <Integer, ...> -- our records will have some sort of ID (Integer) | basic, but might be more general?
- <Timestamp, ...> -- time dependent records (or Date class)
- <..., List<Double>> -- our input will be a vector of values
List<Double> -- our records will have implicit timestamps (processed by Flink) and an explicit vector of values
LabeledVector -- A vector (of double values) supporting vector operations, with a double label

❕ Collection needs to have elements of the same type, so it doesn't make sense to use it, unless we choose a general type that all elements have in common.
(We can't simply use the Number class as it doesn't support arithmetic operations. Therefore we'd have to overload the methods for all the different subclasses.)

Generalization option

List<ChosenType>
where ChosenType (working name) would be our custom class that can contain (as its field) an instance of any chosen class (serves as universal container) - List for input vectors, anything for other (meta)data of each record...

Thesis

Reformat vectors to be in bold
Figures to have titles (e.g. for ToC)

h4nek / flink-rc Goto Github PK

flink-rc's People

Contributors

Watchers

flink-rc's Issues

Generalization option

Recommend Projects

Recommend Topics

Recommend Org