Coder Social home page Coder Social logo

flink-rc's People

Contributors

h4nek avatar

Watchers

 avatar

flink-rc's Issues

Generic method for reading CSV files

A more general version of the readCsvFile method that would allow to read columns into a List of Doubles directly (instead of individual Doubles in a Tuple).

Questions for next meeting #2

  • How large can the matrix W be?
    Spectral radius computation:

    • from a certain size (N_x = 2000+), the computation never finishes (timeout) for dense matrices; works for sparse... although slow
      -> A: Not very large, around 1000s.
  • Bias column in W_in corresponding to first coordinate of u... Implicit, should u be modified, should W_in be modified (different instantiation of its first column)?

  • W not symmetric in CRJ? (due to unidirectional cycle) Do we want a jump from every node instead?
    -> A: Symmetric (just jumps) can be more efficient for spectral radius computation, etc. But we can keep other topologies in code as alternatives.

  • Reorder the MSE definition? - different from the Least Squares method we're using to define the problem. Is used to check the correctness after the training phase... (link)

  • MSE -> NRMSE (consider)?
    ->A: We can have NRMSE instead (more used in ML community), but be consistent throughout thesis.
    -- Possibly consider later.

  • Cite the whole paragraph - after the text, comma?
    ->A: We can have a citation at the beginning and make it apparent that the rest is from the same source. Or cite (reference) multiple times.

  • Citing nonacademic sources (OPT skripta, ...)
    ->A: Care about it later, can be used for now..

Questions for next meeting #3

  • u(t) part of x(t) in our case (RC)?
    -> yes, added after reservoir output is produced? ([1 u(t) x(t)])
  • Should the project be portable (relative paths)?
    ->A: Yes, (use getcwd?) relative paths...
  • Tests for the whole RC and readout only (not reservoir only)?
    ->A: Yes, no need to test "just reservoir".
  • Indexed data for reservoir probably (same as for readout)? I.e. (idx, u(t)) -> (idx, x(t)). In all cases (training, testing, predicting -> here it might not be necessary).
    ->A: Indexes should be derived from timestamps

Weather App Example Problems

  • We have parallelism == 1 to avoid outputting to multiple directories (possibly other problems?)

    -> probably the reason why one stream is completely processed before another?

  • Problems with sharing parameters in parallel environment -> Operator State is probably needed?

  • Even with Operator State, elements arrive in a weird way.

    • Sometimes the alpha vector is updated with a value from the fit() function and sometimes it is not for the whole stream (predict() using alphaInit instead)
  • Stop (close) the example after no input comes in some time range? Otherwise we can't test two approaches consecutively.

Questions for next meeting

  • Should the whole project be public all the time?
    ->A: Private

  • Type of Licence of the Project (Apache License 2.0 - open source?)?
    ->A: Choose something open source (later)

  • How to integrate new LM methods into Flink?

Implementing simple RC model

See notes for the basic formula.
Testing data can be same as for LR (Glaciers).

  • Random matrices

  • Flink State (?)

  • What should be included?

  • Stream vector multiplications... (?)

  • Constant column (bias) in W_in (?)

  • Normalize the (input) data (-1;1)

  • Indexed I/O of reservoir

  • Concatenation phase, and general methods connecting the reservoir to readout.

Add EJML impl. for comparison?

Initialize nonserializable fields through open..
How to test the correctness?

Higher-level example

A class that provides easy creation of different individual example apps.
It should run as much common code as possible. Execute the Flink job.

Can be quite useful as it should ease examples creation/modification and provide a clearer solution, easier to grasp.

  • For the whole RC (different from the existing LM examples), using DataStream

  • Another one using DataSet for training (third could be DataSet only)

  • Reading data (CSV) -> How to deal with different datatypes?

    • Read each line as String[] parse each field, possibly with custom (user-defined) parsers.
  • Make instance for each example. Ideally using DataStream for everything.

Testing

Create some Unit tests or simple examples (integration tests)...

  • Test for different step size (regularization factor) parameter values

  • Take the whole dataset for LM training for graphs

  • Implement the Weather App example (linear, quadratic predictions; compare output with R (MSE))

    • Use the record's date as a key?
  • Implement Glacier Meltdown example
    (same as Weather App (not need to compare with R) & plot the predictions (Matlab))

  • An app that processes the data online - in real time

  • Unit testing of individual methods (fit, predict)

Testing

  • RC Correctness (Examples for the whole RC -off of higher-level example)
  • Streaming ability (handling real-time data, timestamps)
  • Distributed? (running on a cluster, possibly introducing parallelism)
  • Finally All together (Real-time real examples, training e.g. on history, then predicting future)

LM Implementation

GD training without dividing by the number of samples? (impossible for DataStreams)

Multiple options to implement the creation of Linear Model (Linear Regression training):

  • Accept an array (matrix) of training data (non-Flink solution)
  • Create a DataSet from the training data (offline?)
  • Use part of a DataStream to train the model (online?)

Computing MSE:

  • For each record separately (just a squared error)
  • For the whole stream
    • Compute an estimate along the way (during training)
  • For a couple of subsequent records (mini batch)

Keyed vs Non-Keyed Stream:

Could PINV be done by using DataSet or DataStream collection?

Inputlength parameter necessary (LinearRegressionPrimitive)?

Criteria for stopping the iterative method (# of steps, convergence threshold) (?)

Questions for next meeting #4

  • Predictive vs generative model? Will we generally just be predicting based on older data (output column same as the input)? Ability to specify time delay (e.g. predictions for y(t) = u(t+7)...)?
    -> Primarily time-based predictions of the same data.. (can have more input columns...); Yes, time-ahead as a parameter for the library.
  • The (y-axis) scales for training and testing plot are different. Should we require the same?
  • How should the plot for reservoirs comparison look? individual points? (scatter) or some more complex surface/mesh area?
  • MSE on unscaled data only?
    -> Better only for scaled I/O, for interexample comparison.
  • LR not used for time-steps ahead prediction? It fits non-linear curve, when x(t) = y(t-1)...
    -> Low-priority, we don't compare RC and LR on the same settings. Just demonstrate LR functionality, which we already did..
  • Leaky integration extension?
  • Should we make an example that demonstrates distributive (cluster) execution, fault-tolerance, and other Flink features?
  • Should the final project only contain most up-to-date files, or everything..? (possibility of 2 branches - 1 full, 1 light)
  • Supervisor address? In thesis spec.; Hope the assignment is OK, without signatures? (others have as well)
  • MSEs for LR examples - keep unnormalized?

LM Examples - File I/O Exception sometimes occurs

A FileNotFoundException or similar sometimes occurs probably due to a race condition when creating and reading the files used for predictions in some example app. We use multiple threads to start writing and predicting at the same time, but it's just for the example purposes. Normally we could create a serial job as well.

Implementing Continuous (Incremental) LR

Implementing a Linear Regression where the model is trained and improved even during the prediction/testing phase (the two phases are not separated but interconnected instead).
Optimal alpha parameters should be passed as sets of rules in a stream that broadcasts them to every processed element of the input stream.
They should probably be stored using an operator state (CheckpointedState).

LM Reorganization

Have all DataSet (and primitive array...) methods in batch directory, and possibly implement "online" LR training with streams in streaming directory (merge with Continuous LR?).

LM Tests

  • Training (always a DataSet)
  • Testing
    • batch
    • streaming

Features

The desired features of the implementation:

  • Flink Timestamps

    • How should we use the timestamps, key by them w/ tolerance? What to do if we don't have them (have some default creation logic?)
    • Probably not part of the library but useful mostly for the examples.
  • Flink State

    • Operator State (CheckpointedState) vs Keyed State (requires keyed streams - what should be the key?)
  • Flink Checkpointing

    • Gives an ability to recover in case of system failure (fault tolerance).

Non-mandatory (extra)

  • Parallelism (>1)

    • We should need keyed streams to make it work.
  • Scalability - running on remote clusters?

  • Rework Gradient Descent to be more efficient (streams?)

  • Gradient Descent Learning Rate Getting Smaller With Time (e.g. steps decay)

Write JavaDoc

Write a JavaDoc for every class and method. Also check for commenting parts of code.
Try writing the code documentation before writing the code, as a note of what should be done.

Flink Timestamps

Incorporate the Flink's timestamps & watermarks mechanism. Describe it in thesis.

Keying by timestamps?
Unrealistic to key on equality if they are too exact.
Unless we're guaranteed the same timestamp. E.g. every hour.

Otherwise we need an explicit ID to key by.

Gradient Descent Tests

Test the correctness of Gradient Descent implementation, used in the "online" LR...

The results may vary based on:

  • Using Batch (one matrix/vector computation (iteration) for the whole dataset) vs Stochastic (one iteration for each sample) Gradient Descent
  • Doing multiple iterations of the whole dataset or just 1 (makes more sense for Batch GD)
  • Dividing by the number of samples (2* # of samples -- for convenience)
  • Taking the norm instead of the norm squared (this could actually be harmful, as the function then becomes less sensitive and less smooth (|x| vs x^2))
  • Setting learning rate too big (diverges) or too small (never converges close enough)
  • Choosing Alpha with the best MSE instead of the last Alpha
  • Shuffling the dataset first
  • Modifying the input dataset before computing (e.g. some graphs show an exponential dependency; when the numbers are big but with small variance (years), it can also cause problems)

Supported type of stream elements?

We'll have to decide what type of DataStream (or DataSet) we accept.

For the RC part...

For the LM part:
Input:

  • Tuple(2):
    • <Integer, ...> -- our records will have some sort of ID (Integer) | basic, but might be more general?
    • <Timestamp, ...> -- time dependent records (or Date class)
    • <..., List<Double>> -- our input will be a vector of values
  • List<Double> -- our records will have implicit timestamps (processed by Flink) and an explicit vector of values
  • LabeledVector -- A vector (of double values) supporting vector operations, with a double label

โ• Collection needs to have elements of the same type, so it doesn't make sense to use it, unless we choose a general type that all elements have in common.
(We can't simply use the Number class as it doesn't support arithmetic operations. Therefore we'd have to overload the methods for all the different subclasses.)

Generalization option

List<ChosenType>
where ChosenType (working name) would be our custom class that can contain (as its field) an instance of any chosen class (serves as universal container) - List for input vectors, anything for other (meta)data of each record...

Thesis

  • Reformat vectors to be in bold
  • Figures to have titles (e.g. for ToC)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.