thomasmbury / deep-early-warnings-pnas Goto Github PK
View Code? Open in Web Editor NEWRepository to accompany the publication 'Deep learning for early warning signals of tipping points', PNAS (2021)
License: Other
Repository to accompany the publication 'Deep learning for early warning signals of tipping points', PNAS (2021)
License: Other
Hi Dr. Bury,
I stumbled across this GitHub repository and found the analysis and results you present most exciting!
Thanks for sharing this excellent research in your public repository! While all-in-all I found this rather nicely organized and documented, I still struggled to follow through with a few parts, and so I thought I'd just pass along a few quick notes of potential stumbling blocks. As this may be a work-in-progress anyway, I thought perhaps these notes might help you further improve the reproducibility of your analysis for other readers who will no doubt be interested in following in your footsteps. Apologies if I've miss-understood or misconstrued anything in my notes below, and would definitely welcome your clarification on these issues!
It would be nice to more clearly document the overall workflow of the project, from generation of training data & training to generation of the evaluation data and final evaluation and figure creation.
In particular, the repository sometimes includes 'intermediate' data objects, like the data used to generate the figure, that would be re-generated in a 'from scratch' reproduction, while at other times not including any ability to access other core 'intermediate' objects, like the training data library (see below). (For instance, I don't see any reference to where "best_model" created by dl_train/DL_training.py
is used any other script -- presumably that trained model is used in generating the various data/ml_preds
in each of the test_models
and test_empirical
sub-dirs, but I couldn't find out where that happens...)
Consider archiving the training data. While the code provided can potentially replicate the training data, this takes some time and may not be fully reproducible. After attempting to reproduce this, a zip of the entire output
directory generated for 500,000 time series weighed in at 5.6 GB; not tiny but not in the range that would be difficult to archive freely on a public repository. (e.g. the CERN-backed database Zenodo will archive up to 50 GB files with a permanent DOI).
rb
to r
in some of the text file parsers -- perhaps encoding issues due to Windows/Unix differences?)The training script refers to a hard-coded absolute path to an archive that doesn't exist. It would make more sense to align the scripts into a clean workflow, e.g. either have the training script refer directly to the relative path training_data/output
as generated by the run_job.sh
script there, or have the run_job.sh
generate the zip archive at the relative path assumed in dl_train/DL_training.py
.
It's a bit unclear how batches are to be dealt with. I believe the intent is to read in all files regardless of which batch they are generated in. For the groups.csv and out_labels.csv, this probably means that those files in each batch need to be stacked in the same order each time. The code doesn't handle this, presumably shows only a single-batch logic?
Perhaps more importantly, it looks like the training script provided does not utilize the hyperparameters reported in the paper. For instance, the code appears to use hyperprameters corresponding to 1 CNN layer and no LSTM layers, while the reported architecture actually used for the results I think suggests you used 1 CNN layer and 2 LSTM layers, if I've followed it correctly?
Lastly, though I stuck with installing dependencies from your requirements.txt, I could not get tensorflow to recognize the val_acc
monitor or metric, but could get code to excute only when that was commented out. (see commit-diffs in my fork)
This is perhaps the most important bit and also where I'm currently stuck, which may just be my unfamiliarity with certain aspects of the toolset.
One of the most obvious applications of this work would be to run the trained agent on other data. I think the archive should include a copy of the trained agent (the best_model_1_1.pkl
folder) created by your dl_train
scripts. Ideally, I think the repository would include a python script which takes data in the required format and uses the trained agent to report the classification probabilities -- exactly how this step is done is still entirely unclear to me, though presumably is involved in generating the data in the various ml_preds
directories. Being able to reproduce/evaluate the performance of the trained classifier on arbitrary timeseries residuals is at least as important as generating the training data or training the classifier, but the provided code to do so seems more opaque to me on this point.
requirements.txt
(and were not automatically pulled in as an upstream dependency when running pip install -r requirements.txt
)scikit-learn
pandas
matplotlib
plotly
kaleido
These things probably mostly don't depend on the versions that much anyway, but still would be good to have.
Still, it would be better to provide a more comprehensive requirements.txt with the versions of all packages used by a fresh virtualenv in regenerating these results.
Additionally, it would be relatively straight-forward to include an installation script for auto-07p. I have added that to my fork.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.