In this Google Summer of Code 2020 project I developed a web-based system that helps the user to compare new time-series analysis algorithms to a collection of over 7700 existing algorithms, implemented as the hctsa package. The website takes a new time-series analysis algorithm (as python code) from the user and computes its outputs across a dataset of 1000 diverse time series. It then analyzes the correlation between the output of the user's algorithm with the hctsa feature library and presents a range of intuitive output visualizations that show the best-matching features. This output helps the user to understand connections between their method and the existing interdisciplinary time-series analysis literature, and therefore to assess whether their algorithm is really contributing progress to the literature.
Here is an example of the website functionality I developed from scratch in this GSoC project:
Since the project needs to be developed from scratch, I have broken down the development process into three parts:
-
First phase - Backend / logic development
I developed a series of functions to enable successful execution of the user's code, and to perform systematic comparison of its output to that of existing algorithms:- Read the user's code as a string to check for malicious code before execution.
- Passes a diverse time-series dataset through user's function and generate long feature vector.
- Compute the Spearman correlation coefficient between the computed feature vector and with every individual hctsa feature, and sort and store all of the relevant information: (Feature name, Keywords, p-value, Correlation coefficient).
- Structure the results for rendering in a dynamic table and interactive plotting.
-
Second phase - Front-end development
In this phase, I focused on front-end development, that will be used by the user.I implemented a range of functionality, including:-
Development of pages for websites, including 'Home', 'How-it-works', 'Contact', 'Preloader', 'Result', 'Syntax error', 'Timeout Error', and '404 Not found'.
-
Interactive results table (functionality shown in the gif below), that allows users to:
-
Visualization of top 12 results as interactive scatter plots (as visualized in the gif below), which enables users to:
-
Visualization of pairwise relationships between each of the top 12 matches as a correlation heatmap reordered using linkage clustering.
-
-
Third phase - Running user's code securely and with error handling.
This was one of the major challenges, as executing custom user code on a server could compromise the system.Thus, in order to run user's code safely, we:- Used RestrictedPython to run the user's code in a restricted environment.
- Allow the user to import only specific modules that are relevant to scientific data analysis, and thus disabling functionality related to accessing/modifying the system.
- Restricted in-built functions like exec or eval that could be used to harm the system.
- Added a timeout limit so that the system is protected from algorithms falling into an infinite loop.
These are the weekly reports that i had submitted to INCF during GSoC period:
Although all the requirements of this project as outlined in the GSoC proposal have been completed, this project represents the important initial steps in the full development of CompEngine-Features. After the official GSoC period, I plan to contribute to this further development by:
- Adding an explore mode by which user can compare already exisiting features.
- Adding a nested result table clicking on any result will take to other result table of similar features.
- Implementing additional visualizations, including a network visualization.