nlesc / team2018 Goto Github PK

This is the repo for the 2018 TEAM sprint

team2018's Introduction

TEAM2018

This is the repo for the 2018 TEAM sprint

This years topic will be:

our projects

... for the broadest definition of 'projects'. This includes our regular projects, eStep, Flagship, RSD, EU projects, etc.

Due to the fast growth of the eScience Center over the last year, it is no longer practical to jointly work on a single project in the team sprint. That's why we have decided to broaden the scope and go for several projects instead. We'll still schedule everything together in a sprint week (with standups, presentations, etc.) to keep the team building aspect. It's important to get to know your colleagues, especially since we are growing so fast.

Do you have an idea for a sprint?

Do you need the expertise of others for a week ? Is your project not moving fast enough ? Do you need to adapt an existing tool to a new project? This is your opportunity to submit an idea for a sprint!

If you have an idea, please add it to the ideas folder here:

https://github.com/NLeSC/TEAM2018/tree/master/ideas

There is a template.md you can start from.

Sprints ideas can be on any of the projects we do, provided they:

have a well defined goal.
clearly state what expertise is needed
contain enough work for 3-5 people for 4 days.
specify which project they relate to (can also be eStep, Flagship, EU etc).

Dates

The sprint dates for this year are:

25-28 June
24-27 September
26-29 November

What hours are we writing this on?

The most often asked question during team sprints is on which project do I write the hours? For this year the rules are pretty clear: the hours go to the project that is the topic of your sprint (of course there will always be exceptions, such as multi project sprints, eStep sprints, etc.)

This does mean that it is good to involve your coordinator (and maybe the PI) when you write a sprint proposal.

Since we expect the sprints to have well defined goals, you are basically doing a month of work in four days time, and with added expertise you may not have yourself. So it should be pretty easy to convince everyone of the benefits ;-)

The first sprint

The topics for the first sprint are selected and can be found here:

https://github.com/NLeSC/TEAM2018/blob/master/june/overview.md

The second sprint

The schedule for the second sprint can be found here:

https://github.com/NLeSC/TEAM2018/blob/master/september/schedule.md

The third sprint

The schedule for the third sprint can be found here:

https://github.com/NLeSC/TEAM2018/blob/master/november/schedule.md

team2018's People

Contributors

Stargazers

Watchers

Forkers

dafnevk lourensveen shrgam

team2018's Issues

mcfly software paper

Title: Deep learning for time series made easy

Abstract: https://blog.esciencecenter.nl/mcfly-time-series-classification-made-easy-e47de8d29838

Editor: ? (Atze maybe? I also submitted an idea for another paper, if needed I can also be involved a bit in this one)

Note: Mcfly does not have shown impact, which may make it make challenging to demonstrate (potential) value.

NLeSC social media impact analysis

Analyze the impact of the Netherlands eScience Center on Twitter. Create data visualizations showing tweets, retweets, mentions and hastags over time. Link data to relevant events.

Case Law

We need a demo about the CaseLaw project. We need a little description about the demo.

@dafnevk

RSD update

OMUSE wrapper for DALES

In the Cloud-resolving modeling project we have planning to write a software paper on the OMUSE interface for the Dutch Atmospheric Large Eddy Simulation (DALES). This MPI-parallel Fortran code is exposed as a python object through this interface and can be manipulated programmatically and dynamically from within e.g. a Jupyter Notebook. This should make setting up test cases for the program much easier and enable the application of external forcings on the system. It would be nice to actually demonstrate the value of this by re-creating dynamically forced test cases, such as the cold air outbreak, from within a python script.

EYRA benchmark platform MVP

The Nijmegen Diagnostic Image Analysis Group has created the grand-challenge.org website for hosting benchmark challenges in the medical imaging domain. We want to extend this platform to become a multi/cross-domain benchmark challenge platform. In a previous sprint, a demo version of the grand-challenge.org site was deployed on a HPC cloud server. During this sprint, we will update this instance to the latest version and start extending it to suit our needs. We will implement a full REST API, add community-related functionality, and start working on a new React-based user interface. If time permits, we will implement a demo challenge as well.

candYgene: pbg-ld platform

Mon-Wed: we will work together on pbg-ld release/paper with the project team at Plant Breeding, Wageningen UR (@c-martinez will join us on grlc [#37]).
Thu: I'm at NLeSC.

3De-e-Chem

3De-e-Chem demo

For the project http://3d-e-chem.github.io/ multiple KNIME workflows and nodes where written. They are described in 2 papers: http://dx.doi.org/10.1021/acs.jcim.6b00686 and http://dx.doi.org/10.1002/cmdc.201700754.

A Vagrant virtual machine (https://3d-e-chem.github.io/3D-e-Chem-VM/) has been made with KNIME, the workflows and nodes installed in it.

For the eScience symposium 2017 we made a couple of screencasts.
For a demo I would like use the screencasts and make a storyboard for that will lead the presenter through the virtual machine, opening and running several workflows.

AMUSE

@ipelupessy it seems there was a suggestion to work on AMUSE during the November Sprint, could you provide us some information?

UncertaintyViz

Can someone give details about this demo? To who should we assign it?

Support Vector Machine to classify protein-protein interfaces

@NicoRenaud could you provide a bit more info about the paper you will work on? Would some help be welcomed or do you plan to work alone on it?

TICCLAT

The TICCLAT project is about extending TICCL, software that does ocr post correction and/or spelling correction and/or word normalization based on what word forms it sees in the corpus. In this project we want to run a number of experiments to evaluate the performance of different configurations of TICCL. The sprint will be about setting up the pipeline/infrastructure to run these experiments. We will focus on the task of OCR post-correction and have a data set available. Hopefully, we'll be able to run the baseline experiment at the end of the sprint.

Together with @egpbos

LaserChicken - Toolkit for feature calculation for airborne laser scanned point clouds

Title: LaserChicken - Feature calculation for ALS point clouds

Editor: me

About: LaserChicken, a tool that finds nearest neighbors and describes them through feature vectors. A few dozen features have already been implemented and more will follow in later releases.

Visual Storytelling

https://visualstorytelling.github.io/brainvis/
https://github.com/visualstorytelling

Tom + Stefan + Maarten

Paper: Exascale literature study

@stijnh Jason has suggested that you will be working on a paper during the Sprint. Could you give us a bit more information about it, little title (issue subject) and what will it be about?

RSD Part 2

RSD November Sprint.

@jmaassen will you together with @jspaaks fill in the info about this issue?

NLeSC social media impact analysis

Analyze the impact of the Netherlands eScience Center on Twitter. Create data visualizations showing tweets, retweets, mentions and hastags over time. Link data to relevant events.

experiment-launcher

A webservice to generate and launch a Jupyter notebook.

See https://github.com/eWaterCycle/experiment-launcher

GGIR: An R package for multi-day high resolution accelerometer data analysis

Title:
GGIR: An R package for multi-day high resolution accelerometer data analysis

Abstract:
R package GGIR converts multi-day high resolution raw data from wearable movement sensors
into insightful reports for researchers investigating human daily physical activity and sleep. The
package includes a range of literature supported methods to process, clean and analyse the data and
provide day-by-day as well as weekly estimates of physical activity and sleep parameters. In
addition to the separate functions to do the different steps, the package also comes with a shell function that enables the user to process a set of input files and produce csv summary reports with a single function call, ideal for the users less proficient in R.

Editor:
Me (Vincent)

Relation with NLeSC:
Substantial parts of code were developed as part of projects we did with the University of Exeter and University College London.

Sprint objective:
I recently drafted this paper together with three domain scientists. The text itself is already fairly mature, but I could use some help with:

Create (short) demonstration video of how the software works. For example, intro with demo on how data is collected, followed by screen capture summary of how to work with the software. Such a video could be a nice special feature of the paper and an extension to the existing documentation materials.
Brainstorm about how to best profile the software and present the profiling results.
I will bring an example movement sensor, such that one or two team members can record and analyse their own movement and sleep during the sprint week.
General improvement to text, but not a major concern.
Get paper ready for submission to SoftwareX, and circulate for final round of feedback.

Number of engineers needed:
My estimate is that this work can be done with a fairly small team of engineers (1 or 2 in addition to myself)

Crowd simulation + Monte Carlo methods

Goal: In order to complete an existing manuscript, we need to perform validation of the method by simulation. Sonja + 1 or 2 persons with background in analytics/statistics would be enough.

Background: We have data about estimations of concert visitors'positions using Wi-Fi technology and their smart phones in (then) Amsterdam Arena. The data comes with a lot of errors and uncertainties. We have proposed a method for estimating the crowd density out of this data. The advantage of this method over related work is that the estimations becomes more accurate as the crowd size increases. This has been shown theoretically as a proof-of concept, but we do not have enough video data to confirm experimentally. Therefore, the idea is to validate the method with simulations.

What needs to be done:

Use state of the art crowd simulator to simulate concert crowd with various increasing crowd densities.
Use the actual data of Wi-Fi location estimations to model the probability distributions of the errors in the estimations.
Draw samples from those distributions to introduce errors/uncertainties in the simulated data through time.
Apply the proposed method to estimate the crowd density.
Compare the obtained crowd density with the method to the density as given by step 1.
Check if the relative difference in step 5 becomes smaller as the crowd size increases. If it does (it should!), we have completed the paper :).

ViaAppia

Revive the ViaAppia demo.

Demo for Grlc

Might be nice to also have a demo on how https://github.com/CLARIAH/grlc/ is used.

There is a version for CandyGene which I had prepared together with @arnikz. Perhaps that is a good starting point?

Allelic Variant Explorer

Allelic Variant Explorer demo

The Allelic Variation Explorer (AVE) is a web application to visualize (clustered) single-nucleotide variants across genomes.

There is a Docker image with the application and a sample dataset at https://github.com/nlesc-ave/ave-demo

This Docker images is for showing the application works, to create a proper demo a scientific storyboard has to be written.

The story I have in mind is to take a gene which encodes for some visual charactistic (color, shape, stem size etc.) of a tomato and show that different tomatos strains have different visuals.
Showe the visual difference as pictures combined with the clustering of the genomes in the explorer.

This would require some literature searches and if the gene is not included in the sample dataset, a new dataset must be constructed.

ReciPy

@jvdzwaan could you elaborate a bit on what ReciPy can do and what we could do during the Sprint?
Who else has worked with ReciPy?
What is the best label for it Software dev or Soft/Meth Paper?

QMFlows: Automatic Workflow in quantum Chemistry

Title: QMFlows: Automatic Workflow in quantum Chemistry

Editor: Felipe Zapata

about: This library tackles the construction and efficient execution of computational chemistry workflows

Paper 3

@ridderl just for the records, could you then give a title and 2 lines summary about the paper?

SPOT demo

Sprint Name

SPOT-If-I...

Team leader

Faruk

Target project

IDARK / SPOT

Expertise required

Ability to use a web browser, keyboard and mouse

Size of team

3 - 5

Description

SPOT is a generic visual data analytics tool for multi-dimensional
datasets. Although, it was primarily developed for High Energy Physics project IDARK, it is a generic software that can be used in some other domains where the dataset is complex.
Currently, we have a demo using famous Titanic dataset (see http://spot.esciencecenter.nl).

Goals

The main goal will be to focus on finding a nice (scientific) use case and an interesting datasets for a demo
The demo will be used in external presentations and SPOT workshops which we are planning to organize when the materials are ready
Update the demo web site
Dockerize the demo
Identify missing features to be used in certain domains

Fair eWaterCycle I

The outcome of the EOSCPilot Hydro project could also lead to a paper (and a demo)

Main point is how to make an entire scientific software pipeline FAIR.

Includes Cylc, CWL, Docker, Singularity, OneData, and Notebooks.

Main author tbd.

EOSCPfL

The outcome of the EOSC Pilot for LOFAR project could also lead to a paper (and a demo).

The goal of this project is to unlock the LOFAR Long Term Archive (LTA).
It contains > 28 PB of LOFAR observations as visibility datasets with almost zero scientific output.
Almost all astronomical science starts with sky images, so these datasets have to be calibrated and imaged. But this is very labour intensive, i.e. there are a lot of steps in processing uncalibrated visibility datasets into a sky image that can be used for publication. That is a main reason why the LTA is hardly used. And that is a waste of taxpayers' money.
We want to bridge this gap by automating the processing and taking care of 70% of the work of the astronomer. So by selecting an observation from a webportal and starting processing in just a few mouse clicks, a pretty reasonable sky image will be produced. For an astronomer this should be enough to decide if it contains interesting science. In that case he/she can fine tune the processing to make a close to perfect sky image.

In the last sprint, it was shown that we could select observations directly from the archive and start processing them to coarsely calibrated compressed datasets with just a few mouse clicks.

However, it turned out that this did not include "staging" i.e. copying the observation from the LTA tapes to disk. Also, it did not include imaging the coarsely calibrated compressed datasets. These steps have to be added.

We want to add these steps, show that we can bridge the gap and unlock the LTA. This would make the LTA a much more attractive astronomical resource.

AMBER: Apertif Monitor for Bursts Encountered in Real-Time

Title: AMBER, the Apertif Monitor for Bursts Encountered in Real-Time

Software Package: AMBER

Editor: myself

Notes: AMBER is the real-time pipeline for detecting Fast Radio Bursts in real-time developed for the AA-ALERT project.

grpc-bmi-containers

A paper on how the combination of grpc, bmi, and docker, makes for a really nice way to share geo models with others (and makes them reproducable, etc)