The data_versioning from traitecoevo

Minimal set of information required

repo name: wcornwell/taxonlookup, traitecoevo/baad.data
single file name: plant_lookup.csv, baad_data.zip
hook to load into R: read_csv, baad's bespoke functions
additional version hook? (not clear, used in baad)

incorporate comments from dan noble

Create table matching users to tools

Basically defining the types of problems and the proposed solutions

Discussion of the discussion...

Re-write abstract to focus on data versioning

Subsequently discuss medium versus large data

this arises from @richfitz comments in 384b57b

Throughout we use dataset and database somewhat interchangeably. Can I standardise?
My preference is to label any single product being delivered via datastorr a dataset. And reserve use of database for things like genbank.

Thoughts anyone?

Another edit to figures to align with current text and terminology

concept needs a proper name

for the sake of marketing and communications, i think we should pick a consistent name for the concept we are discussing.

lightweight versioned data (or, LVD in the ms) could work but of course there are many alternatives

Possible journals

Assuming we write a paper, where to submit:

Scientific Data has an Article section: The ‘Article’ format can be used to present original reports on systems or techniques that clearly advance data sharing and reuse to support reproducible research. This includes research on sharing, managing and processing scientific research data. Articles describing data repositories, standards and ontologies are welcome when they include compelling demonstrations of data exchange, enrichment or knowledge generation made possible by the system or standard.
PLoS Computtational Biology Research articles must be declared as belonging to one of the following categories: General, Methods or Software. Software articles form a specific sub-category. ...Research articles specifically designated as Methods papers should describe outstanding methods of exceptional importance that have been shown, or have the promise to provide new biological insights. The method must already be widely adopted, or have the promise of wide adoption by a broad community of users. Enhancements to existing published methods will only be considered if those enhancements bring exceptional new capabilities. Methods articles and Software articles require presubmission inquiries.
Methods in Ecology & Evolution: As an application note. We have a good track record here.

Will this be a paper or a blog post?

Paragraph on DOIs

Licenses

Do we comment on licenses? Currently using MIT (taxonlookup), BSD_2 (baad.data)

Distinguish between data archiving for a paper, versus "living" dataset

data archiving for papers is more-or-less solved. see 384b57b

future directions section at the end of discussion?

Maybe we need to pull out the CKAN, Fli, Dat and put it at the end, so that we separate what is possible now, from where things are going in the future?

Minor things needed for submission

Update reademe for datastorr

missing references

what should go here (to replace 'REFS')? https://github.com/traitecoevo/data_versioning/blob/master/ms.tex#L123

a better title

options:

Why We Need Versioned Data and an Easy Workflow for Setting It Up
Versioned data: why it's needed and how it can be achieved easily, cheaply, and now

Prior work

There's a lot of prior work in this space, and @cboettig tells me that people have tried this approach and ended up in a mess. Do we solve this problem? Or are we working with data that will always be simple enough to not get in a quagmire? How do people tell when they need to move to something more heavyweight?

Table 4 is currently not cited

Cut?

need to write a nice short statement defining the problem

authorship reorder

I really think that Daniel should go first (I thought I added comments on this in my edits but obviously didn't). Daniel and Will have clearly done the most work and are still actively engaged in careers where authorship order matters!

Paragraph on allowing for one or more data contributers

make sure description of git is technically correct

Figure: illustrates technology stack

I am thinking something that shows where the different technologies exist

This one is too complex, but that type of thing:

Forks

As pointed out by @cboettig, think about what happens with forked data. Do forks get numbers? What happens on a merge? (technically, can gh forks handle releases?)

Datastorr needs a icon or hex sticker

Seems like a good time to assign an icon to datastorr. Or even a hex sticker.

In #11 I put in a placeholder.

From memory storr is named after the old man of storr, so icon could be an outline of a mountain range?

apparently there's an r package for creating Hex stickers, suitably named hexSticker

At @richfitz -- what's your thoughts on this?

traitecoevo / data_versioning Goto Github PK

data_versioning's People

Contributors

Stargazers

Watchers

Forkers

data_versioning's Issues

Recommend Projects

Recommend Topics

Recommend Org