Coder Social home page Coder Social logo

force11-scwg's Introduction

⚠️ This group is no longer active. If you are interested in implementing software citation, please join https://www.force11.org/group/software-citation-implementation-working-group ⚠️

FORCE11 Software Citation Working Group

Mission Statement (WIP)

The software citation working group is a committee that will leverage the perspectives of a variety of existing initiatives working on software citation to produce a consolidated set of citation principles in order to encourage broad adoption of a consistent policy for software citation across disciplines and venues. The working group will review existing efforts and make a set of recommendations. These recommendations will be put up for endorsement by the organizations represented by this group and others that play an important role in the community.

The group will produce a set of principles, illustrated with working examples, and a plan for dissemination and distribution. This group will not be producing detailed specifications for implementation although it may review and discuss possible technical solutions.

See Joint Declaration of Data Citation Principle as a example of a similar deliverable.

The final output of the group was Smith AM, Katz DS, Niemeyer KE, FORCE11 Software Citation Working Group. (2016) Software citation principles. PeerJ Computer Science 2:e86 https://doi.org/10.7717/peerj-cs.86.

Co-chairs: Arfon Smith & Daniel S. Katz & Kyle Niemeyer

##Timeline

Phase 1 (June - July 2015)

Kick off meeting (telecon) with goals of:

  • Establish interest/backgrounds of working group participants.
  • Review mission statement, timeline and goals
  • Seek out additional participants (if we're missing key individuals)

Phase 2 (July - September 2015)

  • Gather materials documenting existing practices in member disciplines
  • Gather materials from workshops and other reports
  • Review materials, identifying overlaps and differences

Phase 3 (September 2015 - January 2016)

  • Drafting of Software Citation Principles (possibly in person at WSSSPE, Boulder, CO - 28/29 September)
  • Seek community feedback on draft
  • Iterate

Phase 4 (January - March 2016)

  • Software Citation Principles proposed final draft complete.
  • Seek out community endorsements for draft principles.

Phase 5 (April 2016)

  • Presentation of formal recommendations at FORCE2016

Communication plan

  • Monthly telecons
  • GitHub for documentation/iterating on content
  • Google groups for general discussion
  • FORCE11 email list for announcements

##Members

If you are interested in joining the group, please:

  1. Add yourself to the list below through a pull request
  2. To add yourself to the group mailing list and group folder on FORCE11:
Name Affiliation Role
Alberto Accomazzi (@aaccomazzi) Harvard-Smithsonian CfA Participant
Alice Allen (@owlice) Astrophysics Source Code Library Participant
Micah Altman (@maltman) Program on Information Science, MIT Participant
Jay Jay Billings (@jayjaybillings) Oak Ridge National Laboratory Participant
Carl Boettiger (@cboettig) UC Berkeley Participant
Jed Brown (@jedbrown) CU Boulder Participant
Sou-Cheng Choi (@sctchoi) NORC at the University of Chicago and Illinois Institute of Technology Participant
Neil Chue Hong (@npch) Software Sustainability Institute Participant
Tom Crick (@tomcrick) Cardiff Metropolitan University Participant
Mercè Crosas (@mcrosas) IQSS, Harvard University Participant
Scott Edmunds (@ScottBGI) GigaScience, BGI Hong Kong Participant
Christopher Erdmann (@libcce) Harvard-Smithsonian CfA Participant
Martin Fenner (@mfenner) DataCite Participant
Darel Finkbeiner (@darelf) OSTI Participant
Ian Gent (@turingfan) University of St Andrews, recomputation.org Participant
Carole Goble (@carolegoble) The University of Manchester, Software Sustainability Institute Participant
Paul Groth (@pgroth) Elsevier Labs Participant
Melissa Haendel (@mellybelly) OHSU Participant
Stephanie Hagstrom (@sthagstrom) FORCE11 Participant
Robert Hanisch (@rjhanisch) NIST/ODI Participant
Edwin Henneken (@ehenneken) Harvard-Smithsonian CfA Participant
Ivan Herman (@iherman) W3C Participant
Konrad Hinsen (@khinsen) CNRS Participant
James Howison (@jameshowison) UTexas Participant
Michael Hucka (@mhucka) Caltech Participant
Lorraine Hwang (@ljhwang) UC Davis Participant
Thomas Ingraham (@tingraham) F1000Research Participant
Matthew B. Jones (@mbjones) NCEAS, UC Santa Barbara Participant
Catherine Jones ([@cm-j0nes] (https://github.com/cm-j0nes)) Science and Technology Facilities Council Participant
Daniel S. Katz (@danielskatz) University of Illinois Co-chair
Alexander Konovalov (@alex-konovalov) University of St Andrews Participant
John Kratz (@JEK-III) California Digital Library Participant
Jennifer Lin (@jenniferlin15) Public Library of Science Participant
Frank Löffler (@knarrff) Louisiana State University Participant
Brian Matthews (@brianmatthews42) Science and Technology Facilites Council Participant
Abigail Cabunoc Mayes (@acabunoc) Mozilla Science Lab Participant
Daniel Mietchen (@Daniel-Mietchen) NIH Participant
Bill Mills (@BillMills) Mozilla Science Lab Participant
Evan Misshula (@EMisshula) CUNY Graduate Center Participant
August Muench (@augustfly) American Astronomical Society Participant
Fiona Murphy (@DrFionalm) Independent Researcher Participant
Lars Holm Nielsen (@lnielsen) CERN Participant
Kyle Niemeyer (@kyleniemeyer) Oregon State University Co-chair
Robert Peters (@rcpeters) ORCID.org Participant
Tom Pollard (@tompollard) MIT Participant
Karthik Ram (@_inundata) University of California, Berkeley Participant
Fernando Rios (@zoidy) Johns Hopkins University Participant
Ashley Sands (@ashleysa) UCLA Information Studies Participant
Soren Scott (@roomthily) Independent Researcher Participant
Frank J. Seinstra (@fjseins) Netherlands eScience Center Participant
Arfon Smith (@arfon) GitHub Co-chair
Kaitlin Thaney (@kaythaney) Mozilla Science Lab Participant
Ilian Todorov (@iliant) STFC Participant
Matt Turk (@MatthewTurk) University of Illinois Participant
Miguel de Val-Borro (@migueldvb) Princeton University Participant
Daan Van Hauwermeiren (@DaanVanHauwermeiren) Ghent University Participant
Stijn Van Hoey (@StijnVanHoey) Ghent University Participant
Belinda Weaver (@weaverbel) The University of Queensland Participant
Nic Weber (@nniiicc) University of Washington iSchool Participant
Marijane White (@marijane) OHSU Participant
Qian Zhang (@paopao74cn) University of Illinois Participant

(this list is in alphabetic order by surname; please keep it that way when making additions)

force11-scwg's People

Contributors

aaccomazzi avatar abbycabs avatar arfon avatar ashleysa avatar brianmatthews42 avatar cm-j0nes avatar daanvanhauwermeiren avatar daniel-mietchen avatar danielskatz avatar darelf avatar fjseins avatar iliant avatar jameshowison avatar jenniferlin15 avatar kaythaney avatar knarrff avatar kyleniemeyer avatar ljhwang avatar maltman avatar marijane avatar mbjones avatar npch avatar owlice avatar paopao74cn avatar pgroth avatar scottbgi avatar sthagstrom avatar tomcrick avatar tompollard avatar zoidy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

force11-scwg's Issues

what does an identifier point to?

Discuss the issue that an identifier can point to a specific version, and this is what we mostly were thinking here. There are also other use cases that are valid, such as identifiers that point to a collection of versions of software, or identifiers that point to the latest version.

One of the drivers behind a collection is to be able to follow and obtain credit for a total software package.

figure 2 change

move stakeholder column to the far right, and make it not so strict, recognize that there are multiple stakeholder groups for many use cases

Extend discussion of software metadata needs

The current section Existing efforts around metadata standards discusses several software metadata efforts, but doesn't clearly articulate that a consolidation effort is needed. Projects like CodeMeta are attempting to to crosswalk software metadata specifications for interoperability, and this section should highlight the need for this while clarifying that it is out of scope for these citation principles. I am willing to draft a pull request with a short edit to this section.

Use case text

strongly agree that the text in the table 2 caption is moved to section 3

section 6 is empty

we don't have any examples of how the principles would be applied to the use cases, which we were going to do

Please delete respecting software authors' requests.

This may be controversial, and indeed I know it is from recent twitter discussions. But I feel it's important - or at least even if too controversial for agreement it's important to have the discussion.

There is text in the draft which says

"In addition, if the software authors ask that a paper should be cited, that should be respected."

I would like this sentence simply to be deleted.

I profoundly disagree with this point. I feel that software authors' opinion of whether their work should be cited is almost (but not completely) irrelevant.

Not only that I think it's extremely important that the community understands that it is not up to software authors to demand citation. They can certainly request it, that is fine. But citation is part of the scientific process not a request that an author can insist on. This is NOT trying to demean the importance of software citation, it's actually equating it to all other forms of citation. I cite papers because it's the right thing to do, not because somebody asked me to. Nobody ever cites the papers that authors want to have cited: if they did then all papers ever written would cite all other papers ever written.

The only area where I think the author's wish should be viewed as relevant is when it's a 50:50 call on whether to cite a piece of software or not. I.e. I should cite the software I should cite it irrespective of the author's wish and - critically - if it's wrong to cite it then I shouldn't cite it irrespective of the author's wish. If it's an edge case then yes, it's reasonable to cite it if the author wants me to.

As an analogy, consider the very common case where a review comes back asking for 4 citations to somebody you strongly suspect to be the author of the review. This is often seen at best as an embarassment or at worst as a form of mild scientific misconduct. In my view software authors insisting on citation (as opposed to requesting it) is similar.

There is a long running dispute in this area in the case of Gnu parallel, for example. The author insists on software citation for any paper that uses it, and explicitly asks people not to use it if they are not prepared to use it. But there is no nuance: i.e. the author is not encouraging me to do so if it is right, but requiring me to do so if I use the software on a scientific paper.

Access to software: free vs commercial

The section talks about software that is “free” as well as “commercial” software. I am not sure whether this is about free as in freedom (or just gratis or freely available), since it is compared with commercial software, which is unrelated in general, see http://www.gnu.org/philosophy/words-to-avoid.html#Commercial

I suppose that “free” should be replaced by “gratis” and “commercial” be replaced by “non-free” in that section.

infographic for principles?

Should we create a graphic of some type that makes this more appealing to a wider audience?

perhaps in addition, create a few (3?) slides that people can use to talk about this.

"Software citations should permit ... access to the software itself"

Under the "Access" header, the data declaration states that:

"Data citations should facilitate access to the data themselves"

Under the same header, the software declaration states:

"Software citations should permit and facilitate access to the software itself "

The addition of "permit" suggests that software citations should also grant the user with permission to access the software. Is this intentional?

It doesn't seem like a good idea to make access a requirement for discovery, so "permit" might not be helpful in this sentence.

Related identifiers

While we often need to cite a specific version of software, e.g. a release, we also need a way to cite the software in general, and to link multiple releases together. For this reason we need more than one persistent identifier for each software: a) a general one, and b) a specific one for each release.

This issue is similar to what we see with versions for data. One use case would be to link all versions of a piece of software with a Zenodo DOI together, and then also associate the stars and forks of the code repository.

Van de Sompel et al, 5 attributes important for software metadata records?

Not sure if this made it into the earlier list of relevant materials, but I think this article gives a good introduction to research objects:

Van de Sompel, H., Payette, S., Erickson, J., Lagoze, C., & Warner, S. (2004). Rethinking scholarly communication: Building the system that scholars deserve. D-Lib Magazine, 10. Retrieved from http://www.dlib.org/dlib/september04/vandesompel/09vandesompel.html

They mentioned the following processes for multiple kinds of scholarly communication (data, code, paper, etc.) = registration, certification, awareness, archiving, rewarding.

Do we want to ensure those 5 attributes are relayed within the metadata?

Confusion about Excel

It's a small point but I find the text about Excel confusing in 5.1. That is, in the same paragraph we are told to cite Excel, not to cite Excel, and that the two statements are consistent. I realise this is a parody of what is said, but some clarification might be helpful. Even if it just meant changing the example of Excel to something else in the storing and plotting data example.

use case: data repository wants to link data, software, and papers in provenance trace

Domain and institutional data repositories have both data and software artifacts, and want to link these together in a provenance trace that can be cited. Sometimes the software is a separately identified artifact, but at other times software is included inside of data packages, and the researcher wants to cite the combined product. See example of mixed data and software package (containing R code) here: https://knb.ecoinformatics.org/#view/doi:10.5063/F1Z899CZ

Granularity of the citation

one of the key issues with any citation, whether document, individual, or software is the specificity of what is being cited. in the case of publications, there is almost zero specificity most of the time.

it's very easy to cite an entire package even though one function was used. part of this problem is being solved in the Python world through this project (https://github.com/duecredit/duecredit).

any citation should have the ability to specify more than just the obvious, but even the obvious would be a good starting point.

the citation/url should therefore allow for greater specificity within a code base. in general though, a provenance record of the workflow would be significantly more useful than a citation from a research perspective.

Principles should emphasize the need for a better way than citations to manage academic credit

The current wording risks leaving the take-home as "All our issues with citation, credit, and reproducibility of software can be adequately addressed within the current model of academic citation practices." We dismiss problems where the desire of credit conflict with the desire to track provenance by saying "that's a problem for academic citation in general, so to the extent that citation still fulfils both roles for papers, it can do so for data as well."

I fear this misses the orders of magnitude difference between how these problems manifest in software vs how they are dealt with in papers. The quirks of citation practices which have been manageable in papers are exacerbated to a degree in which they may no longer be manageable.

For instance: Both citations to both software and papers can suffer the 'wrapper problem' -- citing a review paper acknowledges the provenance of ideas but fails to allocate credit (citation count) to the originators. Likewise citing a software client library acknowledges the provenance through it's dependency on the sever software system, but fails to transfer credit to it. The difference is one of scale -- a closely knit research community can self-police a glaring omission of credit if an author cites a textbook in place of a citation classic of the field. A reviewer is far less likely to be familiar with the original sources and underlying dependencies when they encounter a citation to a software wrapper around an existing algorithm or software system.

Both software and papers share a tension in citing for provenance and citing for credit, but software has this issue in spades. Provenance means fine-grained citation to particular version, credit means accumulating those citations against a single object. Thin wrappers around fundamental dependencies are commonplace. Authorship concepts are both more diverse and less governed by well-understood norms.

While we strive to offer practical guidelines that acknowledge the current incentive system of academic citation, a more modern system of assigning credit is sorely needed. It is not that academic software needs a separate system from academic papers, but that it underscores the need to overhaul the system of credit for both.


As discussed in the workshop, I'm working on a pull request along these lines, but comments & references welcome here.

Comment from Catherine Jones in chat

I have to go soon, so I wanted to make a general comment about the principles, section 2. The term "science/scientific" is used a lot through this section. Here in the UK this term has a meaning which restricts this to physical sciences and excludes social sciences, art & humanities. The term "research" tends to be used in these circumstances. I believe that these principles apply to all domains, so maybe the wording should be considered. I appreciate this may by a UK only cultural issue. I was on a data policy committee where this was a very touchy issue.

Is the question of when to cite an entirely community-specific decision?

The current document appears to declare of the question of what to cite/when to cite completely out of scope:

"The software citation principles do not define what software should be cited, but rather, how software should be cited."

The result appears to be to defer analysis to the differing scholarly communities. The declaration of "importance" suggests that software should be cited more often, but again seems to imply that practices will be community specific.

Given the focus of the document on reproducibility, might it be possible to specify some necessary conditions (not sufficient) for citation in the principles, e.g:

"When a software is used directly in the process of establishing a published claim, that software should be cited. "

Additional material and community actions

sciencecodemanifesto.org set out principles including citation

sect 4.1 there have been numerous workshops on reproducibility which have included software and data citation
the latest in the UK was the Alan Turing Institute Symposium on Reproducibility last week.

sect 4.2 the NIH report was also put out to public comment

sect 4.3
other efforts on metadata standards include:

5.6 we should mention of RRID which is a Force11 activity

Additional considerations for contribution representation

Consider more advanced options to represent the relationship between people and various software contributions. e.g. section 4.3.

  1. Models such as those evolving in openRIF (formerly VIVO-ISF) https://github.com/openrif, which represent contribution types and roles towards any contribution type, and are not dependent on authoring of journals (though this is not the only source of such models)
  2. Transitivity of contribution in packages that rely on one another and across versions. I quite like how this has been captured for various versions of data in the HCLS dataset description - you can have a summary level representation that includes all contributions to date, or a version level distribution that has only contributions to that version.
  3. Include reference to how software should be represented in CVs and biosketches to aid evaluation and review.

change figure 2

In Figure 2, change Requirements label to be "Basic Requirements"

discuss RRIDs?

@CaroleGoble wrote in #111
5.6 we should mention of RRID which is a Force11 activity

I'm separating this out, so the main part of that issue, on Related Work, can be assigned to Arfon

Reference lists

In my personal view one of the shortcomings of the Joint Data Citation Principles is that they don't specifically mention that citations should go into reference lists. I am glad to see that the software citation principles mention reference lists. There might be a better place in the text for this, e.g. in an item 7 interoperability, again similar to the joint data citation principles.

Additional references for section 4.3

  1. I think that, at least for historical reasons, we should mention DOAP. Although, afaik, Ed Dumbill does not pursue the project any more, it was, for a long time, almost the only game in town when it came to a more formal set of terms used in computer science (mainly in open source projects).
  2. More recent is the set of terms defined by schema.org, namely https://schema.org/SoftwareApplication. Knowing the importance schema.org has in the search space, this metadata set may have a great importance in practice.

Add prereq "use case" to narrative prior to use case table

Summary of discussion a little before noon on Sunday of workshop:
Ensure there is an introduction to the "zero" use case in the narrative introduction to the use cases table. This prerequisite is that a "creator" has generated a piece of software that has metadata. Then the rest of the use cases follow, but we need to clarify that the software has been generated.

Citation styles

Citations in text follow the citation style used. Two practical recommendations (which might already be work for the implementation group) are: a) include version information, and b) include a label to indicate that it is software, e.g. [Software].

Deletion of some text about how to cite.

I just opened and closed an issue because I may have misunderstood the text. So I am reopening this one but would still wish this text to be deleted:

"In addition, if the software authors ask that a paper should be cited, that should be respected."

It's not clear to me if the point being made is what paper to cite once a decision is made to cite some software.

In fact I feel that sentence is still superfluous and in fact could be deleted, but I feel much less strongly about this. The preceding and succeeding sentences surely cover every case? I.e. they clearly explain that if i should cite it I should cite it, so I'm not sure what the "in addition" point is.

On the other hand this sentence could be read as implying that if I use software and the authors want me to cite it, then I should cite it. This is the point I was addressing earlier, to which I strongly objected. And which I would stand by if it was interpreted that way. In that case I would much more strongly urge deletion of that sentence, as per my now closed issue.

Recommended vs. required/minimal metadata for use cases

Based on comments from the 5 April call and some in the Use Cases Google Doc, there is some interest in differentiating between metadata for each use case that we see as required/minimal and what we recommend.

My suggestion is that we use an open circle (LaTeX: \textopenbullet) for the "optional" recommended metadata.

There are already some suggestions:

  • @mfenner and @owlice suggested adding description/abstract/readme as recommended
  • @ljhwang suggested that license may become recommended (rather than required) for most

Since this may involve some discussion, perhaps rather than issuing PRs people can make additional suggestions or comments here.

Document which use cases are in scope and whether principles sufficient

The use case section appears to refer to use cases that are not within the scope (or entirely within) the recommendation. For example "show how funded software has been used" seems to relate to citing a series of software, not a specific version. It is not clear that citing a series is in scope.

Recommend to indicate for each use case whether (a) the use case is in the scope of the recommendations (b) if in scope, whether principles are necessary vs. sufficient for citation wrt to use case

Discussion items

Aspects that we should to make some reference to in the discussion even if its just to rule out of scope:

What should we say about "Software Papers"

I think a key unresolved question is how to address the practice of "software papers".

If a piece of software has a "software paper", should that be:

  1. cited on its own (supercede the software citation itself,
  2. cited in addition to the software citation itself,
  3. not cited; only cite the software itself (discourage software papers).

I'm not really sure, but I think my vote is for 2, although I acknowledge that that then creates two citations, exacerbating the "too many references" issue.

number of articles in table 1

Table 1 refers to 286 publications, but the 2nd paragraph of the section "Motivation" refers to the same source but regarding a random sample of 90 articles. Should it be 286 too?

Persistence of identifier vs. persistence of software

The persistence principle outlined in (4) is a key element in making software citeable. Where software has become part of the record of science not only the identifier and metadata of the software should be persistent, it should also be the goal to keep a persistent copy of the source code, where applicable. This links with the accessibility principle (5).

There are still many open questions about how to resolve package dependencies in the long term, therefore I would not make the persistent access to code a hard requirement but may add something more specific towards preserving the record of science.

new use case for funder?

As a funder, I want to measure the impact of the researchers I fund. This is a bit different from measuring the software itself - might require a new line in the table, with the requirement "authors"

Line 440 re: inaccessible versions

For fast reference:

As stated in the Persistence principle (\ref{principle:persistence}), we recognize that the commercial software version may no longer be available, but it still should be cited along with information about how it was accessed.

Should this be limited to commercial software only? I can think of a few 'open' hydro models that don't maintain older versions.

Change of wording of Software Importance

(Sorry to be doing this late, I realise would have been better to contribute earlier.)

I suggest a change of wording for Importance in section 1.

The current wording is "Software should be cited whenever and wherever a research product (such as a paper or derived software) relies upon it, specifically, as part of the standard reference list for that research product."

I feel the current text is too dogmatic about what software should be cited, and indeed is contradictory to the preamble text which says "For example, in this section we do not define what software should be cited, but how it should be cited."

To emphasise the point that software should be treated the same as anything else, I would suggest a revision to something like:

Software should be cited on the same basis as any other research product such as a paper or book. That is, authors should see citing the appropriate set of software products as being as important as citing the appropriate set of papers. Software citations should not be separated and should be part of the standard reference list for that research product

Two typos, lines 301/401

Line 301: noteable to notable

Line 401 to:

Understanding these chains of knowledge and credit have been part of the history of science field for some time, though more recent work is suggesting more nuanced evaluation of the credit chains~\cite{casrai-credit, transitive_credit_json-ld}.

(insert 'the' for 'part of history of science')

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.