Coder Social home page Coder Social logo

srstevenson / nb-clean Goto Github PK

View Code? Open in Web Editor NEW
127.0 127.0 18.0 615 KB

Clean Jupyter notebooks of outputs, metadata, and empty cells, with Git integration

Home Page: https://pypi.org/project/nb-clean/

License: ISC License

Python 80.28% Jupyter Notebook 19.72%
git jupyter jupyter-notebook notebook version-control

nb-clean's People

Contributors

bneijt avatar carderne avatar danieltsiang avatar dependabot[bot] avatar fcooper8472 avatar imrehg avatar jamesbraza avatar jbfiot avatar srstevenson avatar thatlittleboy avatar tovrstra avatar uyiyei avatar yasirroni avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

nb-clean's Issues

Only preserve `cells`.

What do you think about only preserve cells? It means that it will clean all except 'cells`?

In the notebook example, it will destroy:

 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:Python3] *",
   "language": "python",
   "name": "conda-env-Python3-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2

Change the current file

Hey

Is it possible to change the current file by running nb-clean -i Sample.ipynb? Instead of creating a new one and renaming it to the old one.

Thank you!

By the way, congratulations for this cool tool!

Removing the git filter

A useful feature would be the ability to remove the git filter using nb-clean, maybe something like nb-clean configure-git --uninstall. Thanks!

Filter cleans python version metadata

The filter nb-clean add-filter --preserve-cell-metadata cleans the python version at the end of the notebook. This causes a metadata misalignment between local git and github notebooks.

- "pygments_lexer": "ipython3",
- "version": "3.8.8"

+ "pygments_lexer": "ipython3"

Every time that I open a notebook after pushing with the filter, I get my notebook modified. It is possible to fix that? Thanks in advance!

[BUG] Filtering to ignore metadata during checks not working as expected

Problem

Using the current latest versions of nb-clean i.e. 3.2.0, it fails to ignore the following when checking the notebook even though it is configured to ignore it:

metadata: language_info.version

Examples

When checking notebook:

$ nb-clean check my-notebook.ipynb 
my-notebook.ipynb metadata: language_info.version

Not working as expected

I thought nb-clean should no longer complain about the language info version metadata, but it still does:

$ nb-clean check my-notebook.ipynb -m/--preserve-cell-metadata 
my-notebook.ipynb metadata: language_info.version
$ nb-clean check my-notebook.ipynb -m/--preserve-notebook-metadata
my-notebook.ipynb metadata: language_info.version
$ nb-clean add-filter --preserve-cell-metadata
$ nb-clean check my-notebook.ipynb
my-notebook.ipynb metadata: language_info.version

$ nb-clean remove-filter
$ nb-clean add-filter --preserve-notebook-metadata
$ nb-clean check my-notebook.ipynb
my-notebook.ipynb metadata: language_info.version
$ nb-clean add-filter --preserve-notebook-metadata check my-notebook.ipynb
usage: nb-clean [-h] {version,add-filter,remove-filter,check,clean} ...
nb-clean: error: unrecognized arguments: check my-notebook.ipynb

Working as desired

Although this is not in the README.md, this is the only way I could get it to work as I desired, i.e. ignore the language info version metadata:

$ nb-clean add-filter --preserve-cell-metadata check my-notebook.ipynb 
$ echo $?
0

Update

Turns out I misunderstood the README, it wasn't obvious to me that it was saying I could use the shorthand or the longhand of the flags. So this does work.

$ nb-clean check my-notebook.ipynb --preserve-notebook-metadata
$ echo $?
0

I would suggest modifying the README to include only the longhand (or shorthand) flags in the table, to avoid similar future confusion.

Also, there is a copy-paste typo for ignoring notebook metadata, i.e. it incorrectly copied the values from the cell metadata.

Screenshot 2024-01-17 at 14 28 05

[BUG] Using preserve cell metadata flag before filename not working as expected

Problem

Using the current latest versions of nb-clean i.e. 3.2.0, checking the notebook with preserving the cell metadata flag before the filename causes the command to hang indefinitely.

Examples

Working as expected

$ nb-clean check notebook.ipynb --preserve-cell-metadata
notebook.ipynb metadata: language_info.version

Not working as expected

These just hang and it never finishes the commands:

$ nb-clean check --preserve-cell-metadata notebook.ipynb 
$ nb-clean check --preserve-notebook-metadata --preserve-cell-metadata notebook.ipynb 

Other working examples

These commands also work as expected:

$ nb-clean check --preserve-notebook-metadata notebook.ipynb
notebook cell 12: metadata
$ nb-clean check notebook.ipynb --preserve-notebook-metadata
notebook cell 12: metadata
$ nb-clean check notebook.ipynb --preserve-cell-metadata --preserve-notebook-metadata
$ echo $?
0

Git filter only if file >100MB

Hi, is there a flag to set a git filter which only cleans files >100MB (Github file size limit)? Would be much appreciated!

may it be useful to sometimes preserve metadata?

Thanks for the nice tool!

Unfortunately I find it disrupts some of my notebooks, since I use papermill which relies on cell tags (stored in the cell metadata). And it seems to be something to preserve.

Do metadata often contain something which is not worth saving? If so, would it make sense to have an option for preserve metadata?

Cannot clean notebooks encountering “NotJSONError” with plotly js code inside

@srstevenson Thanks for this awesome repo. I am having some trouble cleaning notebooks with html/js inside. Below is the detailed error. Please kindly check it out :)

System :

Windows Server 2022 Datacenter 21H2 20348.2402

Core Packages :

jupyterlab >= 4.0.10
nbformat 5.9.2
nb-clean 3.2.0
plotly 5.18.0

Core Commands :

nb-clean add-filter
git add plotly-example-2.ipynb

It works well on notebooks without plotly.
But getting error from this notebook with plotly's html js snippets in it. plotly-example-2.zip

Error :

I checked the json format. It happens on line 29 which is the beginning of a chunk of js snippet having confusing "" in it.

(dev) PS D:\Dapu\prod> git add plotly-example-2.ipynb
Traceback (most recent call last):
  File "D:\ProgramData\miniconda3\envs\dev\Lib\site-packages\nbformat\reader.py", line 19, in parse_json
    nb_dict = json.loads(s, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ProgramData\miniconda3\envs\dev\Lib\json\__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ProgramData\miniconda3\envs\dev\Lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ProgramData\miniconda3\envs\dev\Lib\json\decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 29 column 301224 (char 302194)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "D:\ProgramData\miniconda3\envs\dev\Scripts\nb-clean.exe\__main__.py", line 7, in <module>
  File "D:\ProgramData\miniconda3\envs\dev\Lib\site-packages\nb_clean\cli.py", line 298, in main
    args.func(args)
  File "D:\ProgramData\miniconda3\envs\dev\Lib\site-packages\nb_clean\cli.py", line 150, in clean
    notebook = nbformat.read(input_, as_version=nbformat.NO_CONVERT)  # type: ignore[no-untyped-call]
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ProgramData\miniconda3\envs\dev\Lib\site-packages\nbformat\__init__.py", line 174, in read
    return reads(buf, as_version, capture_validation_error, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ProgramData\miniconda3\envs\dev\Lib\site-packages\nbformat\__init__.py", line 92, in reads
    nb = reader.reads(s, **kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ProgramData\miniconda3\envs\dev\Lib\site-packages\nbformat\reader.py", line 75, in reads
    nb_dict = parse_json(s, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ProgramData\miniconda3\envs\dev\Lib\site-packages\nbformat\reader.py", line 25, in parse_json
    raise NotJSONError(message) from e
nbformat.reader.NotJSONError: Notebook does not appear to be JSON: '{\n "cells": [\n  {\n   "cell_type": "c...
error: external filter 'nb-clean clean' failed 1
error: external filter 'nb-clean clean' failed
warning: in the working copy of 'plotly-example-2.ipynb', LF will be replaced by CRLF the next time Git touches it

Can not reproduce using nbformat directly in python:

When I use nbformat to load, such error will not happen. It seems fine to get the whole html content in notebook['cells'][0]['outputs'][0]['data']['text/html'].

import nbformat
filename = "plotly-example-2.ipynb"
with open(filename, 'r', encoding='utf-8') as f:
    notebook = nbformat.read(f, as_version=nbformat.NO_CONVERT)

notebook['cells'][0]['outputs'][0]['data']['text/html']

Rewrite in Rust

Hi,

Do you think it would make sense to rewrite nb-clean in Rust?

Support batch and wildcard file names

Support this feature:

nb-clean clean --preserve-cell-outputs  notebooks/*

It first will list all .ipynb files in notebooks/, then iteratively clean all the notebooks.

"Check" mode for use in CI

It would be great if you could do something like nb-clean check . or nb-clean clean -i . --check to check if notebooks have been cleaned without making any changes so that nb-clean can be used in continuous integration checks.

When preserving output, remove the execution count on outputs of type execute_result

When using the option to preserve outputs, the execution count of the output is currently preserved.

I have tested it and both notebook and lab have no problems rendering if this field is set to null like the excecution counts of the cells themselves.

It is only relevant to outputs of type execute_result since according to the spec, it's the only type of output with that field.

Preserve execution count when using `nb-clean`

Is it possible to retain the cell execution count when using nb-clean?

My objective is just to remove metadata, while retaining cell execution count and cell outputs.

I'm not sure if this option exists (not clear to me from reading nb-clean clean --help)

I'm using this at the moment

nb-clean clean -eo <...ipynb>

If it doesn't exist, would you consider adding this option to nb-clean? thanks.

Remove python version info

Hi,

How to remove python version info?

git diff
diff --git a/android/build.ipynb b/android/build.ipynb
index a06f06f..0d103af 100644
--- a/android/build.ipynb
+++ b/android/build.ipynb
@@ -120,7 +120,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.7.5"
+   "version": "3.8.6"
   }
  },
  "nbformat": 4,

Thanks.

Feedback in cli missing

There is no feedback in the CLI when a command is executed.

For example:

nb-clean clean notebook.ipynb --> "notebook.ipynb was successfully clean"

or

nb-clean check notebook.ipynb --> "notebook.ipynb is already clean"

Notebooks not cleaned when staged for git commit

Issue

Following the documentation to add a git filter by running nb-clean add-filter correctly creates/modifies the /.git/info/attributes and /.git/config files. However, notebooks that are staged for commit are not cleaned by the filter.

Expected Behavior

After running nb-clean add-filter in the root directory of the git repo, any notebooks staged for a commit should be cleaned by the filter prior to staging.

Setup: a new git repository has been created by running git init. nb-clean has been installed by running pip install nb-clean, and the nb-clean add-filter command has been run. The correct git files have been modified:

# .git/info/attributes

*.ipynb filter=nb-clean

# .git/config
[core]
	repositoryformatversion = 0
	filemode = true
	bare = false
	logallrefupdates = true
[filter "nb-clean"]
	clean = nb-clean clean

Create a new jupyter notebook and add some data.

Output for an unstaged notebook, test_notebook.ipynb:

$ cat test_notebook.ipynb
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "ce5c1671",
   "metadata": {},
   "outputs": [],
   "source": [
    "import time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "0f43b93d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5.005712509155273\n"
     ]
    }
   ],
   "source": [
    "time_start = time.time()\n",
    "time.sleep(5)\n",
    "print(time.time() - time_start)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19ff96fb",
   "metadata": {},
   "source": [
    "# This is a markdown cell\n",
    "\n",
    "### Blah blah blah\n",
    "\n",
    "<a href=\"#\" target=\"_blank\">Link with HTML formatting</a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "b5364b1b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# running many executions"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c65a168c",
   "metadata": {},
   "source": [
    "The following is an empty cell to test `nb-clean clean --remove-empty-cells`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f99d74b9",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
$ git status
On branch main
Untracked files:
  (use "git add <file>..." to include in what will be committed)
        test_notebook1.ipynb

nothing added to commit but untracked files present (use "git add" to track)

Run git add test_notebook.ipynb.

$ git status
On branch main
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        new file:   test_notebook1.ipynb

Run :

$ nb-clean check test_notebook.ipynb
test_notebook1.ipynb cell 0: execution count
test_notebook1.ipynb cell 1: execution count
test_notebook1.ipynb cell 1: outputs
test_notebook1.ipynb cell 3: execution count

Suggested Fix

The .git/config file should be modified as follows:

...
[filter "nb-clean"]
    clean = nb-clean clean %f

This will run the clean function for every file that matches the filter described in the .git/info/attributes file. It might also be nice to include required = true after the clean definition.

Further, it would be nice if the clean function made the modifications to the *.ipynb files in-place, rather than making the changes and not adding them to the commit. As it is now, adding the files to a commit will only add the pre-filtered notebooks to the staging area, while the modified version is left out, meaning the user has to re-add the notebooks to the commit (assuming that the above fix is implemented). See the following articles for more explanation:

Information

OS : Ubuntu 20.04 (running on WSL2)

program version
Python 3.6.13
git 2.25.1
nb-clean 2.0.2

Option to preserve tag metadata

The metadata element in a notebook may include "reserved" tags list data that may be used to to modify the display of a cell in a live notebook UI or when rendering the notebook using a tool such as Jupyter Book.

It would be useful to be able to clear all metadata except the reserved tags element when cleaning notebooks.

Add `required` option to `git filter`

Thanks for writing such a convenient tool!
I personally would like it if we could have an option to add required = True to the git filter option.
Optionally, I want to decide whether I want to get the output with:

[filter "nb-clean"]
	clean = nb-clean clean --preserve-cell-metadata
	required = True

Instead of

[filter "nb-clean"]
	clean = nb-clean clean --preserve-cell-metadata
	required = False # Could also use the empty line like before

The main use-case for me is that I use VSCode for quick edits (also to edit jupyter-notebooks) and when I use the vscode git-cli and have nb-clean installed in a conda/mamba environment then vscode will not use the correct environment and the cleaning is silently skipped (as vscode doesn't find the binary)
I know that there are many possible options to fix it (not use conda/mamba, install nb-clean via pipx, etc.) but since it should be a minor change, maybe it could add some more convenience for similar use-cases?

I am also happy to add a tiny PR :)

Thanks!

Feature request: preserve notebook metadata

Hi Scott, hope you're well!

I'm interested in using nb-clean in a project to fail CI when output is committed in a notebook, but I'd like to be permissive about metadata. I see that the --preserve-cell-metadata option allows you to permit cell metadata, but language_info.version cannot currently be permitted, if I understand correctly.

Would you consider adding a flag to permit this metadata?

Request: `pyproject.toml`-based configuration

I currently use nb-clean in my pre-commit config like so:

---
default_language_version:
    python: python3

repos:
    - repo: https://github.com/srstevenson/nb-clean
      rev: 3.2.0
      hooks:
          - id: nb-clean
            args: [--preserve-cell-outputs, --remove-empty-cells]

It would be nice if the args could be moved to a pyproject.toml, so that running nb-clean clean locally (outside of pre-commit) could also use the same configuration.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.