srstevenson / nb-clean Goto Github PK

View Code? Open in Web Editor NEW

127.0 127.0 18.0 615 KB

Clean Jupyter notebooks of outputs, metadata, and empty cells, with Git integration

Home Page: https://pypi.org/project/nb-clean/

License: ISC License

Python 80.28% Jupyter Notebook 19.72%

git jupyter jupyter-notebook notebook version-control

nb-clean's People

Contributors

Stargazers

Watchers

Forkers

connectedcars louisdorard volodymyrss fcooper8472 jharris427 jbfiot kai-tub takana-v stoneidolon yasirroni psychemedia carderne bneijt tovrstra shacharhelmer qzyu999 thatlittleboy danieltsiang

nb-clean's Issues

Only preserve `cells`.

What do you think about only preserve cells? It means that it will clean all except 'cells`?

In the notebook example, it will destroy:

 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:Python3] *",
   "language": "python",
   "name": "conda-env-Python3-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2

Change the current file

Hey

Is it possible to change the current file by running nb-clean -i Sample.ipynb? Instead of creating a new one and renaming it to the old one.

Thank you!

By the way, congratulations for this cool tool!

Removing the git filter

A useful feature would be the ability to remove the git filter using nb-clean, maybe something like nb-clean configure-git --uninstall. Thanks!

Cleans ALL the metadata, include any RISE slide information

I found that all the cell metadata was removed, not just the cell outputs

Filter cleans python version metadata

The filter nb-clean add-filter --preserve-cell-metadata cleans the python version at the end of the notebook. This causes a metadata misalignment between local git and github notebooks.

- "pygments_lexer": "ipython3",
- "version": "3.8.8"

+ "pygments_lexer": "ipython3"

Every time that I open a notebook after pushing with the filter, I get my notebook modified. It is possible to fix that? Thanks in advance!

Support Jupyter `savehook`

It is nice to have git-filter and pre-commit. But, how about save hook so that it is running on save?

See https://jupyter-notebook.readthedocs.io/en/stable/extending/savehooks.html

What is the benefit? It will not cause git diff if you just open a notebook (because opening a notebook will add metadata) and save. Running it will also not add diff if nothing changed.

[BUG] Filtering to ignore metadata during checks not working as expected

Problem

Using the current latest versions of nb-clean i.e. 3.2.0, it fails to ignore the following when checking the notebook even though it is configured to ignore it:

metadata: language_info.version

Examples

When checking notebook:

$ nb-clean check my-notebook.ipynb 
my-notebook.ipynb metadata: language_info.version

Not working as expected

I thought nb-clean should no longer complain about the language info version metadata, but it still does:

$ nb-clean check my-notebook.ipynb -m/--preserve-cell-metadata 
my-notebook.ipynb metadata: language_info.version

$ nb-clean check my-notebook.ipynb -m/--preserve-notebook-metadata
my-notebook.ipynb metadata: language_info.version

$ nb-clean add-filter --preserve-cell-metadata
$ nb-clean check my-notebook.ipynb
my-notebook.ipynb metadata: language_info.version

$ nb-clean remove-filter
$ nb-clean add-filter --preserve-notebook-metadata
$ nb-clean check my-notebook.ipynb
my-notebook.ipynb metadata: language_info.version

$ nb-clean add-filter --preserve-notebook-metadata check my-notebook.ipynb
usage: nb-clean [-h] {version,add-filter,remove-filter,check,clean} ...
nb-clean: error: unrecognized arguments: check my-notebook.ipynb

Working as desired

Although this is not in the README.md, this is the only way I could get it to work as I desired, i.e. ignore the language info version metadata:

$ nb-clean add-filter --preserve-cell-metadata check my-notebook.ipynb 
$ echo $?
0

Update

Turns out I misunderstood the README, it wasn't obvious to me that it was saying I could use the shorthand or the longhand of the flags. So this does work.

$ nb-clean check my-notebook.ipynb --preserve-notebook-metadata
$ echo $?
0

I would suggest modifying the README to include only the longhand (or shorthand) flags in the table, to avoid similar future confusion.

Also, there is a copy-paste typo for ignoring notebook metadata, i.e. it incorrectly copied the values from the cell metadata.

[BUG] Using preserve cell metadata flag before filename not working as expected

Problem

Using the current latest versions of nb-clean i.e. 3.2.0, checking the notebook with preserving the cell metadata flag before the filename causes the command to hang indefinitely.

Examples

Working as expected

$ nb-clean check notebook.ipynb --preserve-cell-metadata
notebook.ipynb metadata: language_info.version

Not working as expected

These just hang and it never finishes the commands:

$ nb-clean check --preserve-cell-metadata notebook.ipynb

$ nb-clean check --preserve-notebook-metadata --preserve-cell-metadata notebook.ipynb

Other working examples

These commands also work as expected:

$ nb-clean check --preserve-notebook-metadata notebook.ipynb
notebook cell 12: metadata

$ nb-clean check notebook.ipynb --preserve-notebook-metadata
notebook cell 12: metadata

$ nb-clean check notebook.ipynb --preserve-cell-metadata --preserve-notebook-metadata
$ echo $?
0

Git filter only if file >100MB

Hi, is there a flag to set a git filter which only cleans files >100MB (Github file size limit)? Would be much appreciated!

may it be useful to sometimes preserve metadata?

Thanks for the nice tool!

Unfortunately I find it disrupts some of my notebooks, since I use papermill which relies on cell tags (stored in the cell metadata). And it seems to be something to preserve.

Do metadata often contain something which is not worth saving? If so, would it make sense to have an option for preserve metadata?

Cannot clean notebooks encountering “NotJSONError” with plotly js code inside

@srstevenson Thanks for this awesome repo. I am having some trouble cleaning notebooks with html/js inside. Below is the detailed error. Please kindly check it out :)

System :

Windows Server 2022 Datacenter 21H2 20348.2402

Core Packages :

jupyterlab >= 4.0.10
nbformat 5.9.2
nb-clean 3.2.0
plotly 5.18.0

Core Commands :

nb-clean add-filter
git add plotly-example-2.ipynb

It works well on notebooks without plotly.
But getting error from this notebook with plotly's html js snippets in it. plotly-example-2.zip

Error :

I checked the json format. It happens on line 29 which is the beginning of a chunk of js snippet having confusing "" in it.

(dev) PS D:\Dapu\prod> git add plotly-example-2.ipynb
Traceback (most recent call last):
  File "D:\ProgramData\miniconda3\envs\dev\Lib\site-packages\nbformat\reader.py", line 19, in parse_json
    nb_dict = json.loads(s, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ProgramData\miniconda3\envs\dev\Lib\json\__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ProgramData\miniconda3\envs\dev\Lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ProgramData\miniconda3\envs\dev\Lib\json\decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 29 column 301224 (char 302194)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "D:\ProgramData\miniconda3\envs\dev\Scripts\nb-clean.exe\__main__.py", line 7, in <module>
  File "D:\ProgramData\miniconda3\envs\dev\Lib\site-packages\nb_clean\cli.py", line 298, in main
    args.func(args)
  File "D:\ProgramData\miniconda3\envs\dev\Lib\site-packages\nb_clean\cli.py", line 150, in clean
    notebook = nbformat.read(input_, as_version=nbformat.NO_CONVERT)  # type: ignore[no-untyped-call]
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ProgramData\miniconda3\envs\dev\Lib\site-packages\nbformat\__init__.py", line 174, in read
    return reads(buf, as_version, capture_validation_error, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ProgramData\miniconda3\envs\dev\Lib\site-packages\nbformat\__init__.py", line 92, in reads
    nb = reader.reads(s, **kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ProgramData\miniconda3\envs\dev\Lib\site-packages\nbformat\reader.py", line 75, in reads
    nb_dict = parse_json(s, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ProgramData\miniconda3\envs\dev\Lib\site-packages\nbformat\reader.py", line 25, in parse_json
    raise NotJSONError(message) from e
nbformat.reader.NotJSONError: Notebook does not appear to be JSON: '{\n "cells": [\n  {\n   "cell_type": "c...
error: external filter 'nb-clean clean' failed 1
error: external filter 'nb-clean clean' failed
warning: in the working copy of 'plotly-example-2.ipynb', LF will be replaced by CRLF the next time Git touches it

Can not reproduce using nbformat directly in python:

When I use nbformat to load, such error will not happen. It seems fine to get the whole html content in notebook['cells'][0]['outputs'][0]['data']['text/html'].

import nbformat
filename = "plotly-example-2.ipynb"
with open(filename, 'r', encoding='utf-8') as f:
    notebook = nbformat.read(f, as_version=nbformat.NO_CONVERT)

notebook['cells'][0]['outputs'][0]['data']['text/html']

Rewrite in Rust

Hi,

Do you think it would make sense to rewrite nb-clean in Rust?

Support batch and wildcard file names

Support this feature:

nb-clean clean --preserve-cell-outputs  notebooks/*

It first will list all .ipynb files in notebooks/, then iteratively clean all the notebooks.

It would be great if you could do something like nb-clean check . or nb-clean clean -i . --check to check if notebooks have been cleaned without making any changes so that nb-clean can be used in continuous integration checks.

Clean cell metadata in addition to output

Currently nb-clean only removes each cell's output and execution count. It should also remove each cell's metadata.

When preserving output, remove the execution count on outputs of type execute_result

When using the option to preserve outputs, the execution count of the output is currently preserved.

I have tested it and both notebook and lab have no problems rendering if this field is set to null like the excecution counts of the cells themselves.

It is only relevant to outputs of type execute_result since according to the spec, it's the only type of output with that field.

Option to preserve output

Based on nb-clean --help, there is no option to preserve output cells.

Preserve execution count when using `nb-clean`

Is it possible to retain the cell execution count when using nb-clean?

My objective is just to remove metadata, while retaining cell execution count and cell outputs.

I'm not sure if this option exists (not clear to me from reading nb-clean clean --help)

I'm using this at the moment

nb-clean clean -eo <...ipynb>

If it doesn't exist, would you consider adding this option to nb-clean? thanks.

Remove python version info

Hi,

How to remove python version info?

git diff
diff --git a/android/build.ipynb b/android/build.ipynb
index a06f06f..0d103af 100644
--- a/android/build.ipynb
+++ b/android/build.ipynb
@@ -120,7 +120,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.7.5"
+   "version": "3.8.6"
   }
  },
  "nbformat": 4,

Thanks.

Feedback in cli missing

There is no feedback in the CLI when a command is executed.

For example:

nb-clean clean notebook.ipynb --> "notebook.ipynb was successfully clean"

nb-clean check notebook.ipynb --> "notebook.ipynb is already clean"

Notebooks not cleaned when staged for git commit

Issue

Following the documentation to add a git filter by running nb-clean add-filter correctly creates/modifies the /.git/info/attributes and /.git/config files. However, notebooks that are staged for commit are not cleaned by the filter.

Expected Behavior

After running nb-clean add-filter in the root directory of the git repo, any notebooks staged for a commit should be cleaned by the filter prior to staging.

Setup: a new git repository has been created by running git init. nb-clean has been installed by running pip install nb-clean, and the nb-clean add-filter command has been run. The correct git files have been modified:

# .git/info/attributes

*.ipynb filter=nb-clean

# .git/config
[core]
	repositoryformatversion = 0
	filemode = true
	bare = false
	logallrefupdates = true
[filter "nb-clean"]
	clean = nb-clean clean

Create a new jupyter notebook and add some data.

Output for an unstaged notebook, test_notebook.ipynb:

$ cat test_notebook.ipynb
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "ce5c1671",
   "metadata": {},
   "outputs": [],
   "source": [
    "import time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "0f43b93d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5.005712509155273\n"
     ]
    }
   ],
   "source": [
    "time_start = time.time()\n",
    "time.sleep(5)\n",
    "print(time.time() - time_start)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19ff96fb",
   "metadata": {},
   "source": [
    "# This is a markdown cell\n",
    "\n",
    "### Blah blah blah\n",
    "\n",
    "<a href=\"#\" target=\"_blank\">Link with HTML formatting</a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "b5364b1b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# running many executions"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c65a168c",
   "metadata": {},
   "source": [
    "The following is an empty cell to test `nb-clean clean --remove-empty-cells`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f99d74b9",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}

$ git status
On branch main
Untracked files:
  (use "git add <file>..." to include in what will be committed)
        test_notebook1.ipynb

nothing added to commit but untracked files present (use "git add" to track)

Run git add test_notebook.ipynb.

$ git status
On branch main
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        new file:   test_notebook1.ipynb

Run :

$ nb-clean check test_notebook.ipynb
test_notebook1.ipynb cell 0: execution count
test_notebook1.ipynb cell 1: execution count
test_notebook1.ipynb cell 1: outputs
test_notebook1.ipynb cell 3: execution count

Suggested Fix

The .git/config file should be modified as follows:

...
[filter "nb-clean"]
    clean = nb-clean clean %f

This will run the clean function for every file that matches the filter described in the .git/info/attributes file. It might also be nice to include required = true after the clean definition.

Further, it would be nice if the clean function made the modifications to the *.ipynb files in-place, rather than making the changes and not adding them to the commit. As it is now, adding the files to a commit will only add the pre-filtered notebooks to the staging area, while the modified version is left out, meaning the user has to re-add the notebooks to the commit (assuming that the above fix is implemented). See the following articles for more explanation:

Information

OS : Ubuntu 20.04 (running on WSL2)

program	version
`Python`	3.6.13
`git`	2.25.1
`nb-clean`	2.0.2

[Feature] support --recursive in combination with a folder as an argument

I would like support for recursive to have all the notebooks under a given folder to be considered for cleaning.

nb-clean clean --recursive banana.ipynb src would clean both banana.ipynb and all *.ipynb below the folder src.

This relates to issue #140 because that could also be achieved using this option.

Option to preserve tag metadata

The metadata element in a notebook may include "reserved" tags list data that may be used to to modify the display of a cell in a live notebook UI or when rendering the notebook using a tool such as Jupyter Book.

It would be useful to be able to clear all metadata except the reserved tags element when cleaning notebooks.

Add `required` option to `git filter`

Thanks for writing such a convenient tool!
I personally would like it if we could have an option to add required = True to the git filter option.
Optionally, I want to decide whether I want to get the output with:

[filter "nb-clean"]
	clean = nb-clean clean --preserve-cell-metadata
	required = True

Instead of

[filter "nb-clean"]
	clean = nb-clean clean --preserve-cell-metadata
	required = False # Could also use the empty line like before

The main use-case for me is that I use VSCode for quick edits (also to edit jupyter-notebooks) and when I use the vscode git-cli and have nb-clean installed in a conda/mamba environment then vscode will not use the correct environment and the cleaning is silently skipped (as vscode doesn't find the binary)
I know that there are many possible options to fix it (not use conda/mamba, install nb-clean via pipx, etc.) but since it should be a minor change, maybe it could add some more convenience for similar use-cases?

I am also happy to add a tiny PR :)

Thanks!

Feature request: preserve notebook metadata

Hi Scott, hope you're well!

I'm interested in using nb-clean in a project to fail CI when output is committed in a notebook, but I'd like to be permissive about metadata. I see that the --preserve-cell-metadata option allows you to permit cell metadata, but language_info.version cannot currently be permitted, if I understand correctly.

Would you consider adding a flag to permit this metadata?

pre-commit hook?

Have you/someone made a pre-commit hook for this yet?

If not, are you happy for me to?

Request: `pyproject.toml`-based configuration

I currently use nb-clean in my pre-commit config like so:

---
default_language_version:
    python: python3

repos:
    - repo: https://github.com/srstevenson/nb-clean
      rev: 3.2.0
      hooks:
          - id: nb-clean
            args: [--preserve-cell-outputs, --remove-empty-cells]

It would be nice if the args could be moved to a pyproject.toml, so that running nb-clean clean locally (outside of pre-commit) could also use the same configuration.

srstevenson / nb-clean Goto Github PK

nb-clean's People

Contributors

Stargazers

Watchers

Forkers

nb-clean's Issues

Problem

Examples

Not working as expected

Working as desired

Update

Problem

Examples

Working as expected

Not working as expected

Other working examples

System :

Core Packages :

Core Commands :

Error :

Can not reproduce using nbformat directly in python:

Issue

Expected Behavior

Suggested Fix

Information

Recommend Projects

Recommend Topics

Recommend Org