srstevenson / nb-clean Goto Github PK
View Code? Open in Web Editor NEWClean Jupyter notebooks of outputs, metadata, and empty cells, with Git integration
Home Page: https://pypi.org/project/nb-clean/
License: ISC License
Clean Jupyter notebooks of outputs, metadata, and empty cells, with Git integration
Home Page: https://pypi.org/project/nb-clean/
License: ISC License
What do you think about only preserve cells
? It means that it will clean all except 'cells`?
In the notebook example, it will destroy:
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:Python3] *",
"language": "python",
"name": "conda-env-Python3-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 2
Hey
Is it possible to change the current file by running nb-clean -i Sample.ipynb
? Instead of creating a new one and renaming it to the old one.
Thank you!
By the way, congratulations for this cool tool!
A useful feature would be the ability to remove the git filter using nb-clean, maybe something like nb-clean configure-git --uninstall
. Thanks!
The filter nb-clean add-filter --preserve-cell-metadata
cleans the python version at the end of the notebook. This causes a metadata misalignment between local git and github notebooks.
- "pygments_lexer": "ipython3",
- "version": "3.8.8"
+ "pygments_lexer": "ipython3"
Every time that I open a notebook after pushing with the filter, I get my notebook modified. It is possible to fix that? Thanks in advance!
It is nice to have git-filter
and pre-commit
. But, how about save hook so that it is running on save?
See https://jupyter-notebook.readthedocs.io/en/stable/extending/savehooks.html
What is the benefit? It will not cause git diff
if you just open a notebook (because opening a notebook will add metadata) and save. Running it will also not add diff
if nothing changed.
Using the current latest versions of nb-clean
i.e. 3.2.0
, it fails to ignore the following when checking the notebook even though it is configured to ignore it:
metadata: language_info.version
When checking notebook:
$ nb-clean check my-notebook.ipynb
my-notebook.ipynb metadata: language_info.version
I thought nb-clean
should no longer complain about the language info version metadata, but it still does:
$ nb-clean check my-notebook.ipynb -m/--preserve-cell-metadata
my-notebook.ipynb metadata: language_info.version
$ nb-clean check my-notebook.ipynb -m/--preserve-notebook-metadata
my-notebook.ipynb metadata: language_info.version
$ nb-clean add-filter --preserve-cell-metadata
$ nb-clean check my-notebook.ipynb
my-notebook.ipynb metadata: language_info.version
$ nb-clean remove-filter
$ nb-clean add-filter --preserve-notebook-metadata
$ nb-clean check my-notebook.ipynb
my-notebook.ipynb metadata: language_info.version
$ nb-clean add-filter --preserve-notebook-metadata check my-notebook.ipynb
usage: nb-clean [-h] {version,add-filter,remove-filter,check,clean} ...
nb-clean: error: unrecognized arguments: check my-notebook.ipynb
Although this is not in the README.md, this is the only way I could get it to work as I desired, i.e. ignore the language info version metadata:
$ nb-clean add-filter --preserve-cell-metadata check my-notebook.ipynb
$ echo $?
0
Turns out I misunderstood the README, it wasn't obvious to me that it was saying I could use the shorthand or the longhand of the flags. So this does work.
$ nb-clean check my-notebook.ipynb --preserve-notebook-metadata
$ echo $?
0
I would suggest modifying the README to include only the longhand (or shorthand) flags in the table, to avoid similar future confusion.
Also, there is a copy-paste typo for ignoring notebook metadata, i.e. it incorrectly copied the values from the cell metadata.
Using the current latest versions of nb-clean
i.e. 3.2.0
, checking the notebook with preserving the cell metadata flag before the filename causes the command to hang indefinitely.
$ nb-clean check notebook.ipynb --preserve-cell-metadata
notebook.ipynb metadata: language_info.version
These just hang and it never finishes the commands:
$ nb-clean check --preserve-cell-metadata notebook.ipynb
$ nb-clean check --preserve-notebook-metadata --preserve-cell-metadata notebook.ipynb
These commands also work as expected:
$ nb-clean check --preserve-notebook-metadata notebook.ipynb
notebook cell 12: metadata
$ nb-clean check notebook.ipynb --preserve-notebook-metadata
notebook cell 12: metadata
$ nb-clean check notebook.ipynb --preserve-cell-metadata --preserve-notebook-metadata
$ echo $?
0
Hi, is there a flag to set a git filter which only cleans files >100MB (Github file size limit)? Would be much appreciated!
Thanks for the nice tool!
Unfortunately I find it disrupts some of my notebooks, since I use papermill which relies on cell tags (stored in the cell metadata). And it seems to be something to preserve.
Do metadata often contain something which is not worth saving? If so, would it make sense to have an option for preserve metadata?
@srstevenson Thanks for this awesome repo. I am having some trouble cleaning notebooks with html/js inside. Below is the detailed error. Please kindly check it out :)
Windows Server 2022 Datacenter 21H2 20348.2402
jupyterlab >= 4.0.10
nbformat 5.9.2
nb-clean 3.2.0
plotly 5.18.0
nb-clean add-filter
git add plotly-example-2.ipynb
It works well on notebooks without plotly.
But getting error from this notebook with plotly's html js snippets in it. plotly-example-2.zip
I checked the json format. It happens on line 29 which is the beginning of a chunk of js snippet having confusing "" in it.
(dev) PS D:\Dapu\prod> git add plotly-example-2.ipynb
Traceback (most recent call last):
File "D:\ProgramData\miniconda3\envs\dev\Lib\site-packages\nbformat\reader.py", line 19, in parse_json
nb_dict = json.loads(s, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^
File "D:\ProgramData\miniconda3\envs\dev\Lib\json\__init__.py", line 346, in loads
return _default_decoder.decode(s)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\ProgramData\miniconda3\envs\dev\Lib\json\decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\ProgramData\miniconda3\envs\dev\Lib\json\decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 29 column 301224 (char 302194)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "D:\ProgramData\miniconda3\envs\dev\Scripts\nb-clean.exe\__main__.py", line 7, in <module>
File "D:\ProgramData\miniconda3\envs\dev\Lib\site-packages\nb_clean\cli.py", line 298, in main
args.func(args)
File "D:\ProgramData\miniconda3\envs\dev\Lib\site-packages\nb_clean\cli.py", line 150, in clean
notebook = nbformat.read(input_, as_version=nbformat.NO_CONVERT) # type: ignore[no-untyped-call]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\ProgramData\miniconda3\envs\dev\Lib\site-packages\nbformat\__init__.py", line 174, in read
return reads(buf, as_version, capture_validation_error, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\ProgramData\miniconda3\envs\dev\Lib\site-packages\nbformat\__init__.py", line 92, in reads
nb = reader.reads(s, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\ProgramData\miniconda3\envs\dev\Lib\site-packages\nbformat\reader.py", line 75, in reads
nb_dict = parse_json(s, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^
File "D:\ProgramData\miniconda3\envs\dev\Lib\site-packages\nbformat\reader.py", line 25, in parse_json
raise NotJSONError(message) from e
nbformat.reader.NotJSONError: Notebook does not appear to be JSON: '{\n "cells": [\n {\n "cell_type": "c...
error: external filter 'nb-clean clean' failed 1
error: external filter 'nb-clean clean' failed
warning: in the working copy of 'plotly-example-2.ipynb', LF will be replaced by CRLF the next time Git touches it
When I use nbformat to load, such error will not happen. It seems fine to get the whole html content in notebook['cells'][0]['outputs'][0]['data']['text/html']
.
import nbformat
filename = "plotly-example-2.ipynb"
with open(filename, 'r', encoding='utf-8') as f:
notebook = nbformat.read(f, as_version=nbformat.NO_CONVERT)
notebook['cells'][0]['outputs'][0]['data']['text/html']
Hi,
Do you think it would make sense to rewrite nb-clean
in Rust?
Support this feature:
nb-clean clean --preserve-cell-outputs notebooks/*
It first will list all .ipynb
files in notebooks/
, then iteratively clean all the notebooks.
It would be great if you could do something like nb-clean check .
or nb-clean clean -i . --check
to check if notebooks have been cleaned without making any changes so that nb-clean
can be used in continuous integration checks.
Currently nb-clean
only removes each cell's output and execution count. It should also remove each cell's metadata.
When using the option to preserve outputs, the execution count of the output is currently preserved.
I have tested it and both notebook and lab have no problems rendering if this field is set to null
like the excecution counts of the cells themselves.
It is only relevant to outputs of type execute_result since according to the spec, it's the only type of output with that field.
Based on nb-clean --help
, there is no option to preserve output cells.
Is it possible to retain the cell execution count when using nb-clean
?
My objective is just to remove metadata, while retaining cell execution count and cell outputs.
I'm not sure if this option exists (not clear to me from reading nb-clean clean --help
)
I'm using this at the moment
nb-clean clean -eo <...ipynb>
If it doesn't exist, would you consider adding this option to nb-clean
? thanks.
Hi,
How to remove python version info?
git diff
diff --git a/android/build.ipynb b/android/build.ipynb
index a06f06f..0d103af 100644
--- a/android/build.ipynb
+++ b/android/build.ipynb
@@ -120,7 +120,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.7.5"
+ "version": "3.8.6"
}
},
"nbformat": 4,
Thanks.
There is no feedback in the CLI when a command is executed.
For example:
nb-clean clean notebook.ipynb
--> "notebook.ipynb was successfully clean"
or
nb-clean check notebook.ipynb
--> "notebook.ipynb is already clean"
Following the documentation to add a git
filter by running nb-clean add-filter
correctly creates/modifies the /.git/info/attributes
and /.git/config
files. However, notebooks that are staged for commit are not cleaned by the filter.
After running nb-clean add-filter
in the root directory of the git
repo, any notebooks staged for a commit should be cleaned by the filter prior to staging.
Setup: a new git
repository has been created by running git init
. nb-clean
has been installed by running pip install nb-clean
, and the nb-clean add-filter
command has been run. The correct git
files have been modified:
# .git/info/attributes
*.ipynb filter=nb-clean
# .git/config
[core]
repositoryformatversion = 0
filemode = true
bare = false
logallrefupdates = true
[filter "nb-clean"]
clean = nb-clean clean
Create a new jupyter notebook
and add some data.
Output for an unstaged notebook, test_notebook.ipynb
:
$ cat test_notebook.ipynb
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "ce5c1671",
"metadata": {},
"outputs": [],
"source": [
"import time"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "0f43b93d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5.005712509155273\n"
]
}
],
"source": [
"time_start = time.time()\n",
"time.sleep(5)\n",
"print(time.time() - time_start)"
]
},
{
"cell_type": "markdown",
"id": "19ff96fb",
"metadata": {},
"source": [
"# This is a markdown cell\n",
"\n",
"### Blah blah blah\n",
"\n",
"<a href=\"#\" target=\"_blank\">Link with HTML formatting</a>"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "b5364b1b",
"metadata": {},
"outputs": [],
"source": [
"# running many executions"
]
},
{
"cell_type": "markdown",
"id": "c65a168c",
"metadata": {},
"source": [
"The following is an empty cell to test `nb-clean clean --remove-empty-cells`"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f99d74b9",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
$ git status
On branch main
Untracked files:
(use "git add <file>..." to include in what will be committed)
test_notebook1.ipynb
nothing added to commit but untracked files present (use "git add" to track)
Run git add test_notebook.ipynb
.
$ git status
On branch main
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
new file: test_notebook1.ipynb
Run :
$ nb-clean check test_notebook.ipynb
test_notebook1.ipynb cell 0: execution count
test_notebook1.ipynb cell 1: execution count
test_notebook1.ipynb cell 1: outputs
test_notebook1.ipynb cell 3: execution count
The .git/config
file should be modified as follows:
...
[filter "nb-clean"]
clean = nb-clean clean %f
This will run the clean
function for every file that matches the filter described in the .git/info/attributes
file. It might also be nice to include required = true
after the clean
definition.
Further, it would be nice if the clean
function made the modifications to the *.ipynb
files in-place, rather than making the changes and not adding them to the commit. As it is now, adding the files to a commit will only add the pre-filtered notebooks to the staging area, while the modified version is left out, meaning the user has to re-add the notebooks to the commit (assuming that the above fix is implemented). See the following articles for more explanation:
OS
: Ubuntu 20.04 (running on WSL2)
program | version |
---|---|
Python |
3.6.13 |
git |
2.25.1 |
nb-clean |
2.0.2 |
I would like support for recursive to have all the notebooks under a given folder to be considered for cleaning.
nb-clean clean --recursive banana.ipynb src
would clean both banana.ipynb
and all *.ipynb
below the folder src
.
This relates to issue #140 because that could also be achieved using this option.
The metadata element in a notebook may include "reserved" tags
list data that may be used to to modify the display of a cell in a live notebook UI or when rendering the notebook using a tool such as Jupyter Book.
It would be useful to be able to clear all metadata except the reserved tags
element when cleaning notebooks.
Thanks for writing such a convenient tool!
I personally would like it if we could have an option to add required = True
to the git filter option.
Optionally, I want to decide whether I want to get the output with:
[filter "nb-clean"]
clean = nb-clean clean --preserve-cell-metadata
required = True
Instead of
[filter "nb-clean"]
clean = nb-clean clean --preserve-cell-metadata
required = False # Could also use the empty line like before
The main use-case for me is that I use VSCode for quick edits (also to edit jupyter-notebooks) and when I use the vscode git-cli and have nb-clean installed in a conda/mamba environment then vscode will not use the correct environment and the cleaning is silently skipped (as vscode doesn't find the binary)
I know that there are many possible options to fix it (not use conda/mamba, install nb-clean via pipx, etc.) but since it should be a minor change, maybe it could add some more convenience for similar use-cases?
I am also happy to add a tiny PR :)
Thanks!
Hi Scott, hope you're well!
I'm interested in using nb-clean
in a project to fail CI when output is committed in a notebook, but I'd like to be permissive about metadata. I see that the --preserve-cell-metadata
option allows you to permit cell metadata, but language_info.version
cannot currently be permitted, if I understand correctly.
Would you consider adding a flag to permit this metadata?
Have you/someone made a pre-commit hook for this yet?
If not, are you happy for me to?
I currently use nb-clean
in my pre-commit
config like so:
---
default_language_version:
python: python3
repos:
- repo: https://github.com/srstevenson/nb-clean
rev: 3.2.0
hooks:
- id: nb-clean
args: [--preserve-cell-outputs, --remove-empty-cells]
It would be nice if the args
could be moved to a pyproject.toml
, so that running nb-clean clean
locally (outside of pre-commit
) could also use the same configuration.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.