cpburnz / python-pathspec Goto Github PK
View Code? Open in Web Editor NEWUtility library for gitignore style pattern matching of file paths.
License: Mozilla Public License 2.0
Utility library for gitignore style pattern matching of file paths.
License: Mozilla Public License 2.0
Consider the case of a directory structure like :
.
|-- directoryD
| |-- fileE
| `-- fileF
|-- directoryG
| |-- directoryH
| | |-- fileI
| | |-- fileJ
| | |-- fileK
| | |-- fileL
| | |-- fileM
| | `-- fileN
| `-- fileO
|-- fileA
|-- fileB
|-- fileC
`-- .gitignore
The contents of the .gitignore
file are :
fileB
directoryD/*
directoryG/*
Now if we use pathspec
to match all the files with the specs defined in .gitignore
then here are the responses of match_file(filepath)
function:
|-- directoryD : False
| |-- fileE : True
| `-- fileF : True
|-- directoryG : False
| `-- directoryH : True
| |-- fileI : False
| |-- fileJ : False
| |-- fileK : False
| |-- fileL : False
| |-- fileM : False
| `-- fileN : False
| `-- fileO : True
|-- fileA : False
|-- fileB : True
|-- fileC : True
`-- .gitignore
If you compare it with the behaviour of .gitignore
inside a git repository, the files inside directoryH
should all return True
as a reponse to the match_file
function. Or am I getting something wrong ?
Accorging to https://pypi.org/project/pathspec/ latest version is 0.8.0 but there is no any version git tags here.
Hey @cpburnz , thanks for the great lib!
In match_files()
(https://github.com/cpburnz/python-path-specification/blob/c00b332b2075548ee0c0673b72d7f2570d12ffe6/pathspec/pathspec.py#L170), the line
file_map = util.normalize_files(files, separators=separators)
(L190) requires files
to be completely exhausted before even the first file is matched. If files
is a list-like, this is not a problem, but when calling it from the tree_*()
methods it means that the whole iterator mechanics is pretty much useless.
It also means that if I have an ignored folder containing a very complex structure, which I want pathspec
to ignore, pathspec
will search through it although there is no way it will play a role in the results.
As an example, for an automation I'm writing on a real life repository containing a frontend application, the scan of npm generated files took about 10 minutes (before yielding the first result) and then I gave up and stopped it.
I think a possible solution is to remove this dictionary and simply doing:
for file in files:
if util.match_file(self.patterns, util.normalize_file(file)):
yield file
(I bypassed util.match_files()
here as it, too, is not a generator and will try to convert files
to list first)
Hi. Thanks for the project.
The path spec concept seems to be inspired by gitignore, but I can't find a convenient way to use it as an actual ignore pattern.
That is, write a spec, and then find all files under a directory that do not match it
This may be considered a bug or "works as designed", depending on the interpretation of "followSymlinks", but wanted to point out a (design?) flaw in match_tree:
Suppose I want to archive a folder so I specify a very inclusive pattern: "Libs/"
Under 'Libs/' I have a Mac framework named 'Crmsdk'. Mac frameworks are really just directories with a particular structure inside but the interesting aspect for us is that it has symlinks. This is how Crmsdk.framework looks inside (first level):
lrwxr-xr-x 1 flo staff 23 Oct 1 17:13 CrmSdk -> Versions/Current/CrmSdk
lrwxr-xr-x 1 flo staff 26 Oct 1 17:13 Resources -> Versions/Current/Resources
drwxr-xr-x 5 flo staff 160 Jan 15 00:23 Versions
Notice that Resources
is a symlink pointing inside the Versions
folder.
Now, when I call the path_spec.match_tree() function, regardless of the value of followSymlinks
parameter, I can never get the Resources
folder as an entry in the result set. Which, someone might argue, is what I should expect because:
followSymlinks
is True I will get entries like:Libs/CrmSdk.framework/Resources/CrmRsc2
...
Libs/CrmSdk.framework/Versions/Current/Resources/CrmRsc1
Libs/CrmSdk.framework/Versions/Current/Resources/CrmRsc2
Which is normal because the code is following the symlinks and Resources
is just a symlink to Versions/Current/Resources
.
followSymlinks
is False, then Resources
doesn't even show up in the results list as a folder after CrmSdk.framework, in other words I get only these entries:Libs/CrmSdk.framework/Versions/Current/Resources/CrmRsc2
I don't get any Libs/CrmSdk.framework/Resources
entries.
This puts me in the impossibility to create an archive of the matched entries that I can unzip on another location and have it recreated the same (because the symlinks are missing).
One may claim that this is a bug and when followSymlinks
is False, the Resources folder should be returned in the list of results (clients shouldn't make an assumption about the entries returned: they may be files, folders or symlinks - it's their job to properly handle it). Bottom line is that symlinks are missing in the results list regardless of the flag's value.
I am running unprivileged on a corporate laptop, and I encounter the following errors. It would be nice to detect this and skip the tests which depend on it.
FAILED pathspec/tests/test_util.py::IterTreeTest::test_2_0_check_symlink - OSError: symbolic link privilege not held
FAILED pathspec/tests/test_util.py::IterTreeTest::test_2_1_check_realpath - OSError: symbolic link privilege not held
FAILED pathspec/tests/test_util.py::IterTreeTest::test_2_2_links - OSError: symbolic link privilege not held
FAILED pathspec/tests/test_util.py::IterTreeTest::test_2_3_sideways_links - OSError: symbolic link privilege not held
FAILED pathspec/tests/test_util.py::IterTreeTest::test_2_4_recursive_links - OSError: symbolic link privilege not held
FAILED pathspec/tests/test_util.py::IterTreeTest::test_2_5_recursive_circular_links - OSError: symbolic link privilege not held
FAILED pathspec/tests/test_util.py::IterTreeTest::test_2_6_detect_broken_links - OSError: symbolic link privilege not held
FAILED pathspec/tests/test_util.py::IterTreeTest::test_2_7_ignore_broken_links - OSError: symbolic link privilege not held
FAILED pathspec/tests/test_util.py::IterTreeTest::test_2_8_no_follow_links - OSError: symbolic link privilege not held
Hi Caleb:
Any reason why you use tabs rather that the conventional PEP 8 space-based indentation?
http://www.python.org/dev/peps/pep-0008/
Cordially
Hello @cpburnz, I've looked existing issues but couldn't find this one. It was initially reported in yamllint issues: adrienverge/yamllint#334 (yamllint uses pathspec).
Considering these files:
file.yaml
dir/
└── dir/file.sql
dir.yaml/
└── dir.yaml/file.sql ← problem here
pathspec.PathSpec.from_lines('gitwildmatch', ['*.yaml'])
does match file.yaml
, dir.yaml
, but also dir.yaml/file.sql
. The latter isn't what I would expect, because *.yaml
feels like "files that end with .yaml", not "any file in any path that contains .yaml".
Is it the expected behavior? If yes, what is the correct pattern for "files that end with .yaml"?
Looking forward to your answer, and thanks for this very useful project 👍
I get the following deprecation warning
/usr/lib/python3.7/site-packages/pathspec/pathspec.py:27: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
self.patterns = patterns if isinstance(patterns, collections.Container) else list(patterns)
^
is escaped, but it works with gitignore.
To test it:
mkdir test && cd test
git init
echo 'a[^b]c' > .gitignore
touch abc azc
git status -u --ignored
Notice that abc doesn't match but azc does match. pathspec's behavior with ^
is inconsistent with Git.
I read:
But I didn't understand why would you prefer to be compatible with POSIX and Python's fnmatch.translate
when you could be compatible with gitignore itself. May I ask what led you to the decision to escape ^
?
The match_file
and match_files
functions do not seem to support pathlib.Path
objects, which would be convenient. If that is something that you want, I would be happy to work on a pull request.
Hi,
I was looking for a Python implementation to match files with the rules defined in .gitignore
files and this project is great!
My use case is to synchronize directories across a network and most of the control logic (filter, compare, update) is at the inode level to allow me to maximize the number of skipped elements (to not explore excluded directories for example).
I would like to update my current filter logic to support git patterns: given a list of patterns, is my file path matched or not ? The issue is that currently pathspec
seems to be heavily oriented around processing lists of paths, what if I have a single file ?
Here is what my current implementation boils down to:
spec = pathspec.PathSpec.from_lines(pathspec.GitIgnorePattern, patterns)
def match_file(file_path):
return len(list(spec.match_files([file_path]))) > 0 # This should not be so complicated
is_ignored = match_file(u'testfile.py')
As you can see, it's pretty cumbersome: I have to create a a collection with a single item, run the matcher and then extract the result.
Ideally, I would imagine that PathSpec exposes a match_file
function returning a boolean and match_files
(or filter_files
since it's currently acting as a filter ?) would just reuse it:
class PathSpec(object):
# ...
def match_file(self, file, separators=None): # Core logic
norm, path = util.normalize_file(file, separators=separators) # Single file version
is_matched = util.match_file(self.patterns, norm) # Single file version
return is_matched # bool
def match_files(self, files, separators=None): # Quality of life function: it just replaces a one line generator
return (file for file in files if self.match_file(file, separators))
Basically, it boils down to the fact the library does not expose single item functions to let me iterate other my files as I want but hides a loop inside every function.
What do you think about adding better support for single file matching ? I am aware that due to the current architecture of the library, it would require some refactoring but I believe that it would be for the best. Could you implement it or should I do it and send a PR (since it's a big change, I'd rather wait for your feedback)
Side note: the real name of the gitignore matcher is wildmatch. How about adding this as an alias name when registering the pattern ? Your module deserves to be better referenced (I had some troubles to find it even if I knew what I was looking for).
@cpburnz I feel this repo should be just called pathspec
the same way you would install it from PyPI, as it is not the same as python-path-specification
.
Minor detail, but it confused me.
Dear Caleb,
I am a bit confused, please be so kind and help me.
Because git allows for nested .gitignore files, therefore a base_path value is required for correct behavior. But I cant find it ?
I played around and found some small glitches - if You use the Pathspec.from_lines
also comment lines of an ignorefile will create (empty) spec patterns - those should be skipped.
also double entries should be skipped.
But most confusing is, that there is no reference to a base directory ? How You would handle nested .gitignore files ?
yours sincerely
Robert
Hello,
I maintain the python-pathspec
package for Fedora and CentOS, Rocky Linux, RHEL etc.
Since version 0.10.0, the setup.py
file was removed. I tried adapting our files to use python -m build
, bdist_wheel
, tox
, but after 40 minutes I fail to get it working.
Is there a documentation on how to build and install the package now that setup.py
was removed? I couldn't find it in READMEs, changelogs and Git history.
Was is the recommended way to:
python setup.py sdist
)/usr/lib/python3.x/site-packages/...
? (previously with 0.9.0: python setup.py install
)tox
is enough now, previously with 0.9.0: python setup.py test
)Thanks in advance.
=========================================================================== test session starts ============================================================================
platform linux -- Python 3.8.9, pytest-6.2.3, py-1.10.0, pluggy-0.13.1
rootdir: /home/tkloczko/rpmbuild/BUILD/pathspec-0.8.1
plugins: forked-1.3.0, shutil-1.7.0, virtualenv-1.7.0, asyncio-0.14.0, expect-1.1.0, cov-2.11.1, mock-3.5.1, httpbin-1.0.0, xdist-2.2.1, flake8-1.0.7, timeout-1.4.2, betamax-0.8.1, pyfakefs-4.4.0, freezegun-0.4.2, flaky-3.7.0, cases-3.4.6, hypothesis-6.10.1, case-1.5.3, isort-1.3.0
collected 48 items
pathspec/tests/test_gitwildmatch.py ......................s.s... [ 58%]
pathspec/tests/test_pathspec.py ........ [ 75%]
pathspec/tests/test_util.py ............ [100%]
========================================================================= short test summary info ==========================================================================
SKIPPED [1] pathspec/tests/test_gitwildmatch.py:421: Python 3 is strict
SKIPPED [1] pathspec/tests/test_gitwildmatch.py:440: Python 3 is strict
====================================================================== 46 passed, 2 skipped in 0.13s =======================================================================
I encountered an exeption in pathspec when
This line gives an error when the string "!\n" is passed, specifically due to a stray linebreak in the .gitignore file.
https://github.com/cpburnz/python-path-specification/blob/master/pathspec/patterns/gitwildmatch.py#L112
I am not sure what the expected behaviour is here, but it may be best to ignore rather than crash on the line.
Commit 7b125ac can be tagged with 0.8.1 for example.
This would be helpful for non-pip package managers such as chromebrew, which I'm packaging pathspec for right now.
Hi,
Since today, pathspec is used by yamllint, which is used by Ansible, OpenStack and others.
The problem is: all these software must support Python 2.6, but pathspec currently doesn't. This leads to issues like adrienverge/yamllint#55 and ansible/ansible#26186.
In your opinion, what's the amount of work needed to support Python 2.6?
currently, the way to get a regex string for a given gitignore-style glob is:
>>> pathspec.GitIgnorePattern('/dist/').regex.pattern
'^dist/.*$'
which incurs the glob->regex translation inside GitIgnorePattern.__init__
which in turn calls RegexPattern.__init__
which automatically compiles the regex.
for the simple case of just wanting to convert a glob into a non-compiled regex string, it'd be great to have a utility function/method that could both be used inside GitIgnorePattern.__init__
and outside as part of the public API.
I am trying to package my Python project/library. So first I created a pyproject.toml
file, following the official Python docs, using hatchling - no fancy stuff here. Then, after upgrading to the latest build
version, I ran this command from within the root folder of my project:
$ python -m build
* Creating venv isolated environment...
* Installing packages in isolated environment... (hatchling)
* Getting build dependencies for sdist...
* Building sdist...
Traceback (most recent call last):
File "/home/mfb/.local/lib/python3.8/site-packages/pep517/in_process/_in_process.py", line 351, in <module>
main()
File "/home/mfb/.local/lib/python3.8/site-packages/pep517/in_process/_in_process.py", line 333, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/home/mfb/.local/lib/python3.8/site-packages/pep517/in_process/_in_process.py", line 302, in build_sdist
return backend.build_sdist(sdist_directory, config_settings)
File "/tmp/build-env-bgb1kk5q/lib/python3.8/site-packages/hatchling/build.py", line 21, in build_sdist
return os.path.basename(next(builder.build(sdist_directory, ['standard'])))
File "/tmp/build-env-bgb1kk5q/lib/python3.8/site-packages/hatchling/builders/plugin/interface.py", line 144, in build
artifact = version_api[version](directory, **build_data)
File "/tmp/build-env-bgb1kk5q/lib/python3.8/site-packages/hatchling/builders/sdist.py", line 156, in build_standard
for included_file in self.recurse_included_files():
File "/tmp/build-env-bgb1kk5q/lib/python3.8/site-packages/hatchling/builders/plugin/interface.py", line 168, in recurse_included_files
yield from self.recurse_project_files()
File "/tmp/build-env-bgb1kk5q/lib/python3.8/site-packages/hatchling/builders/plugin/interface.py", line 182, in recurse_project_files
if self.config.include_path(relative_file_path, is_package=is_package):
File "/tmp/build-env-bgb1kk5q/lib/python3.8/site-packages/hatchling/builders/config.py", line 82, in include_path
and (explicit or self.path_is_included(relative_path))
File "/tmp/build-env-bgb1kk5q/lib/python3.8/site-packages/hatchling/builders/config.py", line 90, in path_is_included
return self.include_spec.match_file(relative_path)
File "/tmp/build-env-bgb1kk5q/lib/python3.8/site-packages/pathspec/pathspec.py", line 176, in match_file
return self._match_file(self.patterns, norm_file)
File "/tmp/build-env-bgb1kk5q/lib/python3.8/site-packages/pathspec/gitignore.py", line 104, in _match_file
dir_mark = match.match.group('ps_d')
IndexError: no such group
ERROR Backend subprocess exited when trying to invoke build_sdist
I do have a .gitignore
file in my root folder, however, it includes only the default Python project template from Github.
Any ideas what is causing this error?
Hello @cpburnz, and thanks for maintaining pathspec 👍
Since release 0.12.0 and the change of signature of PathSpec.match_file()
from bool
to Tuple[Optional[bool], Optional[int]]
, yamllint tests fail. New errors looks like:
self.assertEqual(c.ignore.match_file('test.yaml'), False)
AssertionError: None != False
(↑ c.ignore
is an instance of pathspec.PathSpec
)
These seem easy to fix, but I wanted to ask first. I don't understand why None
is a better return value? I guessed maybe it's a small bug?
I look at all commits in this new release and found 92a9066 "Improve debugging", I read it but it's not clear to me.
Looking forward to know more about this. Thanks! 🙂
Given a gitignore pattern like "\*sterisk
", Git (as of v2.36.1) will match a file named *sterisk
but will not match one named asterisk
. pathspec v0.9.0, meanwhile, matches both files.
import pathspec
spec = pathspec.GitIgnoreSpec.from_lines([
'*', # Ignore all files by default
'!*/', # but scan all directories
'!*.txt', # Text files
'/test1/**' # ignore all in the directory
])
files = {
'test1/b.bin',
'test1/a.txt',
'test1/c/c.txt',
'test2/a.txt',
'test2/b.bin',
'test2/c/c.txt',
}
ignores = set(spec.match_files(files))
print(ignores)
{'test1/b.bin', 'test2/b.bin'}
It should be
{'test1/a.txt', 'test1/c/c.txt', 'test1/b.bin', 'test2/b.bin'}
I think GitIgnoreSpec._match_file should have something like:
if pattern.include is False and dir_mark and out_priority == 0:
out_matched = pattern.include
out_priority = priority
elif pattern.include is True and dir_mark:
out_matched = pattern.include
out_priority = priority
elif priority >= out_priority:
out_matched = pattern.include
out_priority = priority
When using PathSpec.match_tree
, If there is a broken symlink encountered, we get unhandled exception. It's here:
I think you could fix it with os.lstat
instead, but that's backwards incompat change.
Perhaps iter_tree
could have an option to not follow symlinks? You could follow the example of os.walk
, which accepts a followlinks
keyword arg, and also an onerror
callback which can be used to handle problems such as permission errors when stat each file.
Hello, I am using this library and noticing some odd results around white space. When the white space is leading or trailing, I do not get any matches.
# Leading whitespace does not match
>>> pathspec.PathSpec.from_lines(pathspec.patterns.GitWildMatchPattern, [' whitespace.txt']).match_file(' whitespace.txt')
False
# Trailing whitespace does not match
>>> pathspec.PathSpec.from_lines(pathspec.patterns.GitWildMatchPattern, ['whitespace.txt ']).match_file('whitespace.txt ')
False
It seems to have no problem with whitespace internal to a string.
>>> pathspec.PathSpec.from_lines(pathspec.patterns.GitWildMatchPattern, ['white space.txt']).match_file('white space.txt')
True
Can you help me understand what is going on here? Both whitespace.txt
and whitespace.txt
are valid file names so I would like to figure out why I can't match them.
When using .gitignore I could to the following:
*.log
!important/*.log
trace.*
This would exclude all log files but then include important/*log. Pathspec doesn't work this way, was this intentional?
Behavior of pattern "match everything" changed when upgrading from version 0.9.0 to 0.10.0:
Version 0.9.0
>>> import pathspec
>>> patterns = pathspec.PathSpec.from_lines(pathspec.patterns.GitWildMatchPattern, ["*"])
>>>
>>> print(patterns.match_file('anydir/file.txt'))
True
Result is as expected
Version 0.10.0
>>> import pathspec
>>> patterns = pathspec.PathSpec.from_lines(pathspec.patterns.GitWildMatchPattern, ["*"])
>>>
>>> print(patterns.match_file('anydir/file.txt'))
False
Result is wrong. *
should match any file, but it doesn't match anydir/file.txt
.
Discovered when checking case "match everything except for one directory":
patterns = pathspec.PathSpec.from_lines(pathspec.patterns.GitWildMatchPattern, ["*", "!product_dir/"])
On version 0.9.0 it correctly matches file outside of product_dir
and doesn't match file inside:
>>> print(patterns.match_file('anydir/file.txt'))
True
>>> print(patterns.match_file('product_dir/file.txt'))
False
On version 0.10.0 it doesn't match in either case:
>>> print(patterns.match_file('anydir/file.txt'))
False
>>> print(patterns.match_file('product_dir/file.txt'))
False
This library is very cool and I want to use it a project. However, i'm running against a severe limitation: I can't tell which part of a filename was matched.
Consider this example: I have several match patterns (let's call them SpecEntries). They specified what I'm looking for and, optionally, what should the matched thing be remapped to. Example:
'/Documentation/': 'docs/',
'/*.html': 'docs/',
'**/Examples/SDK/': 'docs/'
In the above examples, all those patterns on the left side are remapped to a 'docs/' folder.
Now I'm using match_tree to iterate and directory and compare against the patterns specified by my SpecEntries.
The first issue is that the results returned by match_tree doesn't specify which pattern matched which file. I worked around this by iterating through my patterns, compile each one and calling match_files against it. Doable but inneficient (consider this an improvement request).
After the matches are returned I'd like to remap those paths according to the right hand side of the SpecEntry for example:
'/Documentation/foo.txt' -> 'docs/foo.txt'
'foo.html': 'docs/foo.html'
'blah/Examples/SDK/bar.txt' -> 'docs/bar.txt'
The problem is that there isn't an API that will allow me to do this. pathspec library knows which part of the path matched my specifier but it doesn't expose that information to me so I can't do this remapping. (I considered using regular expressions or fnmatch but they won't easily match pathspec's capabilities - for example no easy way to match '**')
Is it possible to expose the matching logic in the library APi so callers can implement this kind of remapping feature?
Either I'm missing something or exclusions are not working properly.
Check this test case:
What it is currently:
def test_02_dir_exclusions(self):
"""
Test directory exclusions.
"""
spec = GitIgnoreSpec.from_lines([
'*.txt',
'!test1/',
])
files = {
'test1/a.txt',
'test1/b.bin',
'test1/c/c.txt',
'test2/a.txt',
'test2/b.bin',
'test2/c/c.txt',
}
results = list(spec.check_files(files))
ignores = get_includes(results)
debug = debug_results(spec, results)
self.assertEqual(ignores, {
'test1/a.txt',
'test1/c/c.txt',
'test2/a.txt',
'test2/c/c.txt',
}, debug)
self.assertEqual(files - ignores, {
'test1/b.bin',
'test2/b.bin',
}, debug)
What I would expect:
def test_02_dir_exclusions(self):
"""
Test directory exclusions.
"""
spec = GitIgnoreSpec.from_lines([
'*.txt',
'!test1/',
])
files = {
'test1/a.txt',
'test1/b.bin',
'test1/c/c.txt',
'test2/a.txt',
'test2/b.bin',
'test2/c/c.txt',
}
results = list(spec.check_files(files))
ignores = get_includes(results)
debug = debug_results(spec, results)
self.assertEqual(ignores, {
'test2/a.txt',
'test2/c/c.txt',
}, debug)
self.assertEqual(files - ignores, {
'test1/b.bin',
'test2/b.bin',
'test1/a.txt',
'test1/b.bin',
'test1/c/c.txt',
}, debug)
Here is the demo for pathspec
:
from pathlib import Path
import pathspec
exclude_lines=["*", "!libfoo", "!libfoo/**"]
exclude_spec = pathspec.GitIgnoreSpec.from_lines(exclude_lines)
print(exclude_spec.match_file(Path("./libfoo/__init__.py")))
In pathspec==0.11.2
the result is True
, which means ./libfoo/__init__.py
is excluded.
Another demo to check the behavior of Git
:
Directory structure be like
demo-project
├── libfoo
│ └── __init__.py
└── .gitignore
Then check ignorance with command git check-ignore -v ./libfoo/__init__.py
Expect no output, which means ./libfoo/__init__.py
is not excluded.
I suppose this difference is caused by the priority
between directory and file patterns.
python-pathspec/pathspec/gitignore.py
Lines 124 to 136 in 878be22
Including pattern !/libfoo/**
is treated as a directory pattern and will be overridden by the excluding file pattern *
, thus the ./libfoo/__init__.py
will get excluded.
p.s. I've googled this priority
mechanism several hours, but still could not find out any documentation mentioning it.
Other platform info:
pathspec 0.11.2
Git 2.40.1.windows.1
Windows 11 22H2 (22621.2134)
Python 3.11.4
We write the year 2020: Pythonistas have largely adopted the use of the pathlib
module as a more convenient way to perform file system operation that old stinkin os.path
and its comrades.
Hence it would be great if all of pathspec
function would accept Path objects or iterables thereof as arguments, compared to plain strings.
Putting a conversion to a string via str(my_path)
at the right places is all it takes to make it work.
Consider the use case when you have a directory structure like this :
.
|-- 0.csv
|-- A
| `-- 1.csv
`-- .gitignore
and the contents of .gitignore
are :
*.csv
!A/0.csv
ignorespec.match_file("A/0.csv")
returns True
, which is expected.
while
ignorespec.match_file("./A/0.csv")
returns False
.
setup.cfg refers to https://github.com/cpburnz/python-path-specification/blob/master/pathspec_meta.py with
version: attr: pathspec_meta.__version__
On a git checkout on Windows, unless git config core.symlinks
was enabled explicitly, the following occurs when trying to install the package.
Traceback (most recent call last):
File "C:\Users\vandjohn\Downloads\WPy64-3771\python-3.7.7.amd64\lib\site-packages\setuptools\config.py", line 40, in __getattr__
for statement in self.module.body
StopIteration
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\vandjohn\Downloads\WPy64-3771\python-3.7.7.amd64\lib\site-packages\setuptools\config.py", line 391, in _parse_attr
return getattr(StaticModule(module_name), attr_name)
File "C:\Users\vandjohn\Downloads\WPy64-3771\python-3.7.7.amd64\lib\site-packages\setuptools\config.py", line 47, in __getattr__
"{self.name} has no attribute {attr}".format(**locals()))
AttributeError: pathspec_meta has no attribute __version__
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "setup.py", line 5, in <module>
setup()
File "C:\Users\vandjohn\Downloads\WPy64-3771\python-3.7.7.amd64\lib\site-packages\setuptools\__init__.py", line 144, in setup
return distutils.core.setup(**attrs)
File "C:\Users\vandjohn\Downloads\WPy64-3771\python-3.7.7.amd64\lib\distutils\core.py", line 121, in setup
dist.parse_config_files()
File "C:\Users\vandjohn\Downloads\WPy64-3771\python-3.7.7.amd64\lib\site-packages\setuptools\dist.py", line 690, in parse_config_files
ignore_option_errors=ignore_option_errors)
File "C:\Users\vandjohn\Downloads\WPy64-3771\python-3.7.7.amd64\lib\site-packages\setuptools\config.py", line 161, in parse_configuration
meta.parse()
File "C:\Users\vandjohn\Downloads\WPy64-3771\python-3.7.7.amd64\lib\site-packages\setuptools\config.py", line 467, in parse
section_parser_method(section_options)
File "C:\Users\vandjohn\Downloads\WPy64-3771\python-3.7.7.amd64\lib\site-packages\setuptools\config.py", line 440, in parse_section
self[name] = value
File "C:\Users\vandjohn\Downloads\WPy64-3771\python-3.7.7.amd64\lib\site-packages\setuptools\config.py", line 224, in __setitem__
value = parser(value)
File "C:\Users\vandjohn\Downloads\WPy64-3771\python-3.7.7.amd64\lib\site-packages\setuptools\config.py", line 556, in _parse_version
version = self._parse_attr(value, self.package_dir)
File "C:\Users\vandjohn\Downloads\WPy64-3771\python-3.7.7.amd64\lib\site-packages\setuptools\config.py", line 394, in _parse_attr
module = importlib.import_module(module_name)
File "C:\Users\vandjohn\Downloads\WPy64-3771\python-3.7.7.amd64\lib\importlib\__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 728, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "C:\Users\vandjohn\gh\python-path-specification\pathspec_meta.py", line 1, in <module>
pathspec/_meta.py
NameError: name 'pathspec' is not defined
Using the following in setup.cfg works for me, but I suspect there was some problem with this (possibly a setuptools issue?) which the symlink was intended to workaround.
version: attr: pathspec._meta.__version__
@cpburnz Thanks for putting this neat little highly useful python package out there.
I like to work with files featuring weird little symbols inside their names, to make the life of the Spanish speaking users of my package copier
: copier-org/copier#118 (comment)
I would need pathspec
to handle files like: ñana.txt
which currently get just never match and hence are never ignored.
A possible solution would be to use the regex library, which is a drop-in replacement for pythons inbuilt re
lib, but featuring full unicode support among other power-ups.
If you would accept a dependency to your package I would be ready to open a PR implementing the required changes.
gitignore file is as follows.
a
b
spam/**
**/api/
**/
After converting each line to regex, I have a method called list_path that scans the python files in the current directory and returns the ones that do not match each pattern found in gitignore, and the following test failed in the Windows environment, do you have Windows support for Pathspec?
Hi,
I have a simple case:
pattern = "cargo/"
spec = pathspec.PathSpec.from_lines('gitwildmatch', pattern)
return spec.match_file("cargo")
=> False
I was expecting True
edit:
cargo/a
is ignored as expected
I am currently investigating a problem in the black
formatter which uses pathspec
to figure out ignored files by parsing .gitignore
files. black
uses this library and tries to check if directories are ignored before checking their contents.
It boils down to this minimum example:
from pathlib import Path
import pathspec
spec = pathspec.PathSpec.from_lines('gitwildmatch', """
/important/
""".splitlines())
assert spec.match_file("important/bar.log")
assert spec.match_file("/important/")
assert spec.match_file("important/")
assert spec.match_file("important") # does not match
assert spec.match_file(Path("important/")) # does not match as pathlib removes trailing slash
It matches the file important/bar.log
as expected, however, it does not match the folder when it has no trailing slash.
This becomes especially problematic when using Path
objects, as the trailing slash is removed by pathlib
.
I don't know where this needs to be fixed. What do you think? Is black
using it wrong? Should the match_file()
method check if the given Path
object is a directory and add the trailing slash on its own before checking?
It would certainly be possible to implement a workaround in black
and check with Path.is_dir()
, convert to str
and append the slash...
FYI, the place where the check happens in black
is here. relative_path
is a pathlib.Path
and can be a directory.
The following program illustrates some possible variations:
#!/usr/bin/python3
import pathspec
import os
print("A. String pattern, string path:\n ", end="")
try:
s = pathspec.PathSpec.from_lines('gitwildmatch', ['*.py'])
for p in os.listdir('pathspec'):
if s.match_file(p):
print(p, end=" ")
print()
except Exception as e:
print("FAILED '%s'" % e)
print("B. String pattern, binary path:\n ", end="")
try:
s = pathspec.PathSpec.from_lines('gitwildmatch', ['*.py'])
for p in os.listdir(b'pathspec'):
if s.match_file(p):
print(p, end=' ')
print()
except Exception as e:
print("FAILED '%s'" % e)
print("C. String pattern, binary path + surrogateescape:\n ", end="")
try:
s = pathspec.PathSpec.from_lines('gitwildmatch', ['*.py'])
for p in os.listdir(b'pathspec'):
if s.match_file(p.decode('utf8','surrogateescape')):
print(p, end=' ')
print()
except Exception as e:
print("FAILED '%s'" % e)
print("D. Binary pattern, binary path:\n ", end="")
s = pathspec.PathSpec.from_lines('gitwildmatch', [b'*.py'])
try:
for p in os.listdir(b'pathspec'):
if s.match_file(p):
print(p, end=' ')
print()
except Exception as e:
print("FAILED '%s'" % e)
Gives the following result when run in the source directory:
A. String pattern, string path:
util.py pattern.py pathspec.py __init__.py compat.py
B. String pattern, binary path:
FAILED 'cannot use a string pattern on a bytes-like object'
C. String pattern, binary path + surrogateescape:
b'util.py' b'pattern.py' b'pathspec.py' b'__init__.py' b'compat.py'
D. Binary pattern, binary path:
IMHO examples A-C behaves as expected, while example D does not match any files, neither does it complain on the pattern.
Some patterns work with both absolute and relative paths.
For example,
>>> list(pathspec.patterns.GitWildMatchPattern("*.py").match(["/foo/a.py", "foo/a.py", "x/foo/a.py", "/x/foo/a.py"]))
['/foo/a.py', 'foo/a.py', 'x/foo/a.py', '/x/foo/a.py']
>>> list(pathspec.patterns.GitWildMatchPattern("**").match(["/foo/a.py", "foo/a.py", "x/foo/a.py", "/x/foo/a.py"]))
['/foo/a.py', 'foo/a.py', 'x/foo/a.py', '/x/foo/a.py']
However, the pattern foo
or /foo
won't match the path starts with /foo
.
For example,
>>> list(pathspec.patterns.GitWildMatchPattern("foo").match(["/foo/a.py", "foo/a.py", "x/foo/a.py", "/x/foo/a.py"]))
['foo/a.py', 'x/foo/a.py', '/x/foo/a.py']
>>> list(pathspec.patterns.GitWildMatchPattern("/foo").match(["/foo/a.py", "foo/a.py", "x/foo/a.py", "/x/foo/a.py"]))
['foo/a.py']
Can we support matching the absolute path in this case?
I think making output.append('(?:.+/)?')
at here output.append('(?:.*/)?')
could solve the issue, but I am not sure whether it has other unwanted side effect.
Another solution is that we normalize /...
to ...
similar to normalizing the ./
at here.
It appears that pathspec is not handling exclusions correctly for directories without a trailing slash:
import pathspec
gitignore = """\
build
!/foo/build
"""
spec = pathspec.PathSpec.from_lines("gitwildmatch", gitignore.splitlines())
# incorrectly returns True
print(spec.match_file("foo/build/file.py"))
gitignore = """\
build
!/foo/build/
"""
spec = pathspec.PathSpec.from_lines("gitwildmatch", gitignore.splitlines())
# correctly returns False
print(spec.match_file("foo/build/file.py"))
If you try doing the same with .gitignore, it works correctly:
git init repro
cd repro
echo $'build\n!/foo/build' > .gitignore
mkdir build
touch build/file.py
mkdir -p foo/build
touch foo/build/file.py
Running git status
gives:
On branch main
No commits yet
Untracked files:
(use "git add <file>..." to include in what will be committed)
.gitignore
foo/
nothing added to commit but untracked files present (use "git add" to track)
And running git ls-files --others --exclude-standard
gives:
.gitignore
foo/build/file.py
Hello.
We are users of pathspec
in some other project. I have a performance question.
For a long list of rules (dozens) matches large amount of files (hundreds of thousands) the match_file
takes a long time. Is there any method to improve its performance?
For example, using a big regex instead of multiple small ones.
Could you please consider switching the build system from setuptools to flit_core? This would help Linux distributions such as Gentoo avoid cyclic dependencies that cause bootstrapping unbundled setuptools a real pain. If you agree, I can submit a pull request doing the conversion.
The problem is that the most recent release of setuptools (66.0.0) started using platformdirs. platformdirs use the hatchling build backend which in turn requires this package. This creates a dependency cycle that we can't install setuptools before installing platformdirs, and we can't build platformdirs before all of hatchling's dependencies are installed, and we effectively end up needing setuptools to build them.
flit_core is a "no dependencies [except for tomli, on Python < 3.11]" by design, so it makes bootstrapping packages much easier.
I'm using pattern
"""
!test1/
*.txt
"""
to scan a folder, from gitingore description
(http://git-scm.com/book/en/v2/Git-Basics-Recording-Changes-to-the-Repository)
build/
this should exclude everything under test1 folder, but the run result shows the line dose not work, the result still show all contents under test 1 folder.
[An Example Folder Structure]
d:\dev\eclipse_workspace\test_scan\src\test1\a.txt
d:\dev\eclipse_workspace\test_scan\src\test1\b.txt
d:\dev\eclipse_workspace\test_scan\src\test1\c\c1.txt
d:\dev\eclipse_workspace\test_scan\src\test2\a.txt
d:\dev\eclipse_workspace\test_scan\src\test2\b.txt
d:\dev\eclipse_workspace\test_scan\src\test2\c\c1.txt
[Code]
SCAN_PATTERN = """
!test1/
*.txt
"""
spec = pathspec.PathSpec.from_lines(pathspec.GitIgnorePattern, SCAN_PATTERN.splitlines())
spec.match_tree('d:\dev\eclipse_workspace\test_scan')
[Run Result]
d:\dev\eclipse_workspace\test_scan\src\test1\a.txt
d:\dev\eclipse_workspace\test_scan\src\test1\b.txt
d:\dev\eclipse_workspace\test_scan\src\test1\c\c1.txt
d:\dev\eclipse_workspace\test_scan\src\test2\a.txt
d:\dev\eclipse_workspace\test_scan\src\test2\b.txt
d:\dev\eclipse_workspace\test_scan\src\test2\c\c1.txt
Can you please have a look? Thanks.
It would be great to have a method for escaping a string according to gitwildmatch, thus, putting backslashes before !
, [
, ]
, ?
*
.
I could send a PR if there's interest on this
pathspec
provides typing, but is not marked as py.typed
. So type checkers (like mypy and pyright) think stubs are missing.
https://peps.python.org/pep-0561/#packaging-type-information
It took me quite a bit of time to figure out this one: if you have an exclusion pattern before an inclusion one, it won't work. The reason is that match_files()
processes the patterns in whatever orders they are specified:
for pattern in patterns:
if pattern.include is not None:
result_files = pattern.match(all_files)
if pattern.include:
return_files.update(result_files)
else:
return_files.difference_update(result_files)
return return_files
If the exclusion pattern is processed first, the difference_update won't have any effect since the return_files
will be empty. To fix that I separated the patterns in two lists and process the inclusion ones first:
include_patterns, exclude_patterns = partition(patterns, lambda p: p.include)
for pattern in include_patterns:
result_files = pattern.match(all_files)
return_files.update(result_files)
for pattern in exclude_patterns:
result_files = pattern.match(all_files)
return_files.difference_update(result_files)
return return_files
where partition
is defined as:
def partition(data, pred):
"""Partitions the data according to the predicate. Returns a tuple of lists (yes, no) with the partitioned elements."""
yes, no = [], []
for d in data:
(yes if pred(d) else no).append(d)
return [yes, no]
Of course you may fix this differently (for example you may "sort" the patterns
such that the exclusion ones come after the include ones.
HTH
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.