Coder Social home page Coder Social logo

cpburnz / python-pathspec Goto Github PK

View Code? Open in Web Editor NEW
143.0 6.0 41.0 333 KB

Utility library for gitignore style pattern matching of file paths.

License: Mozilla Public License 2.0

Shell 0.61% Python 97.64% Makefile 1.75%
python gitignore-patterns wildmatch

python-pathspec's People

Contributors

adrienverge avatar avasam avatar bitdeli-chef avatar boogles avatar cpburnz avatar dahlia avatar davidfraser avatar dcecile avatar ghickman avatar groodt avatar highb avatar hugovk avatar ichard26 avatar jdufresne avatar kolanich avatar kurtmckee avatar mgorny avatar mikexstudios avatar nhhollander avatar nhumrich avatar pykong avatar sebastiaanz avatar tirkarthi avatar tomruk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

python-pathspec's Issues

files inside an ignored sub-directory are not matched

Consider the case of a directory structure like :

.
|-- directoryD
|   |-- fileE
|   `-- fileF
|-- directoryG
|   |-- directoryH
|   |   |-- fileI
|   |   |-- fileJ
|   |   |-- fileK
|   |   |-- fileL
|   |   |-- fileM
|   |   `-- fileN
|   `-- fileO
|-- fileA
|-- fileB
|-- fileC
`-- .gitignore

The contents of the .gitignore file are :

fileB
directoryD/*
directoryG/*

Now if we use pathspec to match all the files with the specs defined in .gitignore then here are the responses of match_file(filepath) function:

|-- directoryD : False
|   |-- fileE : True
|   `-- fileF : True
|-- directoryG : False
|   `-- directoryH : True
|       |-- fileI : False
|       |-- fileJ : False
|       |-- fileK : False
|       |-- fileL : False
|       |-- fileM : False
|       `-- fileN : False
|   `-- fileO : True
|-- fileA : False
|-- fileB : True
|-- fileC : True
`-- .gitignore

If you compare it with the behaviour of .gitignore inside a git repository, the files inside directoryH should all return True as a reponse to the match_file function. Or am I getting something wrong ?

`match_files()` is not a pure generator function, and it impacts `tree_*()` gravely

Hey @cpburnz , thanks for the great lib!
In match_files() (https://github.com/cpburnz/python-path-specification/blob/c00b332b2075548ee0c0673b72d7f2570d12ffe6/pathspec/pathspec.py#L170), the line

file_map = util.normalize_files(files, separators=separators)

(L190) requires files to be completely exhausted before even the first file is matched. If files is a list-like, this is not a problem, but when calling it from the tree_*() methods it means that the whole iterator mechanics is pretty much useless.
It also means that if I have an ignored folder containing a very complex structure, which I want pathspec to ignore, pathspec will search through it although there is no way it will play a role in the results.

As an example, for an automation I'm writing on a real life repository containing a frontend application, the scan of npm generated files took about 10 minutes (before yielding the first result) and then I gave up and stopped it.

I think a possible solution is to remove this dictionary and simply doing:

for file in files:
  if util.match_file(self.patterns, util.normalize_file(file)):
    yield file

(I bypassed util.match_files() here as it, too, is not a generator and will try to convert files to list first)

match_files with negated path spec

Hi. Thanks for the project.

The path spec concept seems to be inspired by gitignore, but I can't find a convenient way to use it as an actual ignore pattern.
That is, write a spec, and then find all files under a directory that do not match it

match_tree doesn't return the symlinks regardless of the followSymlinks parameter

This may be considered a bug or "works as designed", depending on the interpretation of "followSymlinks", but wanted to point out a (design?) flaw in match_tree:

Suppose I want to archive a folder so I specify a very inclusive pattern: "Libs/"

Under 'Libs/' I have a Mac framework named 'Crmsdk'. Mac frameworks are really just directories with a particular structure inside but the interesting aspect for us is that it has symlinks. This is how Crmsdk.framework looks inside (first level):

lrwxr-xr-x  1 flo  staff    23 Oct  1 17:13 CrmSdk -> Versions/Current/CrmSdk
lrwxr-xr-x  1 flo  staff    26 Oct  1 17:13 Resources -> Versions/Current/Resources
drwxr-xr-x  5 flo  staff   160 Jan 15 00:23 Versions

Notice that Resources is a symlink pointing inside the Versions folder.

Now, when I call the path_spec.match_tree() function, regardless of the value of followSymlinks parameter, I can never get the Resources folder as an entry in the result set. Which, someone might argue, is what I should expect because:

  • if followSymlinks is True I will get entries like:
Libs/CrmSdk.framework/Resources/CrmRsc2
...
Libs/CrmSdk.framework/Versions/Current/Resources/CrmRsc1
Libs/CrmSdk.framework/Versions/Current/Resources/CrmRsc2

Which is normal because the code is following the symlinks and Resources is just a symlink to Versions/Current/Resources.

  • if followSymlinks is False, then Resources doesn't even show up in the results list as a folder after CrmSdk.framework, in other words I get only these entries:
Libs/CrmSdk.framework/Versions/Current/Resources/CrmRsc2

I don't get any Libs/CrmSdk.framework/Resources entries.

This puts me in the impossibility to create an archive of the matched entries that I can unzip on another location and have it recreated the same (because the symlinks are missing).

One may claim that this is a bug and when followSymlinks is False, the Resources folder should be returned in the list of results (clients shouldn't make an assumption about the entries returned: they may be files, folders or symlinks - it's their job to properly handle it). Bottom line is that symlinks are missing in the results list regardless of the flag's value.

test_util.py uses os.symlink which can fail on Windows

I am running unprivileged on a corporate laptop, and I encounter the following errors. It would be nice to detect this and skip the tests which depend on it.

FAILED pathspec/tests/test_util.py::IterTreeTest::test_2_0_check_symlink - OSError: symbolic link privilege not held
FAILED pathspec/tests/test_util.py::IterTreeTest::test_2_1_check_realpath - OSError: symbolic link privilege not held
FAILED pathspec/tests/test_util.py::IterTreeTest::test_2_2_links - OSError: symbolic link privilege not held
FAILED pathspec/tests/test_util.py::IterTreeTest::test_2_3_sideways_links - OSError: symbolic link privilege not held
FAILED pathspec/tests/test_util.py::IterTreeTest::test_2_4_recursive_links - OSError: symbolic link privilege not held
FAILED pathspec/tests/test_util.py::IterTreeTest::test_2_5_recursive_circular_links - OSError: symbolic link privilege not held
FAILED pathspec/tests/test_util.py::IterTreeTest::test_2_6_detect_broken_links - OSError: symbolic link privilege not held
FAILED pathspec/tests/test_util.py::IterTreeTest::test_2_7_ignore_broken_links - OSError: symbolic link privilege not held
FAILED pathspec/tests/test_util.py::IterTreeTest::test_2_8_no_follow_links - OSError: symbolic link privilege not held

Incorrectly (?) matches files inside directories that do match

Hello @cpburnz, I've looked existing issues but couldn't find this one. It was initially reported in yamllint issues: adrienverge/yamllint#334 (yamllint uses pathspec).

Considering these files:

file.yaml
dir/
└── dir/file.sql
dir.yaml/
└── dir.yaml/file.sql        ← problem here

pathspec.PathSpec.from_lines('gitwildmatch', ['*.yaml']) does match file.yaml, dir.yaml, but also dir.yaml/file.sql. The latter isn't what I would expect, because *.yaml feels like "files that end with .yaml", not "any file in any path that contains .yaml".

Is it the expected behavior? If yes, what is the correct pattern for "files that end with .yaml"?

Looking forward to your answer, and thanks for this very useful project 👍

Deprecation warining

I get the following deprecation warning

/usr/lib/python3.7/site-packages/pathspec/pathspec.py:27: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
self.patterns = patterns if isinstance(patterns, collections.Container) else list(patterns)

On bracket expression negation

^ is escaped, but it works with gitignore.

To test it:

mkdir test && cd test
git init
echo 'a[^b]c' > .gitignore
touch abc azc
git status -u --ignored

Notice that abc doesn't match but azc does match. pathspec's behavior with ^ is inconsistent with Git.

I read:

# POSIX declares that the regex bracket expression negation

But I didn't understand why would you prefer to be compatible with POSIX and Python's fnmatch.translate when you could be compatible with gitignore itself. May I ask what led you to the decision to escape ^?

Support Python Pathlib

The match_file and match_files functions do not seem to support pathlib.Path objects, which would be convenient. If that is something that you want, I would be happy to work on a pull request.

Add PathSpec.match_file

Hi,
I was looking for a Python implementation to match files with the rules defined in .gitignore files and this project is great!
My use case is to synchronize directories across a network and most of the control logic (filter, compare, update) is at the inode level to allow me to maximize the number of skipped elements (to not explore excluded directories for example).
I would like to update my current filter logic to support git patterns: given a list of patterns, is my file path matched or not ? The issue is that currently pathspec seems to be heavily oriented around processing lists of paths, what if I have a single file ?

Here is what my current implementation boils down to:

spec = pathspec.PathSpec.from_lines(pathspec.GitIgnorePattern, patterns)
def match_file(file_path):
    return len(list(spec.match_files([file_path]))) > 0  # This should not be so complicated

is_ignored = match_file(u'testfile.py')

As you can see, it's pretty cumbersome: I have to create a a collection with a single item, run the matcher and then extract the result.

Ideally, I would imagine that PathSpec exposes a match_file function returning a boolean and match_files (or filter_files since it's currently acting as a filter ?) would just reuse it:

class PathSpec(object):
    # ...

    def match_file(self, file, separators=None):  # Core logic
        norm, path = util.normalize_file(file, separators=separators)  # Single file version
        is_matched = util.match_file(self.patterns, norm)  # Single file version
        return is_matched  # bool

    def match_files(self, files, separators=None):  # Quality of life function: it just replaces a one line generator
        return (file for file in files if self.match_file(file, separators))

Basically, it boils down to the fact the library does not expose single item functions to let me iterate other my files as I want but hides a loop inside every function.
What do you think about adding better support for single file matching ? I am aware that due to the current architecture of the library, it would require some refactoring but I believe that it would be for the best. Could you implement it or should I do it and send a PR (since it's a big change, I'd rather wait for your feedback)

Side note: the real name of the gitignore matcher is wildmatch. How about adding this as an alias name when registering the pattern ? Your module deserves to be better referenced (I had some troubles to find it even if I knew what I was looking for).

Rename repo to pathspec

@cpburnz I feel this repo should be just called pathspec the same way you would install it from PyPI, as it is not the same as python-path-specification.

Minor detail, but it confused me.

can You explain please ?

Dear Caleb,
I am a bit confused, please be so kind and help me.
Because git allows for nested .gitignore files, therefore a base_path value is required for correct behavior. But I cant find it ?

I played around and found some small glitches - if You use the Pathspec.from_lines also comment lines of an ignorefile will create (empty) spec patterns - those should be skipped.
also double entries should be skipped.

But most confusing is, that there is no reference to a base directory ? How You would handle nested .gitignore files ?

yours sincerely

Robert

Dist failure for Fedora, CentOS, EPEL

Hello,

I maintain the python-pathspec package for Fedora and CentOS, Rocky Linux, RHEL etc.

Since version 0.10.0, the setup.py file was removed. I tried adapting our files to use python -m build, bdist_wheel, tox, but after 40 minutes I fail to get it working.

Is there a documentation on how to build and install the package now that setup.py was removed? I couldn't find it in READMEs, changelogs and Git history.

Was is the recommended way to:

  • build the package? (previously with 0.9.0: python setup.py sdist)
  • install the package on the system, inside /usr/lib/python3.x/site-packages/...? (previously with 0.9.0: python setup.py install)
  • test the package once installed? (I guess tox is enough now, previously with 0.9.0: python setup.py test)

Thanks in advance.

0.8.1: pytest warnings

=========================================================================== test session starts ============================================================================
platform linux -- Python 3.8.9, pytest-6.2.3, py-1.10.0, pluggy-0.13.1
rootdir: /home/tkloczko/rpmbuild/BUILD/pathspec-0.8.1
plugins: forked-1.3.0, shutil-1.7.0, virtualenv-1.7.0, asyncio-0.14.0, expect-1.1.0, cov-2.11.1, mock-3.5.1, httpbin-1.0.0, xdist-2.2.1, flake8-1.0.7, timeout-1.4.2, betamax-0.8.1, pyfakefs-4.4.0, freezegun-0.4.2, flaky-3.7.0, cases-3.4.6, hypothesis-6.10.1, case-1.5.3, isort-1.3.0
collected 48 items

pathspec/tests/test_gitwildmatch.py ......................s.s...                                                                                                     [ 58%]
pathspec/tests/test_pathspec.py ........                                                                                                                             [ 75%]
pathspec/tests/test_util.py ............                                                                                                                             [100%]

========================================================================= short test summary info ==========================================================================
SKIPPED [1] pathspec/tests/test_gitwildmatch.py:421: Python 3 is strict
SKIPPED [1] pathspec/tests/test_gitwildmatch.py:440: Python 3 is strict
====================================================================== 46 passed, 2 skipped in 0.13s =======================================================================

Python 2.6 support

Hi,

Since today, pathspec is used by yamllint, which is used by Ansible, OpenStack and others.

The problem is: all these software must support Python 2.6, but pathspec currently doesn't. This leads to issues like adrienverge/yamllint#55 and ansible/ansible#26186.

In your opinion, what's the amount of work needed to support Python 2.6?

Provide a library function/method for converting a glob to an uncompiled regex string.

currently, the way to get a regex string for a given gitignore-style glob is:

>>> pathspec.GitIgnorePattern('/dist/').regex.pattern
'^dist/.*$'

which incurs the glob->regex translation inside GitIgnorePattern.__init__ which in turn calls RegexPattern.__init__ which automatically compiles the regex.

for the simple case of just wanting to convert a glob into a non-compiled regex string, it'd be great to have a utility function/method that could both be used inside GitIgnorePattern.__init__ and outside as part of the public API.

IndexError with my .gitignore file when trying to build a Python package

Problem description

I am trying to package my Python project/library. So first I created a pyproject.toml file, following the official Python docs, using hatchling - no fancy stuff here. Then, after upgrading to the latest build version, I ran this command from within the root folder of my project:

$ python -m build
* Creating venv isolated environment...
* Installing packages in isolated environment... (hatchling)
* Getting build dependencies for sdist...
* Building sdist...
Traceback (most recent call last):
  File "/home/mfb/.local/lib/python3.8/site-packages/pep517/in_process/_in_process.py", line 351, in <module>
    main()
  File "/home/mfb/.local/lib/python3.8/site-packages/pep517/in_process/_in_process.py", line 333, in main
    json_out['return_val'] = hook(**hook_input['kwargs'])
  File "/home/mfb/.local/lib/python3.8/site-packages/pep517/in_process/_in_process.py", line 302, in build_sdist
    return backend.build_sdist(sdist_directory, config_settings)
  File "/tmp/build-env-bgb1kk5q/lib/python3.8/site-packages/hatchling/build.py", line 21, in build_sdist
    return os.path.basename(next(builder.build(sdist_directory, ['standard'])))
  File "/tmp/build-env-bgb1kk5q/lib/python3.8/site-packages/hatchling/builders/plugin/interface.py", line 144, in build
    artifact = version_api[version](directory, **build_data)
  File "/tmp/build-env-bgb1kk5q/lib/python3.8/site-packages/hatchling/builders/sdist.py", line 156, in build_standard
    for included_file in self.recurse_included_files():
  File "/tmp/build-env-bgb1kk5q/lib/python3.8/site-packages/hatchling/builders/plugin/interface.py", line 168, in recurse_included_files
    yield from self.recurse_project_files()
  File "/tmp/build-env-bgb1kk5q/lib/python3.8/site-packages/hatchling/builders/plugin/interface.py", line 182, in recurse_project_files
    if self.config.include_path(relative_file_path, is_package=is_package):
  File "/tmp/build-env-bgb1kk5q/lib/python3.8/site-packages/hatchling/builders/config.py", line 82, in include_path
    and (explicit or self.path_is_included(relative_path))
  File "/tmp/build-env-bgb1kk5q/lib/python3.8/site-packages/hatchling/builders/config.py", line 90, in path_is_included
    return self.include_spec.match_file(relative_path)
  File "/tmp/build-env-bgb1kk5q/lib/python3.8/site-packages/pathspec/pathspec.py", line 176, in match_file
    return self._match_file(self.patterns, norm_file)
  File "/tmp/build-env-bgb1kk5q/lib/python3.8/site-packages/pathspec/gitignore.py", line 104, in _match_file
    dir_mark = match.match.group('ps_d')
IndexError: no such group

ERROR Backend subprocess exited when trying to invoke build_sdist

I do have a .gitignore file in my root folder, however, it includes only the default Python project template from Github.

Any ideas what is causing this error?

PathSpec.match_file() returns None since 0.12.0

Hello @cpburnz, and thanks for maintaining pathspec 👍

Since release 0.12.0 and the change of signature of PathSpec.match_file() from bool to Tuple[Optional[bool], Optional[int]], yamllint tests fail. New errors looks like:

self.assertEqual(c.ignore.match_file('test.yaml'), False)
AssertionError: None != False

(↑ c.ignore is an instance of pathspec.PathSpec)

These seem easy to fix, but I wanted to ask first. I don't understand why None is a better return value? I guessed maybe it's a small bug?

I look at all commits in this new release and found 92a9066 "Improve debugging", I read it but it's not clear to me.

Looking forward to know more about this. Thanks! 🙂

Include directory should override exclude file

import pathspec

spec = pathspec.GitIgnoreSpec.from_lines([
    '*',  # Ignore all files by default
    '!*/',  # but scan all directories
    '!*.txt',  # Text files
    '/test1/**'  # ignore all in the directory
])
files = {
    'test1/b.bin',
    'test1/a.txt',
    'test1/c/c.txt',
    'test2/a.txt',
    'test2/b.bin',
    'test2/c/c.txt',
}
ignores = set(spec.match_files(files))
print(ignores)

{'test1/b.bin', 'test2/b.bin'}

It should be

{'test1/a.txt', 'test1/c/c.txt', 'test1/b.bin', 'test2/b.bin'}

I think GitIgnoreSpec._match_file should have something like:

                if pattern.include is False and dir_mark and out_priority == 0:
                    out_matched = pattern.include
                    out_priority = priority
                elif pattern.include is True and dir_mark:
                    out_matched = pattern.include
                    out_priority = priority
                elif priority >= out_priority:
                    out_matched = pattern.include
                    out_priority = priority

dangling symlinks cause crash

When using PathSpec.match_tree, If there is a broken symlink encountered, we get unhandled exception. It's here:

https://github.com/cpburnz/python-path-specification/blob/da86e2c4d557df2d0a7cc9743268a7173d3a4828/pathspec/util.py#L68

I think you could fix it with os.lstat instead, but that's backwards incompat change.

Perhaps iter_tree could have an option to not follow symlinks? You could follow the example of os.walk, which accepts a followlinks keyword arg, and also an onerror callback which can be used to handle problems such as permission errors when stat each file.

Leading & trailing whitespace

Hello, I am using this library and noticing some odd results around white space. When the white space is leading or trailing, I do not get any matches.

# Leading whitespace does not match
>>> pathspec.PathSpec.from_lines(pathspec.patterns.GitWildMatchPattern, ['   whitespace.txt']).match_file('   whitespace.txt')
False

# Trailing whitespace does not match
>>> pathspec.PathSpec.from_lines(pathspec.patterns.GitWildMatchPattern, ['whitespace.txt   ']).match_file('whitespace.txt   ')
False

It seems to have no problem with whitespace internal to a string.

>>> pathspec.PathSpec.from_lines(pathspec.patterns.GitWildMatchPattern, ['white   space.txt']).match_file('white   space.txt')
True

Can you help me understand what is going on here? Both whitespace.txt and whitespace.txt are valid file names so I would like to figure out why I can't match them.

Why doesn't !<path> act the same as .gitignore?

When using .gitignore I could to the following:

*.log
!important/*.log
trace.*

This would exclude all log files but then include important/*log. Pathspec doesn't work this way, was this intentional?

Since version 0.10.0 pure wildcard does not work in some cases

Behavior of pattern "match everything" changed when upgrading from version 0.9.0 to 0.10.0:


Version 0.9.0

>>> import pathspec
>>> patterns = pathspec.PathSpec.from_lines(pathspec.patterns.GitWildMatchPattern, ["*"])
>>> 
>>> print(patterns.match_file('anydir/file.txt'))
True

Result is as expected


Version 0.10.0

>>> import pathspec
>>> patterns = pathspec.PathSpec.from_lines(pathspec.patterns.GitWildMatchPattern, ["*"])
>>> 
>>> print(patterns.match_file('anydir/file.txt'))
False

Result is wrong. * should match any file, but it doesn't match anydir/file.txt.


Discovered when checking case "match everything except for one directory":

patterns = pathspec.PathSpec.from_lines(pathspec.patterns.GitWildMatchPattern, ["*", "!product_dir/"])

On version 0.9.0 it correctly matches file outside of product_dir and doesn't match file inside:

>>> print(patterns.match_file('anydir/file.txt'))
True
>>> print(patterns.match_file('product_dir/file.txt'))
False

On version 0.10.0 it doesn't match in either case:

>>> print(patterns.match_file('anydir/file.txt'))
False
>>> print(patterns.match_file('product_dir/file.txt'))
False

pathspec should expose the information of what matched in a string/path

This library is very cool and I want to use it a project. However, i'm running against a severe limitation: I can't tell which part of a filename was matched.
Consider this example: I have several match patterns (let's call them SpecEntries). They specified what I'm looking for and, optionally, what should the matched thing be remapped to. Example:
'/Documentation/': 'docs/',
'/*.html': 'docs/',
'**/Examples/SDK/': 'docs/'

In the above examples, all those patterns on the left side are remapped to a 'docs/' folder.
Now I'm using match_tree to iterate and directory and compare against the patterns specified by my SpecEntries.

  1. The first issue is that the results returned by match_tree doesn't specify which pattern matched which file. I worked around this by iterating through my patterns, compile each one and calling match_files against it. Doable but inneficient (consider this an improvement request).

  2. After the matches are returned I'd like to remap those paths according to the right hand side of the SpecEntry for example:

    '/Documentation/foo.txt' -> 'docs/foo.txt'
    'foo.html': 'docs/foo.html'
    'blah/Examples/SDK/bar.txt' -> 'docs/bar.txt'

The problem is that there isn't an API that will allow me to do this. pathspec library knows which part of the path matched my specifier but it doesn't expose that information to me so I can't do this remapping. (I considered using regular expressions or fnmatch but they won't easily match pathspec's capabilities - for example no easy way to match '**')

Is it possible to expose the matching logic in the library APi so callers can implement this kind of remapping feature?

Exclusions not working

Either I'm missing something or exclusions are not working properly.
Check this test case:

def test_02_dir_exclusions(self):

What it is currently:

	def test_02_dir_exclusions(self):
		"""
		Test directory exclusions.
		"""
		spec = GitIgnoreSpec.from_lines([
			'*.txt',
			'!test1/',
		])
		files = {
			'test1/a.txt',
			'test1/b.bin',
			'test1/c/c.txt',
			'test2/a.txt',
			'test2/b.bin',
			'test2/c/c.txt',
		}

		results = list(spec.check_files(files))
		ignores = get_includes(results)
		debug = debug_results(spec, results)

		self.assertEqual(ignores, {
			'test1/a.txt',
			'test1/c/c.txt',
			'test2/a.txt',
			'test2/c/c.txt',
		}, debug)
		self.assertEqual(files - ignores, {
			'test1/b.bin',
			'test2/b.bin',
		}, debug)

What I would expect:

	def test_02_dir_exclusions(self):
		"""
		Test directory exclusions.
		"""
		spec = GitIgnoreSpec.from_lines([
			'*.txt',
			'!test1/',
		])
		files = {
			'test1/a.txt',
			'test1/b.bin',
			'test1/c/c.txt',
			'test2/a.txt',
			'test2/b.bin',
			'test2/c/c.txt',
		}

		results = list(spec.check_files(files))
		ignores = get_includes(results)
		debug = debug_results(spec, results)

		self.assertEqual(ignores, {
			'test2/a.txt',
			'test2/c/c.txt',
		}, debug)
		self.assertEqual(files - ignores, {
			'test1/b.bin',
			'test2/b.bin',
			'test1/a.txt',
			'test1/b.bin',
			'test1/c/c.txt',
		}, debug)

`GitIgnoreSpec` behaviors differ from git

Here is the demo for pathspec:

from pathlib import Path

import pathspec

exclude_lines=["*", "!libfoo", "!libfoo/**"]
exclude_spec = pathspec.GitIgnoreSpec.from_lines(exclude_lines)

print(exclude_spec.match_file(Path("./libfoo/__init__.py")))

In pathspec==0.11.2 the result is True, which means ./libfoo/__init__.py is excluded.


Another demo to check the behavior of Git:

Directory structure be like

demo-project
├── libfoo
│   └── __init__.py
└── .gitignore

Then check ignorance with command git check-ignore -v ./libfoo/__init__.py

Expect no output, which means ./libfoo/__init__.py is not excluded.


I suppose this difference is caused by the priority between directory and file patterns.

if dir_mark:
# Pattern matched by a directory pattern.
priority = 1
else:
# Pattern matched by a file pattern.
priority = 2
if pattern.include and dir_mark:
out_matched = pattern.include
out_priority = priority
elif priority >= out_priority:
out_matched = pattern.include
out_priority = priority

Including pattern !/libfoo/** is treated as a directory pattern and will be overridden by the excluding file pattern *, thus the ./libfoo/__init__.py will get excluded.

p.s. I've googled this priority mechanism several hours, but still could not find out any documentation mentioning it.


Other platform info:
pathspec 0.11.2
Git 2.40.1.windows.1
Windows 11 22H2 (22621.2134)
Python 3.11.4

Feature request: Accept Path as arguments

We write the year 2020: Pythonistas have largely adopted the use of the pathlib module as a more convenient way to perform file system operation that old stinkin os.path and its comrades.

Hence it would be great if all of pathspec function would accept Path objects or iterables thereof as arguments, compared to plain strings.

Putting a conversion to a string via str(my_path) at the right places is all it takes to make it work.

a "./" infront of the filename provides wrong matches

Consider the use case when you have a directory structure like this :

.
|-- 0.csv
|-- A
|   `-- 1.csv
`-- .gitignore

and the contents of .gitignore are :

*.csv
!A/0.csv

ignorespec.match_file("A/0.csv") returns True, which is expected.
while
ignorespec.match_file("./A/0.csv") returns False.

Symlink pathspec_meta.py breaks Windows

setup.cfg refers to https://github.com/cpburnz/python-path-specification/blob/master/pathspec_meta.py with

version: attr: pathspec_meta.__version__

On a git checkout on Windows, unless git config core.symlinks was enabled explicitly, the following occurs when trying to install the package.

Traceback (most recent call last):
  File "C:\Users\vandjohn\Downloads\WPy64-3771\python-3.7.7.amd64\lib\site-packages\setuptools\config.py", line 40, in __getattr__
    for statement in self.module.body
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\vandjohn\Downloads\WPy64-3771\python-3.7.7.amd64\lib\site-packages\setuptools\config.py", line 391, in _parse_attr
    return getattr(StaticModule(module_name), attr_name)
  File "C:\Users\vandjohn\Downloads\WPy64-3771\python-3.7.7.amd64\lib\site-packages\setuptools\config.py", line 47, in __getattr__
    "{self.name} has no attribute {attr}".format(**locals()))
AttributeError: pathspec_meta has no attribute __version__

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "setup.py", line 5, in <module>
    setup()
  File "C:\Users\vandjohn\Downloads\WPy64-3771\python-3.7.7.amd64\lib\site-packages\setuptools\__init__.py", line 144, in setup
    return distutils.core.setup(**attrs)
  File "C:\Users\vandjohn\Downloads\WPy64-3771\python-3.7.7.amd64\lib\distutils\core.py", line 121, in setup
    dist.parse_config_files()
  File "C:\Users\vandjohn\Downloads\WPy64-3771\python-3.7.7.amd64\lib\site-packages\setuptools\dist.py", line 690, in parse_config_files
    ignore_option_errors=ignore_option_errors)
  File "C:\Users\vandjohn\Downloads\WPy64-3771\python-3.7.7.amd64\lib\site-packages\setuptools\config.py", line 161, in parse_configuration
    meta.parse()
  File "C:\Users\vandjohn\Downloads\WPy64-3771\python-3.7.7.amd64\lib\site-packages\setuptools\config.py", line 467, in parse
    section_parser_method(section_options)
  File "C:\Users\vandjohn\Downloads\WPy64-3771\python-3.7.7.amd64\lib\site-packages\setuptools\config.py", line 440, in parse_section
    self[name] = value
  File "C:\Users\vandjohn\Downloads\WPy64-3771\python-3.7.7.amd64\lib\site-packages\setuptools\config.py", line 224, in __setitem__
    value = parser(value)
  File "C:\Users\vandjohn\Downloads\WPy64-3771\python-3.7.7.amd64\lib\site-packages\setuptools\config.py", line 556, in _parse_version
    version = self._parse_attr(value, self.package_dir)
  File "C:\Users\vandjohn\Downloads\WPy64-3771\python-3.7.7.amd64\lib\site-packages\setuptools\config.py", line 394, in _parse_attr
    module = importlib.import_module(module_name)
  File "C:\Users\vandjohn\Downloads\WPy64-3771\python-3.7.7.amd64\lib\importlib\__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "C:\Users\vandjohn\gh\python-path-specification\pathspec_meta.py", line 1, in <module>
    pathspec/_meta.py
NameError: name 'pathspec' is not defined

Using the following in setup.cfg works for me, but I suspect there was some problem with this (possibly a setuptools issue?) which the symlink was intended to workaround.

version: attr: pathspec._meta.__version__

Feature request: Unicode support

@cpburnz Thanks for putting this neat little highly useful python package out there.
I like to work with files featuring weird little symbols inside their names, to make the life of the Spanish speaking users of my package copier: copier-org/copier#118 (comment)

I would need pathspec to handle files like: ñana.txt which currently get just never match and hence are never ignored.

A possible solution would be to use the regex library, which is a drop-in replacement for pythons inbuilt re lib, but featuring full unicode support among other power-ups.

If you would accept a dependency to your package I would be ready to open a PR implementing the required changes.

The pattern_to_regex method does not seem to work correctly on windows.

gitignore file is as follows.

a
b
spam/**
**/api/
**/

After converting each line to regex, I have a method called list_path that scans the python files in the current directory and returns the ones that do not match each pattern found in gitignore, and the following test failed in the Windows environment, do you have Windows support for Pathspec?

Fail to match directory

Hi,
I have a simple case:

pattern = "cargo/"
spec = pathspec.PathSpec.from_lines('gitwildmatch', pattern)
return spec.match_file("cargo")
=> False

I was expecting True

edit:
cargo/a is ignored as expected

Checking directories via match_file() does not work on Path objects

I am currently investigating a problem in the black formatter which uses pathspec to figure out ignored files by parsing .gitignore files. black uses this library and tries to check if directories are ignored before checking their contents.

It boils down to this minimum example:

from pathlib import Path

import pathspec

spec = pathspec.PathSpec.from_lines('gitwildmatch', """
/important/
""".splitlines())

assert spec.match_file("important/bar.log")
assert spec.match_file("/important/")
assert spec.match_file("important/")
assert spec.match_file("important")  # does not match
assert spec.match_file(Path("important/"))  # does not match as pathlib removes trailing slash

It matches the file important/bar.log as expected, however, it does not match the folder when it has no trailing slash.

This becomes especially problematic when using Path objects, as the trailing slash is removed by pathlib.

I don't know where this needs to be fixed. What do you think? Is black using it wrong? Should the match_file() method check if the given Path object is a directory and add the trailing slash on its own before checking?

It would certainly be possible to implement a workaround in black and check with Path.is_dir(), convert to str and append the slash...

FYI, the place where the check happens in black is here. relative_path is a pathlib.Path and can be a directory.

Unintuitive behavior with binary paths/patterns

The following program illustrates some possible variations:

#!/usr/bin/python3

import pathspec
import os

print("A. String pattern, string path:\n    ", end="")
try:
    s = pathspec.PathSpec.from_lines('gitwildmatch', ['*.py'])
    for p in os.listdir('pathspec'):
        if s.match_file(p):
            print(p, end=" ")
    print()
except Exception as e:
    print("FAILED '%s'" % e)

print("B. String pattern, binary path:\n    ", end="")
try:
    s = pathspec.PathSpec.from_lines('gitwildmatch', ['*.py'])
    for p in os.listdir(b'pathspec'):
        if s.match_file(p):
            print(p, end=' ')
    print()
except Exception as e:
    print("FAILED '%s'" % e)

print("C. String pattern, binary path + surrogateescape:\n    ", end="")
try:
    s = pathspec.PathSpec.from_lines('gitwildmatch', ['*.py'])
    for p in os.listdir(b'pathspec'):
        if s.match_file(p.decode('utf8','surrogateescape')):
            print(p, end=' ')
    print()
except Exception as e:
    print("FAILED '%s'" % e)

print("D. Binary pattern, binary path:\n    ", end="")
s = pathspec.PathSpec.from_lines('gitwildmatch', [b'*.py'])
try:
    for p in os.listdir(b'pathspec'):
        if s.match_file(p):
            print(p, end=' ')
    print()
except Exception as e:
    print("FAILED '%s'" % e)

Gives the following result when run in the source directory:

A. String pattern, string path:
    util.py pattern.py pathspec.py __init__.py compat.py 
B. String pattern, binary path:
    FAILED 'cannot use a string pattern on a bytes-like object'
C. String pattern, binary path + surrogateescape:
    b'util.py' b'pattern.py' b'pathspec.py' b'__init__.py' b'compat.py' 
D. Binary pattern, binary path:
    

IMHO examples A-C behaves as expected, while example D does not match any files, neither does it complain on the pattern.

Bug of matching absolute paths for some patterns.

Some patterns work with both absolute and relative paths.
For example,

>>> list(pathspec.patterns.GitWildMatchPattern("*.py").match(["/foo/a.py", "foo/a.py", "x/foo/a.py", "/x/foo/a.py"]))
['/foo/a.py', 'foo/a.py', 'x/foo/a.py', '/x/foo/a.py']
>>> list(pathspec.patterns.GitWildMatchPattern("**").match(["/foo/a.py", "foo/a.py", "x/foo/a.py", "/x/foo/a.py"]))
['/foo/a.py', 'foo/a.py', 'x/foo/a.py', '/x/foo/a.py']

However, the pattern foo or /foo won't match the path starts with /foo.
For example,

>>> list(pathspec.patterns.GitWildMatchPattern("foo").match(["/foo/a.py", "foo/a.py", "x/foo/a.py", "/x/foo/a.py"]))
['foo/a.py', 'x/foo/a.py', '/x/foo/a.py']
>>> list(pathspec.patterns.GitWildMatchPattern("/foo").match(["/foo/a.py", "foo/a.py", "x/foo/a.py", "/x/foo/a.py"]))
['foo/a.py']

Can we support matching the absolute path in this case?
I think making output.append('(?:.+/)?') at here output.append('(?:.*/)?') could solve the issue, but I am not sure whether it has other unwanted side effect.
Another solution is that we normalize /... to ... similar to normalizing the ./ at here.

`!` doesn't exclude files in directories if the pattern doesn't have a trailing slash

It appears that pathspec is not handling exclusions correctly for directories without a trailing slash:

import pathspec


gitignore = """\
build
!/foo/build
"""
spec = pathspec.PathSpec.from_lines("gitwildmatch", gitignore.splitlines())
# incorrectly returns True
print(spec.match_file("foo/build/file.py"))


gitignore = """\
build
!/foo/build/
"""
spec = pathspec.PathSpec.from_lines("gitwildmatch", gitignore.splitlines())
# correctly returns False
print(spec.match_file("foo/build/file.py"))

If you try doing the same with .gitignore, it works correctly:

git init repro
cd repro
echo $'build\n!/foo/build' > .gitignore
mkdir build
touch build/file.py
mkdir -p foo/build
touch foo/build/file.py

Running git status gives:

On branch main

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	.gitignore
	foo/

nothing added to commit but untracked files present (use "git add" to track)

And running git ls-files --others --exclude-standard gives:

.gitignore
foo/build/file.py

Performance improvement?

Hello.

We are users of pathspec in some other project. I have a performance question.

For a long list of rules (dozens) matches large amount of files (hundreds of thousands) the match_file takes a long time. Is there any method to improve its performance?
For example, using a big regex instead of multiple small ones.

Please consider switching the build-system to flit_core to ease setuptools bootstrap

Could you please consider switching the build system from setuptools to flit_core? This would help Linux distributions such as Gentoo avoid cyclic dependencies that cause bootstrapping unbundled setuptools a real pain. If you agree, I can submit a pull request doing the conversion.

The problem is that the most recent release of setuptools (66.0.0) started using platformdirs. platformdirs use the hatchling build backend which in turn requires this package. This creates a dependency cycle that we can't install setuptools before installing platformdirs, and we can't build platformdirs before all of hatchling's dependencies are installed, and we effectively end up needing setuptools to build them.

flit_core is a "no dependencies [except for tomli, on Python < 3.11]" by design, so it makes bootstrapping packages much easier.

Exclude folder using exclamation mark ('!') doesn't work

I'm using pattern
"""
!test1/
*.txt
"""

to scan a folder, from gitingore description
(http://git-scm.com/book/en/v2/Git-Basics-Recording-Changes-to-the-Repository)

ignore all files in the build/ directory

build/

this should exclude everything under test1 folder, but the run result shows the line dose not work, the result still show all contents under test 1 folder.

[An Example Folder Structure]
d:\dev\eclipse_workspace\test_scan\src\test1\a.txt
d:\dev\eclipse_workspace\test_scan\src\test1\b.txt
d:\dev\eclipse_workspace\test_scan\src\test1\c\c1.txt
d:\dev\eclipse_workspace\test_scan\src\test2\a.txt
d:\dev\eclipse_workspace\test_scan\src\test2\b.txt
d:\dev\eclipse_workspace\test_scan\src\test2\c\c1.txt

[Code]
SCAN_PATTERN = """
!test1/
*.txt
"""

spec = pathspec.PathSpec.from_lines(pathspec.GitIgnorePattern, SCAN_PATTERN.splitlines())
spec.match_tree('d:\dev\eclipse_workspace\test_scan')

[Run Result]
d:\dev\eclipse_workspace\test_scan\src\test1\a.txt
d:\dev\eclipse_workspace\test_scan\src\test1\b.txt
d:\dev\eclipse_workspace\test_scan\src\test1\c\c1.txt
d:\dev\eclipse_workspace\test_scan\src\test2\a.txt
d:\dev\eclipse_workspace\test_scan\src\test2\b.txt
d:\dev\eclipse_workspace\test_scan\src\test2\c\c1.txt

Can you please have a look? Thanks.

method to escape gitwildmatch

It would be great to have a method for escaping a string according to gitwildmatch, thus, putting backslashes before !, [, ], ? *.

I could send a PR if there's interest on this

Exclusion patterns work only if processed after the inclusion ones.

It took me quite a bit of time to figure out this one: if you have an exclusion pattern before an inclusion one, it won't work. The reason is that match_files() processes the patterns in whatever orders they are specified:

	for pattern in patterns:
		if pattern.include is not None:
			result_files = pattern.match(all_files)
			if pattern.include:
				return_files.update(result_files)
			else:
				return_files.difference_update(result_files)
	return return_files

If the exclusion pattern is processed first, the difference_update won't have any effect since the return_files will be empty. To fix that I separated the patterns in two lists and process the inclusion ones first:

	include_patterns, exclude_patterns = partition(patterns, lambda p: p.include)

	for pattern in include_patterns:
		result_files = pattern.match(all_files)
		return_files.update(result_files)

	for pattern in exclude_patterns:
		result_files = pattern.match(all_files)
		return_files.difference_update(result_files)
	return return_files

where partition is defined as:

def partition(data, pred):
	"""Partitions the data according to the predicate. Returns a tuple of lists (yes, no) with the partitioned elements."""
	yes, no = [], []
	for d in data:
		(yes if pred(d) else no).append(d)
	return [yes, no]

Of course you may fix this differently (for example you may "sort" the patterns such that the exclusion ones come after the include ones.

HTH

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.