victordomingos / count-files Goto Github PK

A CLI utility written in Python to help you count files, grouped by extension, in a directory. By default, it will count files recursively in current working directory and all of its subdirectories, and will display a table showing the frequency for each file extension (e.g.: .txt, .py, .html, .css) and the total number of files found.

Home Page: https://no-title.victordomingos.com/projects/count-files/

License: MIT License

Python 99.84% HTML 0.16%

cli python3 argparse file-management statistics

count-files's Introduction

English | Português | Русский | Українська

Count Files

A command-line interface (CLI) utility written in Python to help you counting or searching files with a specific extension, files without an extension or all files regardless of the extension, in the specified directory.

Documentation

Dependencies

To run this application, you need to have a working Python 3.6+ installation.

Installation

On regular desktop operating systems

Count Files is a platform-independent application that run in Python and can be easily installed using pip:

pip3 install count-files

If you are interested in the current development version, you can simply clone this git repository and install it using pip3 install -e. Please notice, however, that only released versions are expected to be stable and usable. The development code is often unstable or buggy, for the simple reason that it is a work in progress.

On iPhone or iPad (in Pythonista 3 for iOS)

It may also be used on iOS (iPhone/iPad) using the StaSh command-line in the Pythonista 3 app. Please see documentation for further instructions.

How to use

To check the list of available options and their usage, you just need to use one of the following commands:

count-files -h

count-files --help

By default, the program counts or searches for files recursively in current working directory and all of its subdirectories.
For fully supported operating systems (Linux, macOS, iOS, Windows), any hidden files or folders are ignored by default. For other operating systems in which Python can be run, this option to include/exclude hidden files is not available. And as a result, all existing files will be included.
The names of extensions are case insensitive by default. The results for ini and INI will be the same.
Optionally, you can pass it a path to the directory to scan, choose non-recursive counting or searching, process the file extensions with case-sensitive mode and enable search or counting in hidden files and folders.

See more about CLI arguments in English, Portuguese, Russian, Ukrainian.

The most simple form of usage is to type a simple command in the shell, without any arguments. It will display a table showing the frequency for each file extension (e.g.: .txt, .py, .html, .css) and the total number of files found.

count-files

Another main feature of this application consists in searching files by a given extension, which presents to the user a list of all found files paths.

count-files -fe txt [path]

count-files --file-extension txt [path]

You can also count the total number of files with a certain extension, without listing them.

count-files -t py [path]

count-files --total py [path]

For information about files without an extension, specify a single dot as the extension name.

count-files -fe . [path]

count-files --file-extension . [path]

count-files -t . [path]

count-files --total . [path]

If you need to list or to count all the files, regardless of the extension, specify two dots as the extension name.

count-files -fe .. [path]

count-files --file-extension .. [path]

count-files -t .. [path]

count-files --total .. [path]

You can also search for files using Unix shell-style wildcards: *, ?, [seq], [!seq].

count-files -fm *.py? [path]

count-files --filename-match *.py? [path]

Did you find a bug or do you have a suggestion?

Please, open a new issue or a pull request to the repository.

count-files's People

Contributors

Stargazers

Watchers

Forkers

nataliabondarenko ndanl tusharbihani rbhaiya gentlesik slad99 surajwate xyt556 maruthapandian

count-files's Issues

To be considered: can we simplify this by using a dictionary instead of a class?

As suggested by Daniel Baboiu @ Linkedin:

You don't need a full class for the counter. Just use a dictionary with extensions as keys (best method for storing data), then (if you need sorting) convert it to a list of tuples, and sort with the standard list methods.

To be considered: can we simplify our code by following this suggestion?

To consider: switching from dictionary to Counter

To consider: switching from dictionary to Counter (as suggested by Aaron Kurlan):
https://docs.python.org/3/library/collections.html#collections.Counter

Failing test on test_argument_parser.py

======================================================================
FAIL: test_for_hidden (tests.test_argument_parser.TestArgumentParser)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/fact/Dropbox/Count-files/tests/test_argument_parser.py", line 96, in test_for_hidden
    [self.get_locations('data_for_tests'), '-nr'])), 0)
AssertionError: None != 0

----------------------------------------------------------------------
Ran 6 tests in 0.012s

FAILED (failures=1)

Test runner fails

I get a traceback while running this:

$ python3.7 test_runner.py 
Traceback (most recent call last):
  File "test_runner.py", line 2, in <module>
    from tests.test_argument_parser import TestArgumentParser
ModuleNotFoundError: No module named 'tests.test_argument_parser'

IMG for Readme

Argparse and help system.

maybe implement sub-parsers:
This can be done, for example, the git has a sub-parser help and also the usual --help.
git --help
git help <command> (Launching default browser to display HTML-file), git help -a, git help -g
I do not know how it is implemented there, but with a parser you can do something like this:

from argparse import ArgumentParser
import os
from textwrap import fill


# handlers
def foo(args):
    print('ext:', args.file_extension, 'path:', args.path, 'preview:', args.preview)
    # print start_message
    """print(fill(show_start_message(args.extension, args.case_sensitive,
                                  recursive, include_hidden, location),
               width=START_TEXT_WIDTH), end="\n\n")"""
    print('the process of getting the list here')


def bar(args):
    # here you can write anything and even make a responsive help text
    # key-values may be stored in settings
    arg_help = {
        'desc': 'usage: test help <command>',
        'path': 'path help',
        'fe': 'file-extension help',
        'file-extension': 'file-extension help'
    }
    print(fill(arg_help.get(args.argument, 'not implemented'), width=5))


parser = ArgumentParser(
    prog='test',
    description='desc'
)
# common arguments
parser.add_argument('path', nargs='?', default=os.getcwd(), type=str,
                    help='The path to the folder containing the files to be counted.')
parser.add_argument('-st', '--supported-types', action='store_true',
                    help='The list of currently supported file types for preview.')

subparsers = parser.add_subparsers(help='Usual sub-command help')

parser_bar = subparsers.add_parser('help', help='Help by certain argument name')
parser_bar.add_argument('-a', '--argument', type=str, default='desc',
                        choices=('desc', 'path', 'fe', 'file-extension'))
parser_bar.set_defaults(func=bar)

# special arguments
parser_search = subparsers.add_parser('search', help='File searching by extension help')
parser_search.add_argument('-fe', '--file-extension', type=str, required=True,  # or set default='py',
                           help="Search files by file extension ...")
parser_search.add_argument('-p', '--preview', action='store_true', default=False,
                           help='Display a short preview (only available for text files.')
# parser_search.add_argument(...)
parser_search.set_defaults(func=foo)


def main(*args):

    args = parser.parse_args(*args)

    if args.supported_types:
        parser.exit(status=0, message='supported_type_info_message')

    if os.path.abspath(args.path) == os.getcwd():
        location = os.getcwd()
        loc_text = ' the current directory'
    else:
        location = os.path.expanduser(args.path)
        loc_text = ':\n' + os.path.normpath(location)

    if not os.path.exists(location):
        parser.exit(status=1, message=f'The path {location} '
                                      f'does not exist, or there may be a typo in it.')

    # if not include_hidden and is_hidden_file_or_dir(location): ... parser.exit
    args.path = location
    args.func(args)


if __name__ == "__main__":
    main()

it looks like this:

Originally posted by @NataliaBondarenko in #84 (comment)

Licensing update

Up until now, this project has been made available under Creative Commons Attribution Share Alike 4.0. It's a fairly permissive license that allows a wide range of use cases, including commercial use and code modifications, provided that it retains a license and copyright notice, any code changes are clearly documented and redistribution of derivative work must be released under the same or a similar license.

However, this license was not made for software and is not recommended for that purpose. So, I am looking for another license that is more suitable for software and that keeps about the same level of freedom. Any suggestions?

Error while searching for files with a given file extension (macOS)

$ countfiles -fe txt /

Recursively searching for .txt files in /.

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/bin/countfiles", line 11, in <module>
    load_entry_point('countfiles', 'console_scripts', 'countfiles')()
  File "/Users/username/Dropbox/Count-files/countfiles/__main__.py", line 102, in main_flow
    include_hidden=include_hidden)
  File "/Users/username/Dropbox/Count-files/countfiles/utils/word_counter.py", line 84, in get_files_by_extension
    in Path(os.path.expanduser(location)).rglob(f"*.{extension}")
  File "/Users/username/Dropbox/Count-files/countfiles/utils/word_counter.py", line 83, in <listcomp>
    files = sorted([f for f
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1094, in rglob
    for p in selector.select_from(self):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 543, in _select_from
    for starting_point in self._iterate_directories(parent_path, is_dir, scandir):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 533, in _iterate_directories
    for p in self._iterate_directories(path, is_dir, scandir):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 533, in _iterate_directories
    for p in self._iterate_directories(path, is_dir, scandir):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 531, in _iterate_directories
    if entry.is_dir() and not entry.is_symlink():
OSError: [Errno 9] Bad file descriptor: '/dev/fd/4'

This error does not occur when using the counting feature (countfiles /), only when searching for file extensions.

Since /dev/ is composed of device files, I would suggest to simply skip it on Linux and macOS systems. Any additional thoughts?

Make the display of file size in file path lists optional

I have been thinking on making optional the file size display for each listed file path. It would allow to use this utility to build simple lists of paths separated by newlines that can be useful in some contexts. The adicional info (file size) may affect the ability for this list to be used by other tools.

-hs
--hide-file-sizes

Suggestion: add the ability to search for files for a specific extension

Hello, Victor Domingos!
Something like this.
countfiles.py
import:

from pathlib import Path

in class WordCounter:

    def get_files_by_extension(self, args_path: str, args_fe: str):
        """This is like calling Path.glob() with '**' added in front of the given pattern
        The '**' pattern means 'this directory and all subdirectories, recursively'.
        In other words, it enables recursive globbing.
        (including folder name or filename with dot)
        :param args_path: path to file, args.path
        :param args_fe: file extension, args.fe
        :return:
        """
        print('path: ', os.path.expanduser(args_path))
        print('extension: ', args_fe)
        files = sorted(Path(os.path.expanduser(args_path)).rglob(f"*.{args_fe}"))
        print('len files: ', len(files))
        if files:
            for el in files:
                print('path: ', el)
                print('size: ', el.stat().st_size)
                print('text: ', el.read_text()[0:200].replace('\n', ' '))
        return

# Path.stat().st_size return information about this path (similarly to os.stat())
# Size of the file in bytes, if it is a regular file or a symbolic link.
# The size of a symbolic link is the length of the pathname it contains,
# without a terminating null byte.
# Depending on the extension: Path.read_text() may not work
# Path.read_text(), this makes sense if there is module-level docstrings

to ArgumentParser:

parser.add_argument('-fe', required=False, choices=('py', 'html'),
                                   type=str, help='find files by extension')

in flow control:

if args.fe:
    fc.get_files_by_extension(args.path, args.fe)

Images for Readme

Let's upload here the screenshots we will be using on our readme.

Absolute import instead of relative import

Hello, Victor!
Can you change this

from .utils.file_handlers import get_file_extension
from .utils.word_counter import WordCounter

from countfiles.utils.file_handlers import get_file_extension
from countfiles.utils.word_counter import WordCounter

in countfiles/main.py
I get an error message when I run the program.

Ideas for Count Files v1.5

Lets gather a few ideas of what kind of improvements we could consider to include in the next Count Files release. For now, this issue will function as simple brainstorming place in order to help us define priorities and plan ahead about where we should focus our working efforts.

To get things started, let me introduce some possible ideas:

Find out ways to simplify the help system:
- make text blocks a bit shorter, if possible;
- maybe implement sub-parsers as a way to hide some of the complexity and allow for more clear visualization of each of the available arguments and their meaning.
- consider if we should move from argparse to other similar 3rd. party package without sacrificing portability.
Add file search by other criteria:
- date last modified;
- name starts with (name prefix);
- name ends with (name suffix);
- name contains;
- extension starts with (extension prefix);
- extension ends with (extension suffix);
- extension contains;
- file size bigger than;
- file size smaller than.
Add some new preview systems for common file types:
- List more usual extensions that correspond in fact to text files;
- Images supported by Pillow (keeping external dependencies optional - only show preview if required library is available, else display a simple message);
- PDF files.

Help system - improve readability

Some parts of the help system are too long and maybe can be made shorter, now that we have the ability to specify help topics or keywords. Instead of a long page, we may add a few lines indicating the presence of other related subsections or pages.

It applies also to the new help system, for instance when using count-files -ah docs. Ideally each subpage should fit in a 25 or 40 lines terminal display, unless we have a very good reason to keep all the text it in a single help page.

Also: wouldn't it be better to use "topic(s)" instead of "keyword(s)"? It seems to be a bit more clear.

@NataliaBondarenko

def get_files

Check it, please, on your PC.

import os, sys, ctypes
from pathlib import Path

def get_file_extension(filepath: str) -> str:
    """Extract only the file extension from a given path.
    If the file name does not have an extension, return '' (empty string).
    Behavior:
    select2.3805311d5fc1.css.gz -> gz, .gitignore -> ''
    Pipfile -> '', .hidden_file.txt -> txt
    """
    extension = os.path.splitext(filepath)[1][1:]
    if extension:
        return extension
    else:
        # change here: returns instead of an empty string -> '.'
        return '.'
    
def is_hidden_file_or_dir(filepath: str) -> bool:
    platform_name = sys.platform
    filepath = os.path.normpath(filepath)
    if platform_name.startswith('win'):
        # list with full paths of all parents in filepath except drive
        list_for_check = list(Path(filepath).parents)[:-1]
        list_for_check.append(Path(filepath))
        response = []
        for some_path in list_for_check:
            try:
                attrs = ctypes.windll.kernel32.GetFileAttributesW(str(some_path))
                assert attrs != -1
                result = bool(attrs & 2)
            except (AttributeError, AssertionError):
                result = False
            response.append(result)
        if any(response):
            return True
        return False
    elif platform_name.startswith('linux'):
        return bool('/.' in filepath)
    elif platform_name.startswith('darwin'):
        return bool('/.' in filepath)

# It is possible to bring 1) and 3) branches to a single algorithm.
# We can also do: recursive: bool, include_hidden: bool
def get_files(path: str, extension: str, recursive: bool, include_hidden: bool):
    result = []
    if extension == '.':
        print("1) extension == '.'")
        for root, dirs, files in os.walk(path):
            for f in files:
                f_path = os.path.join(root, f)
                f_extension = get_file_extension(f_path)
                # change here: skip what does not match the specified extension
                if f_extension != extension or not os.path.isfile(f_path):
                # if get_file_extension(f_path) or not os.path.isfile(f_path):
                    continue
                if include_hidden or not is_hidden_file_or_dir(f_path):
                    result.append(f_path)
            if not recursive:
                break
    elif not extension:
        print("2) not extension")
        # change here:
        for root, dirs, files in os.walk(path):
            for f in files:
                f_path = os.path.join(root, f)
                if not os.path.isfile(f_path):
                    continue
                if include_hidden or not is_hidden_file_or_dir(f_path):
                    result.append(f_path)
            if not recursive:
                break
    else:
        print("3) extension == 'some'")
        for root, dirs, files in os.walk(path):
            for f in files:
                f_path = os.path.join(root, f)
                f_extension = get_file_extension(f_path)
                # change here: skip what does not match the specified extension
                if f_extension != extension or not os.path.isfile(f_path):
                # if not f_extension or not os.path.isfile(f_path):
                    continue
                if include_hidden or not is_hidden_file_or_dir(f_path):
                    # if f_extension == extension:
                    result.append(f_path)
            if not recursive:
                break
    # return data for all requests in the same form -> list with strings(paths)
    return result

# C:/Users/Net/Count-files/tests/data_for_tests
# C:/Users/Net/Count-files/tests/test_hidden_windows
a = get_files('C:/Users/Net/Count-files/tests/test_hidden_windows', '', recursive=True, include_hidden=True)
b = get_files('C:/Users/Net/Count-files/tests/data_for_tests', '', recursive=False, include_hidden=False)
c = get_files('C:/Users/Net/Count-files/tests/test_hidden_windows', '', recursive=False, include_hidden=True)
d = get_files('C:/Users/Net/Count-files/tests/hidden_py', '', recursive=True, include_hidden=False)
print(len(a), a) # 6
print(len(b), b) # 4
print(len(c), c)# 2
print(len(d), d)# 2
"""
6 ['C:/Users/Net/Count-files/tests/test_hidden_windows\\hidden_wor_windows.txt',
'C:/Users/Net/Count-files/tests/test_hidden_windows\\not_hidden.txt',
'C:/Users/Net/Count-files/tests/test_hidden_windows\\folder_hidden_for_win\\hidden_for_win.py',
'C:/Users/Net/Count-files/tests/test_hidden_windows\\folder_hidden_for_win\\not_hidden.py',
'C:/Users/Net/Count-files/tests/test_hidden_windows\\not_nidden_folder\\hidden_for_win.xlsx',
'C:/Users/Net/Count-files/tests/test_hidden_windows\\not_nidden_folder\\not_hidden.xlsx']

4 ['C:/Users/Net/Count-files/tests/data_for_tests\\html_file_for_tests.html',
'C:/Users/Net/Count-files/tests/data_for_tests\\md_file_for_tests.md',
'C:/Users/Net/Count-files/tests/data_for_tests\\no_extension',
'C:/Users/Net/Count-files/tests/data_for_tests\\py_file_for_tests.py']

2 ['C:/Users/Net/Count-files/tests/test_hidden_windows\\hidden_wor_windows.txt',
'C:/Users/Net/Count-files/tests/test_hidden_windows\\not_hidden.txt']

2 ['C:/Users/Net/Count-files/tests/hidden_py\\no_extension',
'C:/Users/Net/Count-files/tests/hidden_py\\test_file.py']
"""
# C:/Users/Net/Count-files/tests/hidden_py
a = get_files('C:/Users/Net/Count-files/tests/hidden_py', '.', recursive=True, include_hidden=True)
b = get_files('C:/Users/Net/Count-files/tests/hidden_py', '.', recursive=False, include_hidden=False)
c = get_files('C:/Users/Net/Count-files/tests/hidden_py', '.', recursive=False, include_hidden=True)
d = get_files('C:/Users/Net/Count-files/tests/hidden_py', '.', recursive=True, include_hidden=False)
print(len(a), a) # 3
print(len(b), b) # 1
print(len(c), c)# 2
print(len(d), d)# 1
"""
3 ['C:/Users/Net/Count-files/tests/hidden_py\\no_extension',
'C:/Users/Net/Count-files/tests/hidden_py\\no_extension_hidden',
'C:/Users/Net/Count-files/tests/hidden_py\\hidden\\no_extension_sub']
1 ['C:/Users/Net/Count-files/tests/hidden_py\\no_extension']
2 ['C:/Users/Net/Count-files/tests/hidden_py\\no_extension',
'C:/Users/Net/Count-files/tests/hidden_py\\no_extension_hidden']
1 ['C:/Users/Net/Count-files/tests/hidden_py\\no_extension']
"""
a = get_files('C:/Users/Net/Count-files/tests/hidden_py', 'py', recursive=True, include_hidden=True)
b = get_files('C:/Users/Net/Count-files/tests/hidden_py', 'py', recursive=False, include_hidden=False)
c = get_files('C:/Users/Net/Count-files/tests/hidden_py', 'py', recursive=False, include_hidden=True)
d = get_files('C:/Users/Net/Count-files/tests/hidden_py', 'py', recursive=True, include_hidden=False)
print(len(a), a) # 4
print(len(b), b) # 1
print(len(c), c)# 2
print(len(d), d)# 1
"""
4 ['C:/Users/Net/Count-files/tests/hidden_py\\hidden_test.py',
'C:/Users/Net/Count-files/tests/hidden_py\\test_file.py',
'C:/Users/Net/Count-files/tests/hidden_py\\hidden\\hidden_test_sub.py',
'C:/Users/Net/Count-files/tests/hidden_py\\hidden\\test_file_sub.py']
1 ['C:/Users/Net/Count-files/tests/hidden_py\\test_file.py']
2 ['C:/Users/Net/Count-files/tests/hidden_py\\hidden_test.py',
'C:/Users/Net/Count-files/tests/hidden_py\\test_file.py']
1 ['C:/Users/Net/Count-files/tests/hidden_py\\test_file.py']
"""

Allow to choose inclusion or exclusion of hidden files/directories when using --file-extension or -fe

Following the pattern of all command line arguments being optional and freely combinable, I think countfilesshould allow to choose inclusion or exclusion of hidden files/directories when using --file-extension or -fe, by adding the -aor --all switches.

Example:

countfiles -fe html -a ~/Documents
countfiles -file-extension html --all ~/Documents

Try to improve performance while searching files with feedback or list display

I noticed that when searching for all files on Mac (for instance without extension), it takes a lot more time when printing the file paths to the screen. Turning off the list (countfiles -fe . -nl -nf -a /) takes about 1m05s, while with the list (count-files -fe . / -a) it takes about 1m40s, an increase of about 50%. Without list but with feedback (count-files -fe . / -a -nl), it takes 1m30s (a bit more than without feedback, but less than with the full list).

These tests were conducted in a machine with a fast SSD, but on slower computers it could make an even bigger difference.

Wondering if splitting it into two threads sharing a FIFO queue would make it significantly snappier without making the codebase a lot more complex.

Add a new option to count only files with a given extension, but without listing them.

Similar to countfiles -fe, but without actually listing any files, just displaying the stats.

E.g.:

count files -fe txt -nl DIRPATH
countfiles --file-extension -no-list DIRPATH

Sample Output:

$count files -fe txt -nl ~/Desktop

Recursively counting all files with extension .txt, ignoring hidden files and directories, in /Users/Username/Desktop

   Found 37 file(s).
   Total combined size: 93.2 KiB.
   Average file size: 2.5 KiB (max: 6.0 KiB, min: 109.0 B).

$

Add file search by other criteria

Add file search by other criteria:

date last modified;

name starts with (name prefix);

name ends with (name suffix);

name contains;

extension starts with (extension prefix);

extension ends with (extension suffix);

extension contains;

file size bigger than;

file size smaller than.

_Originally posted by @victordomingos in #84 (comment)

it will be difficult - many filters. can be used: os.stat, glob or pathlib
usually operating systems have tools for searching and sorting files and folders according to the pattern.
here need to think about what tasks we can solve without duplication.

Originally posted by @NataliaBondarenko in #84 (comment)

Refactored as a package - some tasks pending

Update README.md with the correct instructions to run the package.
Update README.md with installation guidelines.
Create a setup.py that includes the correct command-line entry point to run the application (maybe we'll need to rename the main module or create a separate won for that purpose). (I need help with this one!)

ModuleNotFoundError on Pythonista/iOS

After installing on Pythonista for iOS, I get the following trackback:

Traceback (most recent call last):
  File "/private/var/mobile/Containers/Shared/AppGroup/F3C0E711-6D38-4FDF-81F2-DC3B97E4E9F1/Pythonista3/Documents/stash_extensions/bin/countfiles.py", line 2, in <module>
    from countfiles.__main__ import main_flow
  File "/private/var/mobile/Containers/Shared/AppGroup/F3C0E711-6D38-4FDF-81F2-DC3B97E4E9F1/Pythonista3/Documents/stash_extensions/bin/countfiles.py", line 2, in <module>
    from countfiles.__main__ import main_flow
ModuleNotFoundError: No module named 'countfiles.__main__'; 'countfiles' is not a package

With the kind help of a another Pythonista user at their support forum, we found out that it has to do with the fact that the entrypoint (which is a small Python script generated by setup.py) has the same name as the package. So, an easy solution seems to be renaming the entrypoint to something like “count-files” (with a dash).

Improve recursion when ignoring hidden directories

As suggested by Daniel Baboiu @ Linkedin:

Recursive search does a full recurse of hidden directories, only to ignore all files in them.

What would be the best way to do recursion and skip hidden directories?

Improve text preview (or disable) text preview for binary files

Displaying a text preview for non-text files seems to be not that useful, as the content shown most times won't even be readable. So, It seem to me that there are two options for improvement:

A. Detect if it is a text file (how?) and only show a preview if that's the case.
B. Detect the file type (maybe using puremagic and display useful metadata (for instance, image size) if available.

Search or count for multiple extensions

search or count for multiple extensions: count-files --multi-exensions txt py

Originally posted by @NataliaBondarenko in #84 (comment)

Add some new preview systems for common file types:

In the standard library, there are tools for working with common file types. Text files can be read.
and what kind of data can be used for the preview images or PDF?

Originally posted by @NataliaBondarenko in #84 (comment)

Incorrect error messages

$ count-files.py optimize-images/ -a -fe png -p
 Recursively searching all files with (case-insensitive) extension .png,
including hidden files and directories, in optimize-images/
Sorry, there is no preview available for this file type. You may want to try again without preview.
This is the list of currently supported file types for preview: c, css, html, js, json, md, py, txt.
Previewing files without extension is not supported. You can use the "--preview" argument together with the search for all files regardless of the extension ("--file-extension .."). In this case, the preview will only be displayed for files with a supported extension.

Missing tests – get_files_without_extension()

The function get_files_without_extension() now has a recursive keyword argument, so it would need to get its test suite updated.

dev branch

We also need to create a branch for development.For experiments with new and old features. If the changes are successful, they will be transferred to the branch master(production).

Failing test on macOS: test_some_functions.py

$ python3.7 tests/test_some_functions.py 
F.......s..s
======================================================================
FAIL: test_count_files_by_extension (__main__.TestSomeFunctions)
Testing def count_files_by_extension, case_sensitive and recursive params.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests/test_some_functions.py", line 104, in test_count_files_by_extension
    self.assertEqual(str(result1), "Counter({'gz': 3, 'txt': 2, 'md': 2, '[no extension]': 2, 'py': 2, "
AssertionError: "Counter({'gz': 3, 'md': 2, 'txt': 2, '[no extension]': 2, 'p[56 chars] 1})" != "Counter({'gz': 3, 'txt': 2, 'md': 2, '[no extension]': 2, 'p[56 chars] 1})"
- Counter({'gz': 3, 'md': 2, 'txt': 2, '[no extension]': 2, 'py': 2, 'TXT': 1, 'html': 1, 'json': 1, 'css': 1, 'woff': 1})
?                  ---------
+ Counter({'gz': 3, 'txt': 2, 'md': 2, '[no extension]': 2, 'py': 2, 'TXT': 1, 'html': 1, 'json': 1, 'css': 1, 'woff': 1})
?                            +++++++++


----------------------------------------------------------------------
Ran 12 tests in 0.005s

FAILED (failures=1, skipped=2)

Make the file extensions case insensitive by default and add a new option to allow case-sensitive behaviour

From the README:

The list of file types for which preview is available can be viewed with the `-st` or `--supported-types` argument.
The names of extensions are case sensitive. The results for `ini` and `INI` will be different.

I am thinking that we should make it case-insensitive by default, and add a new option (-cs or --case-sensitive) to search or count with case-sensitiveness. It shouldn't be very complicated, right?

Huge spike in memory usage when searching with preview (large files)

In some cases, the application seems to freeze while trying to generate a file preview for a big file (for instance, countfiles -fe . -p ~/). In my case, it happened with mailbox files that had about 1.2GB in size. Memory usage by the Python process spiked to about 10GB and there was no visible output during several minutes, when there should be shown a short text preview for that file. Doing a simple $head FILEPATH on the shell seems to open the file correctly and print the expected file contents.

Possible culprit seems to be generate_preview() (to begin, shouldn't we be opening the file in a context manager?). We need to check its behaviour and change it as needed. Is it reading the whole file content into a bytes object before slicing the first part? If so, lets make sure it reads only the required amount of information from the file.

Optimize the search with skipping root hidden folders.

Hello! How about reducing the file checks? Please try this, I think it can be useful.
Example, skipping root hidden folders for each loop:
if not include_hidden: if is_hidden_file_or_dir(root): continue
in def search_files()

   for root, dirs, files in os.walk(dirpath):
            if not include_hidden:
                if is_hidden_file_or_dir(root):
                    print('SKIPPED:')
                    print('ROOT:', root, '- is hidden')
                    print(len(files), 'skipped files in hidden root')
                    # if root is hidden skip all files
                    continue
            for f in files:
                # do necessary checks with files from not hidden folders

Example output:

Thus, you can skip unnecessary scanning of files/dirs located in known hidden folders.
I think this will allow us to speed up the processing of files a little.
Also, if the root does not contain a hidden folders, then you can only check the file names for the presence of a point (for Unix). For Windows it's more difficult, since attributes are requested for each part of the path.

Hidden files and directories detection does not work on Windows

Currently, the detection of hidden files and directories is based only on the .filename convention used in Unix-like operating systems. That is not the case in Windows, where files with names starting with a dot are not necessarily expected to be hidden and where hidden files can have names starting with any other of the allowed characters.

In order to implement this feature on windows, we will need first to detect the family of the host operating system and then adjust accordingly. In principle, it should be pretty easy by using something like:

import os
if os.name == 'nt':

To determine if a file or directory is hidden on Windows, we can use the stat.FILE_ATTRIBUTE_HIDDEN flag returned by os.stat():

https://docs.python.org/3/library/stat.html
https://msdn.microsoft.com/en-us/library/windows/desktop/gg258117.aspx

However, I have a little problem here, as I don't have easy access to a windows machine to test if it will work as expected...

How to count symbolic links and windows shortcuts?

Example with symbolic link

Example with Windows shortcut

Should I pay attention to such files or count them on a par with others?

Originally posted by @NataliaBondarenko in #84 (comment)

Simplify word increment code

As suggested by Daniel Baboiu @ Linkedin:

For example, lines 46-49 (if/else increasing count for word) can be written as
self.counters[word] = self.counters.get(word,0)+1

While recursively scanning in large directories there is no feedback

Scanning a large number of files recursively in all subdirectories can take some minutes but current the user gets no feedback during that time. There should be some [optional?] indicator that the application is still working in those cases. For instance, printing a total file/folder count every few seconds, or after processing a few thousand files, before generating and printing the final report.

If there is a very long extension, the table won't display properly

When counting files, it there happens to be a file that as a dot and then a very long text, it may not fit the terminal window, making the output look weird. Also, such a very long extension shouldn't probably be considered an extension. Usually file name extensions are small abbreviations or small words. While some filesystems allow for longer file names, that could in theory have long extensions, there is probably no use case for very long file extensions as a way to classify file types.

Therefore, I suggest we limit the extension size to a certain maximum number of 65 characters, in order to accommodate even some large numbers in the table without the need to truncate text. Any file extension longer that the specified maximum length would be treated as [no extension].

Test fails: test_viewing_modes.py

This test is failing on Python 3.7 (macOS 10.12 Sierra):

$ python3.7 test_viewing_modes.py 
..F.
======================================================================
FAIL: test_show_result_for_search_files (__main__.TestViewingModes)
Testing def show_result_for_search_files. Search by extension.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_viewing_modes.py", line 137, in test_show_result_for_search_files
    shallow=False), True)
AssertionError: False != True

----------------------------------------------------------------------
Ran 4 tests in 0.004s

FAILED (failures=1)

Allow to search for files with no file extension

I propose using a dot (.) to mean no extension, when using -fe or --file-extension:

countfiles.py --fe .
countfiles.py --file-extension .

The above lines would list files like .gitignore, Pipfile.

When recursively searching for files without extension, files with names starting with dot are ignored

File names that start with a dot are not considered an extension by themselves. For instance ".gitignore" has no extension, but a file like ".hidden_file.txt" has. The function get_files_without_extension_path() needs to be adjusted according to this principle.

README.md is out of date

We need to update it with some examples for the new features.

Add a new option to list all files, regardless of their extension

Similar to countfiles -fe, but allowing to list any files, with or without extension in their file names.

E.g.:

countfiles -l DIRPATH
countfiles --list-files DIRPATH

The other parameters (recursion/no-recursion, include-hidden) would be in place as usual.

Improving the initial `print()` statement that shows the selected options

@NataliaBondarenko, in the latest commit you made a few changes in the print statements that provide feedback to the user about the options that were selected and that will be used in the program execution. I think it needs some visual improvement. I tend to prefer to display a sentence in natural language, as it allows for a relative concise (fewer lines) presentation for the user, but I understand that there may be situations where that would not be a good option.

This can and probably should be made a separate function, for the sake of better code readability an maintenance.

Before (yes, it was missing information about recursion):

Searching for .txt files, including hidden items, in /Users/fact/Library/Application Support/Postbox/Profiles/2ea5zkkp.default/ImapMail/imap.googlemail.com/.

Now:

Search options
location: /Users/fact/Library/Application Support/Postbox/Profiles/2ea5zkkp.default/ImapMail/imap.googlemail.com/
extension: txt
recursion: False
include hidden: True

I would like to hear you thoughts on this, before doing any changes on this regard.

Consider switching to os.scandir()

In documentation, it’s mentioned that os.scandir() may offer a better performance (probably by behaving as a generator instead of building a giant list in one step). Could result in significant improvements in performance for big queries.

Code example from docs:

with os.scandir(path) as it:
    for entry in it:
        if not entry.name.startswith('.') and entry.is_file():
            print(entry.name)

Failling test for test_word_counter.py (macOS)

$ python3.7 tests/test_word_counter.py
LOCATION:  tests/compare_tables/test_2columns_sorted.txt
LOCATION:  tests/compare_tables/test_2columns_most_common.txt
LOCATION:  tests/data_for_tests

LOCATION:  tests/compare_tables/2columns_sorted.txt
LOCATION:  tests/compare_tables/2columns_most_common.txt
.LOCATION:  tests/data_for_tests
LOCATION:  tests/compare_tables/test_show_result_no_list.txt
LOCATION:  tests/compare_tables/show_result_no_list.txt
F
======================================================================
FAIL: test_show_result_for_search_files (__main__.TestWordCounter)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests/test_word_counter.py", line 57, in test_show_result_for_search_files
    shallow=False), True)
AssertionError: False != True

----------------------------------------------------------------------
Ran 2 tests in 0.012s

FAILED (failures=1)

About the ArgumentParser arguments.

There will be a lot of text here :) But in this case I need to give examples.
I want to consult with you about some of the features of using a parser.
The help on use at present:
python -m countfiles -h

usage: __main__.py [-h] [-a] [-alpha] [-nr] [-nt] [-fe FILE_EXTENSION]
                         [-p] [-ps PREVIEW_SIZE] [-nl]
                         [path]

Count files, grouped by extension, in a directory. By default, it will count
files recursively in current working directory and all of its subdirectories,
and will display a table showing the frequency for each file extension (e.g.:
.txt, .py, .html, .css) and the total number of files found. Any hidden files
or folders (those with names starting with '.') are ignored by default.

positional arguments:
  path                  The path to the folder containing the files to be
                        counted.

optional arguments:
  -h, --help            show this help message and exit
  -a, --all             Include hidden files and directories (names starting
                        with '.')
  -alpha, --sort-alpha  Sort the table alphabetically, by file extension.
  -nr, --no-recursion   Don't recurse through subdirectories
  -nt, --no-table       Don't show the table, only the total number of files
  -fe FILE_EXTENSION, --file-extension FILE_EXTENSION
                        Search files by file extension (use a single dot '.'
                        to search for files without any extension)
  -p, --preview         Display a short preview (only available for text files
                        when using '-fe' or '--file_extension')
  -ps PREVIEW_SIZE, --preview-size PREVIEW_SIZE
                        Specify the number of characters to be displayed from
                        each found file when using '-p' or '--preview')
  -nl, --no-list        Don't show the list, only the total number of files
                        and information about file sizes

The user can use, for example, such a set of commands:

# -fe has no representation in the form of a table/no table, the default list will be shown
print(parser.parse_args(['~/Count-files', '-fe', 'txt', '-nt']))
# the table will be displayed by default, not preview
print(parser.parse_args(['~/Count-files', '-p']))

Namespace(all=False, file_extension='txt', no_list=False, no_recursion=False, no_table=True, path='~/Count-files', preview=False, preview_size=395, sort_alpha=False)
Namespace(all=False, file_extension=None, no_list=False, no_recursion=False, no_table=False, path='~/Count-files', preview=True, preview_size=395, sort_alpha=False)

In general, nothing prevents making mistakes.
Perhaps there is a need for more explicit instructions on how to use the parser?

changelog.txt is out of date

We need to summarize the main changes since v.1.3, and try to wrap this version for PyPI submission.

The current feature set seems to be pretty stable by now, so with a few adjustments I think we could advance to the TestPyPI stage.

Filenames that start with dot (hidden files on Unix-like systems) should not be treated as extensions

For instance, .gitignoreshould be treated as a filename, not an extension.

Also, we should be making use of the existing os.path.splitext() instead of regular string splitting, as it probably will avoid some errors. That method already takes care of the .name issue mentioned above.

Excluding folders or file extensions

include/exclude some dirs count-files --exclude folder1 folder2
include/exclude some types count-files --exclude ini (system files or those for which there is no preview)

Originally posted by @NataliaBondarenko in #84 (comment)

Failing tests after changes related to packaging

@NataliaBondarenko, I added a short setup.py and for that reason needed to alter the definition for main_flow(), removing its args parameter. When installing as a package with pip install -e, the package seems to be correctly installed and then the countfiles command-line entry point seems to accept all the parameters as expected.

However, we have now some tests failing in test_argument_parser.py. I am not sure if I did anything wrong, or if we just need to somehow update the call to main_flow(). Do you know how to setup unittest to properly detect the command-line arguments in these cases?

victordomingos / count-files Goto Github PK

count-files's Introduction

Count Files

Documentation

Dependencies

Installation

On regular desktop operating systems

On iPhone or iPad (in Pythonista 3 for iOS)

How to use

Did you find a bug or do you have a suggestion?

count-files's People

Contributors

Stargazers

Watchers

Forkers

count-files's Issues

Recommend Projects

Recommend Topics

Recommend Org