Today I was thinking that instead of building a comprehensive grammar into a lightweight tool like this, it would be useful to allow it to serve other programs as well as possible. An example could be performing operations on files hierarchies (for example deletion).
Naturally, a BFS traversal of a directory tree will find it hard to actually remove the entire tree in one go since the folders one encounters first can't easily be removed. Yet the files can.
Instead of:
This would be possible:
# delete all files in a file hierarchy
$ bfs somedir -type f | parallel -X rm
# get rid of empty husk
$ rm -rf somedir
So that's great, doing one thing and one thing well. I wondered whether it could be made even better (and by better I mean faster):
It would seem that sorting by inode would be a good shot. Buffering up an array and sorting it in C is needlessly messy, and there are other tools that can do that. But bfs(1)
would need to output the inode number as calling an external program to stat(2)
every file is very expensive. Luckily, stat(2)
is not necessary since the inode number is present in the dirent
struct.
So we can imagine:
$ bfs -i .
10720996 ./.git
10721207 ./.gitignore
10982737 ./bfs
10979979 ./bfs.c
10982713 ./bfs.d
10982714 ./bfs.o
Notice that the output is not sorted by inode number. Naively doing:
$ bfs -i -type f . | sort -n | parallel -X rm -f
Would maybe not be ideal as it is very possible that this places files from lots of different directories close to each other. Intuitively it seems best to "empty" one directory in one god. this would need to be benchmarked though, because I'm just grasping at straws here.
So we could compare that to sorting by inode within a directory. For that it would be sufficient to output the inode of the parent directory as well:
$ bfs -i -p -type f .
...
10720996 10979934 ./.git/ORIG_HEAD
10720996 10721190 ./.git/packed-refs
10720996 10720997 ./.git/refs
10721002 10721003 ./.git/hooks/applypatch-msg.sample
10721002 10721004 ./.git/hooks/commit-msg.sample
10721002 10721005 ./.git/hooks/post-update.sample
...
$ bfs -i -p -type f . | sort -n -k1,2 | cut -f 2- -d' ' | parallel -X rm
That seems nice, if the intuition about what's more convenient for a file system is accurate, which remains to be seen. That said, this will also place directories in non-BFS order, which might be important. An alternative is to output a generation number instead of the parent directories' inode:
$ bfs -i -g -type f .
...
48 10977657 ./.git/objects/a4/7cee3a89da7a2cff6b615e381fd486d5d9c06e
49 10977653 ./.git/objects/a5/41b1002b2830a0bde86a2e50bf7e58cd41ea85
50 10721056 ./.git/objects/ac/11476e20136976bf689acf847fdcea4e05e37a
50 10976289 ./.git/objects/ac/2e304d1fca0aab2eebba954851c3ef759aaaf5
50 10721187 ./.git/objects/ac/6b2e7981854281d75b24fbb972933a25cb5bcc
51 10721048 ./.git/objects/ad/628c395cc45add3e3a87f21250cbd0df1d860e
52 10979608 ./.git/objects/b0/4f245cb03897e2616aaeb643cb03244993b4cd
53 10721137 ./.git/objects/b1/337c8e1dbfe6ae1b9d79e2b16af210562c2e17
53 10721119 ./.git/objects/b1/d430ec463fd1d49bf47443cb143af0e8e9905e
54 10721080 ./.git/objects/b2/510860f1a20cbaab76b3efca68efe9bc530c66
55 10741243 ./.git/objects/b4/0ccb5fe4e00584ec867706b71db707bf3f63ce
56 10721111 ./.git/objects/be/538a8a3cd111795549eecb16f05bce8c8ba907
...
This will allow one to sort only within directories, keeping BFS traversal order.
However, this still has a big problem: sort(1)
wait until there's no more input before sorting and outputting. A bfs(1)
run on a very large directory tree could take a long time, and in the meanwhile one could be performing operations on files instead of twiddling thumbs.
Something like this could work
bfs -i -dirdelim='\n' -type f | parallel --pipe --recend='\n\n' "sort -n | cut -f 2- -d' ' | parallel -X rm"
While it looks like it could work, it doesn't somehow, parallel(1)
doesn't understand. It also looks complicated, which is almost never good.
Anyway, I just wanted to write down my thoughts for possible features. I've implemented a few of them (the output I pasted is mostly real :).