dharple / detox Goto Github PK
View Code? Open in Web Editor NEWTames problematic filenames
License: BSD 3-Clause "New" or "Revised" License
Tames problematic filenames
License: BSD 3-Clause "New" or "Revised" License
table_get()
and table_put()
can fail on small tables when the table size and key are multiples. I discovered this by accident while working on #40, using a unicode value of 0x4000.
I forgot to update configure.ac
, and now detox 1.3.2 reads as version 1.3.0.
Text::Unidecode has already solved transliteration, more or less.
There's no common history between them.
Eriberto, the Debian maintainer, has requested that I replace the existing Makefile.in and configure.in with more modern ones, Makefile.am and configure.ac.
Add a regression test confirming that the max_length
filter does the right thing when confronted with .tar.gz
or similar extension.
For instance, super long filename.tar.gz
should prefer to be reduced down to super_lon.tar.gz
instead of super_long_fil.gz
.
Noticed while working on detoxrc.5
for #22.
Add a regression test confirming that DETOX_SEQUENCE
overrides the default sequence.
Noticed while working on detoxrc.5
for #22.
Something like this
$ stat *Syphilect*
File: ' '$'\n\n''Epidemic Consummation'$'\n''by Syphilectomy'$'\n'
I did a mkdir from a clipboard copy of a web page. That's why it has the carriage returns in it.
the filename is exactly
anliot@ace ~/music/ Epidemic Consummationby Syphilectomy $ pwd |od -t x1
0000000 2f 68 6f 6d 65 2f 61 6e 6c 69 6f 74 2f 6d 75 73
0000020 69 63 2f 20 20 20 20 20 0a 0a 45 70 69 64 65 6d
0000040 69 63 20 43 6f 6e 73 75 6d 6d 61 74 69 6f 6e 0a
0000060 62 79 20 53 79 70 68 69 6c 65 63 74 6f 6d 79 0a
0000100 0a
0000101
this didn't crash detox :
$ detox *Syphil* -s uncgi -v
I have built (and retested with ) your detox current code from github. I don't know the issue, but I wouldn't be a bit surprised if it was a library error.
I don't see any unicode in the filename, really
cheers!
Add a Travis test for macOS to confirm builds there.
After #2 was finished, there are still lingering references to libpopt. Remove these.
Running detox without a config file leaves the user with only the default sequence. This is OK, but doesn't agree with the man pages at all.
Refactor the config_file_spoof logic so that we can easily build extra sequences, and then build out a full set of sequences.
Alternatively, do something like what was done in #21, converting the stock detoxrc into C code that can be loaded at will.
detox does not handle 4 byte UTF-8, but there's no reason it can't.
I think it would be a nice improvent. For example: I want rename foo.bar.baz.pdf
to foo_bar_baz.pdf
.
Thanks for your work!
Running
detox -r -v .
doesn't find anything, while
detox -r -v $(pwd)
works as expected.
Detox version 1.3.0, Fedora 27.
I want to reconfigure detox to have a less opinionated "safe" character subset (essentially, I want to keep Latin1 and Cyrillic chars but to remove unsafe chars for sh -c
input such as '"$
, SMB shares and HFS volumes). I have the following table:
default
start
0x23 _ # '#'
0x25 _ # %
0x2b +
0x2c _ # ,
0x2d -
0x2e .
0x3d _ # =
0x5e _ # ^
0x5f _
0x7e ~
#0x20 _ # space
0x21 _ # !
0x22 _ # "
0x24 _ # $
0x27 _ # '
0x2a _ # *
0x2f _ # /
0x3a _ # :
0x3b _ # ;
0x3c _ # <
0x3e _ # >
0x3f _ # ?
#0x40 # @
0x5c _ # \
0x60 _ # `
0x7c _ # |
#0x28 - # (
#0x29 - # )
#0x5b - # [
#0x5d - # ]
0x7b - # {
0x7d - # }
0x26 _and_ # &
end
And a simple detoxrc:
sequence default {
safe {
filename "/home/driib/mysafe.tbl";
};
};
However, I get the following dry-run output where allowed characters and unsafe ones are stripped alike (seems like every second):
nas-now/downloads/z2020P2/01 5G Core Networks.pdf -> nas-now/downloads/z2020P2/0 GCr ewrspf1
Would you be so kind to point me in the right direction? If I add a non-empty default, the dry-run produces the correct output (albeit not the one I like):
nas-now/downloads/macOS High Sierra Patcher.dmg -> nas-now/downloads/_________________________.___
I will fall back to something like this but I would really like to take advantage of writing more complex sequences in detox instead:
find . -type d -print0 | xargs -0 -I '{}' sh -c "rename -n \"s/[\\\$!'\\\"]/_/g\" {}/*"
Thank you, Merry Christmas and happy holidays!
While testing detox I accidentally ran it across my entire projects directory (all of my dev projects) when I created a symlink in /tmp
that pointed at ../..
and then ran detox
in a test with -r --special
.
Switch from using tabs to using spaces.
astyle --style=kr --indent-switches *.[ch]
This needs to skip config_file_yacc.[ch] and config_file_lex.c.
Bonus points if the output is also a a viable transition table.
Call it something that indicates it works on a single byte with the high bit set.
Confirm that compilation on FreeBSD works and regression tests pass. Travis
Dear Doug,
thanks for providing and sharing this nice tool. I discovered it today, thanks to an article in the c't, a highly regarded computer magazine here in Germany. In order to spread the joy, I packaged v1.3.3 for openSUSE right away here, in such a way, that it's ready for entering the official distribution, building on the blocks from Antoine Ginies. During that course, I noticed a lot of rather disturbing compiler complaints during build:
[ 5s] gcc -DHAVE_CONFIG_H -I. -DDATADIR=\"/usr/share\" -DSYSCONFDIR=\"/etc\" -O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -Werror=return-type -flto=auto -g -c -o detox.o detox.c
[ 5s] detox.c: In function 'main':
[ 5s] detox.c:339:29: warning: pointer targets in passing argument 1 of 'parse_file' differ in signedness [-Wpointer-sign]
[ 5s] 339 | file_work = parse_file(*file_walk, main_options);
[ 5s] | ^~~~~~~~~~
[ 5s] | |
[ 5s] | char *
[ 5s] In file included from detox.c:46:
[ 5s] file.h:38:49: note: expected 'unsigned char *' but argument is of type 'char *'
[ 5s] 38 | extern unsigned char *parse_file(unsigned char *filename, struct detox_options *options);
[ 5s] | ~~~~~~~~~~~~~~~^~~~~~~~
[ 5s] detox.c:339:16: warning: pointer targets in assignment from 'unsigned char *' to 'char *' differ in signedness [-Wpointer-sign]
[ 5s] 339 | file_work = parse_file(*file_walk, main_options);
[ 5s] | ^
[ 5s] detox.c:340:16: warning: pointer targets in passing argument 1 of 'parse_dir' differ in signedness [-Wpointer-sign]
[ 5s] 340 | parse_dir(file_work, main_options);
[ 5s] | ^~~~~~~~~
[ 5s] | |
[ 5s] | char *
[ 5s] In file included from detox.c:46:
[ 5s] file.h:40:38: note: expected 'unsigned char *' but argument is of type 'char *'
[ 5s] 40 | extern void parse_dir(unsigned char *indir, struct detox_options *options);
[ 5s] | ~~~~~~~~~~~~~~~^~~~~
[ 5s] detox.c:344:17: warning: pointer targets in passing argument 1 of 'parse_file' differ in signedness [-Wpointer-sign]
[ 5s] 344 | parse_file(*file_walk, main_options);
[ 5s] | ^~~~~~~~~~
[ 5s] | |
[ 5s] | char *
[ 5s] In file included from detox.c:46:
[ 5s] file.h:38:49: note: expected 'unsigned char *' but argument is of type 'char *'
[ 5s] 38 | extern unsigned char *parse_file(unsigned char *filename, struct detox_options *options);
[ 5s] | ~~~~~~~~~~~~~~~^~~~~~~~
[ 5s] detox.c:347:20: warning: pointer targets in passing argument 1 of 'parse_special' differ in signedness [-Wpointer-sign]
[ 5s] 347 | parse_special(*file_walk, main_options);
[ 5s] | ^~~~~~~~~~
[ 5s] | |
[ 5s] | char *
[ 5s] In file included from detox.c:46:
[ 5s] file.h:42:42: note: expected 'unsigned char *' but argument is of type 'char *'
[ 5s] 42 | extern void parse_special(unsigned char *in, struct detox_options *options);
[ 5s] | ~~~~~~~~~~~~~~~^~
[ 5s] detox.c:366:20: warning: pointer targets in passing argument 1 of 'parse_inline' differ in signedness [-Wpointer-sign]
[ 5s] 366 | parse_inline(*file_walk, main_options);
[ 5s] | ^~~~~~~~~~
[ 5s] | |
[ 5s] | char *
[ 5s] In file included from detox.c:46:
[ 5s] file.h:44:41: note: expected 'unsigned char *' but argument is of type 'char *'
[ 5s] 44 | extern void parse_inline(unsigned char *filename, struct detox_options *options);
[ 5s] | ~~~~~~~~~~~~~~~^~~~~~~~
You can see the full build log by clicking on the succeeded link.
Yes, our compilers are parameterized quite squeamishly, but it often helps to discover issues early. That doesn't work so well anymore, if a project triggers that many warnings, though...
Of course, we could muzzle the compiler, but it would be nice, if you could take a look into this issue yourself. I'm sure, eliminating these signedness issues improves the overall value of this fine project even more.
Move the default translation character out of the translation table, so transliterators can be added in any order, and only the last will filter out untransliteratables.
Running the regression tests on Ubuntu Precise (12.04) via Travis fails.
Noticed while testing #31
When a copy of detox is installed on the system (from an OS package or manually), it interferes with the regression tests, because detox will pick up on the translation tables or system config file and change the behavior being tested.
detox 1.3.0 on linux mint
Steps:
Expected output:
Actual output
The wipeup
filter handles this, and the --remove-trailing
filter has been deprecated for years now.
Hi Doug!
Thanks for your email regarding the updated version of detox. I tried to create an updated PKGBUILD for Arch Linux, but the below message comes up in the build process. This is my rewritten PKGBUILD for v1.3.0
==> Starting build()...
configure: error: cannot find install-sh, install.sh, or shtool in "." "./.." "./../.."
==> ERROR: A failure occurred in build().
Aborting...
Investigate C testing libraries
It probably only makes sense in the context of single byte characters, and it should be a separate filter using the ISO 8859-1 base.
gengetopt allows for abstraction of the command line options. Review whether it's a good fit.
Update the default behavior of detox (with or without a config file) so that the only tables run are safe and wipeup. Add notes to the config file about the iso8859_1 and utf_8 sequences being transliteration.
Add a new syntax to allow a user to state that they want to use the built in table.
From Eriberto (the Debian package maintainer):
I would like to suggest two features. Something as:
-c '^~'
: So, detox also will change characters ^ and ~ by _.
-d '^~'
: detox will delete the charactters if found.
This fits nicely in with my vision of v2, pushing all of the actual sequencing to the command line and away from config files and custom conversion tables.
The example in the manpages detox -c my_detoxrc -L -v
does not work; option -c
is neither documented in the man page nor in --help
.
EXAMPLES
...
detox -c my_detoxrc -L -v
Will list the sequences within my_detoxrc, showing their filters and options.
$ detox -c my_detoxrc -L -v
detox: invalid option -- 'c'
usage: detox [-hLnrvV] [-f configfile] [-s sequence] [--dry-run] [--inline] [--special]
file [file ...]
Shoud be -f
instead?
The basic safe filter removes any character that doesn't match. This is significantly different from when the safe table can be loaded; the default is not set, so any unmatched character is left alone.
So, any UTF-8 characters that get passed through the basic safe filter get removed, while the same character passed through the table-based safe filter is left alone.
Look in to a tool to assist with managing the man pages for detox. Ronn (https://github.com/rtomayko/ronn) looks interesting.
Issue #14 links to a Debian bug ticket. Review the Debian src package for potential fixes to add to detox.
popt was added years ago, to support a build on OS/X (I think). libpopt is no longer being maintained. Remove support for it.
Review the BUGS in the man pages; check to see if they still are bugs, and if so, create github issues to address them.
See issue #31 . Address the underlying issue: convert all internal strings over to either a signed or unsigned char *, and address the consequences of whichever choice is made.
Using signed chars causes the character math to get wonky.
Using unsigned chars causes warnings with the standard C library functions.
The wipeup filter should trim _
and -
off the end of the filename.
The remove_trailing option should probably be the normal behavior.
The UTF-8 filter behaves like the safe filter, in that many characters between 0x20 and 0x3F are converted to _
or -
. This should be done by the safe filter, and the UTF-8 filter should only be for transliterating UTF-8 to ASCII.
"Enhanced bash scripting" Could detox generate a valid filename from a string, so I can avoid an error from an invalid filename?
something like this:
$!/bin/bash
echo input filename
read e
filename=$(echo e| detox) #or detox "$e"
#rest of script
Rename master
to main
(after a notice period) and create a 1.x branch to have a place for bug fixes on v1.x.
Right now we're building two different .o files from several .c files, based on the INLINE_MODE flag. This is overkill, and hackish. Create separate source files where appropriate, and clean up this mess.
From Debian bug 861537[1].
[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=861537
Seems that there is a patch in a Detox fork[2].
[2] https://github.com/mikrosimage/detox/tree/1.3.0.mikros
Cheers,
Eriberto
Create a new version, 1.4.0, to release the current set of changes, and give the 1.x branch the ability to do some regression testing. Ties with issue #25 .
Reproduction steps:
create test.sh:
#!/bin/sh
touch "hi there"
(There is a TAB between hi and there)
chmod +x test.sh
./test.sh
ls -1 | inline-detox
# OUTPUT: hi here
# or
detox -n hi*
# OUTPUT: hi there -> hi here
Add tests for the --special
flag. Add tests for the --special
argument with --recursive
, when confronted with a symlink loop. Add symlinks that reference .
or ..
on tests that use --special --recursive
.
Update the regression tests to run in some sort of confinement. A docker container, or chroot jail would work.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.