Coder Social home page Coder Social logo

Comments (8)

oniony avatar oniony commented on May 25, 2024

Hi there. Thanks for reporting this issue.

You're right: when a repair is made the tags should be sucked from the other file so that the remaining file shares the superset. I'll look at it this as part of the 0.5.0 release.

In the mean time, TMSU has a dupes command that can be used to identify duplicate files in the database (or for a specific file):

$ tmsu dupes
Set of 2 duplicates:
    ./test/file.txt
    ./another/file.txt

When you identify a duplicate, the easiest way forward would be choose which of the two files you wish to live and then tag it with all of the other file's tags:

$ tmsu tag --from another/file.txt test/file.txt

Then you can safely delete the second file:

$ tmsu untag --all another/file.txt
$ rm another/file.txt

from tmsu.

oniony avatar oniony commented on May 25, 2024

I think this will have to wait until after 0.5.0. Something is making me not like the idea of 'repair' automatically synchronizing tags across duplicates and I can't quite put it into words. I think part of the worry is that the operation could be slow: identifying all sets of duplicates and then comparing tags. Or it could be that although a user has multiple files with the same fingerprint that they consider them separate files, though I can't come up with a convincing case to illustrate what I mean. Let me think about this some more.

from tmsu.

xuanngo2001 avatar xuanngo2001 commented on May 25, 2024

Let me argue to make this issue a stronger case. :-)

From the user's perspective, this is a bug. If duplicate files have been merged, then tags should also follow.

When I repaired anything, I expect to be slow but it has to be right. I don't mind at all to wait longer for the process to fix everything correctly. Most importantly, information(tags) should not be lost.

same fingerprint that they consider them separate files doesn't hold. Either the fingerprint algorithm is not strong enough to identify uniqueness or you are trying to trick the application. If it is the latter, then all bets are off.

from tmsu.

oniony avatar oniony commented on May 25, 2024

I'll look at this issue soon. It had fallen off my radar.

from tmsu.

0ion9 avatar 0ion9 commented on May 25, 2024

@limelime :
"Either the fingerprint algorithm is not strong enough to identify uniqueness or you are trying to trick the application. If it is the latter, then all bets are off."

I want to point out that generic hashing algorithms are only capable of making probabilistic guarantees about their behaviour, not absolute ones. So it might be unlikely to have a set of 2 files with the same fingerprint and different content, but it's perfectly possible no matter the strength of your fingerprinting. No matter how strong the hash, it can never make an absolute guarantee that two files with different content will not have the same hash. I have personally encountered collisions with both MD5 and SHA1.

(This is one of the reasons I'm extremely cautious about tmsu repair, although IIRC tmsu repair also compares other metadata like file size after hash, which further reduces risk. It may be a little off-topic, but I think the approach of generating an editable script like rmlint does is generally more sound, to allow the user to deal correctly with exceptional cases. Undeniably slower, but IMO the extra level of control is necessary when performing potentially destructive actions.)

from tmsu.

oniony avatar oniony commented on May 25, 2024

@limelime

From the user's perspective, this is a bug.
When I repaired anything, I expect to be slow but it has to be right. I don't mind at all to wait longer for the process to fix everything correctly. Most importantly, information(tags) should not be lost.

I don't see how you could think this is a bug? TMSU does not combine tagged files during a repair operation: TMSU will only repair a moved files by identifying an untagged file with the same fingerprint. If TMSU cannot find an untagged file with the same fingerprint then it will, instead, report the file as missing. At no point could any tag information be lost as files are simply not merged by 'repair' at this time.

You seem to be saying that 'if TMSU implemented my original suggestion there would be a bug in the implementation', which doesn't make sense as you have no idea how it would be implemented.

Unless I'm missing something? Are you saying you have identified a bug in the 'repair' subcommand where missing files are repaired by merging them with an already tagged file? I have performed a couple of tests and everything seems to be working as expected:

$ echo "hello" >file1
$ cp file1 file2
$ tmsu tag file1 tag1
tmsu: New tag 'tag'.
$ tmsu tag file2 tag2
tmsu: New tag 'tag2'.
$ tmsu dupes
Set of 2 duplicates:
    ./file1
    ./file2
$ rm file1
$ tmsu repair .
/home/paul/test/file1: missing
$ cp file2 file3
$ tmsu repair .
/home/paul/test/file1: updated path to /home/paul/test/file3

As you can see, 'repair' only identified the file as moved when there was an untagged candidate. No tagging information was lost at any time.

from tmsu.

oniony avatar oniony commented on May 25, 2024

same fingerprint that they consider them separate files doesn't hold. Either the fingerprint algorithm is not strong enough to identify uniqueness or you are trying to trick the application. If it is the latter, then all bets are off.

If a user has two files on their disk with the same contents why should TMSU be so arrogant to assume that these files hold the same destiny and treat them identically even though they are separate filesystem entities?

Just because two files are the same now, does not mean they will be the same in the future. The user might plan on editing one, or both, such that they serve different purposes: the tool should not assume.

The better argument may be 'why would a user want these duplicate files on their desk in the first place?' and that's a good question: they likely don't and so they can use the available tooling to remove the duplicates they do not want. TMSU helps here by providing a 'dupes' subcommand. This duplicate functionality is outside of its remit (as a file tagging utility) already but I included it as I figured if TMSU has the fingerprints (for moved files detection) it may aswell leverage this information.

If a user wants to remove duplicates from their filesystem they can do this:

$ tmsu dupes
Set of 2 duplicates:
  ./file1
  ./file2
$ tmsu tag --from file2 file1
$ tmsu untag --all file2
$ rm file2

Now I agree that's a bit long winded and not entirely intuitive so perhaps there could be a facility to do all of this. However such a facility would alter the filesystem and I've been reluctant to add such functionality as right now TMSU does not alter your files: it's only write access is to the database. This is actually one of the 'features' I put on the website at http://tmsu.org/. I've been resisting adding operations that modifying the files as I believe this would cause a trust issue for new (actually all) users.

So, my recommendation is to create a shell function to do this:

tmsu-mergefiles() {
    tmsu tag --from "$1" "$2" && tmsu untag --all "$1" && rm "$1"
}

In fact in another issue we had a discussion about a tmsu-rm type command. Perhaps the most pragmatic thing to do would be to include such scripts with TMSU. That way you get the functionality you want, TMSU itself is purely read-only and everybody is theoretically happy. Plus this feels a lot more Unixesque than putting this simple stuff into TMSU.

from tmsu.

oniony avatar oniony commented on May 25, 2024

I've opened issue #35 to include some scripts for performing filesystem operations whilst maintaining the tag information.

Merging files I feel should be left to the user with the help of 'dupes' as necessary.

from tmsu.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.