softlinking duplicated files automatically about fclones HOT 10 CLOSED

pkolaczk commented on June 24, 2024 4

Offer a way of deleting / hardlinking / softlinking duplicated files automatically

from fclones.

Comments (10)

pkolaczk commented on June 24, 2024 2

@aurelg The postprocessing step would be fast and definitely not a bottleneck. The main bottleneck is I/O for reading files to compute the hashes.

I generally agree this feature is much easier to implement inside fclones.
However, this is not as simple as the provided python script. When automatcally deleting user files, one has to be extremely cautious. E.g. there may be some edge cases when e.g. during the scanning phase a file was moved to a different location and fclones registered it as a duplicate. But at the moment it wants to delete it, there is no duplicate anymore.

This:

       if isfile(dst):
                    unlink(dst)
       link(src, dst)

might end up deleting the only existing file.

Better to move the file first, before deleting, then create a link, then if all ok, drop the moved file.

from fclones.

aurelg commented on June 24, 2024 1

What do you means, exactly?

Most of the python code above deals with reconstructing proper datastructures from the fclones output. I guess such datastructures are probably already available in fclones. A dedicated flag could bypass the need for implementing (and maintaining) a parser.

I'm not very happy with the python dependency either. IMHO the link between an independent python project and fclones would be so tight that I don't think it's worth the split.

I'd prefer a shell-based approach as well. It would be more portable, but I fear it could be rather limiting later, though (as it becomes pretty complex, not very readable nor reliable when compared to Python when tests, additional switches or edge cases handling are needed).

Anyhow, a postprocessing step would probably limit (if not defeat) the speed advantage of fclones vs jdupes/fdupes.

from fclones.

pkolaczk commented on June 24, 2024 1

Implemented in #53 released as v0.12.0.

from fclones.

piranna commented on June 24, 2024

IMHO, a postprocessing script parsing the fclones output might require more complexity than adding a CLI switch

What do you means, exactly?

I like your aproach to using Python, maybe bash is not enought, althought it's more powerful than people would expect, and this can be done with it in a more portable way, while the Python wrapper would need to be an independent project since it would not be just a helper command anymore... But yes, a fclones-helpers package would totally make sense :-)

from fclones.

piranna commented on June 24, 2024

Anyhow, a postprocessing step would probably limit (if not defeat) the speed advantage of fclones vs jdupes/fdupes.

I think bottleneck are in hashes...

from fclones.

aurelg commented on June 24, 2024

It might also be nice to avoid creating dst if that has been removed since fclones was executed. Such edge cases come from the arbitrary amount of time (and changes on the filesystem) between the execution of fclones and the postprocessing. An implementation inside fclones could be far more robust. 👍

from fclones.

rleaver152 commented on June 24, 2024

fclone should offer a way of deleting / hardlinking / softlinking duplicated files automatically.

In #25:

@pkolaczk wrote:

That's right, fclones doesn't offer any way of deleting files automatically yet. I believe this is a task for a different program (or a subcommand) that would take output of fclones.

and @piranna replied:

From a UNIX perspective, yes, it makes sense that task being done by another command, but being so much attached to fclones output format... :-/ Maybe a shell script wrapper that offer a compatible interface with fdupes? :-) That would be easy to implement, but not sure if It should be hosted here un fclones repo or being totally independent...

IMHO, a postprocessing script parsing the fclones output might require more complexity than adding a CLI switch. For instance, here's an (untested) python implementation that leverages the CSV output (expected in fclones_out.csv) to replace duplicates with hard links:
#!/usr/bin/env python

import logging
from os import link, unlink
from os.path import isfile


def main() -> None:
    with open("fclones_out.csv") as f_handler:

        for duplicates in (
            fclone_output_line.split(",")[3:]

            for fclone_output_line in f_handler.readlines()

            if not fclone_output_line.startswith("size")
        ):
            src = duplicates[0]

            for dst in duplicates[1:]:
                logging.debug("%s -> %s", src, dst)

                if isfile(dst):
                    unlink(dst)
                link(src, dst)


if __name__ == "__main__":
    logging.basicConfig(level=logging.DEBUG)
    main()
PS: I think this deserves a ticket on its own, feel free to delete it if you don't agree. :-)

I added a few things - love the code. Assumes you output the csv file to /tmp for tidyness. Remember to put the primary directory last in the fclones path to keep those as a priority (contrast to rdfind where its the first directory that is kept priority)

#!/usr/bin/env python3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
import os
import logging
from pathlib import Path

def main() -> None:
    with open("/tmp/fclones_out.csv") as f_handler:

        for duplicates in (
            fclone_output_line.split(",")[3:]

            for fclone_output_line in f_handler.readlines()

            if not fclone_output_line.startswith("size")
        ):
            src = duplicates[0]

            for dst in duplicates[1:]:
#                logging.debug("%s -> %s", src, dst)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
                dst = dst.strip('\n')
		my_file=Path(dst)
                if my_file.is_file():
                    os.remove(dst)

if __name__ == "__main__":
    logging.basicConfig(level=logging.DEBUG)
    main()

from fclones.

rleaver152 commented on June 24, 2024

fclone should offer a way of deleting / hardlinking / softlinking duplicated files automatically.

In #25:

@pkolaczk wrote:

That's right, fclones doesn't offer any way of deleting files automatically yet. I believe this is a task for a different program (or a subcommand) that would take output of fclones.

and @piranna replied:

From a UNIX perspective, yes, it makes sense that task being done by another command, but being so much attached to fclones output format... :-/ Maybe a shell script wrapper that offer a compatible interface with fdupes? :-) That would be easy to implement, but not sure if It should be hosted here un fclones repo or being totally independent...

IMHO, a postprocessing script parsing the fclones output might require more complexity than adding a CLI switch. For instance, here's an (untested) python implementation that leverages the CSV output (expected in fclones_out.csv) to replace duplicates with hard links:
#!/usr/bin/env python

import logging
from os import link, unlink
from os.path import isfile


def main() -> None:
    with open("fclones_out.csv") as f_handler:

        for duplicates in (
            fclone_output_line.split(",")[3:]

            for fclone_output_line in f_handler.readlines()

            if not fclone_output_line.startswith("size")
        ):
            src = duplicates[0]

            for dst in duplicates[1:]:
                logging.debug("%s -> %s", src, dst)

                if isfile(dst):
                    unlink(dst)
                link(src, dst)


if __name__ == "__main__":
    logging.basicConfig(level=logging.DEBUG)
    main()
PS: I think this deserves a ticket on its own, feel free to delete it if you don't agree. :-)

and here is a version just to move files to a duplicates directory ($HOME/Duplicates) for safety


#!/usr/bin/env python3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
import os, shutil
import logging
from pathlib import Path

def main() -> None:
    with open("/tmp/fclones_out.csv") as f_handler:

        for duplicates in (
            fclone_output_line.split(",")[3:]

            for fclone_output_line in f_handler.readlines()

            if not fclone_output_line.startswith("size")
        ):
            src = duplicates[0]
            moveto = "/Users/MyUserName/Duplicates/"
            for dst in duplicates[1:]:
                logging.debug("%s -> %s", src, dst)
                dst = dst.strip('\n')
                my_file_list=Path(dst)
                if my_file_list.is_file():
                    myfile = os.path.basename(dst)
                    sink = moveto+myfile
                    shutil.move(dst,sink )

if __name__ == "__main__":
    logging.basicConfig(level=logging.DEBUG)
    main()

from fclones.

piranna commented on June 24, 2024

Assumes you output the csv file to /tmp for tidyness

Better if it gets the info directly from stdin :-)

from fclones.

rleaver152 commented on June 24, 2024

Assumes you output the csv file to /tmp for tidyness

Better if it gets the info directly from stdin :-)

I like to check before deleting!! :-) and the move one loses directory structure so equally want to check first

from fclones.

Offer a way of deleting / hardlinking / softlinking duplicated files automatically about fclones HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent