Coder Social home page Coder Social logo

superd_legacy's People

Contributors

bzitzow avatar sipulak avatar snakedoc avatar tracehagan avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

superd_legacy's Issues

Fix DedupeSQL.writeRecord() to not open new db connection every iteration. Also fix prepared statements so only initialized once.

Reference #12

Need to fix opening a new db connection every iteration of DedupeSQL.writeRecord(). Also need to move the creation of the prepared statement so that it's only created once and we save the i/o call. On the db, the prepared statement is only really made once, all other attempts to create it get ignored since it lives in the databases "cache of recent queries"... until it decides to get rid of it (db is closed). But we still waste cpu time doing wasted i/o to tell the db to create a statement that we know exists.

Create Contributors file, fix copyright announcement.

Create a Contributors file and reference it in the copyright notice. This is because as contributors work on the project, they should get public credit, but listing everyone in the copyright and license notice is messy. Moving all names into Contributors file and list by date they joined/contributed to the project. Allow a contributor to pimp something of their choice, such as personal website, company, project, product, interesting fact, etc. Allow contributors to update their text whenever they wish and issue pull request for that file. NO pull requests will be accepted if they make a change to the Contributors file as well as something else (code updates). Contributors file changes will only be accepted if it's the sole file changed in a commit. This is to avoid minor whitespace changes, etc that would clutter normal real commits.

Project Setup?

How can I setup the project?

  • I see the lib/libs.xml - what do I do with this to pull in the libraries? Does the H2 Database Library provide the SQLIte objects?

Add User Interface (CLI or GUI)

Probably lean towards the CLI for now since it makes the program more reusable in other program and later projects. the jcurses library looks very nice and we could do something simple and have a couple columns and use selectable checkboxes or soemthing.

Issue needs research.

Add ToolTip to all buttons and text fields

This is to make it easier and more intuitive to use the program. Also, ToolTip for every table row might be nice, this way user can see full cell contents without having to resize columns or the window.

Add batching of SQL INSERT statements

Look into using:

PreparedStatement.addBatch()
PreparedStatement.executeBatch()
H2.db.getConnection().commit()

This may increase performance by reducing the constant tiny I/O all the time during DedupeR.class run.

Add Web Service to gather statistics

Would be cool to gather statistics on how people use the program an data it finds. We could implement a basic web service (hosted at snakedoc.net) that listens for the program to tell it data, storing it in a database then locally for analyis later on.

Data would be anonymous - only file sizes and hashes would be transmitted. This would allow us to gather stats on number of files people have in drives/directories, what size those files are, etc. Essentially we would be collecting data on how much duplicated data exists in the world, not only on specific drives, but for all users of the program (how many users have files with the same hash?). Completely out of score for the core of the program, but might be interesting none-the-less.

Move all statistics and end of program display stuff to it's own class

We can just move the end of program stuff to another class and have it runs it's own queries to get whatever info we need out of the database. having the checkdedupes and deduper classes run program logic and do stats gathering and display really doesn't make them very reusable.

Experiment with H2 Database's in-memory db capability.

This could yield a massive performance increase in some circumstances (tons of small files, causing lots of small i/o to database quickly), etc. Instead of being a file for the db, it's all in memory, meaning it's lightning fast. Downside is database is not persistent since when program either crashes ungracefully (not caught in a try/catch and handled) or terminates, the db is gone. We could have it "sync out" to a db backup stored in the filesystem, for persistence, then read that into memory when the program starts up again.

System needs more available memory than the expected size of the db, which could be a problem on small or embedded systems. Available memory detection could be built in to check if this requirement was met at startup...

Verify /** */ Comments work in Properties File

I'm not sure if the java Properties API will recognize /** */ as valid comments or if it would result in undefined behavior. The standard I've seen used is the normal # symbol, but this does not mean it's the only thing Properties API will read as comments, it may just be looking for LF followed by property name = property value.

Needs testing and confirmation we can read and write reliably to the properties file using /** */ as comment markers.

Walker sometimes scans files more than once

Don't know if you want to take a crack at this one. Seems that sometimes Walker scans a file more than once... Not sure why...

The database should throw any duplicates out because we used the UNIQUE constraint on file_path column in the table, but obviously it's best to not scan multiple times if we don't have to.

It seems to only scan files more than once occasionally... weird. See the screenshot from the console out:

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.