Coder Social home page Coder Social logo

clipboard-history's People

Contributors

supercilex avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

sakettiwari076

clipboard-history's Issues

Development plan and design decisions

Development plan

Phase 1

Put everything in one bigass crate and commit shitty code. The goal is to get the core functionality 100% working. Develop with a normal cosmic iced app (or just a truly normal iced app if it turns out cosmic apps require the DE to be running).

Phase 2

De-shittify the code, implement the fastest algorithms, cleanup, etc.

Phase 3

Split the core algorithms and data structures into their own crate. There will probably be three crates: core, applet, and app. We can probably call them Gerono because the clipboard is like an hourglass in the sense that all apps will (implicitly) write to it and then any app can get data out. The clipboard is a pinch point.

Phase 4

Final polish, figure out distribution, how to ship, etc.

Design decisions

  • Database compression
    • No. Use bcachefs and enable compression on the data folder.
  • Database encoding
    • Why are we special? As in why not use sqlite? Copying to clipboard has a distinct usage pattern wherein the data is almost exclusively append-only. This means we can provide significant performance and efficiency improvements by implementing our own database format.
    • Use SoftbearStudios/bitcode and implement streaming support: SoftbearStudios/bitcode#6. Implement the same log structure used for GCH. Use encoding hints liberally. A consequence of using this lib is that any change to the data types (like adding an enum) is backwards incompatible, so we'll need a migration story. Make the first byte of each shard be a version number.
  • Shard the database across files, trying to match the number of files to the number of cores in the machine, with some minimum file size (maybe 8MB?) before starting to shard. This means we'll be able to load the database into memory at max speed. Also use a max shard size (maybe 128MB?) so any shard-wide ops aren't horrible to do (e.g. database migration only needs to occur on shards that are modified, so most of them can be untouched, but the ones we do touch probably shouldn't be humongous).
    • TODO think about how and when to shard. Ideally have an old and new folder, with an atomic rename to commit the changes. Use hard links to migrate shards that didn't change. Or maybe since most shards won't change, just have an old and new shard file and do the atomic rename for those files?
  • Only load into memory and display in the UI the clipboard entries for the latest shard. Realistically no one is going to be scrolling through thousands of clipboard entries. For search, stream through all the other shards. Search works like in https://github.com/SUPERCILEX/gnome-clipboard-history where you do both a plaintext match and a regex match (using BurntSushi's stuff of course).
  • Make IDs u32s and be local per shard (as long as the max shard size is less than 4GBs we can't run out of address space so no need to worry about overflow). This means if you delete something in another shard, we'll put the delete in that shard. Move or favoriting will require copying the data to the current shard (which is what you want anyway since the item is hot and should be loaded as part of the single-shard init).
  • TODO We need some way to know for compaction if a shard is dirty for compaction. Maybe a marker file? Maybe rename the shard file? Maybe a single database metadata file? How to ensure atomicity/consistency?
  • Can we mmap the shard? No because compaction might move bytes around and append support is iffy AFAIK. So loading a shard will consist of streaming through the file, parsing each entry into a Vec and allocating a string using some slab library (actually maybe that doesn't help? Need to check the implementation. Ideally the allocator backed with a stack).
  • For large strings, don't actually store them in the shard (how about if they're 65K bytes or longer so we can use a u16 string len) and instead use the non-plaintext mime type handling.
  • Support arbitrary mime types by saving their data somewhere outside the log.
    • TODO figure out a design for this. Use one file per entry because we want people to be able to see their copied images for example. Maybe use mime type as directory name?
  • TODO can we support multiple writers? That probably requires a client/server model which I really don't like. At the very least it'd be nice to be able to know we'll corrupt things if we write. Use a lock file?
  • In memory, since we're storing entries in a Vec we need some way to support moves and deletes. The most cache efficient way to do this will be to use tombstones. So just slap a tombstone in place of the entry and append to the front of the list. Then compaction will take care of making sure the vec doesn't grow too big.
  • Tooling: we should have a cli that can read shard files (input is a list).
    • Stats command which shows you stats on the items in the shard.
    • Export command that supports human output and json output.
    • Debug command that analyses byte offsets etc.
    • Migrate command that concerts from GCH to our format. Also support Clipboard Indicator and maybe if I'm feeling bold.
    • Generate command which creates pseudo random shards for perf testing.
  • A ratui client would be cool to have. Probs the easiest to start with for development.
  • TODO consistency model. Figure out when to fsync changes to disk. Pretty sure we want to let normal copies go through without a sync, but maybe compaction should sync? There should be an fsync on application close.
  • TODO Crash resilience. Think about how the format can break and how we can recover gracefully.

Stats from my clipboard:

Stats: Stats {
    raw: RawStats {
        num_entries: 5001,
        total_str_bytes: 3701123,
        ops: OpCountStats {
            num_save_texts: 5130,
            num_deletes: 129,
            num_favorites: 1,
            num_unfavorites: 0,
            num_moves: 25,
        },
    },
    computed: ComputedStats {
        mean_entry_length: 721.4664717348928,
        median_entry_length: 18,
        min_entry_length: 1,
        max_entry_length: 585304,
    },
}

stats

import matplotlib.pyplot as plt
import numpy as np

def generate_histogram(file_path):
    # Read data from the file
    with open(file_path, 'r') as file:
        numbers = [float(line.strip()) for line in file]

    # Calculate the minimum and maximum powers of 2
    min_power = int(np.floor(np.log2(min(numbers))))
    max_power = int(np.ceil(np.log2(max(numbers))))

    # Create a histogram with bins as powers of 2
    bins = 2 ** np.arange(min_power, max_power + 1)
    hist = plt.hist(numbers, bins=bins, color='blue', edgecolor='black')

    # Add labels and title
    plt.xscale('log', base=2)
    plt.xlabel('Value (log base 2)')
    plt.ylabel('Frequency')
    plt.title('Histogram of entry lengths')

    # Modify x-axis labels to show bucket ranges
    bin_ranges = [f'2^{int(np.log2(bins[i]))}' for i in range(len(bins))]
    plt.xticks(bins, bin_ranges, rotation='vertical')

    # Add space at the bottom of the plot
    plt.subplots_adjust(bottom=0.15)

    # Display the histogram
    plt.savefig("stats.png", dpi=300)

file_path = 'tmp.txt'
generate_histogram(file_path)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.