The clipboard-history from supercilex

Compression

No. Use bcachefs and enable compression on the CCH config folder.

Encoding

Use https://github.com/SoftbearStudios/bitcode and implement streaming support: SoftbearStudios/bitcode#6. Implement the same log structure used for GCH.

Use encoding hints liberally.

Development plan and design decisions

Development plan

Phase 1

Put everything in one bigass crate and commit shitty code. The goal is to get the core functionality 100% working. Develop with a normal cosmic iced app (or just a truly normal iced app if it turns out cosmic apps require the DE to be running).

Phase 2

De-shittify the code, implement the fastest algorithms, cleanup, etc.

Phase 3

Split the core algorithms and data structures into their own crate. There will probably be three crates: core, applet, and app. We can probably call them Gerono because the clipboard is like an hourglass in the sense that all apps will (implicitly) write to it and then any app can get data out. The clipboard is a pinch point.

Phase 4

Final polish, figure out distribution, how to ship, etc.

Design decisions

Database compression
- No. Use bcachefs and enable compression on the data folder.
Database encoding
- Why are we special? As in why not use sqlite? Copying to clipboard has a distinct usage pattern wherein the data is almost exclusively append-only. This means we can provide significant performance and efficiency improvements by implementing our own database format.
- Use SoftbearStudios/bitcode and implement streaming support: SoftbearStudios/bitcode#6. Implement the same log structure used for GCH. Use encoding hints liberally. A consequence of using this lib is that any change to the data types (like adding an enum) is backwards incompatible, so we'll need a migration story. Make the first byte of each shard be a version number.
Shard the database across files, trying to match the number of files to the number of cores in the machine, with some minimum file size (maybe 8MB?) before starting to shard. This means we'll be able to load the database into memory at max speed. Also use a max shard size (maybe 128MB?) so any shard-wide ops aren't horrible to do (e.g. database migration only needs to occur on shards that are modified, so most of them can be untouched, but the ones we do touch probably shouldn't be humongous).
- TODO think about how and when to shard. Ideally have an old and new folder, with an atomic rename to commit the changes. Use hard links to migrate shards that didn't change. Or maybe since most shards won't change, just have an old and new shard file and do the atomic rename for those files?
Only load into memory and display in the UI the clipboard entries for the latest shard. Realistically no one is going to be scrolling through thousands of clipboard entries. For search, stream through all the other shards. Search works like in https://github.com/SUPERCILEX/gnome-clipboard-history where you do both a plaintext match and a regex match (using BurntSushi's stuff of course).
Make IDs u32s and be local per shard (as long as the max shard size is less than 4GBs we can't run out of address space so no need to worry about overflow). This means if you delete something in another shard, we'll put the delete in that shard. Move or favoriting will require copying the data to the current shard (which is what you want anyway since the item is hot and should be loaded as part of the single-shard init).
TODO We need some way to know for compaction if a shard is dirty for compaction. Maybe a marker file? Maybe rename the shard file? Maybe a single database metadata file? How to ensure atomicity/consistency?
Can we mmap the shard? No because compaction might move bytes around and append support is iffy AFAIK. So loading a shard will consist of streaming through the file, parsing each entry into a Vec and allocating a string using some slab library (actually maybe that doesn't help? Need to check the implementation. Ideally the allocator backed with a stack).
For large strings, don't actually store them in the shard (how about if they're 65K bytes or longer so we can use a u16 string len) and instead use the non-plaintext mime type handling.
Support arbitrary mime types by saving their data somewhere outside the log.
- TODO figure out a design for this. Use one file per entry because we want people to be able to see their copied images for example. Maybe use mime type as directory name?
TODO can we support multiple writers? That probably requires a client/server model which I really don't like. At the very least it'd be nice to be able to know we'll corrupt things if we write. Use a lock file?
In memory, since we're storing entries in a Vec we need some way to support moves and deletes. The most cache efficient way to do this will be to use tombstones. So just slap a tombstone in place of the entry and append to the front of the list. Then compaction will take care of making sure the vec doesn't grow too big.
Tooling: we should have a cli that can read shard files (input is a list).
- Stats command which shows you stats on the items in the shard.
- Export command that supports human output and json output.
- Debug command that analyses byte offsets etc.
- Migrate command that concerts from GCH to our format. Also support Clipboard Indicator and maybe if I'm feeling bold.
- Generate command which creates pseudo random shards for perf testing.
A ratui client would be cool to have. Probs the easiest to start with for development.
TODO consistency model. Figure out when to fsync changes to disk. Pretty sure we want to let normal copies go through without a sync, but maybe compaction should sync? There should be an fsync on application close.
TODO Crash resilience. Think about how the format can break and how we can recover gracefully.

Stats from my clipboard:

Stats: Stats {
    raw: RawStats {
        num_entries: 5001,
        total_str_bytes: 3701123,
        ops: OpCountStats {
            num_save_texts: 5130,
            num_deletes: 129,
            num_favorites: 1,
            num_unfavorites: 0,
            num_moves: 25,
        },
    },
    computed: ComputedStats {
        mean_entry_length: 721.4664717348928,
        median_entry_length: 18,
        min_entry_length: 1,
        max_entry_length: 585304,
    },
}

import matplotlib.pyplot as plt
import numpy as np

def generate_histogram(file_path):
    # Read data from the file
    with open(file_path, 'r') as file:
        numbers = [float(line.strip()) for line in file]

    # Calculate the minimum and maximum powers of 2
    min_power = int(np.floor(np.log2(min(numbers))))
    max_power = int(np.ceil(np.log2(max(numbers))))

    # Create a histogram with bins as powers of 2
    bins = 2 ** np.arange(min_power, max_power + 1)
    hist = plt.hist(numbers, bins=bins, color='blue', edgecolor='black')

    # Add labels and title
    plt.xscale('log', base=2)
    plt.xlabel('Value (log base 2)')
    plt.ylabel('Frequency')
    plt.title('Histogram of entry lengths')

    # Modify x-axis labels to show bucket ranges
    bin_ranges = [f'2^{int(np.log2(bins[i]))}' for i in range(len(bins))]
    plt.xticks(bins, bin_ranges, rotation='vertical')

    # Add space at the bottom of the plot
    plt.subplots_adjust(bottom=0.15)

    # Display the histogram
    plt.savefig("stats.png", dpi=300)

file_path = 'tmp.txt'
generate_histogram(file_path)

Places to post the release

Rust, gnome, pop_os, linux, arch linux, federa, and kde subreddits
https://wiki.archlinux.org/title/Clipboard
https://arewewaylandyet.com/
Rust users forum maybe? TWIR
Hacker news

supercilex / clipboard-history Goto Github PK

clipboard-history's People

Contributors

Stargazers

Watchers

Forkers

clipboard-history's Issues

Compression

Encoding

Development plan and design decisions

Development plan

Phase 1

Phase 2

Phase 3

Phase 4

Design decisions

Places to post the release

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent