I'm writing a hex editor , a

having to do bookkeeping work on every access. <p dir="

Question: Saving copy-on-write mmap changes back to file about memmap2-rs HOT 12 CLOSED

razrfalcon commented on July 29, 2024

Question: Saving copy-on-write mmap changes back to file

from memmap2-rs.

Comments (12)

RazrFalcon commented on July 29, 2024

I personally using memmap2 only for reading, so I cannot help you here. Maybe someone else would.

from memmap2-rs.

adamreichold commented on July 29, 2024

This does not answer your general question and I am not sure if file sizes prohibit this in your case, but editors usually do extra work to atomically replace the file being edited, i.e. instead of modifying foo.bin, they would copy it to .foo.bin.swp and modify only that file and finally when the user wants to save, rename .foo.bin.swp to foo.bin thereby atomically replacing one file by the other.

If you adopt this approach, you can map the .foo.bin.swp directly and modify its contents which would not touch the original.

If you want to optimize the impact of copying foo.bin to .foo.bin.swp in the first place, you could do something like cp --reflink=auto foo.bin .foo.bin.swp which shares the underlying file extents if possible thereby copying only metadata and data only on demand, i.e. if you modify the mapping and the page cache's write back wants to push the pages to disk.

from memmap2-rs.

crumblingstatue commented on July 29, 2024

I am not sure if file sizes prohibit this in your case

I want the hex editor to be usable for things like viewing and editing block devices, which could be hundreds of gigabytes large, so I'm not sure how well this approach would play with that.

you could do something like cp --reflink=auto foo.bin .foo.bin.swp

I want my hex editor to work at least on windows and linux, so I don't want to use unix specific commands. Not sure if windows supports something similar to this.

from memmap2-rs.

adamreichold commented on July 29, 2024

In that case, I would suggest looking at it from the other way around: Mmap the files directly for reading and writing. But instead of writing pending modifications into the memory map, store those in separate data structures (a sort of overlay that is considered when displayed any modified file range) and write them into the memory map only when saving.

This is sort of like your option 2 but without closing and re-opening the memory map and keeping the buffers storing the modifications around until saving. Of course, this still means modifications will not be atomic from the PoV of other reads of the underlying file.

from memmap2-rs.

crumblingstatue commented on July 29, 2024

instead of writing pending modifications into the memory map, store those in separate data structures (a sort of overlay that is considered when displayed any modified file range) and write them into the memory map only when saving.

This is what I would like to avoid doing if possible, as it makes the implementation a lot more complex.
Copy-on-write mmap would allow my data model to remain &mut [u8], which is drastically more simple than having to do bookkeeping work on every access.

from memmap2-rs.

adamreichold commented on July 29, 2024

having to do bookkeeping work on every access.

Don't you have to do the book keeping anyway if you are going for option 2?

(I don't think option 1 is obviously sound. This was discussed repeatedly on URLO without reaching a consensus that the current memory map semantics yield anything which soundness can be based upon. But then again, since this is somewhat unchartered territory, soundness of option 2 is not really ensured either. Using memory maps is always mildly problematic with Rust ATM. (Even copy-on-write maps only protect other processes from seeing your writes, it does not protect your process from seeing concurrent modifications performed by other processes.))

from memmap2-rs.

crumblingstatue commented on July 29, 2024

Don't you have to do the book keeping anyway if you are going for option 2?

There is nothing to do when trying to access memory. The only thing that has to be done is keep track of what changed during editing operations.

However, if I'm not using a copy-on-write memory map, I have to, on every access, look up if that access is within an edited range, and if it is, find the edit buffer that corresponds to that location, and return a slice to that instead of the original memory.
What if the request spans both an edited range and a non-edited range? I can't return a contiguous slice that spans both my edit buffer and the original memory, unless I allocate and copy on every access.

from memmap2-rs.

adamreichold commented on July 29, 2024

Since you need detailed control of when writeback (user presses save) happens and care about soundness and portability, I think you should probably implement your own paging to keep the simplicity of not having to implement an overlay on read. (Meaning just read fixed size blocks from storage into memory to display and optionally modify them (and track this on a block level). If a configured memory limit is reached, write out clean blocks to make space. When the user saves write out all dirty blocks.)

from memmap2-rs.

crumblingstatue commented on July 29, 2024

I think you should probably implement your own paging to keep the simplicity of not having to implement an overlay on read.

I don't see how fixed size blocks would help with simplicity, as there could still be requests that span blocks. And there are certain APIs that expect slices, like memmem. If I wanted to do a string search with such an implementation, I couldn't just simply call memmem, I would have to call memmem for every page, plus search each page boundary for matches as well.
And this is just one example of how such an implementation complicates things.
I would rather let the operating system take care of all this hard work for me, as it was written by professionals who know what they are doing, and has received lots of testing over the years.

from memmap2-rs.

crumblingstatue commented on July 29, 2024

I did some preliminary testing, and this passes both on windows and linux:

#[test]
fn test_mmap_write() {
    use std::fs::OpenOptions;
    use std::io::{Seek, Write};
    std::fs::write(
        "testfile-small.txt",
        include_bytes!("../testfile-small-in.txt"),
    )
    .unwrap();
    let mut f = OpenOptions::new()
        .read(true)
        .write(true)
        .open("testfile-small.txt")
        .unwrap();
    let mut mmap = unsafe { memmap2::MmapOptions::new().map_copy(&f).unwrap() };
    assert_eq!(&*mmap, include_bytes!("../testfile-small-in.txt"));
    mmap[7..=12].copy_from_slice(b"Edited");
    assert_eq!(&*mmap, include_bytes!("../testfile-small-edited-in.txt"));
    mmap.flush().unwrap();
    f.seek(std::io::SeekFrom::Start(7)).unwrap();
    f.write_all(&mmap[7..=12]).unwrap();
    f.flush().unwrap();
    let new = std::fs::read("testfile-small.txt").unwrap();
    assert_eq!(new, include_bytes!("../testfile-small-edited-in.txt"));
    mmap[70..=86].copy_from_slice(b"Changed you again");
    assert_eq!(&*mmap, include_bytes!("../testfile-small-edited2-in.txt"));
    mmap.flush().unwrap();
    f.seek(std::io::SeekFrom::Start(70)).unwrap();
    f.write_all(&mmap[70..=86]).unwrap();
    f.flush().unwrap();
    let new = std::fs::read("testfile-small.txt").unwrap();
    assert_eq!(new, include_bytes!("../testfile-small-edited2-in.txt"));
}

testfile-small-in.txt:

Hello, World! I am a small test file.
I'm pretty nice as you can see.
Change me please.
Thank you!.

testfile-small-edited-in.txt:

Hello, Edited I am a small test file.
I'm pretty nice as you can see.
Change me please.
Thank you!.

testfile-small-edited2-in.txt:

Hello, Edited I am a small test file.
I'm pretty nice as you can see.
Changed you again
Thank you!.

This gives me some level of confidence that I could use this strategy.
I could expose it under an --unsafe-memmap option, and warn users about potential risks, but the option would still be available for users who understand the risks.

Then if some smart person wants to contribute a smart paging algorithm for my hex editor, I'll accept their pull request. But until then, I'd like to have at least basic usability for huge files without greatly complicating the implementation.

from memmap2-rs.

adamreichold commented on July 29, 2024

I think you should probably implement your own paging to keep the simplicity of not having to implement an overlay on read.

I think my tone was wrong here. What I would like to have said is that if you ask about soundness - meaning defined behaviour no matter the conditions under which the code executed to me - and portability, then I fear that the semantics this crate (or rather the underlying system calls) can provide will be insufficient. The intersection between memory maps and Rust's take on memory safety is still not really defined as the discussions on URLO seem to indicate.

Testing a certain pattern before usage can of course provide evidence that it is practical, but it cannot answer the question of soundness. This is why I suggested to try to implement your own paging without using the kernel-provided memory maps, because soundness should be achievable there based on the currently established interactions between Rust and the kernel.

Concerning the pattern your test case employs, I think your option 2 is still preferable and it avoids aliasing the same memory via the memory map and via the file handle (which will eventually write into the same physical page in the kernel's page cache). An alternative might to build an API which ensures that your code can only either Deref the Mmap or accessing the underlying File, but not both at the same time.

But finally, even option 2 needs to assume that your process is the only one mapping the underlying file to be safe from aliasing issues which is usually not the case for an interactively used editor.

from memmap2-rs.

crumblingstatue commented on July 29, 2024

I'm fine with assuming that I have exclusive access to the file. I'll document this invariant for the users when they opt into mmap. The default will still be reading everything into a Vec. mmap will be an option available in cases it's needed, like huge files or block devices, and the user should take precautions that they provide exclusive access.

As far as the question of soundness goes, I don't need it to be 100% theoretically sound, I just want to know if there is something obviously wrong with these approaches that is known to break things.

As for why I would rather use a copy on write memmap than implementing my own paging, the answer is greatly reduced implementation complexity. If I can rely on data being [u8], I can reuse all the algorithms and libraries that work on byte slices.
This works because mmap maps pages on page fault, so I (and the libraries I use) don't have to do anything special, just access bytes directly.
If I did my own paging, I would have to write significantly more complex algorithms, and I couldn't use those libraries that work on byte slices either. I would have to write my own versions that request pages explicitly instead of it happening on page fault.
Unless I can write my own page fault handler for doing this, but I get the feeling that is another can of worms that would lead to similar issues that mmap has.

from memmap2-rs.

Question: Saving copy-on-write mmap changes back to file about memmap2-rs HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent