Coder Social home page Coder Social logo

matrixai / js-encryptedfs Goto Github PK

View Code? Open in Web Editor NEW
10.0 8.0 3.0 3.73 MB

Encrypted Filesystem for TypeScript/JavaScript Applications

Home Page: https://polykey.com

License: Apache License 2.0

JavaScript 0.68% Nix 0.15% TypeScript 95.15% HTML 3.70% Shell 0.06% PowerShell 0.27%
filesystem encryption encrypted-store

js-encryptedfs's Introduction

js-encryptedfs

staging:pipeline status master:pipeline status

Encrypted filesystem library for TypeScript/JavaScript applications

  • Virtualised - files, directories, permissions are all virtual constructs, they do not correspond to real filesystems
  • Orthogonally Persistent - all writes automatically persisted
  • Encrypted-At-Rest - all persistence is encrypted
  • Random Read & Write - encryption and decryption operates over fixed-block sizes
  • Streamable - files do not need to loaded fully in-memory
  • Comprehensive continuous benchmarks in CI/CD

Development based on js-virtualfs: https://github.com/MatrixAI/js-virtualfs

Installation

npm install --save encryptedfs

Usage

import type { EFSWorkerModule } from 'encryptedfs';

import { WorkerManager } from '@matrixai/workers';
import { EncryptedFS, utils } from 'encryptedfs';

const key = utils.generateKeySync(256);

const efs = await EncryptedFS.createEncryptedFS({
  dbPath: '/tmp/efs',
  dbKey: key,
});

// optionally set up the worker manager for multi-threaded encryption/decryption
const workerManager = await WorkerManager.createWorkerManager<EFSWorkerModule>({
  workerFactory: () => spawn(new Worker('./src/workers/efsWorker'))
});

efs.setWorkerManager(workerManager);

// create a new directory
const newDir = `test`;
await efs.mkdir(newDir);

// write out to a file
await efs.writeFile(`${newDir}/testFile`, 'output');

// read in the file (contents = 'output')
const contents = await efs.readFile(`${newDir}/testFile`);

// closes the EFS
await efs.stop();

// destroys the EFS state
await efs.destroy();

Encryption & Decryption Protocol

Encryption & Decryption implemented using the node-forge library. However it is possible to plug in your own encrypt and decrypt functions.

Internally we use the AES-GCM symmetric encryption using a master dbKey that can be 128, 192 or 256 bits long.

The dbKey can be generated from several methods:

  • generateKey - random asynchronous
  • generateKeySync - random synchronous
  • generateKeyFromPass - derived from user-provided "password" asynchronous
  • generateKeyFromPassSync - derived from user-provided "password" synchronous

For example:

const [key, salt] = await generateKeyFromPass('secure password');

This uses PBKDF2 to derive a symmetric key. The default key length will be 256 bits. For deterministic key generation, make sure to specify the salt parameter.

const [key, salt] = await generateKeyFromPass('secure password', 'salt');

Construction of EncryptedFS relies on an optional blockSize parameter. This is by default set to 4 KiB. All files are broken up into 4 KiB plaintext blocks. When encrypted, they are persisted as ciphertext blocks.

The ciphertext blocks contain an initialization vector plus an authorisation tag. Here is an example of the structure:

| iv (16 bytes) | authTag (16 bytes) | ciphertext data (x bytes) |

The ciphertext data length is equal to the plaintext block length.

Differences with Node Filesystem

There are some differences between EFS and Node FS:

  • User, Group and Other permissions: In EFS User, Group and Other permissions are strictly confined to their permission class. For example, a User in EFS does not have the permissions that a Group or Other has while in Node FS a User also has permissions that Group and Other have.
  • Sticky Files: In Node FS, a sticky bit is a permission bit that is set on a file or a directory that lets only the owner of the file/directory or the root user to delete or rename the file. EFS does not support the use of sticky bits.
  • Character Devices: Node FS contains Character Devices which can be written to and read from. However, in EFS Character Devices are not supported yet.

Development

Run nix-shell, and once you're inside, you can use:

# install (or reinstall packages from package.json)
npm install
# build the dist
npm run build
# run the repl (this allows you to import from ./src)
npm run ts-node
# run the tests
npm run test
# lint the source code
npm run lint
# automatically fix the source
npm run lintfix

Benchmarks

npm run bench

View benchmarks here: https://github.com/MatrixAI/js-encryptedfs/blob/master/benches/results with https://raw.githack.com/

Docs Generation

npm run docs

See the docs at: https://matrixai.github.io/js-encryptedfs/

Publishing

Publishing is handled automatically by the staging pipeline.

Prerelease:

# npm login
npm version prepatch --preid alpha # premajor/preminor/prepatch
git push --follow-tags

Release:

# npm login
npm version patch # major/minor/patch
git push --follow-tags

Manually:

# npm login
npm version patch # major/minor/patch
npm run build
npm publish --access public
git push
git push --tags

js-encryptedfs's People

Contributors

cmcdragonkai avatar emmacasolin avatar meanmangosteen avatar robert-cronin avatar scottmmorris avatar tegefaulkes avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

js-encryptedfs's Issues

File Integrity - Merkle Tree for Encrypted Chunks

There is no restriction for only using a single key for encryption with EFS. It is possible with each new instantiation of EFS a different key. It would be beneficial to ensure that when an encrypted file is opened, that the key loaded in the EFS is the same as the key that the file was encrypted with.

This helps preserves the integrity of the file by not performing any operation unless the key is verified to be the same. It can be argued that using AES GCM already provides integrity, but there is a difference albiet subtle.

This key validation measure is designed to prevent the user from destroying the integrity of the file. For example there is nothing stopping the using from writing a block anywhere in an existing file using a separate key than the one it was written with, even when using AES GCM. Key validation on open() can prevent this. The most common use case here would not involve a malicious attacker but rather user error

The integrity provided by GCM is designed to alert the user if data modification can occurred. This would have occurred outside the EFS universe, and generally implies that the file has been maliciously tampered or corrupt, so the data is not be trusted.

One way to perform the key validation is to hash the key, and put it the metadata header of a file when it is open(*, w* )'d. On any open(*, !w*) of a file, the key the EFS is loaded with will be hashed and compared to the one in the header of the file.

feeding associated data in GCM cipher mode

MatrixAI/Polykey#14 (comment)

The 'with associated data (AD)' part means that you can feed additional data into the algorithm to tie the ciphertext to some context. This is so

attempts to "cut-and-paste" a valid ciphertext into a different context are detected and rejected.

Most commonly, the AD would be the header of an encrypted network packet, but in our case the AD could be the filename and/or metadata which would bind it its ciphered file data.

We have to decide what sort data would be most appropriate to use as AD. Or whether to use AD at all. It is optional.

Integrate Generic Crypto library

ATM this is using Node crypto. This is not portable to the other systems that we intend to use js-encryptedfs with.

Now this isn't using PGP, so we don't use kbpgp or openpgp.js. However we are using AES.

It seems we have a couple of choices here:

I'm not sure if the NaCL library fits our requirements.

Furthermore nativescript will require a completely different implementation too. So we would likely need to tackle that later.

Update jest test configuration

This applies to EncryptedFS aswell.
We should alter the jest configuration to setup module name mapper.

const { pathsToModuleNameMapper } = require('ts-jest/utils');
const { compilerOptions } = require('./tsconfig');
module.exports = {
  "roots": [
    "<rootDir>/tests"
  ],
  "testMatch": [
    "**/?(*.)+(spec|test|unit.test).+(ts|tsx|js)"
  ],
  "transform": {
    "^.+\\.tsx?$": "ts-jest"
  },
  globalSetup : "<rootDir>/tests/setup.ts",
  globalTeardown: "<rootDir>/tests/teardown.ts",
  moduleNameMapper: pathsToModuleNameMapper(
    compilerOptions.paths,
    { prefix: "<rootDir>/src/" }
  )
};

The global set up and teardown might be useful for polykey to set the daemons running or other stateful setup. Tests that aren't unit tests should also be renamed from ...unit.test.ts to ...test.ts

Migrate to TypeScript and JavaScript-Demo Environment

The JavaScript demo environment has a typescript branch: https://github.com/MatrixAI/JavaScript-Demo/tree/typescript

We intend to migrate our projects to this, except for the ones that have reached stability like js-virtualfs.

But all the other polykey projects are going to be migrated over as well.

So there are 2 aims here:

  1. Start using typescript + webpack
  2. Use the same Nix structure as JS demo

Make sure that all tests have been migrated as well to Jest.

Node process context control

VFS has certain methods to control the file system context (e.g. chdir, setUid, setGid etc), but this is not exposed on the native nodejs fs module. It is instead controlled via the process module. In order to change the context on the lower fs, one would need access to the process module and this entails some extra research as to the best way to approach this. Essentially we want to be careful about tampering with the process context.

One solution to this is to ask for these methods as an additional parameter when the user passes in upperDir and lowerDir. The new constructor for EFS could look something like this:

const efs = new EncryptedFS(
  vfs,    // upperDir
  vfs,    // upperDir context control (for chdir, setUid, setGid, etc...)
  fs,     // lowerDir
  process // lowerDir context control (for chdir, setUid, setGid, etc...)
  ...
)

These context control objects have to conform to an interface to ensure that the relevant methods exist:

interface FSContextControl {
  chdir(...): void
  setUid(...): void
  setGid(...): void
}

But if we are going to use process, then we have to ensure that EFS is notified whenever the process context is changed externally to EFS (i.e. manually by user). This could be done by creating a proxy process that acts as an observer pattern and using this in EFS instead so that EFS is notified every time the cwd/gid/uid is changed in the process context. This should also go in usage notes and operator warnings to ensure correct usage by the end user.

`EncryptedFS.rmdir` does not remove files on Windows

Describe the bug

EncryptedFS.rmdir() does not work as expected on Windows. It removes directories but not files.


Update: The issue may in fact be that Linux and Mac are removing files when they shouldn't and Windows is actually behaving correctly. This would mean that the tests are wrong. The tests that fail on Windows due to this all throw an EEXIST error when trying to open a (new) file and create a file descriptor to it. When this happens the error looks something like this:

ErrorEncryptedFSError: EEXIST: file already exists, dir\file1

      2041 |             // Target already exists cannot be created exclusively
      2042 |             if (flags & constants.O_CREAT && flags & constants.O_EXCL) {
    > 2043 |               throw new errors.ErrorEncryptedFSError({
           |                     ^
      2044 |                 errno: errno.EEXIST,
      2045 |                 path: path as string,
      2046 |                 syscall: 'open',

      at constructor_._open (src/EncryptedFS.ts:2043:21)
      at src/EncryptedFS.ts:1885:15
      at Object.maybeCallback (src/utils.ts:405:12)
      at Object.<anonymous> (tests/EncryptedFS.concurrent.test.ts:832:12)

To Reproduce

Run the following script on Windows:

import fs from 'fs';
import os from 'os';
import path from 'path';
import Logger, { LogLevel, StreamHandler } from '@matrixai/logger';
import { DB } from '@matrixai/db';
import EncryptedFS from './src/EncryptedFS';
import * as utils from './src/utils';
import INodeManager from './src/inodes/INodeManager';

async function main() {
  const logger = new Logger(`${EncryptedFS.name} Concurrency`, LogLevel.WARN, [
      new StreamHandler(),
  ]);
  const dbKey: Buffer = utils.generateKeySync(256);
  let dataDir: string;
  let db: DB;
  let iNodeMgr: INodeManager;
  let efs: EncryptedFS;

  dataDir = await fs.promises.mkdtemp(
    path.join(os.tmpdir(), 'encryptedfs-test-'),
  );
  db = await DB.createDB({
    dbPath: dataDir,
    crypto: {
      key: dbKey!,
      ops: {
      encrypt: utils.encrypt,
      decrypt: utils.decrypt,
      },
    },
    // @ts-ignore - version of js-logger is incompatible (remove when js-db updates to 5.* here)
    logger: logger.getChild(DB.name),
  });
  iNodeMgr = await INodeManager.createINodeManager({
    db,
    logger: logger.getChild(INodeManager.name),
  });
  efs = await EncryptedFS.createEncryptedFS({
    db,
    iNodeMgr,
    logger,
  });

  const path1 = path.join('dir', 'file1');
  await efs.mkdir('dir');
  await efs.mkdir('dir/dir2');
  let fd = await efs.open(path1, 'wx+');
  await efs.close(fd);
  console.log('Dir exists before rmdir? ', await efs.exists('dir'));
  console.log('Dir2 exists before rmdir? ', await efs.exists('dir/dir2'));
  console.log('FD exists before rmdir? ', await efs.exists(path1));
  await efs.rmdir('dir', { recursive: true });
  console.log('Dir exists after rmdir? ', await efs.exists('dir'));
  console.log('Dir2 exists after rmdir? ', await efs.exists('dir/dir2'));
  console.log('FD exists after rmdir? ', await efs.exists(path1));

  await efs.stop();
  await fs.promises.rm(dataDir, {
    force: true,
    recursive: true,
  });
}

main()

Output:

Dir exists before rmdir?  true
Dir2 exists before rmdir?  true
FD exists before rmdir?  true
Dir exists after rmdir?  false
Dir2 exists after rmdir?  false
FD exists after rmdir?  true

Expected behavior

The file should not exist after rmdir is called. This is observed when running the same script on Linux:

Dir exists before rmdir?  true
Dir2 exists before rmdir?  true
FD exists before rmdir?  true
Dir exists after rmdir?  false
Dir2 exists after rmdir?  false
FD exists after rmdir?  false

Platform

  • OS: Windows

Additional context

May be related to MatrixAI/Polykey-CLI#14

Upper directory (in-memory fs) should act as a block cache

Similar to a page cache in operating systems, all read and write file operations can be tried against the image of the file contained the VFS first. This would be on a block level basis. If the desired block is not present in the file image inside the upper directory (VFS), a 'block fault' would occur. Now the corresponding block, persisted on disk, would be read, decrypted then populated in the image in the upper dir.

Every subsequent read of the block would simple be retrieved from memory, instead of performing a disk read.

Every write would also populate the upper dir image as well to ensure the block in the upper dir contains the most up-to-date data.

With this measures, blocks in the upper dir image cannot become 'dirty'. So the integrity of every read of loaded blocks in the upper dir is guaranteed.

To know which blocks are currently loaded in a file in the upper dir, a set for each file containing the loaded block numbers can be maintained. Block numbers would only be added never removed.

Documentation

Documentation needs to be created and should include:

  • Overview (explanation of upper and lower directory/chunks etc)
  • Installation
  • Usage
  • Warnings (e.g. for usage of context control actions like chdir and setuid etc.)

This can go into README.md

@ imports don't work with webworkers

Another issue with @ imports.

The web worker says it can't find the util file which has some common crypto constants. So I have just left it as a standard relative import. This

Full error:

Error: Cannot find module '@encryptedfs/util'
    Require stack:
    - /home/<user>/Documents/github/js-encryptedfs/src/EncryptedFSCrypto.ts
    - /home/<user>/Documents/github/js-encryptedfs/src/EncryptedFSCryptoWorker.ts
    - /home/<user>/Documents/github/js-encryptedfs/[worker eval]

Fix Permission Hierarchy - User, Group, Other permission checking

Specification

The current README.md states:

User, Group and Other permissions: In EFS User, Group and Other permissions are strictly confined to their permission class. For example, a User in EFS does not have the permissions that a Group or Other has while in Node FS a User also has permissions that Group and Other have.

Additional comments

@scottmmorris: when I was testing the permissions in efs, if a user didn't have write permissions a write wouldnt work in any case
but I think for some of the pjdfs tests if this happened but the group or other had write permissions, even if the user didnt have the permissions it was supposed to work

The pjdfs tests are doing it correctly. Permissions are "hierarchical", meaning one should be checking user, group and other permissions.

Additional context

Tasks

  1. - Add back the permission tests from pjdfs
  2. - Update permission checking algorithm to properly match the expected behaviour

Cryptor module is not asynchronous

Following issue #18, the cryptor module ustilises node's 'crypto' module for aes ciphering. But the 'crypto' module only offers a sync api.

Cryptor tries to expose an async method by using to process.nextTick() wrapped around sync calls, but this , as discussed in #18, does not actually make it async. It will still block the event loop.

There is this library:
https://www.npmjs.com/package/@ronomon/crypto-async

But even it is not truly async. It uses a c++ add on, but on the main thread. So once again, it will block the event loop. It also does not support aes gcm which efs uses.

There is possible scope to develop a new truly async crypto (or at least aes) library utilising node's thread pool like the linked library mentions in 'some new ideas' section.

As for now, since there seems to be know suitable solution for an async crypto lib, efs will continue using the built-in sync crypto library.

Benchmark & Load Test EncryptedFS

Right now we only have benchmarks for crypto operations and the DB operations.

To properly assess what is the optimal block size to use (probably more than 4 KiB, but definitely less than 1 MiB), we have to do a proper load test on the EFS instance.

This requires performing a sequential set of operations vs parallel set of operations.

With the tests done in MatrixAI/Polykey#209, we can see that due to the speed for encryption/decryption, using a worker thread will always be slower cause calling workers have 1.5ms overhead.

The speed up can only happen when the there are multiple worker threads being used, so when there are multiple parallel operations occurring either through batching and/or multiple EFS reads/writes.

Benny is not the right tool to do load testing, but we may be able to adapt it for proper load testing.

Otherwise we can use a better load testing tool like artillery, but it needs to be easy to integrate, and not go through the network. This is not an HTTP server. It's a library.

Remember as long as the main thread is doing other things, and that the work is more expensive than call overhead divided by the number of workers, then we should have a profitable use of worker threads.

Additional Context

Tasks

  1. - Investigate how to use benny to create
  2. - Investigate other load testing tools like benny
  3. - Write a load test involving EFS, random read/write operations vs sequential read/write operations
  4. - Consider how these benchmarks interact with PK usage (with vault committing protocol)
  5. - Add additional test for using WorkerManager inside EncryptedFS

Storage and management of encryption keys

EFS uses symmetric keys for encryption. To decrypt a file, you need to have the same symmetric that was used for encryption. EFS should provide some functionality to do generation, storage, and retrieval of keys. A sepatate KeyManager class can be made to do this.

Its functions would be:

Key Generation:

PBKDF2 will be used if a symmetric key is to be generated. It is highly recommended that salt be used especially for low-entropy, dictionary based passwords. We would also need to store this salt somewhere.

Key Storage:

There needs to be a way to store the keys as well if they user wishes to persist the key on disk. Also a method to simply retrieve the key as a Buffer. Perhaps there should also be a warning message printed to alert the user that the key should be protected with asym crypto if writing key to disk.

Key/Salt Retrieval

From disk and into an in-memory buffer.

Another issues is where to store all these artefacts. As of now it can be stored in ~/.efs/ by default unless a path is specified. There is also no restriction on using only one key, i.e. each EFS instance can be instantiated with a different key. They may all have to share the ~/.efs/ space so subdirectories may be needed for each key/profile.

This way, EFS is simply responsible for taking a key as Buffer in it's constructor which the KeyManager will provide: either by loading from disk or generating from passphrase/salt pair, now EFS is free to do it's business

Streams don't seem to emit the close event

Describe the bug

Some testing from #71 show that streams are not emitting the close event after ending.

Because streams have auto close enabled, after the finish event, a close event should be emitted.

s = fs.createWriteStream('./tmp/s');
s.write('abc');
s.write('abc');
s.on('finish', () => { console.log('emitted finish'); });
s.on('close', () => { console.log('emitted close'); });
s.end();

In the above, a regular FS stream does end up emitting the close event.

However in our streams, this is not occurring. Not sure why.

It seems calling destroy() does not result in a close event.

This may be an upstream readable-stream bug. But the latest readable-stream is significantly different. We can try upgrading.

New tests should be written for these streams to check for the close event.

To Reproduce

  1. Do the above but with EFS

Expected behavior

Should say:

emitted finish
emitted close

Unit testing the EFS API

Each method exposed by EncryptedFS should be unit tested to ensure the expected behaviour of the method is met. This allows all changes made to the code base to be tested against the test suite to ensure there are no breaking changes in the updates.

As of now, the units test are constructed using ava tests.

Decryption yields plaintext with trailing zero padding

When you do a write in efs, and the last block's size does not equal a block size, efs will zero pad to fill the reaming space in the block. Since the original filesize is not stored anywhere, the filesize has been lost when it is encrypted. efs cannot tell the difference between trailing zeros that were part of the plaintext and trailing zeros that were added to fill the block. So as of now, say, when you do a readFileSync() there will exists extraneous zeros in the readBuffer if the original plaintext's last block was not block aligned.

import EFS from '../../lib/EncryptedFS.js';
import fs from 'fs';

const efs = new EFS({genKey: true, keyPass: Buffer.from('very password')});
const writeBuf = Buffer.allocUnsafe(10).fill(0x11);
let readBuf = Buffer.allocUnsafe(20).fill(0xff);
const efsFd = efs.openSync('sandbox/tmp/testTrailingZerosEFS.txt', 'w+');
efs.writeSync(efsFd, writeBuf, 0, writeBuf.length, 0);
efs.readSync(efsFd, readBuf, 0, readBuf.length, 0);
console.log(readBuf.toString('hex'));
// 11111111110000000000

const fd = fs.openSync('sandbox/tmp/testTrailingZeros.txt', 'w+');
fs.writeSync(fd, writeBuf, 0, writeBuf.length, 0);
readBuf = Buffer.allocUnsafe(20).fill(0xff);
fs.readSync(fd, readBuf, 0, readBuf.length, 0);
console.log(readBuf.toString('hex'));
// 1111111111ffffffffff

Preserving file metadata when persisting files

Metadata information for a plaintext file is lost when you persist it. For example when you write a file, efs will zero pad the last block so that it is block aligned. When you decrypt the file, efs does not know whether the zero padding was part of the original file or added in by efs. Since the filesize is never stored, this information is lost. This leads to #7. Storing the metadata becomes even more important when efs starts to provide confidentiality of the metadata. We need some way of recovering the original metadata. This needs to be stored maybe in a header to ciphertext file. The exact format still needs to be decided. But I image this would be prepended to the plaintext before it is ciphered and then persisted.

Identify and Eliminate Unscalable Operations with Lazy GC

Specification

Some operations are "unscalable". For example to delete a file, one has to iterate over the entire file blocks and batch up all the delete operations and then execute them. This is due to the structure of how we are storing each file block as a separate key value entry in a sublevel.

If the file is large, this can take a long time. Furthermore deletion of files should be a lot faster than this. Real filesystems do not need to read the entire file just to delete it.

In order to solve this problem one way is to use lazy garbage collection. When deleting a file, it is marked as no longer accessible. The is already possible since file accessibility is determined by hardlinks which are persisted in the directory inodes.

However what is going to actually delete the key value entries? Real filesystems can simply mark filespace as reusable. Unfortunately this is not possible with a key value structure, and it would just get quite complex.

Therefore, we may instead of a garbage collection process that runs as a stream/loop and quietly deletes un-used entries in the background continuously as the EFS is being used. Right now we have something similar to this in order to remove zombie inodes in our gc sublevel.

Recommend researching the garbage collection techniques like in https://github.com/kenfox/gc-viz and seeing how we can apply it.

Deleting a file isn't the only cause where this can occur. I'm sure there may be other operations that are unscalable. Identify all of these in this issue.

Additional context

Tasks

  1. - Identify all unscalable operations and whether they apply to this situation
  2. - Research the garbage collection techniques above
  3. - Integrate the technique above in a lazy way so that it doesn't hog the CPU nor does it block other operations, because JS async contexts cannot be interrupted, then they must be yielded, the implementation must be as asynchronous as possible and coroutine/cooperative multitasking friendly, perhaps by the use of setTimeout yields (implemented as sleep promises).
  4. - Consider if it would be possible to multi-thread the leveldb usage in worker threads and end up with parallel lazy GC, note that leveldb currently is not "multi-process" safe, but there is a multilevel adapter for multiprocesses, experiment with worker threads as it is using threads and not processes

Add in `rm` method

Specification

The rm method takes over some of the functionality of rmdir. This rm method was introduced in Node 14.14. I believe some of our uses of EFS may start using rm instead of rmdir. It's probably a good idea to support rm as well.

Additional context

  • MatrixAI/Polykey#266 (comment) - was confused over why the mkdirExists didn't work, and it's because the FileSystem type in PK is not fulfilled by EFS atm due to the lack of rm.

Tasks

  1. Investigate the docs for rm and fsPromises.rm in NodeJS https://nodejs.org/api/fs.html
  2. Add in the rm method
  3. Keep the old options available in rmdir even though they are deprecated in Node FS.

Implement True Snapshot Isolation for LevelDB

Is your feature request related to a problem? Please describe.

Proper MVCC transactions make use of a snapshot that represent the state of the database. As mutations (put & delete) build up, they apply to the snapshot. This makes it easier to build up a transaction by composing procedures/functions that mutate a transaction object. It means you get read-after-write consistency within the snapshot. Where you may have an operation that depends on the state of the database. Like let's say a counter increment. But if the prior mutation to the transaction already incremented the same counter, it would be incoherent for the subsequent mutation to plus 1 to the counter thinking the counter hasn't already been incremented.

Right now leveldb already supports snapshots natively. However it's not exposed via the JS wrapper. There are pre-existing issues.

If we could have snapshot isolated transactions, it would simplify some of our algorithms here for inodes especially since we have counter increment operations that result from linking and unlinking the inodes.

Describe the solution you'd like

Have a snapshot ability for the leveldb that we can expose through our DB abstraction. Similar to the python library of leveldb: https://plyvel.readthedocs.io/en/latest/api.html#snapshot

Note that a snapshot by itself is not sufficient to provide snapshot-isolated transactions. A combination of a "mutating" snapshot and the batching object which can overlay changes on top of the leveldb database can produce snapshot-isolated transactions.

This would mean an API like:

# it's very similar to https://github.com/Level/level#dbbatch-chained-form
const t = await this.db.transaction();
t.put(...);
t.get(...); // get will get the actual value determined by the snapshot (with overlayed in-memory values)
t.del(...);
t.commit();

In fact I suspect it would be possible to just extend the Batch object to have this.

Additional context

Adapt EFS to the File System Interface

We should adapt EFS to the File System Interface.

This is pertinent to making EFS work with isomorphic git in js-polykey. All of our current methods work by promises but are exposed in the main object whereas the fs interface describes them as being part of a promises API exposed at fs.promises.

Another thing is lstat does not work properly yet as it only propagates the upperDir method and it might not be loaded. This should be changed to propagating the lowerDir method.

related to MatrixAI/Polykey#43 (comment)

Encryptfs Implementation

Just some comments on the current implementation.

This should be moved to a js-encryptfs eventually.

It is sufficient to use import fs from 'fs';. This is because ES6 modules have arrived in NodeJS proper. The import { default as fs } from 'fs'; is superfluous, it does nothing. We only do this when we want to rename the default to something else when exporting. We also use default when we are in a CJS environment attempting to use an ES6 module's default export.

Flow has a library of default type definitions corresponding to Node JS stuff: https://github.com/facebook/flow/blob/master/lib/node.js#L784

So it should be possible to type fs objects via a: fs.

No need to "create new instance of fs". Node fs is not a class, it's just an object. A simple = fs is enough.

ES6 class methods don't need a function.

So the EncryptedFS upper directory is already in memory, async is not needed there. But the lower directory has to be async. So the bridging between the 2 is interesting. My understanding is that we have a "proxy" architecture, where reads and writes goes directly in-memory to the in-memory vfs, while when going to the lower directory, it would have to do so asynchronously. If it did it synchronously, we'd block the main thread. At the same time, the crypto itself make take time, so async may be better if we eventually need to multithread it as well. So what should be the API exposed from EncryptedFS? Well if it has to satisfy the fs API, you actually need both sync and async APIs. This was done in VFS by doing sync first, and wrapping sync as async. You cannot wrap async back into sync, async is infectious. If we try to do this in EncryptedFS, it would be possible when interacting with fs, but I think it might be slow. Instead we definitely want to use the native async fs API when possible. So I think this means we would have separate codepaths for async and sync.

Another problem is whether both async and sync methods of EFS would be using the corresponding sync or async in VFS? Now async in VFS is a simulated wrapper around the sync version, it's still sync. In most cases, it does nothing but add a bit over async overhead. But in some cases, such as streaming it makes sense to use async. I think for most cases, we just use the synchronous version, then in select places we can use the async method from VFS.

Make sure to check the VFS tests to see what we expect from a lot of the writes and read IO ops. Some have interesting behaviour regarding permissions and file descriptors and also the types they expect.

Benchmark the EFS

Currently the benchmarks don't actually bench the EFS itself.

However it would be dominated by the speed of the crypto and DB. DB already has its own benchmarks, so crypto benchmarks should be removed once crypto is abstracted out when we rework the crypto with PK.

Tasks

lstat/lstatSync should propagate lowerDir

For now, the lstat functions should just propagate the lowerDir method. Right now they propagate the upperDir and that is not working well with isomorphic-git on js-polykey.

Until we can get the upper and lower directories agreeing on file stats, it should stay this way.

Refactoring EFS & Align coding standards with js-polykey

Specification

  • The js-encryptedfs is a core library of js-polykey, its coding standards should be aligned with how we are developing things in js-polykey
  • This means the class structure
  • Ensuring that we are properly using async await in most cases, and keeping the callback style of the original methods
  • Testing
  • Linting

Tasks

  1. Make sure all methods in EncryptedFS uses public or protected, no need to use private
  2. Helper methods should probably be moved into a separate utilities function. There are a lot of helper methods.
  3. Provide a promise interface under fs.promises. So you can do efs.promises.f where f is an awaitable version of the functions.

Addressed https://gitlab.com/MatrixAI/Engineering/Polykey/js-encryptedfs/-/merge_requests/46

NPM releases should create tags in the repo

Furthermore we can integrate the pushing to the CI/CD pipeline.

However I'm not sure yet, since it's not every commit results in a release. It should activate only when we decide there should be a release.

We can leave it for manual execution for now.

Either way, I'm not sure if you have pushed up an npm release, because I don't see any git tags. Note that you must do git push --tags to do it.

Should investigate how to do this automatically with the pipeline, Gitlab CI does have manual pipelines as well I remember.

Generating a salt for key derivation

Should we be generating a salt for our key derivation in Crypto?

We could do it by generating some random bytes:
const salt = crypto.randomBytes(128).toString('base64')
I am not sure how one would store this though, we should also consider having multiple salts as well and storing the number of attempts. Or do we leave it up to the user to provide the salts?

Structured Change Detection for Mutation Events and Schema Compliance

Specification

Whilst considering vault schemas, it became apparent that it would be extremely beneficial to implement some kind of filesystem event-watching API integrated into the EFS, such that we can track changes before and after they're made directly to the files in the EFS. That is, before and after changes are made to files, we can generate an accumulating list of these changes. Then, we can provide hooks for these pre-edit and post-edit changes.

This would allow us to solve 3 problems in one:

  • enforce the vault schema before a file change is made (i.e. pre-edit vs. having to do a post-edit check)
  • easier automatic commit message generator
  • accurate elimination of dirty commits

The latter 2 problems are currently being solved by a hacky, post-edit solution through recursive scans over the EFS (see the top-level comment from @scottmmorris here).

Additional context

Some further discussion of pre-edit and post-edit from the vaults refactoring MR:

Tasks

  1. ...
  2. ...
  3. ...

Confidentiality of filename and metadata

The filename and a file's metadata (size, date modified, permission) should be encrypted. This is because these entities can leak information about the file, which is meant to be secret, even though the file content is encrypted.

I think to start with, just the filename can be encrypted, we can deal with encrypted metadata after.

Both CryFS and EncFS encrypt filenames, however, only CryFS encrypts metadata aswell. It needs to be revisited on how they accomplish this.

Crypto - initiVector Variable Length

If an initVector is not provided to the Crypto class upon initialisation, a random vector is generated of length 16 bytes. Should this be variable length? Plausibly one could assume this functionality is already available to users who specify their own initiVector but there are methods that cannot be changed. My suggestion is to use the length of the existing initVector to drive any further creation of initVectors, this lets the user control the initVector length.

Generated docs should be viewable directly from Github

There are 2 ways to do this.

Do it same as js-virtualfs, we provide a link for raw rendering of a github repo. See js-virtualfs README.md for example since the docs is index.html.

Or use gh-pages to provide a "pages" version of this repo.

Github has since improved their pages feature. So I think we can do the latter: https://help.github.com/en/github/working-with-github-pages/configuring-a-publishing-source-for-your-github-pages-site (it appears it can do it directly from the docs directory).

Much better than any man pages!

Streams are old and legacy and should be refactored to v16 LTS

Specification

The streams modules are still based on readable-stream 3.6.0, and also it ported from VFS but the implementation had gone through changes that haven't been fully verified. We should upgrade the readable-stream to 4.x.x https://www.npmjs.com/package/readable-stream which is cut from nodejs v18. Then refactor our streams abstraction, and also write tests to verify all the functionality.

Of particular note is the fact that our _destroy doesn't seem to look right. There's alot of callbacks and errors that are being threaded around.

Additional context

  • #74 had to deal with the streams not properly asynchronously closing the streams and file descriptors

Tasks

  1. Upgrade to 4.x.x of readable-stream
  2. Review https://nodejs.org/api/stream.html#implementing-a-writable-stream and compare with our current implementation
  3. Consider reviewing the source of readable stream for the default implementation, as the main thing is the opening and closing of our EFS file descriptors

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.