layr-team / layr Goto Github PK

A decentralized (p2p) file storage system built atop Kademlia DHT that enforces data integrity, privacy, and availability through sharding, proofs of retrievability, redundancy, and encryption, with smart-contract powered incentive scheme

JavaScript 100.00%

tcp file-storage p2p nodejs streams kademlia dht decentralized-storage decentralized distributed-systems

layr's Issues

Retrieve File Should Test if a Target Node is Online Before Sending it a Request

Currently, retrieve file only works with the happy path: it assumes that the host node of the first shard copy it locates is online.

We should implement an alive-test using a simple ping, and only send a request from that node if the ping is successful.

Further, the check that retrieve file uses to make sure a bat node is online is also not truly testing if a bat node is online: it is only testing that the batnode server was created at some point in the past.

Use import statements to selectively include functionality

The idea is you export a larger object, and then only import the functions you need. For example, import some-function from fileUtils instead of importing the whole fileUtils function that returns the entire object of functions like we currently have. Modern libraries have this setup so can do things like import { prop-types} from React, etc.

Upload File Should Test Whether A Target Kademlia Node's Batnode is Live

Currently, upload file pings a viable host to make sure it is alive before sending a payment or data to that node. However, the ping does not check the status of the target Kademlia Node's BatNode server. It is possible (though unlikely) that one can be live while the other is not.

Bigger shard has issues with file patching

current file patching method doesn't work well with shard which size > 66kb.

Fix solution in this branch: https://github.com/layr-team/Layr/tree/patch-bigger-shards

Automatic clean up the shards after uploading/downloading

Since we use the same folder "shards" to temporarily store the file pieces when client uploads and downloads files, it maybe more conveniently if our system can automatically clean up for clients once the uploading/downloading process finishes.

Daemonize CLI interface

This is more of a nice to have, but it since our primary interface is a CLI it would make the UX for issuing commands much nicer. Here are some links I looked up around the subject.

Daemonizing a process (not running it forever)
https://www.npmjs.com/package/daemon - This is what Kadence uses https://github.com/kadence/kadence/blob/master/bin/kadence.js#L20
https://stackoverflow.com/a/12214993/3950092 - node-daemonize2
https://github.com/niegowski/node-daemonize2
https://stackoverflow.com/questions/10428684/how-to-implement-console-commands-while-server-is-running-in-node-js - using process.stdin or prompt library

Running a process forever (related, but probably not something we want to do)
https://www.digitalocean.com/community/tutorials/how-to-set-up-a-node-js-application-for-production-on-ubuntu-16-04#install-pm2
https://stackoverflow.com/a/4988180/3950092 - simplest way to send process to background
https://stackoverflow.com/questions/4018154/how-do-i-run-a-node-js-app-as-a-background-service
https://github.com/Storj/storjshare-daemon

https://github.com/kadence/kadence/blob/master/bin/kadence.js#L126-L135 - Kadence code around stopping a process. PM2 also has docs around graceful stops

Remove readFile method from BatNode

Reading a file is Node file system (fs module) responsibility, and we aren't adding any functionality by adding the code to batnode.js.

Similar methods that be used on other modules:

connect
writeFile

Error "Timeout: failed to complete audit" If one node deleted a shard

DRY up auditing code

auditShard should be able to make use of getHostNode to dry up some of the code. Didn't work as planned initially, but we should later.#14 (comment)

Use the Async library for asynchronous actions

A lot of our asynchronous actions are achieved with callbacks that go down multiple levels. This can be re-written in a cleaner way with node.js' Async library.

Export sharing constants for unchanged values to a module

Example: Since our default port and host values will be the same and never change for the 2nd BatNode server communicating with command line. It will be convenient to to define such constants once in a module. Extracting them in a module allows us to look up these unchanged values across the project. Defining them in one place will also prevents us from having typo errors.

Improve upload process with async just-in-time process

Right now our upload process works, but it will be inefficient for large data.

Below is our current process:

Encrypt the file
Once encryption finishes, divide it in to k number of shards
Once all the shards finishes, distribute it to the network

This will slow down the upload process with large file since sharding takes longer time for big file.

A more efficient way will be using "async" way.
Step 1 and 2 will remain unchanged
3. Once the first shard finishes, we can distribute it first to the network when client is still writing the next shard
4. There will be very small chance that uploading the first shard will be faster than client writing the 2nd shard, but if it really happens, we can use if else statement to check if the 2nd shard completes yet before implementing distribution

async/await pattern in audit

We were hoping to use the async/await pattern in audit to solve a problem where we need to asynchronously return data from the auditFile method in order to pass the correct data object into the data event handler for the CLI to access. However, after a bit of research I’m a little skeptical of getting async/await type things to work quickly* the way we were thinking about. The problem is that auditFile executes a series of synchronous actions, then, those synchronous actions trigger events that have handlers. These events (mostly the ‘data’ event) and their event handlers then execute asynchronously. You’d have to do something differently to have all methods called in the audit operation execute asynchronously and be tied to their event handler.

The basic issue is that async works with functions that returns promises. Therefore I think what you would have to do to make this strategy work is make all audit related methods promise based, but I’m not sure I’d be able to pull that off soon. The way you’d start off is to have the entire auditShardData method body inside the return a promise (i.e return Promise.new(resolve, reject …) where it resolves in the data event handler. Article 2. shows a decent example of this. The problem then is we have 3/4 other methods in between there and auditFile method we need to return data from, which would also have to be promisified.

Options to explore are:

promisfying audit related method
promise based net.Socket wrapper libraries
custom events using event emmitter

Some related links:

https://github.com/mkloubert/node-simple-socket - socket library that’s promise based. not popular at all, though
https://techbrij.com/node-js-tcp-server-client-promisify - example of promisfying a client send method
https://www.ibm.com/developerworks/community/blogs/binhn/entry/creating_a_tls_tunnel_with_node_js_and_promise?lang=en - just another example of net.connect w/ promises
https://stackoverflow.com/questions/40352682/promisify-event-handlers-and-tiemout-in-nodejs - simple example of event based promise

Add guard cases before generating .env for users

Currently there is an edge case that we need to take care for generating new .env file:

User already HAS stellar account but no PRIVATE KEY.
User created an empty env
With our current code in master, since the user self-created the .env file and added stellar account already, system will skip generating PRIVATE KEY also, therefore we should handle cases differently.

I made the changes to a new branch test-env to fix this bug.

Have hosted folders created automatically if they don't already exist

If a user tries to upload a file via the command line or manually and there is no existed hosted directory on the server node(s), then no files will be written to that node.

Additionally, the error isn't handled and it causes the server without the hosted folder to crash. Creating the folder if it doesn't exist means we won't have to handle any errors, though.

Improve on flat file storage

Data hosts currently store files in a one dimensional folder. As the number of files a data host stores increases, lookup will take longer and longer since lookup time grows linearly to the number of items in the hosted folder.

Random Challenges and Their Results are Stored and Sent to Host

This makes the PoR answer unpredictable and variable. Challenges cannot be reused.

Benefits of this method:

Higher degree of confidence in audit accuracy: if the host passes, we can be more confident that they have the file.

Cons of this method:

Introduces an O(number of audits) space complexity for the data owner
Places an upper bound on the number of audits a data owner can do (they will have to download and re-upload the file if they run out of audits, which is more costly)
It is computationally costly for the data host, who has to process the entirety of the file data for each audit

Users can set the amount of storage they want to offer to the network

Users can set the amount of storage their Batnode offers
Batnode tracks max storage (which is set by the user)
On Batnode initialization, current storage is set as a property to Batnode object
When a user tries to store a file on a host candidate, the candidate's available storage is calculated and compared: shard size <= max storage - available storage
Optimization challenge: checking available storage without reading each file and adding up the data it uses
Optimization challenge: if two nodes contact a host node at the same time, asking if it has enough storage for their shard, the host node will say yes to both of them because it has stored neither, but by saying yes to both, it agrees to store more data than it has the capacity to store. There needs to be an in-memory data structure of available storage that can be updated immediately when a node agrees to store a shard even if it hasn't already stored that shard.

An edge case to this, though, is that this data structure may be rendered inaccurate if the shard it agreed to store never made it over the wire!

Able to upload larger file with JSON Stream but unable to download correctly

If we add JSON stream library, we can upload large file without setting very small size for each shard.

Branch I have been working on with : https://github.com/WilfredTA/batnode_proto/tree/jsonstream

Previously, we couldn't download large file to the client as we will experience data loss when trying to write large data into client's shards folder with current method:

 issueRetrieveShardRequest(shardId, hostBatNode, options, finishCallback){
..
  client.on('data', (data) => {
      fs.writeFileSync(`./shards/${saveShardAs}`, data, 'utf8')
  ....

While the servers can read the content quickly but client can not write to the folder as the same speed as servers in downloading process.

For example, when the servers are ready to read the 2nd shard and send the content back to client, the client is still trying to finish writing content for the 1st shard. What happens is the client will actually stop writing the previous shard and try to write the next shard.

If we compare the downloaded shard size from client's shards folder with the uploaded shard size from hosted server's folder, we can notice the downloaded shard size from client is smaller than the the same shard in hosted server's folder. We know that during the downloading process, the client server prematurely finishes writing a complete single shard in disk before accepting new shard request.

We currently fix it with writeStream and setTimeout method:

let writeStream = fs.createWriteStream(fileDestination);
    const completeFileSize = manifest.fileSize;
   // set the divided amount slightly below 16kb ~ 16384 (the default high watermark for read/write streams)
    const waitTime = Math.floor(completeFileSize/16000);  

    // use once listener here instead of "on" in order to pipe it once
    client.once('data', (data) => {
      writeStream.write(data);
      client.pipe(writeStream);

      if (distinctIdx < distinctShards.length - 1){
        finishCallback()
      } else {
        setTimeout( function() {fileUtils.assembleShards(fileName, distinctShards)}, waitTime);
      }
    })

In the future, it maybe better to use async/await instead of calculating estimated waiting time here

Stellar Smart Contract to Ensure Payment and File Storage Between Untrusted Parties

Our current shard transfer algorithm goes like this:

A node with an ID close to the shard Id is found
That node is pinged to make sure it's still alive
If it's alive, initiate shard payment and transfer
If it's not alive, remove node from contacts and re-search

The "shard payment and transfer" subroutine goes like this:

Given a target node's address, ask for its Stellar account id
Send a payment to that Stellar account id
If the payment is successful, send the shard to the target node for storage

The problem with this is that the data owner cannot trust the target node to host their file. Nor can the data owner trust the inherent volatility of network connections. It is possible that a host node, upon receiving a payment, disconnects from the network. Finally, it is possible that the host node simply deletes the file right after receiving it, keeping the payment but freeing up storage.

We therefore need to "batch" file storage and payment for file storage such that the failure of one entails the failure of the other. To further prevent deletion of file storage immediately after receiving the file, the two nodes must agree on a duration for which the host will store the data owner's shard.

To ensure that this agreement is honored, the host node must be able to prove that it still has the file at the end of the agreed-upon duration. It must therefore pass a data availability and integrity audit immediately prior to receiving payment.

To ensure that the host node is actually paid by the data owner, an escrow account is set up with the funds to pay the host node at the end of the agreed-upon duration.

Edge cases:

What if the host node is generally online and available, but happens to be offline at the time of the final audit that verifies that it is storing the data it agreed to store?
What if the host node is offline for the entire duration, but then gets online immediately preceding the audit in order to get paid? They haven't really satisfied their end of the bargain in this case.

We can redefine the agreement in order to account for these edge cases. We can say that the host node agrees to host the data owner's file and also be available a given percentage of the time. Every time the host node is audited between the time of initial data storage and agreed-upon duration to host the file, the result of that audit is stored. At the end of the agreed-upon duration, the ratio of passed audits/total audits is calculated, and if that ratio is >= the agreed upon ratio of availability, then the host node is paid, otherwise, they are not paid.

Edge cases of the new agreement:

The host node cannot trust the data owner (who is also the auditor) to keep an honest record of the results of the audit. Depending on where these records are stored, it may be possible for the data owner to manipulate these records so that the data owner doesn't have to pay the host at the end of the agreement.

storage duration/agreement between hosts and users

To further incentivize more hosts to share their storage in the network, having a storage duration agreement before uploading will be more fair to the hosts.

For example, our system can suggest a default storage duration will be 3 months for each file. Before the storage duration expiring, system will notify users and then users need to decide whether they would like to extend or not. If users want to extend, we need to verify if users' wallet and subtract the payment on the first day of the extension.

Use separate method for preparing audit data

Using something like the method below will clean up auditFile

prepareAuditData(shards, shaIds) {
    return shaIds.reduce((acc, shaId) => {
      acc[shaId] = {};

      shards[shaId].forEach((shardId) => {
        acc[shaId][shardId] = false;
      });

      return acc;
    }, {});
  }

Set the payment amount based on file size instead of fix amount

Currently, we use a fix amount 10 for each transaction when when the user downloads/uploads shards each time, no matter how big or small the piece of file is. For example, owner pays the same amount of lumens for a 5MB piece of data, and also pays the same for a 1KB piece of data :

https://github.com/layr-team/batnode_proto/blob/20f947dea5a25ccbf43e36114979721b61044968/batnode.js#L51

While it works in our alpha phase, in real world situation we will need to calculate the amount for each shard/file based on its size to ensure fair usage.

Retrieving a file from a secondary machine

Great project, really learned a lot from your youtube video (https://www.youtube.com/watch?v=oCS05QSQ-1k). What isn't clear to me is the following: The use of a centralized cloud service is that when I upload on machine A, I can come online on machine B, independent of whether A is online or not, and get what A uploaded previously. However, what I don't get with Layr is that it works with a manifest file. As you show in your demo, you need to hand the CLI a path to a manifest. If I upload something on machine A, how am I getting the necessary manifest on machine B to access the uploaded file?

Need to verify if the file extension matches ".batchain" before performing downloading&auditing

Currently our CLI just only checks if specific file existed in home dir. We also need to check the extension for manifest file. Otherwise, if we pass a valid file but not with the correct manifest extension, the client side server will crash and exit.

Updated code in jsonstream branch but will need to add to master later

Use constants for CLI messages

Messages like 'You can audit file to make sure file integrity' are currently in two files, so importing constants seems like an easy way to edit them in two places at once. JS communities leaders in React/Redux also seem to like using constants in general, so we should probably consider using them in additional situations like for numbers.

Refactor processUpload function fileUtils

We pass the callback parameter in processUpload down to about four additional methods. It would be nice if we could make this part of the fileUtils a bit easier to read.

Remove main-thread-blocking I/O Operations

Some of our I/O operations are achieved with synchronous (thread-blocking) methods. This is a known anti-pattern in node (see here).

The reason we use synchronous actions in some places is to prevent another action from executing until the synchronous action has completed. This matches one of the use cases of the Async library, so this issue may be resolved when we refactor to Async-based asynchronous code rather than callback-based asynchronous code.

Optimize file streaming

Currently, space complexity for transferring data from one node to another scales linearly with the amount of data in the shard being transferred: peak memory usage (space complexity) = O(bytesInShard)

We can push peak memory usage down to constant space complexity by using something like node.js pipe function.

Instead of fs.readFile we can pipe the data in a readStream to a tcp stream.

The JSONStream library we are using solves the problem of larger JSON objects getting parsed when being split into two when being transferred over streams. It seems like it does this by holding the JSON in memory and delaying the trigger of the data event on the JSONStream until it has received a full JSON object. That means that JSONStream's peak memory usage also scales linearly with the size of the JSON object being sent to it. I need to verify this suspicion with their source code, though.

The problem with piping smaller JSON objects that each contain a portion of the total shard data is that the shard data needs to be written in the order it was received, which is hard to manage when the data is written via event handlers.

Essentially what we need to do is write multiple chunks that are not received in order without storing all chunks in memory.

Auditing should report the shard copy ID that failed and patching should remove it from the manifest

The only case in which an audit fails but the shard copy ID is still accessible is when the host node was offline at the time of the audit. In all other cases, it is in the data owner's best interest to completely forget about that shard copy ID forever.

Therefore, auditing should report the copy ID that failed and patching should remove that copy ID from the manifest.

In the future, we can always add three states to an audit rather than two states (true or false). The third state will handle the case in which the audited node is simply offline. That's a relatively easy thing to do once the above change is made, since the node alive test is a simple ping with an event handler: if the ping fails, set the result of the shard copy ID to this third state.

layr-team / layr Goto Github PK

layr's Issues

Recommend Projects

Recommend Topics

Recommend Org