ssbc / ssb-db Goto Github PK

View Code? Open in Web Editor NEW

1.2K 58.0 74.0 1.18 MB

A database of unforgeable append-only feeds, optimized for efficient replication for peer to peer protocols

Home Page: https://scuttlebot.io/

License: MIT License

JavaScript 100.00%

ssb-db's Introduction

ssb-db

secret-stack plugin which provides storing of valid secure-scuttlebutt messages in an append-only log.

What does it do?

ssb-db provides tools for dealing with unforgeable append-only message feeds. You can create a feed, post messages to that feed, verify a feed created by someone else, stream messages to and from feeds, and more (see API).

Unforgeable means that only the owner of a feed can modify that feed, as enforced by digital signing (see Security properties).

This property makes ssb-db useful for peer-to-peer applications. ssb-db also makes it easy to encrypt messages.

Example

In this example, we create a feed, post a signed message to it, then create a stream that reads from the feed. Note: ssb-server includes the ssb-db dependency already, so the example here uses this as a plugin for secret-stack.

/**
 * create an ssb-db instance and add a message to it.
 */
var pull = require('pull-stream')

//create a secret-stack instance and add ssb-db, for persistence.
var createApp = require('secret-stack')({})
  .use(require('ssb-db'))


// create the db instance.
// Only one instance may be created at a time due to os locks on port and database files.

var app = createApp(require('ssb-config'))

//your public key, the default key of this instance.

app.id

//or, called remotely

app.whoami(function (err, data) {
  console.log(data.id) //your id
})

// publish a message to default identity
//  - feed.add appends a message to your key's chain.
//  - the `type` attribute is required.

app.publish({ type: 'post', text: 'My First Post!' }, function (err, msg) {
  // the message as it appears in the database:
  console.log(msg)

  // and its hash:
  console.log(msg.key)
})

// collect all the messages into an array, calls back, and then ends
// https://github.com/pull-stream/pull-stream/blob/master/docs/sinks/collect.md
pull(
  app.createLogStream(),
  pull.collect(function (err, messagesArray) {
    console.log(messagesArray)
  })
)

// collect all messages for a particular keypair into an array, calls back, and then ends
// https://github.com/pull-stream/pull-stream/blob/master/docs/sinks/collect.md
pull(
  app.createHistoryStream({id: app.id}),
  pull.collect(function (err, messagesArray) {
    console.log(messagesArray)
  })
)

Concepts

Building upon ssb-db requires understanding a few concepts that it uses to ensure the unforgeability of message feeds.

Identities

An identity is simply a public/private key pair.

Even though there is no worldwide store of identities, it's infeasible for anyone to forge your identity. Identities are binary strings, so not particularly human-readable.

Feeds

A feed is an append-only sequence of messages. Each feed is associated 1:1 with an identity. The feed is identified by its public key. This works because public keys are unique.

Since feeds are append-only, replication is simple: request all messages in the feed that are newer than the latest message you know about.

Note that append-only really means append-only: you cannot delete an existing message. If you want to enable entities to be deleted or modified in your data model, that can be implemented in a layer on top of ssb-db using delta encoding.

Messages

Each message contains:

A message object. This is the thing that the end user cares about. If there is no encryption, this is a {} object. If there is encryption, this is an encrypted string.
A content-hash of the previous message. This prevents somebody with the private key from changing the feed history after publishing, as a newly-created message wouldn't match the "prev-hash" of later messages which were already replicated.
The signing public key.
A signature. This prevents malicious parties from writing fake messages to a stream.
A sequence number. This prevents a malicious party from making a copy of the feed that omits or reorders messages.

Since each message contains a reference to the previous message, a feed must be replicated in order, starting with the first message. This is the only way that the feed can be verified. A feed can be viewed in any order after it's been replicated.

Object ids

The text inside a message can refer to three types of ssb-db entities: messages, feeds, and blobs (i.e. attachments). Messages and blobs are referred to by their hashes, but a feed is referred to by its signing public key. Thus, a message within a feed can refer to another feed, or to a particular point within a feed.

Object ids begin with a sigil @ % and & for a feedId, msgId and blobId respectively.

Note that ssb-db does not include facilities for retrieving a blob given the hash.

Replication

It is possible to easily replicate data between two instances of ssb-db. First, they exchange maps of their newest data. Then, each one downloads all data newer than its newest data.

ssb-server is a tool that makes it easy to replicate multiple instances of ssb-db using a decentralized network.

Security properties

ssb-db maintains useful security properties even when it is connected to a malicious ssb-db database. This makes it ideal as a store for peer-to-peer applications.

Imagine that we want to read from a feed for which we know the identity, but we're connected to a malicious ssb-db instance. As long as the malicious database does not have the private key:

The malicious database cannot create a new feed with the same identifier
The malicious database cannot write new fake messages to the feed
The malicious database cannot reorder the messages in the feed
The malicious database cannot send us a new copy of the feed that omits messages from the middle
The malicious database can refuse to send us the feed, or only send us the first N messages in the feed
Messages may optionally be encrypted. See test/end-to-end.js.

API

require('ssb-db')

SecretStack.use(require('ssb-db')) => SecretStackApp

The design pattern of ssb-db is for it to act as a plugin within the SecretStack plugin framework. The main export provides the plugin, which extends the SecretStack app with this plugins functionality, and API. ssb-db adds persistence to a SecretStack setup. Without other plugins, this instance will not have replication or querying. Loading ssb-db directly is useful for testing, but it's recommended to instead start from a plugin bundle like ssb-server

Because of legacy reasons, all the ssb-db methods are mounted on the top level object, so it's app.get instead of app.db.get as it would be with all the other ssb-* plugins.

In the API docs below, we'll just call it db

db.get: async

db.get(id | seq | opts, cb) // cb(error, message)

Get a message by its hash-id.

If id is a message id, the message is returned.
If seq is provided, the message at that offset in the underlying flumelog is returned.
If opts is passed, the message id is taken from either opts.id or opts.key.
If opts.private = true the message will be decrypted if possible.
If opts.meta = true is set, or seq is used, the message will be in {key, value: msg, timestamp} format. Otherwise the raw message (without key and timestamp) are returned. This is for backwards compatibility reasons.

Given that most other apis (such as createLogStream) by default return {key, value, timestamp} it's recommended to use db.get({id: key, meta: true}, cb)

Note that the cb callback is called with 3 arguments: cb(err, msg, offset), where the 3rd argument is the offset position of that message in the log (flumelog-offset).

db.add: async

db.add(msg, cb) // cb(error, data)

Append a raw message to the local log. msg must be a valid, signed message. ssb-validate is used internally to validate messages.

db.publish: async

db.publish(content, cb) // cb(error, data)

Create a valid message with content with the default identity and append it to the local log. ssb-validate is used to construct a valid message.

This is the recommended method for publishing new messages, as it handles the tasks of correctly setting the message's timestamp, sequence number, previous-hash, and signature.

content (object): The content of the message.
- .type (string): The object's type.

db.del: async

⚠ This could break your feed. Please don't run this unless you understand it.

Delete a message by its message key or a whole feed by its key. This only deletes the message from your local database, not the network, and could have unintended consequences if you try to delete a single message in the middle of a feed.

The intended use-case is to delete all messages from a given feed or deleting a single message from the tip of your feed if you're completely confident that the message hasn't left your device.

//Delete message
db.del(msg.key, (err, key) => {
  if (err) throw err
})

//Delete all author messages
db.del(msg.value.author, (err, key) => {
  if (err) throw err
})

db.whoami: async

db.whoami(cb) // cb(error, {"id": FeedID })

Get information about the current ssb-server user.

db.createLogStream: source

db.createLogStream({ live, old, gt, gte, lt, lte, reverse, keys, values, limit, fillCache, keyEncoding, 
valueEncoding, raw }): PullSource

Create a stream of the messages that have been written to this instance in the order they arrived. This is mainly intended for building views.

live (boolean) Keep the stream open and emit new messages as they are received. Defaults to false.
old (boolean) If false the output will not include the old data. If live and old are both false, an error is thrown. Defaults to true.
gt (greater than), gte (greater than or equal) (timestamp) Define the lower bound of the range to be streamed. Only records where the key is greater than (or equal to) this option will be included in the range. When reverse=true the order will be reversed, but the records streamed will be the same.
lt (less than), lte (less than or equal) (timestamp) Define the higher bound of the range to be streamed. Only key/value pairs where the key is less than (or equal to) this option will be included in the range. When reverse=true the order will be reversed, but the records streamed will be the same.
reverse (boolean) Set true and the stream output will be reversed. Beware that due to the way LevelDB works, a reverse seek will be slower than a forward seek. Defaults to false.
keys (boolean) Whether the data event should contain keys. If set to true and values set to false then data events will simply be keys, rather than objects with a key property. Defaults to true.
values (boolean) Whether the data event should contain values. If set to true and keys set to false then data events will simply be values, rather than objects with a value property. Defaults to true.
limit (number) Limit the number of results collected by this stream. This number represents a maximum number of results and may not be reached if you get to the end of the data first. A value of -1 means there is no limit. When reverse=true the highest keys will be returned instead of the lowest keys. Defaults to false.
keyEncoding / valueEncoding (string) The encoding applied to each read piece of data.
raw (boolean) Provides access to the raw flumedb log. Defaults to false.

The objects in this stream will be of the form:

{
  "key": Hash,
  "value": Message,
  "timestamp": timestamp
}

timestamp * is the time which the message was received. It is generated by monotonic-timestamp. The range queries (gt, gte, lt, lte) filter against this receive timestap.

If raw option is provided, then instead createRawLogStream is called, messages are returned in the form:

{
  "seq": offset,
  "value": {
    "key": Hash,
    "value": Message,
    "timestamp": timestamp
  }
}

All options supported by flumelog-offset are supported.

db.createHistoryStream: source

db.createHistoryStream(id, seq, live) -> PullSource
//or
db.createHistoryStream({ id, seq, live, limit, keys, values, reverse }) -> PullSource

Create a stream of the history of id. If seq > 0, then only stream messages with sequence numbers greater than seq. If live is true, the stream will be a live mode

createHistoryStream and createUserStream serve the same purpose.

createHistoryStream exists as a separate call because it provides fewer range parameters, which makes it safer for RPC between untrusted peers.

Note: since createHistoryStream is provided over the network to anonymous peers, not all options are supported. createHistoryStream does not decrypt private messages.

id (FeedID) The id of the feed to fetch.
seq (number) If seq > 0, then only stream messages with sequence numbers greater than or equal to seq. Defaults to 0.
live (boolean): Keep the stream open and emit new messages as they are received. Defaults to false
keys (boolean): Whether the data event should contain keys. If set to true and values set to false then data events will simply be keys, rather than objects with a key property. Defaults to true
values (boolean) Whether the data event should contain values. If set to true and keys set to false then data events will simply be values, rather than objects with a value property. Defaults to true.
limit (number) Limit the number of results collected by this stream. This number represents a maximum number of results and may not be reached if you get to the end of the data first. A value of -1 means there is no limit. When reverse=true the highest keys will be returned instead of the lowest keys. Defaults to false.
reverse (boolean) Set true and the stream output will be reversed. Beware that due to the way LevelDB works, a reverse seek will be slower than a forward seek. Defaults to false.

db.messagesByType: source

db.messagesByType({type: string, live,old,reverse: bool?, gt,gte,lt,lte: timestamp, limit: number }) -> PullSource

Retrieve messages with a given type, ordered by receive-time. All messages must have a type, so this is a good way to select messages that an application might use. This function returns a source pull-stream.

As with createLogStream messagesByType takes all the options from pull-level#read (gt, lt, gte, lte, limit, reverse, live, old)

db.createFeedStream: source

db.createFeedStream({ live, old, gt, gte, lt, lte, reverse, keys, value,, limit, fillCache, keyEncoding, 
valueEncoding, raw }))

Like createLogStream, but messages are in order of the claimed time, instead of the received time.

This may sound like a much better idea, but has surprising effects with live messages (you may receive a old message in real time) but for old messages, it makes sense.

The range queries (gt, gte, lt, lte) filter against this claimed timestap.

As with createLogStream createFeedStream takes all the options from pull-level#read (gt, lt, gte, lte, limit, reverse, live, old)

db.createUserStream: source

db.createUserStream({id: feed_id, lt, lte ,gt ,gte: sequence, reverse, old, live, raw: boolean, limit: number, private: boolean})

createUserStream is like createHistoryStream, except all options are supported. Local access is allowed, but not remote anonymous access. createUserStream can decrypt private messages if you pass the option { private: true }.

db.links: source

db.links({ source, dest: feedId|msgId|blobId, rel, meta, keys, values, live, reverse }) -> PullSource

Get a stream of links from a feed to a blob/msg/feed id. The objects in this stream will be of the form:

{ 
  "source": FeedId,
  "rel": String,
  "dest": Id,
  "key": MsgId,
  "value": Object?
}

source (string) An id or filter, specifying where the link should originate from. To filter, just use the sigil of the type you want: @ for feeds, % for messages, and & for blobs. Optional.
dest (string) An id or filter, specifying where the link should point to. To filter, just use the sigil of the type you want: @ for feeds, % for messages, and & for blobs. Optional.
rel (string) Filters the links by the relation string. Optional.
live (boolean): Keep the stream open and emit new messages as they are received. Defaults to `false.
values (boolean) Whether the data event should contain values. If set to true and keys set to false then data events will simply be values, rather than objects with a value property. Defaults to false.
keys (boolean) Whether the data event should contain keys. If set to true and values set to false then data events will simply be keys, rather than objects with a key property. Defaults to true.
reverse (boolean): Set true and the stream output will be reversed. Beware that due to the way LevelDB works, a reverse seek will be slower than a forward seek. Defaults to false.
meta If is unset source, hash, rel will be left off. Defaults to true.

Note: if source, and dest is provided, but not rel, ssb will have to scan all the links from source, and then filter by dest. Your query will be more efficient if you also provide rel.

db.addMap: sync

db.addMap(fn)

Add a map function to be applied to all messages on read. The fn function is should expect (msg, cb), and must eventually call cb(err, msg) to finish.

These modifications only change the value being read, but the underlying data is never modified. If multiple map functions are added, they are called serially and the msg output by one map function is passed as the input msg to the next.

Additional properties may only be added to msg.value.meta, and modifications may only be made after the original value is saved in msg.value.meta.original.

db.addMap(function (msg, cb) {
  if (!msg.value.meta) {
    msg.value.meta = {}
  }

  if (msg.value.timestamp % 3 === 0)
    msg.value.meta.fizz = true
  if (msg.timestamp % 5 === 0)
    msg.value.meta.buzz = true
  cb(null, msg)
})

const metaBackup = require('ssb-db/util').metaBackup

db.addMap(function (msg, cb) {
  // This could instead go in the first map function, but it's added as a second
  // function for demonstration purposes to show that `msg` is passed serially.
  if (msg.value.meta.fizz && msg.value.meta.buzz) {
    msg.meta = metaBackup(msg.value, 'content')

    msg.value.content = {
      type: 'post',
      text: 'fizzBuzz!'
    }
  }
  cb(null, msg)
})

db._flumeUse: view

db._flumeUse(name, flumeview) => View

Add a flumeview to the current instance. This method was intended to be a temporary solution, but is now used by many plugins, which is why it starts with _.

See creating a secret-stack plugin for more details.

db.getAtSequence: async

db.getAtSequence([id, seq], cb) //cb(err, msg)

Get a message for a given feed id with given sequence. Calls back a message or an error, takes a two element array with a feed id as the first element, and sequence as second element.

Needed for ssb-ebt replication

db.getVectorClock: async

db.getVectorClock(cb) //cb(error, clock)

Load a map of id to latest sequence ({<id>: <seq>,...}) for every feed in the database.

Needed for ssb-ebt replication

db.progress: sync

db.progress()

Return the current status of various parts of the scuttlebut system that indicate progress. This api is hooked by a number of plugins, but ssb-db adds an indexes section (which represents how fully built the indexes are).

The output might look like:

{
  "indexes": {
    "start": 607551054,
    "current": 607551054,
    "target": 607551054
  }
}

Progress is represented linearly from start to target. Once current is equal to target the progress is complete. start shows how far it's come. The numbers could be anything, but start <= current <= target if all three numbers are equal that should be considered 100%

db.status: sync

db.status()

Returns metadata about the status of various ssb plugins. ssb-db adds an sync section, that shows where each index is up to. The purpose is to provide an overview of how ssb is working.

Output might took like this:

{
  "sync": {
    "since": 607560288,
    "plugins": {
      "last": 607560288,
      "keys": 607560288,
      "clock": 607560288,
      "time": 607560288,
      "feed": 607560288,
      "contacts2": 607560288,
      "query": 607560288,
      ...
    },
    "sync": true
  }
}

sync.since is where the main log is up to, and since.plugins.<name> is where each plugin's indexes are up to.

db.version: sync

db.version()

Return the version of ssb-db. currently, this returns only the ssb-db version and not the ssb-server version, or the version of any other plugins. We should fix this soon

db.queue: async

db.queue(msg, cb) //cb(error, msg)

Add a message to be validated and written, but don't worry about actually writing it. The callback is called when the database is ready for more writes to be queued. Usually that means it's called back immediately. This method is not exposed over RPC.

db.flush: async

db.flush(cb) //cb()

Callback when all queued writes are actually definitely written to the disk.

db.getFeedState: async

db.getFeedState(feedId, (err, state))

Calls back with state, { id, sequence } - the most recent message ID and sequence number according to SSB-Validate:

NOTE

this may contain messages that have been queued and not yet persisted to the database
- this is required for e.g. boxers which depend on knowing previous message state
this is the current locally known state of the feed, it is possible if it's a foreign feed that the state has progressed beyond whay you know but you haven't got a copy yet, so use this carefully.
"no known state" is represented by { id: null, sequence: 0 }

db.post: Observable

db.post(fn({key, value: msg, timestamp})) => Ovb

Observable that calls fn whenever a message is appended (with that message). This method is not exposed over RPC.

db.since: Observable

db.since(fn(seq)) => Obv

An observable of the current log sequence. This is always a positive integer that usually increases, except in the exceptional circumstance that the log is deleted or corrupted.

db.addBoxer: sync

db.addBoxer({ value: boxer, init: initUnboxer })

Add a boxer, which will be added to the list of boxers which will try to automatically box (encrypt) the message content if the appropriate content.recps is provided.

Where:

boxer (msg.value.content, feedState) => ciphertext which is expected to either:
- successfully box the content (based on content.recps), returning a ciphertext String
- not know how to box this content (because recps are outside it's understanding), and undefined (or null)
- break (because it should know how to handle recps, but can't), and so throw an Error
- The feedState object contains id and sequence properties that describe the most recent message ID and sequence number for the feed. This is the same data exposed by db.getFeedState().
initUnboxer (done) => null (optional)
- is a functional which allows you set up your unboxer
- you're expected to call done() once all your initialisation is complete

db.addUnboxer: sync

db.addUnboxer({ key: unboxKey, value: unboxValue, init: initBoxer })

Add an unboxer object, any encrypted message is passed to the unboxer object to test if it can be unboxed (decrypted)

Where:

unboxKey(msg.value.content, msg.value) => readKey
- Is a function which tries to extract the message key from the encrypted content (ciphertext).
- Is expected to return readKey which is the read capability for the message
unboxValue(msg.value.content, msg.value, readKey) => plainContent
- Is a function which takes a readKey and uses it to try to extract the plainContent from the `ciphertext
initBoxer (done) => null (optional)
- is a functional which allows you set up your boxer
- you're expected to call done() once all your initialisation is complete

db.box(content, recps, cb)

attempt to encrypt some content to recps (an Array of keys/ identifiers). callback has signature cb(err, ciphertext)

db.unbox: sync

db.unbox(data, key)

Attempt to decrypt data using key. Key is a symmetric key, that is passed to the unboxer objects.

db.rebuild: async

db.rebuild(cb)

Rebuilds the indexes. This takes a while to run and using SSB features before it is completed may lead to confusing experiences as the indexes will be out of sync.

db.Deprecated apis

db.getLatest: async

db.getLatest(feed, cb) //cb(err, {key, value: msg})

Get the latest message for the given feed, with {key, value: msg} style. Maybe used by some front ends, and by ssb-feed.

db.latestSequene: async

db.latestSequence(feed, cb) //cb(err, sequence)

Call back the sequence number of the latest message for the given feed.

db.latest: source

db.latest() => PullSource

Returns a stream of {author, sequence, ts} tuples. ts is the time claimed by the author, not the received time.

db.createWriteStream: source

db.createWriteStream() => PullSink`

Create a pull-stream sink that expects a stream of messages and calls db.add on each item, appending every valid message to the log.

db.createFeed: sync

db.createFeed(keys?)

db.createSequenceStream() => PullSource

Create a pull-stream source that provides the latest sequence number from the database. Each time a message is appended the sequence number should increase and a new event should be sent through the stream.

Note: In the future this stream may be debounced. The number of events passed through this stream may be less than the number of messages appended.

db.createFeed(keys?) => Feed (deprecated)

Use ssb-identities instead.

Create and return a Feed object. A feed is a chain of messages signed by a single key (the identity of the feed).

This handles the state needed to append valid messages to a feed. If keys are not provided, then a new key pair will be generated.

May only be called locally, not from a ssb-client connection.

The following methods apply to the Feed type.

Feed#add(message, cb)

Adds a message of a given type to a feed. This is the recommended way to append messages.

message is a javascript object. It must be a {} object with a type property that is a string between 3 and 32 chars long.

If message has recps property which is an array of feed ids, then the message content will be encrypted using private-box to those recipients. Any invalid recipients will cause an error, instead of accidentially posting a message publically or without a recipient.

Feed#id

The id of the feed (which is the feed's public key)

Feed#keys

The key pair for this feed.

Stability

Stable Expect patches, possible features additions.

License

MIT

ssb-db's People

Contributors

Stargazers

Watchers

Forkers

imclab nhq jcrugzz pfrazee thorgnyr jtremback uniteddiversity nikolamandic hackergrrl ootoovak gardner paulkernfeld disordinary garrows kustomzone yilab oleggirko stephensong tetratorus staltz fun-alex-alex2006hw y4my4my4m don-smith enterstudio kryptohaus vinod-designer1 prayagverma alanshaw maniacs-oss vijaykonda runtheworld emyarod harwoodleon ocforks christianbundy mlegore donpdonp br3nda inoas blastdelta1837 powersimple ecronomy codeaudit luandro erethon jibe-b elevenpassin myf 14core theproductiveprogrammer lowi-yeah regular neosiae productinfo tnachen sunrise-choir frankiebee privydg hendrikpetertje happy0 connoropolous socioprophet gmarcos87 sahwar jn7163 ken4r austinfrey andreaheatherr66 jbanddb black-puppydog nichoth raimundojimenez soapdog scantist-ossops-m2

ssb-db's Issues

should never happen - seq too high

a user installed latest without clearing their .ssb, then used a grimwire invite. now, during gossip, my client crashes with

/Users/paulfrazee/secure-scuttlebutt/validation.js:217
        throw new Error('should never happen - seq too high')
              ^
Error: should never happen - seq too high
    at validate (/Users/paulfrazee/secure-scuttlebutt/validation.js:217:15)
    at /Users/paulfrazee/secure-scuttlebutt/validation.js:227:7
    at Object.v.validate (/Users/paulfrazee/secure-scuttlebutt/validation.js:241:14)
    at EventEmitter.db.add (/Users/paulfrazee/secure-scuttlebutt/index.js:152:16)
    at /Users/paulfrazee/secure-scuttlebutt/index.js:235:12
    at /Users/paulfrazee/secure-scuttlebutt/node_modules/pull-paramap/index.js:41:11
    at check (/Users/paulfrazee/scuttlebot/node_modules/pull-many/index.js:71:16)
    at next (/Users/paulfrazee/scuttlebot/node_modules/pull-many/index.js:101:21)
    at Object.weird.read (/Users/paulfrazee/scuttlebot/node_modules/muxrpc/pull-weird.js:24:7)
    at onStream (/Users/paulfrazee/scuttlebot/node_modules/muxrpc/node_modules/packet-stream/index.js:90:14)

Merkle Tree Logs

Idea: if we where to use merkle tree logs, then it would be possible to securely replicate a portion of a feed.

The messages would still contain a link to the previous message, but the messages would also be hashed into a binary tree. Then I can request the 5th message, this would point to the hash of the first 4 messages (since 32 is a power of 2)

it would be

hash(
  hash(hash(msgs[0]) + hash(msgs[1])) +
  hash(hash(msgs[2]) + hash(msgs[3]))
)

Then, if I requested message 1 later, I could verify it was consistent with what I was given last time.

Currently we use only use a singly linked list. This is simple, but can only be used to verify that messages are part of a log if you have the entire log.

I suspect that this would be very useful on the edges of the network. Maybe you want to retrive a reply that a friend of a friend of a friend made? If you did this for everyone on reddit it would be too much resources - but if you got a few messages from them, it would still be highly secure, because you would be able to prove those messages where part of the same tree, using just a few extra hashes...

This would mean rewriting most of the validation code.

jshint

Reading through the Scuttlebutt source, I notice many lint errors, such as undeclared variables etc. I'd like to propose that we adopt a very permissive linting regime, to:

Enforce strict mode
Catch undeclared or unused variables
Require curly braces on statements where they would otherwise be optional

All of these things have a record of preventing bugs and security holes in software. I'll put together a pull request, see if you like it.

'Retires' or 'deletes' reltype

We need a way for users/apps to remove messages from the indexes. A good example: the replication protocol collects the 'follow' links to decide who to fetch, and we need a way to remove links when the user 'unfollows.'

One solution would be to create an 'unfollow' reltype, but I think we can go ahead and generalize the solution with a single reltype. The mechanic would be, if a message is published with the 'retires' (or 'deletes' or whatever) reltype by the same author of the target message, SSB de-indexes the target message from the type index, and from the indexes created by links in the target msg.

Occasional decode failure

For some reason, gui posts in the current WIP branch of phoenix are creating a crash-error in SSB.

/home/pfrazee/secure-scuttlebutt/codec.js:56
    v.value.signature = v.signature
     ^
TypeError: Cannot read property 'value' of undefined
    at Object.decode (/home/pfrazee/secure-scuttlebutt/codec.js:56:6)
    at Object.decode (/home/pfrazee/secure-scuttlebutt/node_modules/varstruct-match/index.js:46:30)
    at Object.decodeValue (/home/pfrazee/secure-scuttlebutt/node_modules/level-sublevel/node_modules/levelup/lib/codec.js:53:35)
    at /home/pfrazee/secure-scuttlebutt/node_modules/level-sublevel/nut.js:120:34

It's not clear to me what's different about the gui posts that's causing it. I've not observed it in action or text posts, but they dont have a significant structural difference. Here is the publishing code:

wsrpc.api.add({type: 'post', is: 'gui', text: text, timezone: localTZ}, cb)

Hiding messages on the feed through a secure branching algorithm

User feeds are broadcast-only because of constraints in the feed's security model.

Each entry has a sequence number and the signature of the previous message. There should be only one sequence of messages for a given keypair, so discovering two messages with the same sequence number is the sign of a forged message. Put another way, branches in the log are signs of errors or attacks.

Each branch can be identified uniquely by its chain of signatures. Peers can compare chains with each other to determine which branch is the forgery and which is canonical. The drawback of this design is that all messages in the feed must be broadcast to everyone: gaps in the log could signal that someone is hiding the messages from you. You want to have the full chain so you know nobody's blocking parts of a feed, and so you can verify the full history.

Proposal

If the server could selectively hide entries from its peers, it could deliver to select recipients without using encryption or ephemeral message. Messages could specify their recipients (including their pubs) and servers would only replicate to peers on that list. Doing so would be simpler than encryption, which requires key exchange, and would have better delivery guarantees than ephemeral messages, which can't benefit from the feed's signature-chain. (Message-hiding would not be a replacement for encryption.)

Details

Secure branching would require a peer-authentication handshake (to confirm the peer is the owner of a pubkey) and a protocol addition which does not weaken the feed's authenticity guarantees.

I propose we allow servers to branch their feeds (publish two messages under the same sequence number) so long as one of the messages is a link to other. As links are hashes of their target messages, the branch-link message would provide a signature of the message with content. The branch-link would then be publicly shared, while message with content would only go to its recipients. The following message in the public feed would include the branch-link's signature.

Example:

init msg
⬆️
broadcast: "hello!"
⬆️
branch link ➡️ send to bob: "I'm online!"
⬆️
broadcast: following bob
⬆️
branch link ➡️ send to alice, bob: "We're online!"

To make sure the content is distributed correctly, the branch-link might include the content's recipients, or possibly a passphrase which the recipient group had established previously to mark messages are meant for them.

Thoughts/concerns?

npm install secure-scuttlebutt does not have text.js in apps

The latest build on npm is missing of apps/text.js that is key for pfraze/phoenix, see pfraze/phoenix#11

Managed pub-server registration

In the interest of simplicity, I propose setting up some pub servers which automatically accept new users. Clients may then hardcode those pub servers and choose one during setup to host the user's account.

This may require a registration endpoint/protocol.

Managing feed length: subfeeds, compaction

Large feeds will slow down edge-creation in the network as newly-introduced nodes will need to receive each others' full histories. Because this threatens the SSB network's efficiency, we should identify strategies to solve this within the core message types.

I suggest we isolate the feeds which represent identities in the network. (This relates to using feeds as a replacement to certs, #5). Identity-feeds would be restricted to a defined set of types; all other message-types would have no meaning. This would allow us to:

Put application-data into subfeeds, reducing the growth-rate of the identity-feeds.
Create a compaction scheme which can operate on the message types defined for the identity feed. (If we allow application-defined types, compaction would not be feasible.)

app kit :)

it'd be nice to have some intro guidelines around how one might create a custom app on the protocol

Are key IDs good security UX?

Thoughtful post on this: https://www.debian-administration.org/users/dkg/weblog/105. I love the proquint, but I wonder if we're better of relying on the WoT?

Reltype de-duplication from keynames

For links with simple, single-word reltypes, perhaps we can detect using $feed, $msg, or $ext, and let $rel be optional. If not specified, the reltype is created from the keyname of the link object, with camel-case converted to dashed lowercase.

Examples:

{ original: { $msg: <hash> } } → 'original'
{ repliesTo: { $msg: <hash> } } → 'replies-to'
{ 'is-about': { $msg: <hash> } } → 'is-about'

For multiple reltypes, the $rel keyword is required, and the keyname is not used. Example:

{ original: { $msg: <hash>, $rel: 'foo bar' } } → 'foo bar'

feed.add requires a callback

This code works:

'use strict';

var pull = require('pull-stream');
require('rimraf').sync('./interval.db') // delete the db from last time
var ssb = require('../create')('./interval.db')
var feed = ssb.createFeed()

setInterval(function () {
  feed.add('msg', 'hello there!', function () {

  })
}, 2000)

pull(
  ssb.createFeedStream({ tail: true }),
  pull.drain(function (message) {
    console.log('MESSAGE: ', message)
  })
)

but this code doesn't:

'use strict';

var pull = require('pull-stream');
require('rimraf').sync('./interval.db') // delete the db from last time
var ssb = require('../create')('./interval.db')
var feed = ssb.createFeed()

setInterval(function () {
  feed.add('msg', 'hello there!')
}, 2000)

pull(
  ssb.createFeedStream({ tail: true }),
  pull.drain(function (message) {
    console.log('MESSAGE: ', message)
  })
)

There's not much reason to use add without a callback, but this little quirk threw me off for a while.

Discussion on encrypted messages

Allowing entries in ssb to be fully encrypted would have some interesting properties.

Private messaging could work in a similar way to how bitmessage handles hiding of metadata. You just append a message to your feed, that is encrypted to a certain user, without revealing which user it is.

Everyone syncing will then try to decrypt the message, or maybe only a fixed-size header for performance reasons.

This can give you pretty reliable messaging, without attackers learning very much, except by correlating activity between the actual participants in the discussion. Here, some kind of 'chaff' would make sense. i.e. users have a privacy incentative to just periodically post noise in their stream, to stop these kinds of correlation attacks.

Further along, exploring ways of having multiple parties deriving a shared secret, maybe even off the chains, would also allow group messaging.

I think this would make sense to add at the ssb level, and not leave up to applications to deal with. What are your thoughts?

Replication issues

Setup:
From a fresh start, NodeA/FeedA had 3 messages (create, "First post", "Second post"). NodeB follows FeedA; NodeB adds NodeA to network.

Replication run 1:
NodeB is informed the current sequence for FeedA is 3. NodeA then only sends seq 1 and 2 (wrong behavior!). NodeB correctly processes.

Replication run 2:
NodeB informs NodeA that local sequence for FeedA is 2; NodeA tells NodeB latest for FeedA is 3. NodeA then sends NodeB seq 2. NodeB errors out with "sequence out of error, expected 3, got 2", which is correct.

Looks like NodeA is not sending the last message in the feed. Off by one error?

ranking names/keys by link analysis

pagerank uses a straightforward ranking system, where pages linked are ranked by the sum of ranks of the pages that link to them. pagerank is calculated in an iterative process, all pages start off with equal rank, and then rank flows along the links. thus, if many pages link to you, you get a higher rank, or a few very highly ranked pages link to you you get a higher rank.

check out the images on the wikipedia page rank article https://en.wikipedia.org/wiki/PageRank

Since it starts off with all pages being equal, pagerank is vunerable to spam - blackhat seoers create many fake pages which then link to a target page, inflating it's rank.

Google calculates (originally did, at least) a global pagerank value persite, so that it only has to do one pagerank calculation for the whole webgraph. there are some algorithms that attempt to prevent spam, by starting with a known quality set of pages, for example, trustrank: https://en.wikipedia.org/wiki/TrustRank and then allow trust to flow out of those pages as in pagerank. This essentially amounts to doing pagerank but with a non uniform initial state (some sites start off with higher initial rank)

We could use a similar approach with ssb. We can't provide a known good set, because that would be centralization, but since each user will need to calculate the rank for their peers, they could use an initial state where they trust themselves completely, and then let trust flow out from the links they have created - giving each user subjective ranks for their peers. My hunch is this would prevent much spam, because a spammer would have to trick to into explicitly trusting them, which would be much harder than just generating fake accounts.

defence against sock puppets

this is a really good article about attacking various sites with sock puppets,
(new york times, reddit, discus, twitter, etc)

http://conference.hitb.org/hitbsecconf2014kul/sessions/weapons-of-mass-distraction-sock-puppetry-for-fun-profit/

The attacks are mostly very simple, but due to the way that these services prevent spam there is not much they can do. strongly defending against this attack is of fundamental importantce to ssb.
It's much easier to generate a new ssb identity (just generate a public key).

blocking an abusive user

https://gist.github.com/dominictarr/43a351b525fc3c6eaf31#comment-1283868

there are some situations where we actually want the truncate attack. Say someone is harassing me, and I want to block them. I could ask my friends not to relay my information to them, and they would shun that node, pretending that I have just stopped publishing messages. This is essentially a coordinated truncate attack, performed by honest node against a malicious node. I think it's being able to block a harasser is more important than being able to guarantee freshness.

"truncate attack" discussed in that gist.

SSB Server

We're looking at moving the responsibilities of https://github.com/pfraze/phoenix-rpc into a standard SSB Server. One proposal was to keep the SSB API internal and expose the Feed API.

Some collected notes:

Connections need to be authenticated and be given varying permissions
SSB Server will need to manage keys (create, read) perhaps as part of the feed api. It should protect the private key.
Clients need to:
- Add messages, query the feed (with pagination, filtering by author), query the link and message-type indexes
- Manage the network nodes and trigger syncs. It should be possible to fetch all the messages added during a sync.
- Run crytpo ops with the keys (sign data, verify signature, generate hmac, encrypt/decrypt)
We're looking at moving phoenix-relay into SSB Server as well. One thing to note, the SSB Server currently serves an HTTP interface which is useful for looking up the users on the relay (see http://grimwire.com:64000/)

Also, do we want to call it SSB Server, or give it a fancy DB name?

mutability, signatures and syncy freshness

this gist discussion would be more accessable as an issue:
https://gist.github.com/dominictarr/43a351b525fc3c6eaf31

includes using this
https://cassandra-shawn.googlecode.com/files/The%20Phi%20Accrual%20Failure%20Detector.pdf

To estimate when messages are likely to have been created and thus the probability that you are currently in sync.

naming registries

following on from this twitter thread here is how I think you might do a name registry on top of ssb:

like in unix we have files and directories. a file is just a mapping from a name to a hash (which is an external object, eg, a tarball) a directory is a mapping from a name to a certificate.
The certificate says key X may post records appending to this directory (Y)
X then can make posts that say in directory Y, maps to <foo_hash>

readers would collect all the statements that X has made about directory Y, and will then know the contents of the directory.

The directories and files could then be refered to via paths

<registry_id>:<directory_foo>:<file_bar>

this would look up the registry, then look for the directory it has named 'foo' then the file that directory has named 'bar'. when the registry created that directory, it could assign another key to "update" it,
in which case to resolve that path you'd need to retrive the posts from that key.

@Raynos @mikolalysenko @andreypopp

whoami optional param for keypair ownership challenge

Third party apps need proof that the local machine owns the claimed keypair. I propose we add an optional first param (the challenge) which, if provided, will be signed by ssb and returned in the whoami response.

var challenge = Math.random()
ssb.whoami(challenge, function(err, id) {
  if (!ssbKeys.verify(id.pubic, id.sig, challenge)) {
    alert('Hackers afoot!')
  }
})

internet of things?

http://www.coindesk.com/ibm-sees-role-block-chain-internet-things/
and podcast:
https://gigaom.com/2014/09/09/lets-discuss-ibms-new-block-chain-internet-of-things-architecture-and-robots/

I havn't watched the podcast yet... but this looks like an angle we could push ssb in

Latest changes break following - can't figure out why

This code works at commit 09af601, but doesn't work with the latest

'use strict';

var pull = require('pull-stream')
var net = require('net')
var toStream = require('pull-stream-to-stream')
var rimraf = require('rimraf')

rimraf.sync('./following1.db') // delete the db from last time
var ssb1 = require('../create')('./following1.db')
var feed1 = ssb1.createFeed()

rimraf.sync('./following2.db') // delete the db from last time
var ssb2 = require('../create')('./following2.db')
var feed2 = ssb2.createFeed()

var count = 0;

// Let's add some messages
feed1.add('msg', 'hello there! ' + count++, function () {})
feed1.add('msg', 'hello there! ' + count++, function () {})
feed1.add('msg', 'hello there! ' + count++, function () {})
feed1.add('msg', 'hello there! ' + count++, function () {})
feed1.add('msg', 'hello there! ' + count++, function () {})

// Adding a `follows` message causes feed2 to replicate messages added to feed1
feed2.add({ type: 'follow', $feed: feed1.id, $rel: 'follows' }, function () {})

setTimeout(function () {
  var a = feed1.createReplicationStream({ rel: 'follows' }, function () {})
  var b = feed2.createReplicationStream({ rel: 'follows' }, function () {
    pull(
      ssb2.createLogStream({ tail: true }),
      pull.drain(function (message) {
        console.log('SSB2: ', message)
      })
    )
  })

  // pull-stream provides us with this convenient way
  //  to connect duplex streams
  pull(a, b, a)
}, 1000)

Feed branch tests

We need some tests for what happens when a feed branches

createFeedStream with opts.tail==true doesn't live-replicate

After catching up on history, messages stop being emitted by the feed stream, despite opts.tails == true. No messages added after the feed stream is created will emit.

Recency guarantees through signed pings

SSB is an eventually-consistent system on an asynchronous network. Asynchronous means there's no time-bound on delivery: the message can arrive on the other node at any time. Since the wall clock isn't consistent globally, there's no way to know how old the information is. That missing guarantee is called recency.

To provide a recency guarantee, we need a synchronous channel. This is trivial on a local network. On public net, a sync channel can be arranged through proxies (eg pub relays). Once it's established, users can send requests and measure the latency, giving them an upper-bound on how old the response is. This is the recency.

A "signed ping" is a request for the current seq number of a user's feed over the synchronous channel. With the resulting response time, the caller knows roughly how old the feed is. The ping response is signed to prove authorship, and response seq numbers prevent replay attacks. Signed pings can be cached, but they'd have relatively low TTLs (depending on the recency-guarantee needed).

Signed pings only work if both users are reachable. They can be used as heartbeats to provide fault detection. Because the heartbeat includes latest feed seqs, the recency guarantee transfers to the feed's messages, and they become effectively synchronous. Groups of nodes can then implement leader election (eg Raft) using SSB.

idea for simplifying replication protocol

I'm thinking about simplifying the replication protocol, so that it just uses the rpc protocol.

So, instead of sending all the {id, seq} pairs, then receiving all the messages,
it just uses the api to request a stream from the remote. remote.createHistoryStream({id, seq})
we would just need to request all the feeds we follow.

This would also make it very easy to add other request response things that you might want to process during replication - such as a follow request with a code...

There might be other things too?

This would increase the level of abstraction in replication, but make it's operation much more obvious. Also, implementations would only need to implement rpc correctly, and not rpc AND replication.

Trustless trust

Would be nice that ssb could create a confirmations field to ensure that the data in there is trusteable and were verified at least x times by other nodes. This could make the replicated data much more reliable, since you might or not know peers, that differ from public initial ones, that delegate the other nodes in the network that you are able to connect to

upgradeable crypto

here is a method we could use:

#24 (comment)

we should tag signatures and public keys too, of course.

id = hash of first message? not pubkey?

Should the id be the hash of the first message?

This would mean that the id is more directly tied to the feed,
and it would also be possible to have multiple feeds with the same key (but different ids)

Maybe this makes very little difference?

@pfraze what do you think?

ephemeral messages

A channel for communication outside of replication would be useful for on-boarding new users https://github.com/pfraze/phoenix/issues/22

This could also be utilized for allowing a limited capacity to push messages to a node,
for example, sometimes you want people that do not follow you to be able to send you a message, one way to secure this would be to route messages through the social network,
A can send a message to C my sending through their mutual friend B. Then C can read the message and choose to follow A.

remove punctuation from relation names

Module names on npm used to be any ascii string.
If someone published a module name that you wanted
then you could publish another one with capital letters.

capital letters where removed, because some operating systems don't support them in file names.
AHEM.

However, now you can still do that, except with punctuation foobar and foo-bar are different modules.
This is actually a security hole! the official coffee script module has a dash: https://www.npmjs.org/package/coffee-script but turns out a non trivial number accidentially type npm install coffeescript and so it would be pretty easy to use that to deliver a Trojan https://www.youtube.com/watch?v=49Bpzq6okWk

I'm sure how that might be a security hole for us, but it does create confusion (which app uses reply-to and which replyTo? and which replyto?) so I propose trimming all reply-to, replyTo, REPLY_TO all to replyto

If I could travel back in time and write npm before isaacs did, this is one of the things I could do to make it better.

simplest way to handle attachments?

I'm trying to figure out the simplest way to handle attachments.

thinking out loud here

I originally thought that the answer was to put them through as a part of the replication protocol.

Then I thought hey maybe we could cut around that and make something simpler by
having using (at least in the early versions) just using http, and requesting or posting hashes.

But then I realized that it depends on where you expect those hashes to be.

Okay, it generally helps to talk about a concrete usecase.

So: oakdb. oakdb is a secure database such that you might implement a package manager.
there are two operations: 1) give a hash a path (you sign a record saying you name hash X)
2) link a path to another user's path (you sign a record saying your name for another users's path is Y) think of this as a symlink that is also a certificate.

to get to the hashes that I have named you could use a path like this:

MY_HASH/foo/1.0.0 which would point to a the correct hash for [email protected]

These hashes are probably tarballs, of course.
you could use this with a central registry. The central registry could just be a "meeting place" or it could have be a root namespace, maybe so that I can leave off the hash of the registry's key. instead of REGISTRY_HASH/foo/1.0.0 I could just say /foo/1.0.0.
I would just have to request to the registry to bless my foo module in some way.

Now, when I publish a module, I'd need to push the tarball as well. In couchdb you publish by pushing to the registry, but in oakdb we have a local replica, so we just put the message and tarball in that and then replicate with the registry.

Or maybe we just have a tarball store beside the registry, and on publish make sure that the tarball store has the tarballs we have published? If we push the tarballs in order we'd just have to remember the last tarball we sent it.

Suppose we did have a central registry that gave us certificates for module names?
we should probably only put the certified modules in the tarball store.
it's possible to also have a tarball that isn't published to the registry - say a fork.
So which tarballs belong where?

is the answer to traverse the links and put the tarballs which are reachable along a path into your replica?

Okay so lets say we can calculate whether we want a given tarball. can we also calculate whether another node should want a given tarball, or do they have to tell us?

If they tell us, do they need to know that we may have their tarball?
Say, we have either published it, or linked to it?

Suppose each app indexes who published/reshared what and then used that info to figure out where to get something?

error when replicating phoenix

something weird going on, causing ssb to crash

/home/dominic/c/phoenix/node_modules/secure-scuttlebutt/index.js:160
      if(--n) throw new Error('called twice')
                    ^
Error: called twice
    at Object.cb (/home/dominic/c/phoenix/node_modules/secure-scuttlebutt/index.js:160:21)
    at /home/dominic/c/phoenix/node_modules/secure-scuttlebutt/validation.js:160:13
    at /home/dominic/c/phoenix/node_modules/level-sublevel/shell.js:92:46
    at /home/dominic/c/phoenix/node_modules/level-sublevel/nut.js:107:13

hmm, if I try it again it gets a little further but then crashes again... eventually it worked without crashing.

switch to JSON encoding?

<domanic> pfraze, I have been wondering whether we should simply make secure scuttlebutt use json?
<pfraze> domanic, you think?
<domanic> would make implementing much easier because you wouldn't have to implement a binary parser, and debug it
<domanic> (thinking of third party implementations here)
<domanic> binary is awkward though, we'd need to use base64 for hashes and signatures
<pfraze> yeah, that'd be the downside
<domanic> but... we could still save bandwidth by just compressing the streams
<domanic> that would probably bring it back down to a binary format
<domanic> question: will it be faster than binary?
<domanic> this is slightly faster: https://github.com/mafintosh/protocol-buffers
<pfraze> might be in js environments, yeah
<pfraze> interesting, how recent is that?
<domanic> only a few months old
<pfraze> hmm. Not an order of magnitude, but still faster
<domanic> yeah, < 10% faster
<domanic> but the C implementation will be faster
<mafintosh> pfraze, domanic: if you checkout the source so see that readability is a trade-off for getting the performance we ended up getting out of the protocol-buffers module
<domanic> mafintosh, that is what I'm thinking - it's better to optimize for adoptability
<domanic> make it easier to implement a competing implementation - you need that for a p2p protocol to be truly decentralized
<mafintosh> domanic: protocol-buffers parsers are widely available though for almost every platform/language
<domanic> true, but for ssb we need an unstructured embedded format anyway
<domanic> like json - currently we are using msgpack
<domanic> ... but the js implementation for that is slower than javascript
<mafintosh> domanic: ah okay
<pfraze> yeah, pbuf requires a schema definition
<pfraze> json is wire readable
<domanic> on the other hand - there is a kinda macho thing here...
<domanic> binary protocol is more hard core
<mafintosh> domanic: i don't think you'll be able to get JSON like perf out of a non-schema binary protocol anyways
<domanic> and if you are gonna mess with crypto then you better be able to handle binary
<domanic> mafintosh, yeah, no especially not in pure js
<mafintosh> domanic: unrelated, we can probably speed up https://github.com/dominictarr/varstruct by orders of magnitude if we code generate the parsers etc
<domanic> mafintosh, for sure
<domanic> the problem with JSON though, when doing crypto/signatures is that JSON is unstable
<domanic> because of eg, whitespace. 
<nathan7> sorted JSON works
<mafintosh> domanic: also how unicode characters are encoded
<domanic> nathan7, but if you have to sort the json then you don't have the perf of the native JSON implementation
<domanic> this might be better: https://camlistore.googlesource.com/camlistore/+/master/doc/json-signing/json-signing.txt
<mafintosh> i ran into that problem yesterday while trying to generate shasums for docker images (their JSON encoded mapped unicode chars to '\u...' etc
<domanic> actually I'm gonna test that
<jbenet> domanic: protobuf with optional fields
<domanic> jbenet, that works for the outer, but the inner message has an arbitary structure
<jbenet> Oh yeah, I just use an opaque 'bytes Data' field
<domanic> jbenet, that is what we started with too
<domanic> but we decided there was a lot to be gained from a consistent format
<domanic> in particular - we could index structures within messages
<domanic> so if apps create messages that securely refer to other messages, we can detect that without relying on the app (which might be untrustworthy)
<domanic> pfraze, okay so straight forward JSON.stringify is 4 times faster than sorting first
<pfraze> domanic, how does it compare to msgpack-js?
<domanic> pfraze, okay - sorted stringify is slower than msgpack + varstruck
<domanic> but json is still 2.5 times faster than varstruct/msgpack
<pfraze> will the character encoding and whitespace issues be significant?
<pfraze> if not, given the wire-readability, I think that might be the right call
<domanic> okay so there are few other factors
<domanic> how fast to verify a signature?
<pfraze> there'd probably be a lot of base64/buffer conversions
<domanic> then only thing is that we'd either have to reencode consistently, OR to keep the encoded version around for every decoded message
<domanic> my hunch is the latter is gonna be cheaper

issues to investigate

performance to create signature
performance to verify a signature (this will happen way more)
perfomance to parse and reencode stream

this is a nice method that works around the whitespace problem:
https://camlistore.googlesource.com/camlistore/+/master/doc/json-signing/json-signing.txt

maybe we should do that, and keep the encoded form around so that "reencoding" is just returning that string.

Also, if all objects can be a single line, then we can just use line separated json, which is much cheaper than a streaming json parser which must be implemented in js.

Auto-"joining" bidirectional link messages

The linking mechanism adds references in the index which connect the link's target to the message containing the link. There may be cases, however, where users want to create links between two previously-published messages. To do this, I propose supporting a special relation which signifies that a message other than the one containing the links should be indexed. I suggest 'anchor' for the relation.

A traditional example:

12d238..aa = {
  ...
  message: {
    replyTo: { $msg: 0a3cc3..2f, $rel: 'reply-to' },
    plain: 'This is my reply!'
  }
}

A bidirectional example:

12d238..aa = {
  ...
  message: {
    plain: 'This is my reply!'
  }
}
ffac09..23 = {
  ...
  message: {
    replyTo: { $msg: 0a3cc3..2f, $rel: 'reply-to' },
    anchor: { $msg: 12d238..aa, $rel: 'anchor' }
  }
}

In both cases, the indexes will connect 0a3cc3..2f to 12d238..aa.

Include ID in messages

It would be very handy to have the IDs of messages when theyre read out of SSB. Currently, phoenix is generating them by hand.

Looking at the code, we can either store the IDs in the messages themselves, or modify each read function to add the ID to the messages

standardizing getter output

in phoenix code, i've started to standardize the message objects in the {key:, value:} form, for two reasons:

i can modify that wrapper object without modifying the content of the message, so i can stick on new data (eg msg.wasRead = true)
it gives me access to the msg's key

problem is that the getters are inconsistent on the output format and sometimes won't let you get it in this form

createFeedStream cant get it with keys
createHistoryStream cant get it with keys
createLogStream can only get it with keys
messagesByType respects keys opt
messagesLinkedToMessage cant get it with keys
all other link-based getters use their own format

so i'm thinking we should update the first 3 of those to respect the keys opt, and then update messagesLinkedToMessage to work like the other link-based getters

that sound good?

Using opts in createLogStream() screws with sublevel

Providing an opts object causes createLogStream to read values from the toplevel database, not the 'log` sublevel. The subsequent map from log-item to message fails, causing the stream to emit a 'Key not found in database' error.

deadlock when encrypting replication channel

the replication protocol waits for the sink to be finished before closing the source.
This causes a race condition when the channel is encrypted. The sink needs to see the last message
to be done, but it will never decode it because the source needs a bit more data push into the cipher (or to call cipher.final())

The solution is to for the source to give a "goodbye" message when it has received everything it was expecting.

trust, revocation, and modeling attacks

in which @pfraze proposes building a web of trust by having node's sign each other,
and (here is the interesting part) also being responsible for revocations (removing a trust-link).

https://gist.github.com/pfraze/2a3e09efade5c5476446

discussing how you might calculate the trustworthyness of a node given a link graph.

annotate hashes to be tagged pointers? does this lead to schemas?

I've been thinking about this a lot since discussing on irc with @pfraze the other day,
ssb has 3 types of links:

hash of pubkey (a Feed id)
hash of a message (a Message id - i.e. previous message hash on every message)
hash of an arbitary blob (an attachment)

If hashes where tagged with the type of thing they point to you get a really nice ability:
the system can index relationships automatically: that feed X is linked to feed Y (example, because they are friends), that message B was created strictly after message A (to which it was a reply) or that some message refers to an attachment J. Without tags, then the specific
meaning of each hash must be interpreted from it's context, or the hashed object must be
retrived and parsed.

Another idea that we have discussed recently is identifying messages via the hash of their schema. There are certainly nice things about this idea, but also unknowns.
The idea would be to have a canonical representation of each schema, and then id that schema with it's hash. this would allow objects to be tagged like in git, but also to allow
user applications to create new types, and to reflect and parse the documents without
running those applications.

To combine these ideas, we would need links to have be a hash and a type hash.
That would make each link 64 bytes long, which wouldn't fit on a 80 char terminal line.

Maybe we could just use a 1 byte tag for messages, feeds and attachments,
An attachment could be tagged by having the hash of it's schema at the beginning.
Of course, this would be incompatible with most standard mimetypes, so we'd need a raw
blob as well... so that would be 4 id types? (feed, message, attachments: tagged and raw?)

What should links be like?

We could get away with {T}{hash} but I think there is a strong case for including other metadata in the link, such as the size of an attachment, or the feed id and sequence of a message? or the ip address of a relay, as part of a feed id. Sometimes this extra metadata
might be unnecessary or unwarranted or would just create a token that is too long.

@jbenet has a similar idea over here: jbenet/random-ideas#1

Not able to get following working

I'm trying to write some examples of SSB doing different things as a way to learn how it works. Kind of stuck on getting it to replicate:

'use strict';

var pull = require('pull-stream')
var net = require('net')
var toStream = require('pull-stream-to-stream')
var rimraf = require('rimraf')

rimraf.sync('./following1.db') // delete the db from last time
var ssb1 = require('../create')('./following1.db')
var feed1 = ssb1.createFeed()

rimraf.sync('./following2.db') // delete the db from last time
var ssb2 = require('../create')('./following2.db')
var feed2 = ssb2.createFeed()

var count = 0;

setInterval(function () {
  feed1.add('msg', 'hello there! ' + count++, function () {})
}, 2000)

// publish follows link
feed2.add({ type: 'follow', $feed: feed1.id, $rel: 'follows' }, function () {})

var a = ssb1.createReplicationStream({ rel: 'follows' }, function () {})
var b = ssb2.createReplicationStream({ rel: 'follows' }, function () {})

pull(a, b, a)

pull(
  ssb1.createFeedStream({ tail: true }),
  pull.drain(function (message) {
    console.log('SSB1: ', message)
  })
)

pull(
  ssb2.createLogStream({ tail: true }),
  pull.drain(function (message) {
    console.log('SSB2: ', message)
  })
)

Efficiently replicate document sets via casual inks

Idea: an efficient replication handshake via casual links.

Each document links to the previous documents in the dataset. This may be one or more documents. The documents may represent either a snapshot or a diff, but it's assumed that both parties in a replication require all the documents in the set.

If the links are just hashes, then it's necessary exchange the list of hashes, (or use something like https://github.com/dominictarr/merkle-stream or dispersy but both of these will replicate documents in random order - which may be inconvenient.)

Basically, data replication is two things 1) remote set comparison 2) sending data. (1) is the interesting part.

I investigated remote set comparison with merkle-stream, and got these results: https://github.com/dominictarr/merkle-stream/blob/master/test/bench.png the bottom axis is the proportion of different between the two sets (in the center the have 50% of the some records)
and the right axis (for blue graph) is the comparison of the bandwidth used compared to just sending all the hashes. Clearly, merkle-trees are only efficient for this if the datasets are nearly the same or only have a few records.

After finding this result I abandoned this approach, and decided that to efficiently compare remote sets you need to leverage the structure inherent in the data. Or, you can put structure into the data so as to make it easier to replicate (this is scuttlebutt's approach).

Scuttlebutt works well for data that can be partitioned into sections owned by a single writer (like "social network" feeds) but not for say, a wiki. With a wiki, each update would point back to the previous document(s). If there happens to be no concurrent updates, it will be a linked list, but if there are then it will be a tree.

I think the solution is more or less like what git uses. you traverse the tree back into the past (linkwise) and send those hashes to the other side. There needs to be a way for the other side to reconstruct your tree - they don't need the entire files yet, just the relations. I have two ideas how to do this which I will describe below. The protocol is symmetrical, so when you are sending your tree, you are also receiving your friend's tree. When you get to a point in the tree where you know all the branches then we know the subset that is missing.

To illustrate, I'll describe the simplest case.

I have document A, and you edit it creating A', so I send hash(A) and you send hash(A') and hash(A), you get hash(A) from me you'll know that I am missing A', so you can send that to me.
Objects should always be sent in chronological order, so that it's possible to reason about what objects a remote instance has, if they know an object B, they must know the objects that B links to.
(although, leaf objects with no links can be handled differently)

If we both edit it then we have A' and A", then I'll send A',A and you'll send A",A. if there are many users editing a popular document in parallel (eg, hot reddit page) then we'll have lots of concurrent updates. Basically we just send the top of the tree until we have found a shared cut. By "cut" I mean a minimal set of vertices (hashes) that we both share, that partition all the unknown hashes to the other side.

Now, one question is how many hashes do we send? the simple approach would be to send some fixed amount say 1k worth of hashes, and if we still havn't found the cut then send 2k and so on.
Of course, this amount could be a configuration parameter, which could be adjusted dynamically, perhaps according to how many updates you expect since you last replicated.

To encode the tree, you could have a binary format where you send a hash, plus an integer N the number of hashes it links to, an array A, of pointers to links to. This would save you sending the hash B twice to say A->B, B->C

Another option would be to give every document a monotonic counter, that is 1 + the maximum counter from the documents that it links to. This way, the hashes can be kept in a single index,
and read in one scan. This method would not encode the links explicitly, but since the counter increases monotonically, if we both have the same set of hashes with the same counter value,
then that is a valid cut. If there is a reasonable chance of having updates that are serial (I make A" after receiving your A', in which case the cut is a single hash). This is probably a fairly reasonable assumption even with the most edited wikipedia pages (http://fivethirtyeight.com/datalab/the-100-most-edited-wikipedia-articles/) if you have 45,000 edits in a year that is 1 every 120 seconds,
which gives a reasonable chance to synchronize. Probably it would be simplest to include this counter inside the document and reject documents that have invalid counters.

Related issues and projects
@substack's https://github.com/substack/fwdb/
https://github.com/jbenet/ipfs/issues/8
pfrazee/eco#3

thoughts? @pfraze @mafintosh @jbenet @substack @chrisdickinson

Every node is it's own centralized authority: stolen keys

if someone steals your private key, it should not be possible for them to impersonate you (add messages to your chain)

idea: if node's detect a branch (two valid extentions to a message chain) then consider that key compromized.

Nodes should also identify themselves when replicating - and no node should send me my own messages - my private key should never leave my device, so if I am receiving a message it's one that i didn't write.

Currently the replication algorithm is not aware of the identity of the local node.

Improved messagesByType: multiple types, added since time X

Applications will need to fetch the messages, update their local state, then fetch any new messages since the last fetch. As a practical example, Phoenix needs to construct the profiles of users by fetching all the profile messages and getting the last nickname value published by a given user.

There's two tricks here. First, we need to be able to fetch multiple message types at a time. The phoenix feed view, for instance, will probably want everything in https://github.com/pfraze/phoenix/issues/102 (init, text, profile, follow, pub).

Second, we need to be able to get messages that have newly arrived. If I receive a message at 6pm that was published at 4pm, and I say "give me everything that's arrived since 5pm," then I'd want that message published at 4pm.

reject too-big messages

oops, @pfraze created message H6yP9cn3T7D4lB/Yf9ndjVD9gRPwBFImhjsldFtS1+0=.blake2s which is way too and should not have been validated.

now there are two problems: fix validation to disallow such messages. clean the current instances so that invalid messages are removed.

Should feeds be IDed with the public key to protect against hash-collision attacks?

This relates to #8, but regards internal architecture.

Currently, the feeds are identified with Blake2 hashes of their public keys. This means an effective collision-attack could create unpredictable behaviors.

How much of an efficiency impact would we take if we used public keys as the IDs? Based on the hex-strings I've been staring at, the hash (32 bytes) looks like it's half as large as the ECC public key (64 bytes). If that's right, then I think we should consider accepting the drop in efficiency for a simpler security design.

normalize timestamps to UTC-0 ?

i dont want to put timezones on the timestamps because that's location-specific info we dont need to emit. however, without correcting for timezones, posts will appear to come from the future

timestamps arent trustworthy so it's not a huge deal, but the phoenix ui uses them to order the feed (at least for now). would it be better if we normalized all timestamps on the messages to UTC-0? other machines could then convert to their local timezone