noncesense-research-lab / archival_network Goto Github PK

Investigating the frequency of alternative blocks, reorganizations, potential double-spend attacks, selfish mining, and more.

License: MIT License

Shell 0.30% Jupyter Notebook 99.70%

archival_network's People

Contributors

Stargazers

Watchers

Forkers

neffmallon serhack gingeropolous samborsol dginovker thirdaei ink-splatters

archival_network's Issues

IP addresses

Retain IP addresses from which each copy of a block/transaction broadcast is received. This will be helpful for studying topology/latency. Additionally, it will allow cross-referencing between blocks, however this could be indexed other ways if operating in IP-blind mode.

Prior to incorporation into publications/dashboards, exact IP addresses should be obscured to preserve privacy.

Analysis: Do side-chains with lots of PoW correspond to lost work on main-chain?

Since our mysterious entity has ~ 50 MH/s, look at whether:

A) times with side chains correspond to losing 50 MH/s for an hour on the main chain.

-- or --

B) if the massive hashpower pouring in does NOT correspond to a loss on the main chain, which would mean that this is a different entity.

Testnet ASIC presence 2017?

Were the ASICs from spring 2017 run on the testnet or stagenet before making an appearance in the main chain hashrate?

The bigger question is whether we should keep an eye on testnet hash rate for any surreal spikes.

Compatible (unified?) configuration script

When current MAP_VPS_setup.sh script is executed on a fresh Debian install, User map is not added to sudoers (maybe this is intentional, but configuring monerod-archive requires a sudo to write in /opt

It would be ideal if the MAP_VPS_setup also pulls down and places the monerod-archive binary in the appropriate location.

The MAP_VPS_setup.sh could also wget the archival daemon configuration script to create the directory and configure monerod-archive as an auto start service, as mentioned in #36

Wrong link on monerodocs.org

https://monerodocs.org/interacting/monerod-reference/#testing-monero-itself

Here they give you a shoutout, but it links to https://www.noncesense.org/ - appears your DNS is not configured to redirect www to https://noncesense.org/

Temporal analysis of block discovery time

Look for signatures of selfish mining

Installing Grafana

This issue will track my work in order to install Grafana in one VPS for MAP Project.

Install Grafana on MAP-GRAFANA machine.
Install InfluxDB
Pointing all the CollectD to the InfluxDB
Create the dashboard "Nodes" and the row. Each row will be called as the hostname of VPS

Sub Nans for delta_time and delta_length instead of dropping starting and orphan blocks

As noted in the text of altchain_temporal_study.py, this is a planned enhancement.

Analysis: Histogram of time between splits

Potentially two histograms:
A) histogram of length between orphaned single blocks (should be natural)
B) histogram of length between peculiar side chains

Document "remote node" setup for MAP servers

Hack together a quick wiki page about how to connect to Tokyo as a remote node.

Note, the goal here is to connect data about propagation of transactions that originate at our nodes. R & D purposes.

If you want privacy, I would suggest NOT connecting to one of our nodes operating at --log-level > 9000

Use DataFrame.loc instead of multiple indexing to assign values.

There are currently multiple warnings raised for using multiple indexing to assign values. This is bad practice, see Pandas docs.

THIS_BLOCK_ID missing

The member

// hash cash
mutable crypto::hash hash;

on struct block is intentionally omitted from its SERIALIZE directive, and so it never appears in JSON and is not archived.

See cryptonote_basic.h struct block: https://github.com/monero-project/monero/blob/ebf2818ab5f42b10745cb99d07920f3197c3d914/src/cryptonote_basic/cryptonote_basic.h#L386

Should we try to add this field to serialization and bring it into the exported block JSON?

Why frequent 15-35 len alt chains??

Why do many nodes report alt chains going 20 or 30 blocks deep?

Example histogram here: https://github.com/Mitchellpkt/Monero_AltBlock_Research/blob/master/Plotting/node_586b0c5_histogram.png

It has been posited that this is due to a bug in 0.12.0.0 causing these long side chains.

Can you shed light on what could be causing this? Seems like a phenomenal waste of PoW.

Do peer lists contain internal IP addresses?

If so, what fraction? Probably small.

Extract archived txns and query via RPC

Create a script that takes the custom monerod-archive log file as input and extracts a list of all txn hashes that show up in main chain block and a list of all txn hashes that show up in the alternative blocks.

Goal 1: Explore setdiff(alt_txns, main_chain_txns) to see what exactly goes on in those alt blocks….

Goal 2: Run each of the alt chain transaction hashes through RPC/get_transactions and make sure that we can retrieve the details (key image, ring member) for all alt txns. (This is important to verify soon; IFF alternative transactions are not available through the RPC, we must jump on modifying the patch to include transaction write-out functionality).

Are we prepared for triple blocks?

It just dawned on me that it is statistically likely that we will encounter heights with 3 versions of a solved block.

Why?

Natural splits due to latency often lead to a single orphaned block. If that was the only source of forks, we expect a split to two versions.

However, we know that there is a second phenomenon; the miner frequently running out 20 - 35 chain blocks. While those artificial side chains are being produced, there will still be the usual benign orphaned blocks.

Once I have values for (frequency of single orphaned blocks) and (frequency of these longer side chains) it will be possible to calculate the statistical frequency of expected triplets.

However, back-of-the-envelope: I suspect that random latency splits occurring at the same time as the artificial side chains is a common event.

Are we prepared for this?

@neptuneresearch , how does the custom daemon handle this

Node-receipt timestamp

The "timestamp" included in the block by the miner could be spoofed, inaccurate, or not updated.

Should record two fields:

miner-reported timestamp (MRT)
node-received timestamp (NRT)

Time-received resolution

Time received is stored in the filename to second resolution. Several copies of new blocks all seem to roll in together within a second, so we cannot identify latency. Need to record sub-second received time.

NOT the same as the timestamp in the block, which is chosen by the miner, and could be spoofed.

De-duplicate alternative blocks from restarts

Restarts and syncing can cause a single alternative block to be recorded multiple times in the logs. Use the first timestamp for a given version as its observation date, and ignore subsequent reports of the same alternative block.

Empirical study of historical miner penalties

Suppose we have block at height H, which would normally generate a coinbase reward of R(H) if the block is small and there is no penalty. If there is an oversize block, then the coinbase is reduced. Define P(H) as the penalty imposed on each block. Total coinbase payout T(H) is thus:
T(H) = R(H) - P(H)

I'm interested in collect these variables. I'm interested in a histogram of {P}, which I assume will have a lot of small P(H) = 0 blocks. How often are penalties applied, and what does their distribution look like? (bonus points if 3D distribution showing distribution evolution over time, i.e.
x-axis: time bins
y-axis: P (bins?)
z-axis: counts of penalties applied within that time window.

Even a 2D histogram would be sweet, before jumping into the 3D.

Further, consider the miner's gain (G) from electing to oversize the block and take the penalty - how well were they compensated? Now we include the total fees the miner collected, F(H)
G(H) = F(H) - P(H)
I'm curious about the frequency of oversized blocks and profitability in those instances.

This project is totally open game. I have zero time to pursue NRL endeavors, at least for the next month. I would love for somebody to tackle this. Could even be a simple Jupyter notebook. Ping @neptuneresearch for data dumps of {height, total fees, block reward, block size} which I think is all that's necessary for the first steps described above.

Enumerate Monero nodes

During this interview on Bitcoin Uncensored, fluffy comments: "Monero Hash tracks nodes, but they don't track every single node. Its not like they are plowing through nodes like chain analysis would, trying to enumerate them"

Well, that's actually a very interesting idea.... Should be a quick iterative process... Request a peer list from each connected node. Connect to those nodes and request their peer lists (memoize by not repeatedly connecting to already-sampled nodes). Repeat recursively until we know about all of the open nodes.

Occasionally checking the size of the Monero network will be valuable for ascertaining how the number and distribution of nodes impacts other characteristics. This will also help with determining how MAP nodes should be geographically distributed, to match the profile of Monero network activity.

As a speculative side note / secondary analysis... This kind of network mapping is probably already a routine procedure for one or more surveilling entities. Could we turn this idea on its head and analyze our connection history across multiple MAP nodes for evidence of such activity by non-MAP entities? If the scanning party does not take steps intentionally obscure/obfuscate their search pattern, it would be trivial to see their loggers sweep across our archival nodes.

Consider three MAP nodes {A, B, C} configured so that node B is always connected to node A and node C. If a scan is executed without concealing the behavior, what would we expect to see in our combined logs across the MAP network?

13:59:00 Some unknown node X connects to MAP node A
13:59:01 Node X requests peer list from MAP node A
13:59:02 Node X disconnects from A
13:59:03 Node X connects to node B
13:59:04 Node X requests peer list from MAP node B
13:59:05 Node X disconnects from B
13:59:06 Node X connects to node C
13:59:07 Node X requests peer list from MAP node C
13:59:08 Node X disconnects from C

This propagating blip would be strongly suggestive of a network scan in progress

Naming our metrics

Let's say we have a table of block ID and node receipt timestamps (NRTs) for an archival node.

Height HH:MM:SS
1643586 // 00:00:00, 00:00:04, 00:00:09
1643587 // 00:02:05, 00:02:06, 00:02:11, 00:03:00
1643588 // 00:04:06, 00:05:00, 00:07:05
1643589 // 00:07:04, 00:07:14, 00:07:18, 00:07:21

Notation

NRT(H,x) is the xth time that we received a copy of block H

Block discovery waiting time

How long did it take for somebody to solve block H?
W(H) := NRT(H,first) - NRT(H-1, first)
A histogram of this quantity over many H's tells us about mining activity.

Broadcast delay &/or timestamp spoofing

Difference between block's miner timestamp and actual broadcast to network
Maybe call this D for 'delay'
D(H) = NRT(H,first) - MRT(H)
(don't need to specify first or last for MRT since it will be the same in all copies)
A histogram of this quantity over many H's would theoretically provide information about latency etc. However, there is a lot of timestamp spoofing, which becomes the more interesting feature of this histogram

Block broadcast window

What is the time difference between first and last receipt of a certain block by a given node?
B(H) := NRT(H,last) - NRT(H,first)
What are the implications? What would a histogram of this show us. Essentially, the time envelope for bursts of network activity around block discovery times. This might be an interesting way to heuristically detect a running node by network traffic rates, even if actual content is concealed by VPN, etc.

Block receipt count

How many times do we receive a copy of a given block?
C(H) := # of NRT entries for height H
What does this tell us?

Global block propagation time

How long does it take for a broadcast to propagate across the network to the last node. Use extended notation: NRT(N,H,x) indicating the timestamp when MAP node N received the xth copy of block at height H

Suppose MAP node 'orange' is the first to hear a block, and MAP node 'ginger' is the last to hear about that block. Then we are interested in
G(H) := NRT(ginger, H, first) - NRT(orange, H, first)

More generally,
G(H) := NRT(first node, H, first copy) - NRT(last node, H, first copy)

This would be very interesting for both blocks and transactions - and can be used to estimate the expected number of orphaned blocks due to natural causes.

Ring size histogram

Plot distribution of ring sizes since January 2017 (and changes over time)

height_to_time fraction converter for initial sync.

Syncing the blockchain requires patience, and the progress indicator given is not a good indicator of progress.

Synced XXXXX/1625048 provides a fraction completed in terms of 'height', but this is very poorly correlated with how much time is left. This is because early blocks sync very quickly, and later blocks are quite slow.

While different nodes sync at different absolute speeds, based on bandwidth and power, the relative speeds and slowdowns seem perceptually similar. That means that one could make a plot of [fraction of sync time] vs [fraction of sync height].

While this is mostly a novelty, it would be interesting to find the kinks in the plot and mark what they correspond to (e.g. changes in volume, changes in features, etc). This could be useful for studying how scaling of various technologies has performed in practice, relative to theoretical O(*n)

JetBrains open source license

Our project has great need and use for JetBrains CLion, PyCharm and WebStorm. We could probably make use of some of their other tools also.

@Mitchellpkt Can you apply for an open source license for our project?

Snippet from JetBrains open source page:

Open Source Licenses

Get free licenses for JetBrains tools if your non-commercial open source project meets these requirements:

    Your project meets the Open Source definition
    Your project is at least 3 months old
    Your project is actively and regularly developed
    You are the project lead or an active committer
    Your project is NOT sponsored by a commercial company or organization and does NOT have paid employees
    Your project does NOT provide commercial services (such as consulting or training) around the software, and does NOT distribute paid versions of the software

Qualifying open source projects may apply for licenses to the All Products pack, TeamCity, YouTrack, and Upsource

Here's the link to apply: JetBrains open source license request

Analysis: Fish out some of the super-slow blocks

Just for anecdotal fun.

Sort the b1s data frame by delta_time, and peek at a few of the silly slow blocks.

Analyze transaction chains with unusual ring sizes

I think that the necessity for a fixed ringsize is relatively self-evident from a statistical perspective (I fully support fixed ring-size). However, it’d be fun to pull out some proof for good measure. (Only looking at transactions since January 2017, for relevance)

Fishing around for signatures of anomalous behaviors falls right in the ballpark of #noncesense-research-lab :- ) … Seems like a straightforward project to iterate over transactions with non-standard ringsizes and check whether any of their ring members were outputs generated by txns that had unusual ringsizes themselves.

I wouldn’t be surprised if we locate a few chains of transactions with a string of unusually-sized rings surrounded by 7-member decoys. (of course, this can be compared against the background likelihood of selecting a decoy with generated by > 7 ring members).

Wanted to check whether somebody else is already working on this? If not, suggestions for tackling? My default approach would be to use RPC (python wrapper?) to scan through [2017-present] transaction tree, memoize ring size info, then analyze that. Let me know if you have advice for starting points/libraries, better approaches, or prior art.

Configure monerod-archive as auto-start service

Suggestion from @neptuneresearch

Right now, the daemon is manually launched by /.monerod-archive --detach (+other args)

This method has no auto-restart, so the nodes are not very resilient.

It would be better if we register the archival daemon as a systemd service (?)

Ideally this would go in the configuration script

Missing: Historical data of orphaned block contents

If you have any tips for obtaining data on the contents of orphaned blocks, please comment or contact me.

This is crucial for ascertaining whether or not a double-spend attack has ever occurred or been attempted (by checking whether two blocks at the same height contain the same key image spending to a different recipient stealth address)

It is currently unknown whether or not anybody retained these records. Can you find a copy?

Persistent `~/.bitmonero/bitmonero.log`

Uh oh, it seems like the bitmonero.log files are not persistent. I should have realized that before.

Given the archival nature of our project, we want to retain monerod log files, so we have them handy for future analyses.

This issue is being filed as the primary obstacle to analysis addressing issue #28

@serhack @neptuneresearch - I assume there's probably an easy way to mitigate this?

Subsecond node-receipt timestamp

When each copy of a block is received (perhaps multiple copies from multiple nodes), record the node receipt timestamp to milliseconds.

This enables study of latency. What's the timing on the shortest route for a txn/block to arrive at MAP node? What's the timing on the longest route? What is the scale of the time difference? milliseconds? seconds?

Do Frankfurt and Tokyo have identical peer lists?

Check this to see whether iterative enumeration is even necessary.

Use unsupervised machine learning to link transactions with related origins, based on ring member age distribution

As discussed in the "custom ring composition spoils Monero fungibility" wiki, any non-standard algorithm for decoy selection can be used to group transactions that are potentially made by the same wallet or entity. This can be automated by applying unsupervised clustering algorithms on the empirical age distribution of decoys used in real transactions.

In an idea world, where all users and wallets follow the typical decoy selection algorithm, all Monero transactions fall into the same indistinguishable cluster. However, a set of transactions generated with significant deviations away from the norm (e.g. using a uniform selection algorithm) will shift to their own cluster.

The largest cluster(s) with the most members represent the fungible bulk of Monero, and the outlier clusters should be quite interesting to inspect.

Note, it might be useful to try log(age) coordinates as well, to catch signatures on shorter timescales.

Discord link is invalid

Hey all 😁 I see that the discord link on the readme is broke... care to post a new one? 🙏🏼

Retain loose transactions

Record all transaction broadcasts received by the node. This means keeping:

Transactions that were recorded on the main chain (OOTB daemon retains these)
Transactions that ended up in the side chain (OOTB daemon discards these)
Transactions that appear in both (will be frequent, for benign splits)

Right now, we retain the Txn hashes, but we need to know:

Stealth address
Ring signature members (sender & decoys)
Key image (!!!!)

How many unique IP addresses for nodes?

Check peerlist logs to see if there are multiple nodes that share the same IP address (for instance, multiple users running over the same VPN service).

Why does this matter? Using small round exaggerated numbers: Suppose there are 200 active nodes, and 100 of them are using VPN company X over a small set of IP addresses. If Monero activity through VPN company X is halted (whether intentionally or by accident) this would cause a disproportionately large blow to the network.

It's a quasi-centralization that could cause small points of failure to have a larger impact.

Analysis: Histogram of side chain length

Visualize the wonkiness. Definitely multimodal.

Mobile

G_LIBC Upgrade

Debian 9 has G_LIBC (package known as libc6-dev) preinstalled version 2.24 .

To upgrade you need to follow these passages. Warning: if you run commands and you don't know what they do, please don't try to upgrade G_LIBC!

Open with any editor the /etc/apt/sources.list
Add the following line deb http://ftp.us.debian.org/debian sid main
Run apt-get update
Let's upgrade LIBC! Run apt-get install libc6-dev and then wait...
Reopen /etc/apt/sources.list and comment (with "#") the line you wrote in the step 2.
Enjoy!

NEVER NEVER run a apt-get upgrade or apt-get full-upgrade while you are touching /etc/apt/sources.list . If you upgrade all the packages to "sid", the system can become unstable.

Change to launching monerode with --log-level 2

Right now launch_monerod.sh launches with --log-level 0

I think level 2 gives more output? Is this right?

It will need to be updated in launch_monerod.sh and the node instruction document.

Analysis: plot each side-chain's average block time against its length

Expect different clusters, separating different phenomena / players.

Choose format

Right now data is organized in messy grep'd out log dumps. Need a clean format to use for this project. It seems like each entry receiving a given block should contain:

Block height
Block nonce
Timestamp (optional, since redundant with block height, but finer-grained)
Block version A identifier
Block version B identifier
(Block version C identifier?)
Did this cause a reorganization (0/1)

Suggestions for format? Other ideas for data to include? Thanks!

Fix plot titles

Do not use "rate" unless referring to quantity per unit time.