probe-lab / network-measurements Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
We are currently capturing the number of clients observed in the IPFS public DHT network and we report this as part of our weekly reports (currently in this repo - see example for Week 17 as well as at probelab.io: https://probelab.io/ipfsdht/#client-vs-server-node-estimate.
As per this discussion thread in Slack, this is great, but only captures part of the story, i.e., it focuses on the public IPFS DHT only, which in turn, means that it is mostly focusing on Kubo. However, IPFS is more than the kubo implementation and more than the public IPFS DHT. A request from @BigLep is to be able to "show the number of peer ids observed across various "networks" and break out by implementation".
In order to go about doing this, we'd need to identify data sources (i.e., how to collect the data) from different: i) IPFS implementations (e.g., Kubo, Helia, Iroh), and ii) networks that run IPFS nodes (e.g., the IPFS DHT, the Lotus DHT, cid.contact/IPNI, etc). We should also ideally deduplicate the PeerIDs to avoid double-counting a peer that participates in more than one network (?).
I'm starting this issue to capture first what we want to target and then come up with data collection ideas (e.g., through measurement tools, logs etc.).
cc: @BigLep @dennis-tra
eta: 2023Q1
Study the impact of balancing Kademlia buckets over each bucket subkeyspace.
Associated PR: #36
In https://github.com/ipfs/interop, we still have tests running libp2p circuit relay v1, which makes sense because it has functionality that relay v2 does not; however, It has caused some issues. See
I'm wondering if we can get metrics on which relay versions are being used and how much traffic exists for each. I understand that we should be able to query the DHT for multiaddr that indicate which relay version(s) are available.
As far as which metrics would be useful, I think the following is a good start:
Questions I want to answer with this data:
Please let me know if this request/issue is better suited elsewhere! Thanks.
I'm wondering what is the impact of peers that join the IPFS DHT and rotate their PeerIDs excessively. We've seen in recent reports, e.g., Week 5 Nebula Report, that there are 5 peers which rotate their PeerID 5000 times each, within the space of a week. This comes down to peers having a separate PeerID every couple of minutes. The number of rotating PeerIDs seen are roughly as many as the relatively stable nodes in the network (aka network size). The routing table of DHT peers is updated every 10mins, so the impact is likely not sticking around for longer than that, but given the excessive number of rotations, I feel that this requires a second thought.
I can see three cases where this might have an impact (although there might be more):
The first case should be covered by the concurrency factor, although the large number of rotations might be causing issues. We could check the second case through the CID Hoarder - @cortze it's worth spinning up an experiment to cross-check what happens with previous results. Not sure what can be done for the third case :)
Thoughts on whether this is actually a problem or not:
It's worth checking whether those PeerIDs co-exist in parallel in the network, or whether when we see a new PeerID from the same IP address, the previous one(s) we've seen from the same IP address have disappeared. @dennis-tra do we know that already? Is there a way to check that from the Nebula logs?
Also, from @mcamou:
re: thousands of PeerIDs with the same IP, I don't think that we can completely rule out that they are different peers mainly due to NAT. On the one hand, some ISPs implement CG-NAT, where they do use a single IP for multiple customers. On the other hand, you might have large companies who have a single Internet PoP for their whole network.
Depending on how many IP's we have in this state, we might want to make a study regarding the above 2 cases (and others that we might think about). One thing to look at would be whether the same PeerID shows up consistently or whether it's a one-off.
Extra thoughts more than welcome.
We're seeing a very large number of offline peers each week (graph below, latest graph here). Offline peers are defined as those that are seen online for 10% of time or less (https://probelab.io/ipfsdht/#availability). This might be affecting the churn that we're seeing in the network as the churn CDF shows median lifetime of ~20 minutes but in reality will be lower since churn excludes nodes we have never contacted.
Such short-lived peers do not actually contribute to the network, as they fill other peers' routing tables, but do not stay online to provide records, if they happen to store any.
This is a tracking issue for figuring out more details, together with some thoughts on what we can do to find out where this large number is coming from.
We see:
We need to:
As a solution, we could avoid adding peers to the routing table immediately after they're seen online. We could wait for some amount of time before adding them. In the meantime, new peers can be pinged more frequently when they are first added to routing table, gradually decreasing ping frequency over time as peer is known to be stable.
The primary question here would be how long should we wait before adding peers to the routing table.
Other thoughts and ideas more than welcome.
Brave browser ships a feature which downloads and runs Kubo.
We want to measure the number of Brave IPFS nodes on the public network.
@lidel said they announce themselves as kubo/0.16.0/brave
and that we could find them by:
ipfs id QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN | jq .AgentVersion
Context: ProbeLab is monitoring the uptime and performance of several PL websites at: https://probelab.io/websites/ Those sites are pinned in two stable providers (among other nodes that decide to pin these sites in the P2P network): i) PL's pinning cluster, and ii) Fleek's cluster.
One of the things we're monitoring is whether those stable providers are continuously making those sites available.
Assumption: We've worked closely with both the team that operates PL's pinning cluster and Fleek to make sure everything is in place and correctly configured (e.g., all nodes are running the Accelerated DHT Client) to reprovide the CIDs for the websites, so we've been expecting the situation to be rather stable. Stable here means websites are pinned to 7 nodes from PL's pinning cluster and 2 nodes from Fleek's fleet.
Results: Our results are presented under each website's results page, e.g., https://probelab.io/websites/blog.ipfs.tech/#website-trend-hosters-blogipfstech for https://blog.ipfs.tech.
This is a tracking issue for the resolution of the situation. Tagging @gmasgras and @cewood for the PL team and will propagate further to Fleek folks.
We've recently started measuring the performance of PL websites over kubo. We've been presenting some of these results in our weekly reports and we're also now putting more results at probelab.io (e.g., https://probelab.io/websites/protocol.ai/ for protocol.ai). As a way to get more insight into why the performance is what it is, we have collected the number of providers for each one of them. That will enable us to see if, for instance, there are no providers for a site.
We've found an unexpected result, which might makes sense if one gives a deeper thought into it: there are a ton of unreachable providers for most of the websites we're monitoring as shown in the graph below for protocol.ai. Note that the stable providers for protocol.ai should be two, i.e., that's where we currently pin content.
This happens because clients fetch the site, re-provide it and then leave the network, leaving stale records behind. In turn, this means that popular content, which is supposed to be privileged due to the content addressing nature of IPFS, is basically disadvantaged because clients would have to contact tens of "would be" providers before they find one that is actually available.
I'm starting this issue to raise attention to the issue, which should be addressed asap, IMO. We've previously discussed in slack a couple of fixes, such as setting a TTL for provider records equal to the average uptime of the node publishing the provider record. However, this would be a breaking protocol change and would therefore not be easy to deploy before the Composable DHT is in place. Turning off reproviding (temporarily, until we have Composable DHT) could be another avenue to fix this issue.
Other ideas are more than welcome. Tagging people who contributed to the discussion earlier, or would likely have ideas, or be aware of previous discussion around this issue: @Jorropo @guillaumemichel @aschmahmann @lidel @dennis-tra
RFM-16 suggests the following method for testing bitswap efficacy:
Pick a large number of random CIDs (as many as needed in order to be able to arrive to a statistically safe conclusion) and share them with all the nodes involved in the experiment.
and then:
Carry out Bitswap discovery for these CIDs.
This might very well do the trick, particularly in a closed network.
However, I'd like to suggest an alternative to consider that could work "in the wild." Given a set of peers, i.e. ipfs swarm peers
, request their current wantlist, i.e. ipfs bitswap wantlist --peer={PEER_ID}
. Then, poll that wantlist to see how long certain CIDs stay on the wantlist. This metric, the average lifespan of a CID on a wantlist, could be very useful towards getting a sense of the overall user experience of an IPFS node user.
It was also suggested (I believe by @guseggert) that this "average lifespan of a wantlist entry" metric could be rolled into ipfs stat
Thank you for your consideration! ๐
Right now, weekly report includes a snapshot in time for a specific week:
This is useful for understanding current distribution, but does not help with building intuition about trends, how slow is adoption new versions, or if there is difference in ramp-down of specific older version over multiple weeks.
We have historical data, so perhaps we could create a visualization: a plot line where X-axis is time (last 12 months) and Y-axis is the % of peers running specific kubo that week (week-to-week). Similar to this (webextension version from firefox add-on store):
This would be similar to existing:
But focused on % and bigger time window (12 months)
@dennis-tra is this feasible with existing data and tooling, or too much of an ask?
I have expanded the scope of this issue to be feedback on the various website-monitoring reports that have come in during 202302 and 202303. I'll consider this done when we have a first draft that I would feel comfortable sharing with other leaders and not needing to be there to answer/explain it. After that we can develop a separate process for how we report ongoing observations, questions, and suggestions.
This concerns https://github.com/protocol/network-measurements/blob/master/reports/2023/calendar-week-7/ipfs/README.md#website-monitoring
First off, thanks for adding this! Good stuff.
A few things that I think would be helpful to document:
Request from @BigLep in FIL slack (#probe-lab channel).
The ProbeLab team is currently running a continuous experiment to measure the IPFS DHT Publish & Lookup performance (see details here: https://probelab.io/ipfsdht/#performance). There is a request to do the same for IPNI indexers, ideally using the same set of nodes to avoid extra costs.
From @BigLep: "Stopwatch starts when we begin the GET /routing/v1/providers/{CID}
(link) call and the stopwatch ends when the HTTP request completes."
I would like data like this:
Stream Handler | Seen |
---|---|
/ipfs/bitswap/1.0.0 |
123456 |
/ipfs/id/1.0.0 |
234567 |
Hi,
I had a thought that IPFS nodes may self-organize into higher level clusters because of the way connections are formed and maintained.
More specifically, knowing the high level of node churn, do longer running nodes tend towards connecting to each others?
I would think not because of the way latency is prioritised, which result in nodes organising based on distance. Is this good?
More generally, how do we measure self-organization, and could we not use this to our benefit too?
We've been observing a slight increase in the DHT Lookup Latency since around the mid of June 2023. The increase is in the order of ~10% and is captured in our measurement plots at: https://probelab.io/ipfskpi/#dht-lookup-performance-long-plot. This is a tracking issue to identify the cause of the latency increase.
Below the short-term latency graph (https://probelab.io/ipfsdht/#dht-lookup-performance-overall-plot):
Observing the CDFs of the DHT Lookup latency across different regions over time, we see a clear move towards the right of the plot for several regions, most notably for eu-central
, but also ap-south-1
and also af-south-1
(in Week 27).
Week 24 (2023-06-12/18)
https://github.com/plprobelab/network-measurements/tree/master/reports/2023/calendar-week-24/ipfs#dht-performance
Week 25 (2023-06-19-25)
https://github.com/plprobelab/network-measurements/tree/master/reports/2023/calendar-week-25/ipfs#dht-performance
Week 26 (2023-06-26 - 2023-07-02)
https://github.com/plprobelab/network-measurements/tree/master/reports/2023/calendar-week-26/ipfs#dht-performance
Week 27 (2023-07-03/09)
https://github.com/plprobelab/network-measurements/tree/master/reports/2023/calendar-week-27/ipfs#dht-performance
The latency seems to be heading back down, but we're not sure if there's a specific reason for this behaviour. Some thoughts:
kubo-v0.21.0-rc1
and later releases at the end of June: https://github.com/ipfs/kubo/releases/tag/v0.21.0-rc1. There doesn't seem to be something that could affect performance there, other than "Saving previously seen nodes for later bootstrapping" (https://github.com/ipfs/kubo/blob/release-v0.21.0/docs/changelogs/v0.21.md#saving-previously-seen-nodes-for-later-bootstrapping), but even in this case, the original bootstrappers are not removed.kubo-v0.20.0
: https://github.com/ipfs/kubo/releases#boxo-under-the-covers. Not sure if something in there could affect performance (?)kubo-v0.18
as per: https://probelab.io/ipfsdht/#kubo-version-distribution, but there are about 3.5k nodes in v0.20.0
and v0.21.0
, which could be enough in order to cause this slight increase.eu-central
node.Any other thoughts @Jorropo @aschmahmann @lidel @hacdias ?
Summarising several approaches from offband discussions here to have them documented.
Description: The kubo README file is stored and advertised by every node in the network (ipfs/kubo#9590 (comment)), regardless of whether the node is a client or a server in the beginning. The provider records for this README are becoming stale after a while, either because peers are categorised as clients (and are therefore unreachable), or because the leave the network (churn). But the records are still there until they expire. We could count the number of providers across the network for the kubo README CID and approximate the network-wide client vs server ratio.
Downside: This approach would only count kubo nodes (which is a good start and likely the vast majority of clients).
Description: We have:
Maybe we can estimate what share of queries should come across the honeypot and then estimate the total number of clients in the network, based on the number of unique clients the honeypot sees. This would be a low overhead setup and may allow better estimates with more honeypots.
Downside: The approach would need maintenance and infrastructure cost of the honeypot(s).
Description: Another approximation we could get is by running multiple DHT servers. Think of a few baby hydras. Each DHT server would log all peerids sending DHT requests, and get the % of client vs servers by correlating the logs with crawls results. This gives the % of clients and servers observed, we average the results of all DHT servers, and extrapolate this number to get the total number of client, given that we know the total number of servers.
Downside: The approach would need maintenance and infrastructure cost of the DHT servers/baby-hydras.
Description: We capture the total number of Unique PeerIDs through the bootstrapper. What this gives us is the "Total number of nodes that joined the network as either clients or servers". Given that we have the total number of DHT server nodes from the Nebula crawler, we can have a pretty good estimation of the number of clients that join the network. The calculation would simply be: Total number of Unique PeerIDs (seen by bootstrappers) - DHT Server PeerIDs (found by Nebula)
. In this case, clients will include: other non-kubo clients (whether based on the Go IPFS codebase, Iroh, etc.) and js-ipfs based ones too (nodejs, and maybe browser, although the browser ones shouldn't be talking to the bootstrappers anyway).
Downside: We rely on data from a central point - the bootstrappers.
Approach 4 seems like the easiest to get us quick results. All of the rest would be good to have to compare results and have extra data points.
Any other views, or suggested approaches?
I just realized from looking at https://github.com/protocol/network-measurements/blob/master/reports/2023/calendar-week-21/ipfs/README.md#website-monitoring that we're missing specs.ipfs.tech. Please add. This is needed as part of external monitoring of specs.ipfs.tech per ipfs/specs#418
Specifically trying to understand the differences in TTFB performance between ipfs.io gateways and go-ipfs nodes across the following resolution paths:
These performance metrics can help inform where bottlenecks are happening, and how to think through setting a reasonable SLA for services that build on top of IPFS.
Hi,
We're working on analyzing the security of Filecoin's Consensus mechanism, which significantly relies on timing assumptions.
To define a model that best captures the reality it would be extremely helpful to know the current latencies in Filecoin's mainnet. In particular, the latencies associated with broadcasting.
Any information would be valuable! (Mean latency per sender/receiver, complete distribution of latencies, 95th percentile, etc.)
I'm wondering what would be the outcome of the following experiment.
Have we verified that they will receive the EU-based copy? @dennis-tra did we look into this aspect for the experiments we reported here: https://gateway.ipfs.io/ipfs/bafybeidbzzyvjuzuf7yjet27sftttod5fowge3nzr3ybz5uxxldsdonozq ?
Step 3 above would also be worth a look, i.e., do both PeerIDs end up in all the provider records published in the system? Or if not, at which fraction of the records do we have both peers?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.