Specifically trying to understand the differences in TTFB performance between ipfs.io

Thanks for raising this issue <a class="user-mention notranslate" data-hovercard-type=

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Grant to support this experiment: <a href="https://www.dgm.xyz/grants/g5riWRq4BkhDvl9v

Measurement breakdown of IPFS TTFB performance based on content resolution path about network-measurements HOT 7 CLOSED

probe-lab commented on July 29, 2024

Measurement breakdown of IPFS TTFB performance based on content resolution path

from network-measurements.

Comments (7)

yiannisbot commented on July 29, 2024

Thanks for raising this issue @davidd8! Indeed a worthwhile experiment to run, which we will prioritise as team capacity allows. At the moment, unfortunately, we do not have results for this specific ask. However, I'm including below some results between retrieving content from go-ipfs (using a controlled experiment, where we create CIDs from one geographic location and retrieve it from another) and retrieving content from the gateway (using PL gateway logs). The results are not directly comparable though, i.e., we do not request the same CIDs through the different resolution paths, so take them with a pinch of salt.

The first figure shows the CDF of the go-ipfs retrieval latency broken down in steps. Notice that Fig. d) includes the Bitswap discovery process, which times out at 1sec and is always negative in this experiment, because the CIDs we retrieve were artificially created, i.e., no other node can have this content. Fig. e) shows the DHT lookup only (i.e., excluding the Bitswap step and its 1sec timeout), while f) shows the Bitswap content fetch.

The next figure shows results from the logs of one of PL's gateways, albeit from a small sample.

One other note related to your request: points 1 and 2, i.e., retrieval that includes a Hydra vs retrieval that doesn't is not something that can be configured at experiment setup time, as Hydras are integrated in the system and are expected (if spread across correctly) to always be present. So, in order to figure out the difference between these points, we'd have to carry out retrievals and just filter out and analyse those requests that did vs those that did not hit a Hydra peer. Let me know if you had any other thoughts on how the experiment could be set up. Alternatively, one would have to connect and direct a request to a Hydra peer explicitly - is that what you're looking for?

from network-measurements.

davidd8 commented on July 29, 2024

Thanks @yiannisbot, these graphs are great. If I'm understanding them correctly, it looks like the PL gateway has a p80 TTFB of < 75ms, where the fastest go-ipfs TTFB (yellow) has a p80 TTFB of ~1.2s (500ms DHT walk + 700ms content fetch). Is there a way to see the breakdown of response latency for the PL gateway based on whether it hits the cache, node store, or looks up a peer / DHT?

Your suggestion to retrospectively filter out the results and compare the analysis for requests that do and do not hit a Hydra peer, makes sense to answer (1) and (2).

For peering (3), an experiment with known CIDs could help show the difference in TTFB for go-ipfs nodes and the ipfs.io gateway. The trick may be having to control for ipfs nodes' proximity to the content. It may also be interesting to see if there's a significant difference between different IPFS clusters, as in compare response times between a Pinata CID to a web3.storage CID.

from network-measurements.

yiannisbot commented on July 29, 2024

These are very interesting things to find out. The best will be to set up an experiment to benchmark these different content fetch avenues. I'll try to do that ASAP.

@davidd8 clarifying a few things relating to the graphs and your comments above:

the PL gateway has a p80 TTFB of < 75ms

Yes, but this includes the cached content, which is everything before (on the left) of the dashed line in the bottom figure above. So, it's not a "clean" p80. We'd have to exclude the cached responses in order to get a more realistic figure. Intuitively, a clean p80 (non-cached content) through the gateway would have to be at least as high as the DHT way, no?

Is there a way to see the breakdown of response latency for the PL gateway based on whether it hits the cache, node store, or looks up a peer / DHT?

See above, regarding the dashed line. The figure shows this, but it's not easy from this plot to figure out what's the CDF for cached vs non-cached content. We'd probably have to reprocess the data and produce a different plot.

the fastest go-ipfs TTFB (yellow) has a p80 TTFB of ~1.2s (500ms DHT walk + 700ms content fetch)

A few notes here:

The overall latency, if you include the initial Bitswap discovery too is closer to 2.1s (see leftmost top figure). But this includes the Bitswap discovery, which takes 1s in itself. It's debateable whether it's worth waiting that much, but that's the default behaviour of go-ipfs. Of course, in case content is found through Bitswap the fetch completes faster. We're looking into this with this RFM: https://github.com/protocol/network-measurements/blob/master/RFMs.md#rfm-16--effectiveness-of-bitswap-discovery-process
The fastest DHT walk indeed takes 500ms.
Content fetch for eu_central (yellow) indeed takes about 700ms. But that's not TTFB, it's for fetching the whole content, so more like TTLB :) The file size we experimented with is 0.5MBs, so it seems fetching (Bitswap) takes quite a while to complete for such a small file. This needs looking into too.

I'll put together an experiment description and come back to you for a review.

from network-measurements.

yiannisbot commented on July 29, 2024

Intuitively, a clean p80 (non-cached content) through the gateway would have to be at least as high as the DHT way, no?

Correction: this is unless content is found (through Bitswap) in one of the clustered peers, i.e., not cached locally, but found in one of the directly connected storage clusters, which are queried through Bitswap before getting to the DHT step.

from network-measurements.

yiannisbot commented on July 29, 2024

Grant to support this experiment: https://www.dgm.xyz/grants/g5riWRq4BkhDvl9vsjda

from network-measurements.

dennis-tra commented on July 29, 2024

A quick update on the results of DHT lookup times with and without considering Hydra peers.

In the below graphs, we conducted two parallel DHT experiments with six nodes each. Those nodes ran in all different AWS regions - similar to the experimental setup in our SigComm paper. The first set of six nodes was operated without any special modifications. The second set of six nodes was fed a list of Hydra Head PeerIDs that were excluded from any DHT operation. E.g., if another DHT server node serves us a Hydra head as a closer peer to the desired CID we don't take that response into account. Different to the experiment in the SigComm paper we pumped the kubo version to 0.16.0.

With that being said, these are the results:

Hydras Excluded:

Hydras Included:

Some numbers:

Region me_south_1 did 2155 retrievals
Region ap_southeast_2 did 2779 retrievals
Region af_south_1 did 2086 retrievals
Region us_west_1 did 2787 retrievals
Region eu_central_1 did 2652 retrievals
Region sa_east_1 did 2762 retrievals

af_south_1: 50th 3.4344015, 90th 4.663337, 95th 5.53278475
ap_southeast_2: 50th 3.648888, 90th 7.128013400000002, 95th 11.181374599999996
eu_central_1: 50th 2.388563, 90th 6.2162819, 95th 9.009752949999996
me_south_1: 50th 2.829238, 90th 3.690286200000001, 95th 4.335971499999999
sa_east_1: 50th 3.4066875000000003, 90th 7.263756200000001, 95th 9.794784199999997
us_west_1: 50th 2.879004, 90th 8.585243799999999, 95th 11.557577999999996

ALL: 50th 3.125124, 90th 6.047937, 95th 9.298019
Both DHT walks : 50th 0.637327, 90th 1.373615, 95th 2.24039
One DHT walk : 50th 0.3186635, 90th 0.6868075, 95th 1.120195

Region me_south_1 did 587 publications
Region ap_southeast_2 did 538 publications
Region af_south_1 did 587 publications
Region us_west_1 did 540 publications
Region eu_central_1 did 565 publications
Region sa_east_1 did 562 publications

As these graphs are hard to compare, @WYiluo went ahead and produced these comparison charts:

50th percentile:

90th percentile:

95th percentile:

Note, these charts don't just compare the DHT walks but the overall retrieval latencies!

References

Talk

Hydra Booster Analysis Talk @ IPFS Camp

from network-measurements.

yiannisbot commented on July 29, 2024

Closing this issue as Hydras are not even part of the network anymore. We're gearing up to have comparative performance between the DHT and IPNI. Initial thinking outlined here: #47 with separate issues, or RFMs to fllow.

If more help is needed on this, please re-open or create a new issue. Thanks.

from network-measurements.

Measurement breakdown of IPFS TTFB performance based on content resolution path about network-measurements HOT 7 CLOSED

Comments (7)

References

Talk

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent