yelp / casper Goto Github PK

View Code? Open in Web Editor NEW

90.0 12.0 8.0 1.36 MB

A fast web application platform built in Rust and Luau

License: Other

Rust 100.00%

yelp luau rust

casper's People

Contributors

Stargazers

Watchers

Forkers

chakchak1234 dkbala jsonkey bplotnick zeus911 simhaonline guieco

casper's Issues

Refactor cache_entry handling in cacheability_info map

Right now we copy each key to a new table, which forces you to add a new line every time you add a config option. We should simply return the entire cache_entry, ie refactor:

cacheability_info = {
    is_cacheable = true,
    ttl = cache_entry['ttl'],
    pattern = cache_entry['pattern'],
    cache_name = cache_name,
    reason = nil,
    vary_headers_list = vary_headers_list,
    bulk_support = cache_entry['bulk_support'],
    id_identifier = cache_entry['id_identifier'],
    dont_cache_missing_ids = cache_entry['dont_cache_missing_ids'],
    enable_invalidation = cache_entry['enable_invalidation'],
    refresh_cache = false,
    num_buckets = cache_entry['buckets'],
}

into:

cacheability_info = {
    is_cacheable = true,
    cache_entry = cache_entry,
    cache_name = cache_name,
    reason = nil,
    vary_headers_list = vary_headers_list,
    refresh_cache = false,
}

Add instructions on how to run cassandra locally and make casper use it

The easier way to create a cassandra cluster / node locally is using ccm.

We also need to document the format on the synapse file that casper expects (and maybe allow people to simply specify the IPs directly).

This depends on #9 and blocks #11

API endpoint to expose TTL left for a key

Context
During debugging, its helpful to know how much TTL is left for a cache key. This helps to investigate if cache is being stale and/or TTL is working correctly.

Example usecase:
We don't track timestamps in ElasticSearch for optimization reason. The cache was returning old data and we want to confirm that cache was last updated before the write happened.

/status?check_cassandra=true should return the current topology

/status?check_cassandra=true should also return the C* nodes that the driver is using. Ideally they should be split between local and remote nodes, so that we can check that the driver is using the right ones.

That'd make it easier to debug cases where the driver is misconfigured or behaving weirdly.

"make dev" fails

Currently "make dev" fails outside of Yelp devboxes because there's no "/nail":

=> make dev
....
docker run -d -t \
		-p 32927:8888 \
		-e "PAASTA_SERVICE=spectre" \
		-e "PAASTA_INSTANCE=test" \
		-v /nail/etc:/nail/etc:ro \
		-v /nail/srv/configs/spectre:/nail/srv/configs/spectre:ro \
		-v /var/run/synapse/services/:/var/run/synapse/services/:ro \
		-v /Users/abrousse/git/casper:/code:ro \
		--name=spectre-dev-abrousse spectre-dev-abrousse
e61b51810461c509e15a176397bbc9fd78af769f0b814f32c7e5017a6511e0e8
docker: Error response from daemon: Mounts denied:
The paths /var/run/synapse/services/ and /nail/srv/configs/spectre and /nail/etc
are not shared from OS X and are not known to Docker.

id extraction for normal endpoints results in all cache misses

We use https://github.com/Yelp/casper/blob/master/lua/caching_handlers.lua#L9 to extract the ids from the url. However it's only called when we store the result in the cache and not when we read it.

Since our get_bucket logic returns different results based on whether we have a list of ids or not, we end up writing the result in a different bucket than what the read expects.

Have Casper set the X-Smartstack-Source header

Right now Casper relies on the caller's Smartstack to add the X-Smartstack-Source header. This is a bit surprising given that the request to Casper will have a source of casper.main or whatever.

I think it makes more sense to actually have Casper set this header itself when it proxies the request.

The specific reason that I'd like this is due to the way that I plan on implementing this logic in Envoy. I plan on having a special priority routing configuration with Casper being the most preferred priority but then failing over gradually to real service.

Example:

If X-Smartstack-Source is present:
P0        az-local endpoints (habitat)
P1        region-local endpoints (region)
P2        the rest (ecosystem)

If X-Smartstack-Source is not present
P0        casper endpoints
P1        az-local endpoints (habitat)
P2        region-local endpoints (region)
P3        the rest (ecosystem)

There might be other ways that we can implement this, but this seems to be the most straightforward for now. It also gives us the advantage of failing over from Spectre gradually based on health status rather than all at once

@drolando

Automatically create Cassandra keyspace and schema if they're not present

Similar to what https://github.com/openzipkin/zipkin/ does, we can have Casper create the cassandra schema when it starts.

This will simplify the logic used in itests and acceptance tests

Cassandra fails to start with "Cassandra 2.0 and later require Java 7u25 or later"

Here's the output for make itest on my macbook:

=> make itest
....
bin/docker-compose-1.19.0 -f itest/docker-compose.yml up -d spectre backend cassandra Creating itest_syslog_1    ... done
Creating itest_cassandra_1 ... done
Creating itest_backend_1   ... done
Creating itest_cassandra_1 ...
Creating itest_spectre_1   ... done
bin/docker-compose-1.19.0 -f itest/docker-compose.yml exec -T cassandra /opt/setup.sh
ERROR: No container found for cassandra_1
make: *** [run-itest] Error 1

Looking into the cassandra image, it seems like the problem is cassandra refusing to start:

=> bin/docker-compose-1.19.0 -f itest/docker-compose.yml up cassandra
Starting itest_cassandra_1 ... done
Attaching to itest_cassandra_1
cassandra_1  | Cassandra 2.0 and later require Java 7u25 or later.
itest_cassandra_1 exited with code 1

...and that's probably because java is busted inside of docker:

=> bin/docker-compose-1.19.0 -f itest/docker-compose.yml images
    Container            Repository         Tag       Image Id      Size
-------------------------------------------------------------------------
itest_backend_1     itest_backend          latest   35ad69bdd1ab   173 MB
itest_cassandra_1   itest_cassandra        latest   909b2b4d803d   577 MB
itest_spectre_1     spectre-dev-abrousse   latest   f11d204004e9   484 MB
itest_syslog_1      itest_syslog           latest   566da6e9c7a1   154 MB
=> docker run -ti  909b2b4d803d bash
dckruser@2c86d8724510:/$ java -version
/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/rt.jar: invalid LOC header (bad signature)
Error occurred during initialization of VM
java.lang.NoClassDefFoundError: java/lang/ref/Reference$1
	at java.lang.ref.Reference.<clinit>(Reference.java:235)

Support for "Vary" header

Currently we rely on configuration to know which parts of an HTTP request are part of the cache key (code link). Instead of a static configuration option per service, it'd be neat to have this be more granular and dynamic, driven by the server through a "Vary" HTTP header in the response: Vary: Accept-Encoding, Cookie. See these docs for more info.

headers with underscores are being dropped

By default nginx drops any header which contain underscores: http://nginx.org/en/docs/http/ngx_http_core_module.html#underscores_in_headers

Stop gracefully on SIGTERM

When an instance fails the healthcheck, paasta sends a SIGTERM, waits a bit and then kills the process if it hasn't stopped yet.

Right now we don't catch SIGTERM, so the process dies as soon as it receives it. All in-flight requests are lost and clients see this as a 503.

Configuration option to let a portion of hit traffic through Casper

This idea stemmed from a discussion with @mattiskan. If a high QPS service is proxied through Casper, the service naturally gets provisioned less and less over time. However, if Casper is down, we're in a tight spot: the traffic gets forwarded to the underlying service (because of the "fail-safe" philosophy baked in proxied_through), the underlying service is under-provisioned and may error/time out, causing user-facing problems until either (a) Casper is brought back up or (b) sufficient capacity is added to the underlying service.

To avoid these situations, let's add a new per-namespace configuration option to let a fixed portion of hit traffic through. Something like hit_passthrough: 0.65 (name TBD)

In case of a cache hit, we'd still forward the request in Casper's post-request callback (with a 65% likelihood). This will not only let us ensure we keep some capacity in proxied services, but could be a useful tool to gauge whether a particular service would collapse if Casper were to be down (currently the only way to find out is to shut down Casper for real. Not ideal!)

Some integration tests are flaky in Travis CI

I'm trying to get the build to pass for PR #49 but every time I relaunch I get a different failure. It seems PR #47 is also running into the same issues. So far, the tests that have been affected are:

TestPostMethod.test_post_cache_hit_even_if_body_doesnt_match_without_vary
TestPostMethod.test_post_cached_with_id_can_be_purged
TestPostMethod.test_post_always_cached_for_extended_json_content_type

I've never gotten a failure while running make itest locally.

Periodically refresh Cassandra cluster topology

Afaict lua-cassandra doesn't automatically detect changes in the ring topology. You have to manually call cluster:refresh() for it to pick up any change.

This is a problem when we're replacing nodes in the C* cluster since the driver will keep trying to connect to the old ones and ignore any new host.

We should just call refresh() even N seconds, ideally after the response has been returned.

Support for surrogate keys

At the moment we have ID extraction support in URLs (through bulk endpoint support and enable_id_extraction) but surrogate keys would help to invalidate groups of resources across caches. See these docs on how Fastly uses them.

Another big difference with our current support is the fact that surrogate keys are driven by a header returned by the server. Keys can be arbitrary, representing experiment cohorts or deploy versions (things that aren't in the request or response object). For example:

200 OK
Surrogate-Key: elite musician myexperiment-enabled
Content-Type: text/json
{"name": "Bob", "last_name": "Dylan", "num_reviews": 42}

Surrogate key support enables invalidation of all "musician" or "elite" resources for instance.

Support for caching POST requests

It would be useful to add support for caching POST requests.

Since some of the response ids might be part of the request body, we should be able to attach the request body (or the keys needed) to the response for creating the caching key.