mozilla-services / autopush Goto Github PK

View Code? Open in Web Editor NEW

217.0 217.0 34.0 4.47 MB

Python Web Push Server used by Mozilla

Home Page: https://autopush.readthedocs.io/

License: Mozilla Public License 2.0

Python 99.45% Makefile 0.16% Nix 0.07% Lua 0.25% Shell 0.02% Dockerfile 0.05%

mozilla python services-engineering-team webpush

autopush's People

Contributors

Stargazers

Watchers

autopush's Issues

Add iOS Feedback montioring and cleanup

Add a regular, daily check of the iOS feedback service to pull IDs that need to be removed from APNs.

Bugfix: Disconnect dupe uaid's, sweep all connections for excessive idle

On connect, we should sendClose any prior client connection of the same UAID.

Also, we should have a separate task scheduled, that sweeps through all the connections, and if the last ping was more than N minutes/seconds ago, initiates a server-side ping......

OR.....

We turn on autobahn's websocket auto-ping feature.

Fix cancelled deferred error

We're cancelling already cancelled deferred's, these should be caught more gracefully, and all deferreds should include a trap for Cancelled.

While working on a different project with a different service it occurred to me that we have a lot of moving parts. There are some services that exist in a "dev", "stage", "prod" or other such environment which are isolated from the others. This may be fine, or it may cause unusual issues due to information isolation (e.g. state existing in "stage" will not exist in "dev" leading to changes potentially not being properly reflected in all calls)

It would be useful for data responses to return the environment they are in as an optional component. This way, systems that may be consuming or processing results can easily confirm the source of the data that they are using and not accidentally "cross the streams".

thoughts?

Enable Sourcegraph

I want to use Sourcegraph for autopush code search, browsing, and usage examples. Can an admin enable Sourcegraph for this repository? Just go to https://sourcegraph.com/github.com/mozilla-services/autopush. (It should only take 30 seconds.)

Thank you!

Aggressively delete old TCP connections or re-registration

Idle TCP connections continue to slowly grow. We should cull old connections at new device registrations.

Bugfix: Track deferred's, for cancelling

All deferred's generated in a client should be tracked, so they can be cancelled if a client suddenly disconnects.

Cyclone handlers in websocket.py overloads cyclone's settings

@jrconlin already fixed the endpoint to use ''ap_settings'', this should be fixed in websocket.py's handler as well.

Integrate Loads into Autopush test process

Add load testing component for potential continuous deployment.

remove conflicting requirements from pip/setup.py

There are a number of conflicting package versions being included in requirements.txt and setup.py

Add /status to endpoint and connection node

Both autopush/autoendpoint should have a /status that returns OK.

Add provisioned error metrics.

Several parts of the code catch and retry provisioned throughput exceptions. These parts of code should emit metrics indicating that along with context.

i.e. 'error.provisioned.new_connection', 'error.provisioned.store_notification'.

Add a TTL header for incoming messages

The Web Push spec defines a TTL header for app servers to specify message retention duration. Setting TTL: 0 indicates a message is ephemeral, and can be dropped immediately if the client is disconnected. This feature was requested by the Loop team to match the current Loop Push behavior, so let's add it to Simple Push.

DynamoDB doesn't support TTLs, but we can work around this by storing the expiration time and dropping expired messages when the client reconnects. We can also specify a TTL header in the response with the actual time-to-live...so, if an app server sends too many updates, we can indicate we're dropping messages via TTL: 0.

The Web Push spec also says that an omitted TTL is equivalent to TTL: 0. This could be a back-compat issue for us, since the current behavior is to store messages if the client is disconnected. OTOH, I like that storing messages requires an explicit opt-in, so it'd be nice to make this change while we have few users.

Tag WebSocket server metrics with the device ID

SimplePushServerProtocol only tags emitted metrics with the user agent. It'd be helpful to include the uaid_hash and remote-ip, too, like we do for AutoendpointHandler. The former would help with triaging Bugzilla tickets; the latter with tracking down connection counts per client.

Also, some other metrics we could collect:

Whether a mobilenetwork field is present in the client hello, indicating the client is on a cellular network. If we wanted, we could break this down by carrier, too—mobilenetwork["mcc"] and mobilenetwork["mnc"].
Malformed and incomplete messages that call bad_message.
Whether the client was already connected (break down by local vs. remote).

Silence timeouts in endpoint

Every user timeout for a http connection to endpoint results in a logged sentry error, these shouldn't be logged as user timeouts are normal.

Log the websocket user-agent

Bug: pinger's callback fails to accept its result

_check_router should take the result of the pinger.register call, along with the bool, but it doesn't take a result or check it.

https://github.com/mozilla-services/autopush/blob/master/autopush/websocket.py#L174

Ensure we're using the recommended TLS settings

The Wiki has a list of recommended settings, as does the Go server.

Add optional CORS support for Update messages

For releng testing

Fix 5xx error in endpoint

A logging addition most likely has resulted in the increased 5xx rate on the endpoint server.

Adaptive pong delay

We should pong no more than once every 5 seconds, however our pong delay is a hard-coded value that doesn't consider client latencies. So instead we should track the last time we saw a ping, and if that value is more than 5 seconds, respond immediately, otherwise subtract the value from 10 (to ensure the client doesn't timeout waiting for our pong).

Abuse Mitigation Bug

Meta bug for dealing with detecting and preventing system abuse.

Abusive behaviors may include:

Excessive channel registrations
Excessive UAID registrations
Excessive posts to invalid or inactive UAID/Channels

Add a UDP wake-up bridge

Client bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1157696

Now that @bbangert's fantastic routing refactor has landed, we can add a separate bridge for the TEF UDP wake-up platform. Unlike our other bridges, this one only sends a signal for the client to reconnect if it's offline.

It works like this:

When the client connects to the push server, it opens a UDP port and includes the IP, port, and carrier info in the handshake: {"messageType":"hello","uaid":"","channelIDs":[],"wakeup_hostport":{"ip":"127.0.0.1","port":65535},"mobilenetwork":{"mcc":"001","mnc":"01","netid":"001-01.default"}}. We'll store these params for the wake-up request.
If the server supports UDP wake-up, and the connection idles for 10 seconds, the server closes the WebSocket with a special close code (4774). The client won't reconnect when it receives this close code.
When the push server wants to wake up a client, it sends an HTTP request to the TEF wake-up server (authenticated with a TLS client cert issued by TEF).
The TEF wake-up server sends a UDP packet to the client, which prompts the client to reconnect to the push server.

If we see a client isn't online, we can just store the message, and send the ping afterward. The client will get that message the next time it reconnects.

Switch to server driven pings

Add logic for server to send pings to client.

include "ping_interval":_seconds_ as part of "hello" response
include logic for websocket connector to send pings at a given interval.
if interval needs to change, send a "long form" ping message containing the new ping_interval.

consider:

On idle disconnect, note remote IP.
Add logic to allow for IP block based ping interval specifications.
Add logic to dynamically adjust ping delays.

SimplePushServerProtocol name refactor

SimplePushServerProtocol should be renamed to PushServerProtocol as its not just SimplePush. All tests/etc should be updated appropriately.

This is blocked by #57.

Add Datadog stats output

Make storage/router DynamoDB table names configurable

The router/storage table names used must be configurable, as 'router' and 'storage' could cause name conflicts.

WebPush style messages should allow CLI config for notifications per retrieval

After #57 lands, the notifications per fetch is hardcoded to 10, this should be configurable via a CLI option.

Add Sentry support

Exceptions should be logged out using Sentry.

Add WebPush style individual message delivery w/data

Modify the architecture to store individual messages with payload

Implementation Plan

To retain efficient message delivery and storage, a separate router and db table is needed for storing individual messages. Our current router table has some space to store arbitrary additional data, but in this case we need to know in advance whether the channel exists or not, to avoid storing lots of arbitrary data that may be for expired/unregistered channels.

This first version will have a new table, keyed by hash/range key: uaid-channel_id / timestamp

Using a channel_id of "" combined with uaid will return a structure with a 'channels' field that has all registered channels.

After the router lookup indicates to use individual message delivery, the individual message router will verify the channel id is valid, and deliver/store it as needed.

Websocket Changes

The router type will be stored for the connection, so several methods can act appropriately based on whether the prior simplepush style delivery is used, or webpush style. The following methods will need to be modified to toggle implementation based on router_type:

register
unregister
process_notifications
ack

Database Changes & Additions

A new table for individual message storing, and all the registered channels will be added for this user.

Router

New 'webpush router' for webpush style data retention and routing will be added.s, instead of collapsing by version.

Feature: Add structured log output for endpoint

The endpoint should log more details on every hit, using the structured log output that is setup, to track:

User-Agent
Status code
Hash of UAID
Remote-IP

Add GCM and APNS Bridges

Register a producer for the websocket protocol for more intelligent pausing

Add sphinx docs for autopush

Sphinx docs for a full doc site for autopush should be added. The restful docs for the registration endpoint should be switched to the sphinx http restful plugin for nice HTTP docs.

Add Routing for IWakeup and TEF Wakeup protocol

Retain connect info from clients, and add IWakeup interface and TEF wake-up protocol, deploy to staging for TEF testing.

Look into switching to Python3 SSL backport

Look into switching twisted's SSL for the Python 3 SSL backport to save ~ 10kb per connection

[what is our current total per connection size?]

Ensure logging stdout matches ops logging format

Send a special WebSocket close code for clients that ping too frequently

For context: https://bugzilla.mozilla.org/show_bug.cgi?id=1152264

In #78, we introduced an adaptive response delay for clients that ping too frequently. Unfortunately, folks are still reporting high battery and data usage, even with the fix in place. This is caused by a bug in the client's adaptive ping logic, and affects all FxOS 2.x releases (1.x is unaffected because it didn't ship with adaptive pings).

Any client can potentially enter this state (especially those on unreliable networks), and there's no recovery apart from manually resetting the prefs. The client patch is in place, but has not yet been uplifted.

On our end, we can detect when clients enter a ping loop, and send a special WebSocket close code (4774). This is normally used for UDP wake-up: if the client detects this code, it won't reconnect, as it expects the server to wake it up for incoming notifications.

The trade-off is that phones on non-TEF networks won't receive any push notifications until their network status changes—either they lose reception and reconnect, or their phone switches between cellular and Wi-Fi. (TEF has their own UDP wake-up platform, so we can actually make this work for them). But it's a small price to pay for battery life and reasonable data usage.

A vague plan:

Remove the adaptive response delay, and disconnect clients that ping too frequently.
On disconnect, store the connection lifespan in DynamoDB. I think this calls for a weighted moving average, to minimize the impact on well-behaved clients that happen to be on spotty networks.
When the client reconnects, look up its average connection lifespan. If it drops below a threshold (15 seconds?), flush any pending messages, then close its connection via self.sendClose(code=4774).

Add max connections

Expose the autobahn max connection as a config option so we can cut off excess connections before the server overloads.

Add a server backoff protocol

If the server is overloaded, it'll be helpful to have a way to tell the client to go away and reconnect later. https://bugzilla.mozilla.org/show_bug.cgi?id=1184278 tracks the client work to support this.

Example stage/prod configs.

I'll need example stage and prod configs. If they are close to the same, one will suffice.

Switch internal routing from requests lib to twisted's http client

Two blocking operations hit the thread-pool in the endpoint, decryption, and internal delivery to a connection node. We should switch to using twisted's http client to reduce the use of the thread-pool.

http://twisted.readthedocs.org/en/latest/web/howto/client.html

make build doesn't work with OSX pypy binary

pip missing from pypy tarball
$ make build
make: *** No rule to make target /Users/rpappalardo/Dropbox/git/services-test/build/autopush/pypy/bin/pypy', needed by/Users/rpappalardo/Dropbox/git/services-test/build/autopush/pypy/bin/pip'. Stop.

Add option to lower min ping interval

Right now the ping interval is hard-coded, we need to make it an option so that we can lower it temporarily to fix clients that lowered their value too far.

Modularize prop ping further

Right now each prop ping's code is its own module, but the prop ping isn't very modular inside endpoint.py.

For each prop ping, we should probably have a mapping:

proprietary code | should_store | func
gcm                false          gcm.some_func
tef                true           tef.other_func

Indicating whether we should store the message and/or attempt local delivery, or pass it.

Startup check of backend

On startup, autopush/autoendpoint should do a preliminary write/read from both DynamoDB tables to ensure they have appropriate permissions.

Setup docker build file that runs it all

The current Dockerfile builds just the project, to run either autopush or autoendpoint.

It'd be useful for rel-eng to have a single docker that can be started and 'does it all'. My thought is to add another dockerfile (docker now lets you have additional docker files and specify the name when building), and have it spin up moto, autopush, and autoendpoint, with auto* using the moto daemon for AWS instead of actual AWS.

Add /status/health or /heath

The /status endpoint is a good starting point. I'd also like an endpoint that does a deeper check. Off the top of my head:

make sure dynamodb is working

Add timer on connect for hello

Right now we have clients that connect, and never say hello. We only detect this 5 mins in with the autoPing. We should set a timer for 20 seconds to remove these errant connections earlier.

Bug: Send un-ackd direct delivery notifications to storage

If a notification is directly delivered but not ack'd, and the client drops, we drop the notification entirely. Per the todo in websocket.py:

    # TODO: Any notifications directly delivered but not ack'd need
    # to be punted to an endpoint router

We should add this code so that the connection node fires off a notification delivery to the router to redeliver these un-ack'd messages.