t2bot / matrix-media-repo Goto Github PK

View Code? Open in Web Editor NEW

200.0 10.0 76.0 114.25 MB

Highly configurable multi-domain media repository for Matrix.

Home Page: https://docs.t2bot.io/matrix-media-repo

License: MIT License

Go 98.58% Shell 0.28% Dockerfile 0.22% HTML 0.67% PLpgSQL 0.24%

matrix media matrix-media homeserver hacktoberfest s3

matrix-media-repo's Introduction

matrix-media-repo

MMR is a highly configurable multi-homeserver media repository for Matrix. It is an optional component of your homeserver setup, and recommended only for large individual servers or hosting providers with many servers.

If you're looking for an S3 connector, please consider using synapse-s3-storage-provider instead.

Smaller homeservers can still set this up, though may find it difficult to deploy or use. A high level of knowledge regarding the Matrix homeserver stack is assumed.

Documentation and support

Matrix room: #media-repo:t2bot.io

Documentation: docs.t2bot.io

Developers

MMR requires compiling at least once before it'll run in a development setting. See the compilation steps before continuing.

This project offers a development environment you can use to test against a client and homeserver.

As a first-time setup, run:

docker run --rm -it -v ./dev/synapse-db:/data -e SYNAPSE_SERVER_NAME=localhost -e SYNAPSE_REPORT_STATS=no matrixdotorg/synapse:latest generate

Then you can run docker compose -f dev/docker-compose.yaml up to always bring the service online. The homeserver will be behind an nginx reverse proxy which routes media requests to http://host.docker.internal:8001. To test accurately, it is recommended to add the following homeserver configuration to your media repo config:

name: "localhost"
csApi: "http://localhost:8008" # This is exposed by the nginx container

Federated media requests should function normally with this setup, though the homeserver itself will be unable to federate. For convenience, an element-web instance is also hosted at the same port from the root.

A postgresql server is also created by the docker stack for ease of use. To use it, add the following to your configuration:

database:
  postgres: "postgres://postgres:[email protected]:5432/postgres?sslmode=disable"
  pool:
    maxConnections: 10
    maxIdleConnections: 10

Note that the postgresql image is insecure and not recommended for production use. It also does not follow best practices for database management - use at your own risk.

Note: Running the Go tests requires Docker, and may pollute your cached images with tons of layers. It is suggested to clean these images up manually from time to time, or rely on an ephemeral build system instead.

matrix-media-repo's People

Contributors

Stargazers

Watchers

Forkers

jcgruenhage sorunome jackloughran slipeer neocrypto-io half-shot michaelkaye anoadragon453 targodan ananace no-realm thestranjer udaybansal19 jaywink matmaul dereisele silkeh intelsway bitcard kyw7 t3chguy cremesk jellykells ediboko1980 dotwee dkasak russelldavies halkeye fizzadar coffiner acheng-floyd globekeeper scytaleim jae1911 tauqeerali1 buffless-matt hifi bmwtsn098 nicolamori ryhn-link rahsyj1 baronrustamov beeper bitorbitlabs kaiyou inprtx everpcpc lykos153 gregblake darkkirb mordquist mcronce hibobmaster recalibratedsystems anoushadt tchapgouv scrin overphoenix 999eagle alpinebuster spantaleev fusl drgrove igusev mpldr-pulls dapi michael-hollister starcraft66 mweinelt cloudpassenger thisisjoesgh

matrix-media-repo's Issues

Cache popular media in memory for a short time

The idea behind this is to reduce the load on the disk and make requests a bit faster. When an image is uploaded to the repository in a popular room, many servers will be requesting that media to show thumbnails for their users. A similar problem can occur when multiple users of the configured homeservers are requesting the media.

The caching would use a priority-based system where popular media is kept in the cache longer compared to low traffic media. This would be scored simply on the number of downloads. The mechanics are listed below, and the numbers for them should be configurable:

Raw files and thumbnails are distinct items - this is to avoid us caching the giant raw file when everyone wants the 100kb thumbnail
Only cache/track files under 100mb
Only cache up to 1gb (~100 files pessimistically)
Only cache media with more than 5 downloads. When an item falls below 5, drop it from the cache.
Uploaded files automatically get scored at 1 download (in the case of images, the thumbnail gets scored at 1 and the raw file at zero).
The number of downloads should only be for the last 30 minutes (to prevent a download every 15 minutes perpetually keeping the file tracked)
When a file not tracked by the cache wants to be tracked, it can only evict a file with the same number of downloads or less. It should evict the least accessed file with the longest last_accessed timestamp.
A larger file with a higher download count may evict multiple files to make room in the cache provided the cache cannot fit the large file as is and the smaller files have 5 or more less downloads.

Another thing to consider would be scoring files that are ramping up to or plateauing at a large number of downloads higher than those which are ramping down or seeing very little sustained traffic. The above tries to do that in a blocky way that is also space efficient for memory.

Use remote server's thumbnailer when thumbnailing remote media for the first time

Concurrently also download the source file. The thumbnail downloaded from the remote should be an in-memory thing that only survives as long as the remote media download is in progress. After the remote media is downloaded, discard the remote server's thumbnail and rely on our own thumbnailing.

Rate limiting

Use a circuit breaker on the homeserver

When the homeserver goes down, our requests can take ages. Use a circuit breaker to prevent endless calls.

https://github.com/rubyist/circuitbreaker

Cache URL previews

We'll cache previews indefinitely. Operators can clear the URL previews through the API or through the database.

ts = Math.max(ts, now()) - this avoid cases where people say "give me github.com in the year 5000"
Further, chop the ts to intervals of 1 hour to avoid cases where people want previews every few seconds.

Logic for determining which cached preview to return:

If a record exists for the given ts, return it
If a newer record exists for ts, return it (next newest)
Generate a new preview

oEmbed support

Add to waffle

Retry file downloads if the file doesn't exist on disk

Large media that cannot be thumbnailed doesn't return raw file

Use the sample config as config defaults

This is useful when someone is upgrading and may not update their config. Currently the repo throws an error, refusing to start.

Support pluggable storage mechanisms

Synapse is going the way of storage providers so, in theory, it can use S3 to store media. We should follow suite.

This will also help with HA (#15).

By using some kind of URI structure in the database we could say something like s3://bucket/path or mr://hostname/path

Lock thumbnail processing to avoid wasted CPU time

Identicon resource

This should also be a spec PR.
http://matrix.org/_matrix/media/r0/identicon/travisr?width=1000&height=1000

Integration tests

To verify this thing works front to back

Quarantine media API

This is two parts: supporting the synapse endpoint and adding a new one for a specific media record.

Implement POST /_matrix/client/r0/admin/quarantine_media/<room_id>?access_token=<access_token> (from synapse)
- If one doesn't already exist, add an endpoint to synapse to get the media in a room (probably admin)
Add POST /_matrix/client/r0/admin/quarantine_media/<origin>/<media_id>?access_token=<access_token> (for specific media)
Quarantined media should optionally return a template image instead of 404ing.

Default handler for HTTP is not spec compliant

It should respond with things like {"errcode":"M_NOT_FOUND","error":"Not found"} instead of 404 Not Found

Restrict content on mimetype

element-hq/element-web#5588

Add instructions on how to build manually

When gb doesn't work, like on alpine

URL previews can spit out css

Example: https://web.archive.org/web/20140703031739/http://www.spamhaus.org/statistics/networks/

HA/LB support

This includes cross-instance locking to avoid duplicate processing. The idea would be to be able to run multiple of these to help assist with load. In HA mode each instance should be configured to use the REST API to have a shared configuration.

More granular download permissions

Always require auth
On local content
On remote content (always)
On the first download of remote content
Only for thumbnails
- Repeat options for local/remote from this list
Never

Delete (quarantine) media when it's redacted

Synapse issue: matrix-org/synapse#1263

There's no spec for redactions deleting media, so one may have to be created. When the homeserver detects that a media object is dereferenced everywhere it should contact the media repo and ask it to delete the file. The homeserver should ignore the response code entirely (200, 404, etc are all valid) as it would be just a suggestion, and the repo may not implement it.

The homeserver would be responsible for tracking remotely media being redacted as well. This is to prevent the world from recommending deletion on the media repo. A shared secret auth on the API would probably be enough to verify the right homeserver is contacting the repo.

This may be possible to do with the new pluggable storage layer in synapse develop. In theory, a shim could be written to proxy the calls to us, bypassing synapse.

Stream files as they are being downloaded from the remote server

Currently files are downloaded completely, analyzed, then sent to the requesting party. Instead, it should send the data it's receiving over the wire to the requesting party as it gets it, performing analysis after it has received the entire file.

This is most noticeable on large files.

Return correct status code for invalid token

as per matrix-org/synapse#2602

Add to jenkins

Lock remote downloads to avoid wasting bandwidth

Oversized uploads end up causing E_CONTENT_LENGTH_MISMATCH instead of a proper "too large" error

Media purge APIs

These should be supported

Config options for content types to thumbnail

Currently the list seems to exclude gifs. Instead of a dedicated config option, just make a list of types the thumbnail. This would also allow for future pdf/video/audio(?) thumbnailing.

Summarize url preview descriptions better

Use a word-based approach instead of truncating. Synapse implements this by grabbing words and creating a paragraph of up to 500 words.

Support pluggable cache mechanisms

Much like #47, we should support having different cache options. Some options would be memory, filesystem, and redis.

Compare performance against synapse/dendrite

This media repo is intended to be the repo for multiple homeservers, therefore it is expected to see N times the load when compared to synapse/dendrite (where N is the number of homeservers the deployment is responsible for). This also needs to perform quickly so that clients don't get flack for having slow media (as most users aren't going to notice/care what is powering their media).

The results of this should be published somewhere and be updated regularly. Possible expansions would be to have a test on commits/PRs to verify the times aren't on an upward trend.

Off the top of my head, here's a few things to compare against synapse/dendrite:

Uploads (small/med/large) - random data to avoid dedupe
Uploads (small/med/large) - same data to encourage dedupe
Thumbnailing local media (small/med/large) - cold & warm cache
Downloading local media (small/med/large) - cold & warm cache
Downloading remote media (small/med/large) - cold & warm cache (ignore latency of remote server as there's not much we can do about that)
Thumbnailing remote media (small/med/large) - cold & warm cache (ignore latency of remote server)
Processing 10,000 concurrent uploads - cold start
Processing 10,000 concurrent local downloads - cold start
Processing 10,000 concurrent local thumbnails - cold start
Processing 10,000 concurrent remote downloads - cold start (ignore latency of remote server)
Processing 10,000 concurrent remote thumbnails - cold start (ignore latency of remote server)
Processing 10,000 concurrent random requests (uploads, download, thumbnail - local and remote - avoid and encourage dedupe) - attempt to cause caches to become useless and cold
... and other things to generally try and break the repo

Note: 10k concurrent requests may be ambitious. It should handle at least 1k before exhibiting symptoms of load

REST API for configuration

Config options:

restApi:
  enabled: false # default is 'true'
  token: "your_token_here"

If the rest api is enabled, then the config file is only used for getting the db string.

Support SVG thumbnailing

Without a billion laughs.

Add to travis-ci

Log things

The console is eerily quiet.

Verify EXIF data is honoured

To avoid this bug: matrix-org/synapse#1261

Support a fourth param to thumbnailer to prevent gifs -> png?

As per notes in element-hq/element-web#5922 (comment) - nothing is really set in stone there, it's just an idea at this point.

Make sure OPTIONS works

Infer OG when none can be found

For sites where OG tags aren't defined, try to create some. For media URLs, set the og:description to the download name and a URL where possible.

Synapse methods:

Verify remote media timeout works

To avoid potential bugs like matrix-org/synapse#2414

Unit testing

To make sure this thing works

Remote media hostname not resolved correctly

Related: matrix-org/synapse#2823

The media repo should be trying to find the correct host from SRV records on the domain. For compatibility with synapse we'll fall back to trying to use the direct host (the "wrong" way).

Support video thumbnailing

Import media from synapse

Support pdf thumbnailing

Just the first page, centered.

Command line parameter for config file location

Needing to have both the migrations folder and the config file in the working directory makes putting this into a docker container kinda hard. Being able to either configure the location of the migrations folder or config file (or both) would make this a lot easier.

Configurable expiration for remote media/thumbnails/url previews

Remote media and thumbnails (local/remote) should expire after an amount of time. By default this could be something like 30 days.

The media can always be redownloaded/regenerated.

URL previews should also expire after a given amount of time.

Cache remote media download failures for a short time

If media fails to download, report it as a 404 and cache that for an hour or so. In memory is probably fine.