Coder Social home page Coder Social logo

t2bot / matrix-media-repo Goto Github PK

View Code? Open in Web Editor NEW
200.0 10.0 76.0 114.25 MB

Highly configurable multi-domain media repository for Matrix.

Home Page: https://docs.t2bot.io/matrix-media-repo

License: MIT License

Go 98.58% Shell 0.28% Dockerfile 0.22% HTML 0.67% PLpgSQL 0.24%
matrix media matrix-media homeserver hacktoberfest s3

matrix-media-repo's Introduction

matrix-media-repo

MMR is a highly configurable multi-homeserver media repository for Matrix. It is an optional component of your homeserver setup, and recommended only for large individual servers or hosting providers with many servers.

If you're looking for an S3 connector, please consider using synapse-s3-storage-provider instead.

Smaller homeservers can still set this up, though may find it difficult to deploy or use. A high level of knowledge regarding the Matrix homeserver stack is assumed.

Documentation and support

Matrix room: #media-repo:t2bot.io

Documentation: docs.t2bot.io

Developers

MMR requires compiling at least once before it'll run in a development setting. See the compilation steps before continuing.

This project offers a development environment you can use to test against a client and homeserver.

As a first-time setup, run:

docker run --rm -it -v ./dev/synapse-db:/data -e SYNAPSE_SERVER_NAME=localhost -e SYNAPSE_REPORT_STATS=no matrixdotorg/synapse:latest generate

Then you can run docker compose -f dev/docker-compose.yaml up to always bring the service online. The homeserver will be behind an nginx reverse proxy which routes media requests to http://host.docker.internal:8001. To test accurately, it is recommended to add the following homeserver configuration to your media repo config:

name: "localhost"
csApi: "http://localhost:8008" # This is exposed by the nginx container

Federated media requests should function normally with this setup, though the homeserver itself will be unable to federate. For convenience, an element-web instance is also hosted at the same port from the root.

A postgresql server is also created by the docker stack for ease of use. To use it, add the following to your configuration:

database:
  postgres: "postgres://postgres:[email protected]:5432/postgres?sslmode=disable"
  pool:
    maxConnections: 10
    maxIdleConnections: 10

Note that the postgresql image is insecure and not recommended for production use. It also does not follow best practices for database management - use at your own risk.

Note: Running the Go tests requires Docker, and may pollute your cached images with tons of layers. It is suggested to clean these images up manually from time to time, or rely on an ephemeral build system instead.

matrix-media-repo's People

Contributors

999eagle avatar ananace avatar anoadragon453 avatar bbaovanc avatar buffless-matt avatar darkkirb avatar dependabot[bot] avatar dereisele avatar dkasak avatar fizzadar avatar fusl avatar gregblake avatar half-shot avatar halkeye avatar jaywink avatar jcgruenhage avatar jellykells avatar matmaul avatar mweinelt avatar russelldavies avatar silkeh avatar sorunome avatar spantaleev avatar t3chguy avatar targodan avatar thestranjer avatar tleydxdy avatar turt2live avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

matrix-media-repo's Issues

Cache popular media in memory for a short time

The idea behind this is to reduce the load on the disk and make requests a bit faster. When an image is uploaded to the repository in a popular room, many servers will be requesting that media to show thumbnails for their users. A similar problem can occur when multiple users of the configured homeservers are requesting the media.

The caching would use a priority-based system where popular media is kept in the cache longer compared to low traffic media. This would be scored simply on the number of downloads. The mechanics are listed below, and the numbers for them should be configurable:

  • Raw files and thumbnails are distinct items - this is to avoid us caching the giant raw file when everyone wants the 100kb thumbnail
  • Only cache/track files under 100mb
  • Only cache up to 1gb (~100 files pessimistically)
  • Only cache media with more than 5 downloads. When an item falls below 5, drop it from the cache.
  • Uploaded files automatically get scored at 1 download (in the case of images, the thumbnail gets scored at 1 and the raw file at zero).
  • The number of downloads should only be for the last 30 minutes (to prevent a download every 15 minutes perpetually keeping the file tracked)
  • When a file not tracked by the cache wants to be tracked, it can only evict a file with the same number of downloads or less. It should evict the least accessed file with the longest last_accessed timestamp.
  • A larger file with a higher download count may evict multiple files to make room in the cache provided the cache cannot fit the large file as is and the smaller files have 5 or more less downloads.

Another thing to consider would be scoring files that are ramping up to or plateauing at a large number of downloads higher than those which are ramping down or seeing very little sustained traffic. The above tries to do that in a blocky way that is also space efficient for memory.

Cache URL previews

We'll cache previews indefinitely. Operators can clear the URL previews through the API or through the database.

ts = Math.max(ts, now()) - this avoid cases where people say "give me github.com in the year 5000"
Further, chop the ts to intervals of 1 hour to avoid cases where people want previews every few seconds.

Logic for determining which cached preview to return:

  1. If a record exists for the given ts, return it
  2. If a newer record exists for ts, return it (next newest)
  3. Generate a new preview

Support pluggable storage mechanisms

Synapse is going the way of storage providers so, in theory, it can use S3 to store media. We should follow suite.

This will also help with HA (#15).

By using some kind of URI structure in the database we could say something like s3://bucket/path or mr://hostname/path

Quarantine media API

This is two parts: supporting the synapse endpoint and adding a new one for a specific media record.

  • Implement POST /_matrix/client/r0/admin/quarantine_media/<room_id>?access_token=<access_token> (from synapse)
    • If one doesn't already exist, add an endpoint to synapse to get the media in a room (probably admin)
  • Add POST /_matrix/client/r0/admin/quarantine_media/<origin>/<media_id>?access_token=<access_token> (for specific media)
  • Quarantined media should optionally return a template image instead of 404ing.

HA/LB support

This includes cross-instance locking to avoid duplicate processing. The idea would be to be able to run multiple of these to help assist with load. In HA mode each instance should be configured to use the REST API to have a shared configuration.

More granular download permissions

  • Always require auth
  • On local content
  • On remote content (always)
  • On the first download of remote content
  • Only for thumbnails
    • Repeat options for local/remote from this list
  • Never

Delete (quarantine) media when it's redacted

Synapse issue: matrix-org/synapse#1263

There's no spec for redactions deleting media, so one may have to be created. When the homeserver detects that a media object is dereferenced everywhere it should contact the media repo and ask it to delete the file. The homeserver should ignore the response code entirely (200, 404, etc are all valid) as it would be just a suggestion, and the repo may not implement it.

The homeserver would be responsible for tracking remotely media being redacted as well. This is to prevent the world from recommending deletion on the media repo. A shared secret auth on the API would probably be enough to verify the right homeserver is contacting the repo.

This may be possible to do with the new pluggable storage layer in synapse develop. In theory, a shim could be written to proxy the calls to us, bypassing synapse.

Stream files as they are being downloaded from the remote server

Currently files are downloaded completely, analyzed, then sent to the requesting party. Instead, it should send the data it's receiving over the wire to the requesting party as it gets it, performing analysis after it has received the entire file.

This is most noticeable on large files.

Config options for content types to thumbnail

Currently the list seems to exclude gifs. Instead of a dedicated config option, just make a list of types the thumbnail. This would also allow for future pdf/video/audio(?) thumbnailing.

Compare performance against synapse/dendrite

This media repo is intended to be the repo for multiple homeservers, therefore it is expected to see N times the load when compared to synapse/dendrite (where N is the number of homeservers the deployment is responsible for). This also needs to perform quickly so that clients don't get flack for having slow media (as most users aren't going to notice/care what is powering their media).

The results of this should be published somewhere and be updated regularly. Possible expansions would be to have a test on commits/PRs to verify the times aren't on an upward trend.

Off the top of my head, here's a few things to compare against synapse/dendrite:

  • Uploads (small/med/large) - random data to avoid dedupe
  • Uploads (small/med/large) - same data to encourage dedupe
  • Thumbnailing local media (small/med/large) - cold & warm cache
  • Downloading local media (small/med/large) - cold & warm cache
  • Downloading remote media (small/med/large) - cold & warm cache (ignore latency of remote server as there's not much we can do about that)
  • Thumbnailing remote media (small/med/large) - cold & warm cache (ignore latency of remote server)
  • Processing 10,000 concurrent uploads - cold start
  • Processing 10,000 concurrent local downloads - cold start
  • Processing 10,000 concurrent local thumbnails - cold start
  • Processing 10,000 concurrent remote downloads - cold start (ignore latency of remote server)
  • Processing 10,000 concurrent remote thumbnails - cold start (ignore latency of remote server)
  • Processing 10,000 concurrent random requests (uploads, download, thumbnail - local and remote - avoid and encourage dedupe) - attempt to cause caches to become useless and cold
  • ... and other things to generally try and break the repo

Note: 10k concurrent requests may be ambitious. It should handle at least 1k before exhibiting symptoms of load

REST API for configuration

Config options:

restApi:
  enabled: false # default is 'true'
  token: "your_token_here"

If the rest api is enabled, then the config file is only used for getting the db string.

Command line parameter for config file location

Needing to have both the migrations folder and the config file in the working directory makes putting this into a docker container kinda hard. Being able to either configure the location of the migrations folder or config file (or both) would make this a lot easier.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.