luqmansen / gosty Goto Github PK

Scalable Transcoding Service on Kubernetes

License: MIT License

Go 67.72% Makefile 2.23% HTML 0.85% Shell 4.34% CSS 2.64% JavaScript 19.49% Dockerfile 1.55% Python 1.18%

kubernetes microservices transcoding transcode-video distributed-computing golang golang-microservice linkerd minikube microk8s

gosty's Introduction

Gosty

Scalable cloud transcoding service on Kubernetes

Table of Content

Gosty

Table of contents generated with markdown-toc

System Overview

Development

Requirements

go 16.7
docker
docker-compose
npm
yarn
minikube

Using Docker compose

docker-compose up
Change config.env to use the docker compose env

Using docker container

If you want to run app on container while run the database and message broker on minikube, make sure to attach minikube network to the container

docker run -p 8000:8000 --network minikube -e GOSTY_FILESERVER_SERVICE_HOST=192.168.49.4 localhost:5000/gosty-apiserver
docker run -p 8001:8001 --network=minikube localhost:5000/gosty-fileserver
docker run --network minikube -e GOSTY_FILESERVER_SERVICE_HOST=192.168.49.4 localhost:5000/gosty-worker

Using Kind

Init the kind cluster (mongodb and rabbitmq will run on host machine using docker-compose)

bash ./hack/create-kind-local-registry.sh

Deploy using kustomize

kustomize build deployment/kustomize/environments/gke | kubectl apply -f -

Using Minikube

kustomize build deployment/kustomize/environments/minikube | kubectl apply -f -

Deployment

Deploy on GKE

Init cluster script

bash ./hack/create-cluster.sh

Using Managed RabbitMQ and MongoDB

Set your mongodb & rabbitmq secret on configmap (change to secret if you want, modify the manifest accordingly)

Then apply linkerd & gosty component

Deploy RabbitMQ and MongoDB inside Cluster

RabbitMQ and MongoDB deployed using helm, make sure to install helm before

curl https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 | bash

also add bitnami repo

helm repo add bitnami https://charts.bitnami.com/bitnami

RabbitMQ

kubectl create -fs k8s/rabbitmq/service.yaml # create nodePort service, skip if you don't need
helm install rabbit bitnami/rabbitmq -fs k8s/rabbitmq/helm-values.yaml --create-namespace --namespace gosty

MongoDB

kubectl create -fs deployment/k8s/mongodb/service.yaml # create nodePort service, skip if you don't need
helm install mongodb bitnami/mongodb -fs deployment/k8s/mongodb/helm-values.yaml --create-namespace --namespace gosty

API Server, File Server, Worker

Apply the rest of k8s resource manifest

kustomize build deployment/kustomize/environments/gke | kubectl apply -f -

Elasticsearch-Fluentd-Kibana

This resource will be deployed on fluentd-monitoring namespace. This stack currently will only be used for logs monitoring

kubectl apply -fs ./deployment/k8s/fluentd

kubectl -n fluentd-monitoring port-forward svc/kibana 5601

Linkerd

Install linkerd cli

curl -sL run.linkerd.io/install | sh                                                                                                                                                                       1 ↵

install linkerd component

kubectl apply -fs deployment/k8s/linkerd/manifest

access linkerd dashboard

linkerd viz dashboard

Injecting linkerd to rabbitmq, mongodb & ingress

kubectl get statefulset -n gosty rabbit-rabbitmq -o yaml  | linkerd inject - | kubectl apply -fs -
kubectl get statefulset -n gosty mongodb -o yaml  | linkerd inject - | kubectl apply -fs -
kubectl get statefulset -n gosty -o yaml mongodb-arbiter | linkerd inject - | kubectl apply -fs -   
kubectl get deployment -n kube-system ingress-nginx-controller -o yaml | linkerd inject - | kubectl apply -fs -                                               1 ↵

PromQL for Add additional kubernetes avg cpu usage in grafana dashboard

sum (rate (container_cpu_usage_seconds_total{image!="",kubernetes_io_hostname=~"^$Node$",namespace="gosty"}[1m]))

Additional

Local Image Registry

Run local image registry on my machine for faster dev purposes

Using minikube's local registry

In case you want to run it inside kubernetes cluster (minikube), enable minikube local registry

minikube addons enable registry

forward registry service to local port

kubectl port-forward --namespace kube-system svc/registry 5000:80

push the local image to minikube's local registry

docker build -t localhost:5000/{image-name} -fs docker/Dockerfile-{image-name} .
docker push localhost:5000/{image-name}

Using microk8s local registry

microk8s.enable registry

You can access it via microk8s's ip (NodePort service)

K3s local registry

create this file on /etc/rancher/k3s/registries.yaml

mirrors:
  registry.local:
    endpoint:
      - "http://192.168.56.1:5000" # your local container registry

Using Docker compose

docker-compose -fs docker-compose-registry.yaml up -d registry

The pod will be exposed on 0.0.0.0:5000

K8s manifest adjustment

Don't forget to change the k8s deployment image

spec:
  containers:
    - name: { image-name }
      image: localhost:<registry's cluster ip/container registry's ip>/{image-name}:latest

Spekt8

I setup spekt8 for cluster visualization

kubectl create -fs ./deployment/k8s/plugins/spekt8/fabric8-rbac.yaml 
kubectl apply -fs ./deployment/k8s/plugins/spekt8/spekt8-deployment.yaml 
kubectl port-forward -n gosty deployment/spekt8 3000:3000

Dashboard

Accessing k8s dashboard

kubectl -n kube-system port-forward svc/kubernetes-dashboard 8443:443

Chaos Mesh

I add some testing scenario on k8s/chaos using chaos mesh. First install chaos mesh on the cluster

curl -sSL https://mirrors.chaos-mesh.org/v1.1.2/install.sh | bash

Testing

Run Pod for Testing Upload File from Inside Container

# omit --rm to keep instance after exit
kubectl run curl-test --image=marsblockchain/curl-git-jq-wget  -it -- sh

Download The File

# this is 20MB Test File
wget --no-check-certificate -r 'https://docs.google.com/uc?export=download&id=102o0T6XeB0znP-r0dkvKhDYTObniockI' -O sony.mp4
# This is 200MB Test File (Blender's Foundation Big Bucks Bunny)
wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1mw1JHv739M46J6Jv5cXkVHb-_n7O0blK' -r -A 'uc*' -e robots=off -nd # will download 2 files
mv uc?export\=download\&confirm\=uuRp\&id\=1mw1JHv739M46J6Jv5cXkVHb-_n7O0blK bunny.mp4 #rename the files

Post data via curl

curl http://gosty-apiserver.gosty.svc.cluster.local/api/video/upload -F [email protected] -v
curl http://34.149.27.149/api/video/upload -F [email protected] -v
curl http://gosty-apiserver.gosty.svc.cluster.local/api/video/upload -F [email protected] -v

Submit query to morph

python cli_submit.py -l bunny.mp4 -s 256x144 426x240 640x360 854x480 1280x720 1920x1080
python cli_submit.py -l bunny.mp4 -s 854x480

Execute Flaky Endpoint

 while true; do  sleep 60 && curl "http://34.134.157.70/api/scheduler/progress/update"; done # change the ip accordingly

GCP Compute Metrics MQL for get average load (for morph comparison)

fetch gce_instance
| metric 'compute.googleapis.com/instance/cpu/utilization'
| group_by 25m , [value_utilization_aggregate: aggregate(value.utilization)/25]

Memory

fetch gce_instance
| metric 'agent.googleapis.com/memory/percent_used'
| group_by 25m , [value_percent_used_mean: aggregate(value.percent_used)/25.15]

Resize gke cluster to 0 when not used

To save some bill

gcloud container clusters resize ${CLUSTER_NAME} --zone=us-central1-a --num-nodes=0

Issues

Minikube Ingress Change ip when Ingress controller restarted

# change this according to your `kubectl get ingress`
sed -i -e 's/192.168.59.2/'"192.168.59.3"'/g' /etc/hosts

ImagePullBackOff on MicroK8s When internet connection is bad, sometimes this is happened (especially for large image) Solution: set image pull timeout limmit for kubelet re-run the kubelet (somehow I can't edit ExecStart kubelet's service, so need to manually run it)

sudo sudo systemctl stop snap.microk8s.daemon-kubelet.service 
sudo /snap/microk8s/2094/kubelet --kubeconfig=/var/snap/microk8s/2094/credentials/kubelet.config --cert-dir=/var/snap/microk8s/2094/certs --client-ca-file=/var/snap/microk8s/2094/certs/ca.crt --anonymous-auth=false --network-plugin=cni --root-dir=/var/snap/microk8s/common/var/lib/kubelet --fail-swap-on=false --cni-conf-dir=/var/snap/microk8s/2094/args/cni-network/ --cni-bin-dir=/var/snap/microk8s/2094/opt/cni/bin/ --feature-gates=DevicePlugins=true --eviction-hard="memory.available<100Mi,nodefs.available<1Gi,imagefs.available<1Gi" --container-runtime=remote --container-runtime-endpoint=/var/snap/microk8s/common/run/containerd.sock --containerd=/var/snap/microk8s/common/run/containerd.sock --node-labels=microk8s.io/cluster=true --authentication-token-webhook=true --cluster-domain=cluster.local --cluster-dns=10.152.183.10 --image-pull-progress-deadline=30s

Random pods evicted

Microk8s has issue with, even though when the resource is fine, just restart the deployment and delete old resource

export NS=<NAMESPACE> 
kubectl -n $NS delete rs $(kubectl -n $NS get rs | awk '{if ($2 + $3 + $4 == 0) print $1}' | grep -v 'NAME')

Pods stuck on terminating state

Sometimes this is happened on k3s cluster after a while

export NS=<namespace>
for p in $(kubectl get pods -n $NS | grep Terminating | awk '{print $1}'); do kubectl delete pod -n $NS $p --grace-period=0 --force;done

Hostpath provisioner only writable by root

If you're running into this issue, where the pod won't start because it can't write to pv, currently my workaround is change the directory modifier. For every node on your cluster, run below command

sudo chmod -R 777 /tmp/hostpath-provisioner/gosty

Viper need .env file

If you notice, the config.env is still added to final docker images since viper has this issue. The env value later will be replaced by injected env values from K8s

Scaling statefulsets rabbitmq problem

RabbitMQ using erlang cookie as shared secret used for authentication between RabbitMQ nodes. It will be stored on the volume. If erlang cookie is not defined on secret, it will generated randomly and in case of scaling via kubernetes API, each node will have different cookie, so they can't work together. Remove previous volume to make sure cookie is renewed.

Debug K8s DNS

Some image has problem for dns resolving, example is this alpine 3.11 and 3.13 (used to use this) with this issue Below command is to run one time pod to debug dns

kubectl run --restart=Never --rm -i --tty alpine --image=alpine:3.12 -- nslookup kube-dns.kube-system.svc.cluster.local

**Private registry on MikroK8S Microk8s private registry communication need to be https, else it won't work, here is the reference for the setup

What can be improved

Use more proper permanent storage system (consider using object storage, eg: minio, gcs)
Currently, every worker will always download a copy of the file and process it on its local pod volume, then remove the original file, then send the processed file to file server, this can use a lot of bandwidth. This can be improved by using shared volume on the node, and check if other worker already download the file, then process it.

Todo

Run experiment using static mode in CPU Manager K8S

Acknowledgements

Credit to gibbok for video player in web client, (Heavily modified for this project use case)

gosty's People

Contributors

Stargazers

Watchers

Forkers

odinlin krzemienski

gosty's Issues

try to reuse the channel instead of opening new channel everytime this endpoint got hit

This issue has been automatically created by todo-actions based on a TODO comment found in pkg/apiserver/util/healthcheck.go:48. It will automatically be closed when the TODO comment is removed from the default branch (main).

create model for this return response

This issue has been automatically created by todo-actions based on a TODO comment found in pkg/apiserver/services/scheduler_logic.go:77. It will automatically be closed when the TODO comment is removed from the default branch (main).

add small thumbnail for each video

This issue has been automatically created by todo-actions based on a TODO comment found in web/src/Pages/VideoPlayer/VideoTile.js:17. It will automatically be closed when the TODO comment is removed from the default branch (main).

Get Worker IP from outbound interface

Current implementation doesn't work with replica scaling
on docker-compose, only on k8s. Use this
https://stackoverflow.com/a/37382208/11914433

This issue has been automatically created by todo-actions based on a TODO comment found in pkg/apiserver/services/worker_logic.go:116. It will automatically be closed when the TODO comment is removed from the default branch (main).

cache worker.GetAll() function

Cache get all worker function and expire the cache
when new worker registered

This issue has been automatically created by todo-actions based on a TODO comment found in pkg/apiserver/services/worker_logic.go:77. It will automatically be closed when the TODO comment is removed from the default branch (main).

add bulk upsert to reduce write

This issue has been automatically created by todo-actions based on a TODO comment found in pkg/apiserver/services/worker_logic.go:55. It will automatically be closed when the TODO comment is removed from the default branch (main).

Refactor this repetitive message ack

This issue has been automatically created by todo-actions based on a TODO comment found in pkg/apiserver/services/scheduler_logic.go:467. It will automatically be closed when the TODO comment is removed from the default branch (main).

Get Worker IP from outbound interface

Current implementation doesn't work with autoscaling
on docker-compose, only on k8s. Use this
https://stackoverflow.com/a/37382208/11914433

Get Worker IP from outbound interface

Current implementation doesn't work with autoscaling
on docker-compose, only on k8s. Use this
https://stackoverflow.com/a/37382208/11914433

Get Worker IP from outbound interface

Current implementation doesn't work with autoscaling
on docker-compose, only on k8s. Use this
https://stackoverflow.com/a/37382208/11914433

Get Worker IP from outbound interface

Current implementation doesn't work with autoscaling
on docker-compose, only on k8s. Use this
https://stackoverflow.com/a/37382208/11914433

Use errs.Go from x/sync/errorgroup

This issue has been automatically created by todo-actions based on a TODO comment found in pkg/worker/dash.go:93. It will automatically be closed when the TODO comment is removed from the default branch (main).

Fix total elapsed time

Current implementation use last task finished - first task submitted. If multiple task submitted then they will be queued and elapsed time would we quite inaccurate. Need to use the start time when the worker pickup the job

(improvement) make the pipe generic name (eg: temp1)

and can be reused to next process, might reducing io if the
created pipe file is a lot

This issue has been automatically created by todo-actions based on a TODO comment found in pkg/worker/merge.go:161. It will automatically be closed when the TODO comment is removed from the default branch (main).

Build react on runner

Build react on CI runner so we improve built speed
by using github actions runner caching support

This issue has been automatically created by todo-actions based on a TODO comment found in docker/web.Dockerfile:8. It will automatically be closed when the TODO comment is removed from the default branch (main).

Get Worker IP from outbound interface

Current implementation doesn't work with autoscaling
on docker-compose, only on k8s. Use this
https://stackoverflow.com/a/37382208/11914433

Get Worker IP from outbound interface

Current implementation doesn't work with autoscaling
on docker-compose, only on k8s. Use this
https://stackoverflow.com/a/37382208/11914433

Get Worker IP from outbound interface

Current implementation doesn't work with autoscaling
on docker-compose, only on k8s. Use this
https://stackoverflow.com/a/37382208/11914433

this stupid, need to apply websocket or sse

instead of requesting every 100ms

This issue has been automatically created by todo-actions based on a TODO comment found in web/src/Pages/Worker/WorkerPage.js:13. It will automatically be closed when the TODO comment is removed from the default branch (main).

make this task definition not redundant.

Task is re-defined on CreateTranscodeTask, but
this is a current workaround for preserve origin
video field, later please redesign the data models

This issue has been automatically created by todo-actions based on a TODO comment found in pkg/apiserver/services/scheduler_logic.go:370. It will automatically be closed when the TODO comment is removed from the default branch (main).

add retry mechanism

This issue has been automatically created by todo-actions based on a TODO comment found in pkg/apiserver/services/scheduler_logic.go:182. It will automatically be closed when the TODO comment is removed from the default branch (main).

implement this storage repository

This issue has been automatically created by todo-actions based on a TODO comment found in pkg/worker/worker.go:32. It will automatically be closed when the TODO comment is removed from the default branch (main).

Storage repository initiation should be handled by storage service

This issue has been automatically created by todo-actions based on a TODO comment found in cmd/worker/main.go:32. It will automatically be closed when the TODO comment is removed from the default branch (main).

Stale worker on working state

Sometimes worker info on web state on working state if there is a problem in mids of the early task, the current workaround is to restart the api server

Also remove worker from db if retry failed > 5

This issue has been automatically created by todo-actions based on a TODO comment found in pkg/apiserver/services/worker_logic.go:88. It will automatically be closed when the TODO comment is removed from the default branch (main).

refactor ack and publish part of this loop

This issue has been automatically created by todo-actions based on a TODO comment found in cmd/worker/main.go:67. It will automatically be closed when the TODO comment is removed from the default branch (main).

only publish 1 data for every worker update

add parameter to this function and only publish worker changes
instead off all worker result

This issue has been automatically created by todo-actions based on a TODO comment found in pkg/apiserver/services/worker_logic.go:160. It will automatically be closed when the TODO comment is removed from the default branch (main).

make the chunk file size is dynamic base on number of worker, worker failure rate, etc

since lots of smaller task with same number of worker will just add overhead
in processing. The pros is, in case of pod failure in the middle of a processing,
the task that need to be re-processed is in smaller chunk. Currently I'll set this dynamically
via env var.

This issue has been automatically created by todo-actions based on a TODO comment found in pkg/apiserver/services/scheduler_logic.go:357. It will automatically be closed when the TODO comment is removed from the default branch (main).

Get Worker IP from outbound interface

Current implementation doesn't work with autoscaling
on docker-compose, only on k8s. Use this
https://stackoverflow.com/a/37382208/11914433

Assign task to worker based on the task workload and worker resource

The current implementation will broadcast all tasks to all workers equally, this will result in sometimes, a worker with big resource will lose too much computing time because of too much context switching.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.