Coder Social home page Coder Social logo

antidote-core's People

Contributors

cloudtoad avatar jameskellynet avatar jnpr-raylam avatar lara29 avatar mierdin avatar tinyfire9 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

antidote-core's Issues

Remove ToplogyType field

Not needed anymore. Was going to be useful if we used read-only, shared topologies, but we abandoned that approach.

  • Will need to rewrite scheduler to use presence of devices in lieu of this explicit field
  • Update defs
  • Update existing lessons

Pick up existing lessons after restart

Syringe holds state of what lessons are provisioned in memory, which means if syringe restarts, it loses track of this. This isn't too bad, because now that GC is in place, and it works based on purely kubernetes calls, they'll get cleaned up eventually but it means everyone will need to start new stuff from scratch. Would be nice if when syringe gets a request to create a resource, it detects if one already exists, and if so, add it back up to the in-memory map.

Syringe Architectural Changes

Considering a few changes:

  • Make the API server totally stateless. Keep state only in the scheduler, via Kubelab.
  • De-couple the scheduler into it's own running binary. Communicate between scheduler and api via message queue
  • Use database behind scheduler (shouldn't need this for API)

Potential race condition and poor handling of kubelab state

I was working on the NAPALM lesson, which is the first to use jupyter notebooks as the lab guide. It uses the notebook for the lab guide for stages 1, 2, and 4. For stage 3 it will revert back to a markdown file.

EVERY ONCE IN A WHILE (and I mean very rarely) it will fail to load the markdown version. It was very difficult to reproduce this. As a result of this, it's also not clear whether the problem is caused by starting at stage 1 and navigating to stage 3, or if it could happen any time. I only tried going directly to stage 3 a few times but it never happened when doing that. However, with the infrequency of the event, it's still inconclusive.

I still haven't gotten to the root cause but digging through the scheduler code caused me to realize how fragile and shitty it is the way I'm mapping kubelab to livelessons, handling kubelab state, applying state changes between stages, etc. etc. This whole thing needs re-vamped.

I hate discussing a solution before I've REALLY nailed down the problem, but after spending hours trying to reliably reproduce, I'm leaning towards just rebuilding the scheduler to not suck and assuming that will take care of it.

For instance, there's a global var in scheduler.go:

kubeLabs                           = map[string]*KubeLab{}

I'm only writing to this map in one place, and in my testing, I was only spinning up lessons for me, so there shouldn't have been concurrent writes (though obviously I should still be using a Mutex or something to control concurrent writes). I also am not sure of the implications of this global variable as opposed to a property of the scheduler.

Confusing Lesson DIR parameters

Some config parameters point to something like /antidote, others /antidote/lessons.

Should reposition configuration input as "curriculum". This will allow you to incorporate collections resources underneath this as well.

trace from launch day

time="2018-10-11T08:00:15Z" level=info msg="New KubeLab for lesson 19 is of TopologyType custom"
time="2018-10-11T08:00:15Z" level=debug msg="Creating devices and connections"
time="2018-10-11T08:00:15Z" level=error msg="Problem creating network vqfx1-vqfx2-net: network-attachment-definitions.k8s.cni.cncf.io \"vqfx1-vqfx2-net\" is forbidden: unable to create new content in namespace 19-rapsjjo3ypp6m1bc-ns because it is being terminated"
time="2018-10-11T08:00:15Z" level=error msg="network-attachment-definitions.k8s.cni.cncf.io \"vqfx1-vqfx2-net\" is forbidden: unable to create new content in namespace 19-rapsjjo3ypp6m1bc-ns because it is being terminated"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xf56869]
goroutine 447350 [running]:
github.com/nre-learning/syringe/scheduler.(*LessonScheduler).createKubeLab(0xc420409520, 0xc4218010c0, 0xc422fb37a0, 0x2, 0x2)
	/go/src/github.com/nre-learning/syringe/scheduler/scheduler.go:347 +0xd79
github.com/nre-learning/syringe/scheduler.(*LessonScheduler).handleRequest(0xc420409520, 0xc4218010c0)
	/go/src/github.com/nre-learning/syringe/scheduler/scheduler.go:120 +0xbcd
created by github.com/nre-learning/syringe/scheduler.(*LessonScheduler).Start
	/go/src/github.com/nre-learning/syringe/scheduler/scheduler.go:110 +0xb5

might be solved by #5

Delying incoming requests until after nuke finishes

Kind of a niche problem, but requests for lessons immediately after startup, and before the nuke has completed, gives a 408 error. We should probably cache all incoming requests while nuking is in progress, and wait to process them until finished.

time="2019-01-13T22:40:27Z" level=debug msg="Waiting for namespace 15-jjtigg867ghr3gye-ns to delete..."
10.47.0.0 - - [13/Jan/2019:22:40:31 +0000] "GET / HTTP/1.1" 200 2
time="2019-01-13T22:40:32Z" level=debug msg="Waiting for namespace 15-jjtigg867ghr3gye-ns to delete..."
time="2019-01-13T22:40:37Z" level=debug msg="Waiting for namespace 15-jjtigg867ghr3gye-ns to delete..."
10.47.0.0 - - [13/Jan/2019:22:40:41 +0000] "GET / HTTP/1.1" 200 2
time="2019-01-13T22:40:42Z" level=debug msg="Waiting for namespace 15-jjtigg867ghr3gye-ns to delete..."
time="2019-01-13T22:40:47Z" level=debug msg="Waiting for namespace 15-jjtigg867ghr3gye-ns to delete..."
10.47.0.0 - - [13/Jan/2019:22:40:51 +0000] "GET / HTTP/1.1" 200 2
time="2019-01-13T22:40:52Z" level=debug msg="Waiting for namespace 15-jjtigg867ghr3gye-ns to delete..."
ERROR: 2019/01/13 22:40:53 grpc: server failed to encode response:  rpc error: code = Internal desc = grpc: error while marshaling: proto: Marshal called with nil
10.40.0.3 - - [13/Jan/2019:22:40:56 +0000] "GET /exp/lessondef/all HTTP/1.1" 200 143850
10.40.0.3 - - [13/Jan/2019:22:40:56 +0000] "GET /exp/lessondef/15 HTTP/1.1" 200 27172
time="2019-01-13T22:40:57Z" level=debug msg="Waiting for namespace 15-jjtigg867ghr3gye-ns to delete..."
10.47.0.0 - - [13/Jan/2019:22:41:01 +0000] "GET / HTTP/1.1" 200 2
time="2019-01-13T22:41:02Z" level=debug msg="Waiting for namespace 15-jjtigg867ghr3gye-ns to delete..."
10.40.0.3 - - [13/Jan/2019:22:40:57 +0000] "POST /exp/livelesson HTTP/1.1" 408 66
10.40.0.3 - - [13/Jan/2019:22:41:07 +0000] "GET /exp/syringeinfo HTTP/1.1" 200 114

Should also consider handling this error better in antidote-web. The behavior on that side is that the loading screen constantly shows, when in fact it broke on the initial error within seconds. Should go through all requests on that side and make sure nothing is able to fail silently like that.

kubelab crashes


10.38.0.38 - - [17/Feb/2019:18:22:41 +0000] "GET /exp/lessondef/17 HTTP/1.1" 200 4235
time="2019-02-17T18:22:41Z" level=debug msg="Scheduler received new request. Sending to handle function." Operation=3 Stage=2 Uuid=17-wxd1zqhqmej6qoiv
panic: runtime error: index out of range
goroutine 180539 [running]:
github.com/nre-learning/syringe/scheduler.(*KubeLab).ToLiveLesson(0xc420c6eae0, 0xc421310a20)
	/go/src/github.com/nre-learning/syringe/scheduler/kubelab.go:105 +0xb7f
github.com/nre-learning/syringe/scheduler.(*LessonScheduler).handleRequestMODIFY(0xc4201edc70, 0xc420b699c0)
	/go/src/github.com/nre-learning/syringe/scheduler/requests.go:176 +0x146
github.com/nre-learning/syringe/scheduler.(*LessonScheduler).(github.com/nre-learning/syringe/scheduler.handleRequestMODIFY)-fm(0xc420b699c0)
	/go/src/github.com/nre-learning/syringe/scheduler/scheduler.go:93 +0x34
github.com/nre-learning/syringe/scheduler.(*LessonScheduler).Start.func2(0xc4206d2810, 0xc420b699c0)
	/go/src/github.com/nre-learning/syringe/scheduler/scheduler.go:107 +0x6d
created by github.com/nre-learning/syringe/scheduler.(*LessonScheduler).Start
	/go/src/github.com/nre-learning/syringe/scheduler/scheduler.go:106 +0x3ff

Metrics not quite right

Something changed in the last release. Ever since last Monday (v0.2.0), the metrics have been askew:

screen shot 2019-02-04 at 5 20 41 pm

This contradicts the current namespaces (the only session is me):

kubectl get ns

NAME                     STATUS   AGE
14-jjtigg867ghr3gye-ns   Active   46m
default                  Active   25d
kube-public              Active   25d
kube-system              Active   25d
prod                     Active   25d
ptr                      Active   25d

Certificate management for iframe resources

Previously, there was no generic iframe resource - it was specific to jupyter notebooks. Thus, it was easy to load the letsencrypt cert into the docker image because I controlled the build.

However, if we're planning to allow generic web resources to be embedded via iframe, this means in order to display the content, it needs to be a trusted cert. So we need to put some thought into how we're going to do this - I can't always slipstream a cert into every image that needs it.

Shouldn't panic on failure to connect to k8s

panic: Post https://10.96.0.1:443/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions: dial tcp 10.96.0.1:443: connect: connection refused
goroutine 22 [running]:
github.com/nre-learning/syringe/scheduler.(*LessonScheduler).createNetworkCrd(0xc4204c12c0, 0x1386d40, 0xc420568360)
	/go/src/github.com/nre-learning/syringe/scheduler/networks.go:33 +0x94
github.com/nre-learning/syringe/scheduler.(*LessonScheduler).Start(0xc4204c12c0, 0x0, 0x0)
	/go/src/github.com/nre-learning/syringe/scheduler/scheduler.go:83 +0x3d
main.main.func2(0xc4204c12c0, 0xc42036aec0)
	/go/src/github.com/nre-learning/syringe/cmd/syringed/main.go:65 +0x2f
created by main.main
	/go/src/github.com/nre-learning/syringe/cmd/syringed/main.go:64 +0x538

State management abstraction

To prepare for the upcoming move to external state, we should prepare an abstraction for state management. Support internal state as we do today, but using an Interface that could also be satisfied by an external driver like etcd

Should continue to support both for different use cases. Be able to configure the plugin choice via env

Improve logging for endpoint configuration

Config logs don't contain any context of what they're configuring.

We should also error out earlier than 10 or so failures. Like 3 should be fine.

10.38.0.38 - - [11/Mar/2019:00:48:07 +0000] "GET /exp/lessondef HTTP/1.1" 200 181938
time="2019-03-11T00:48:09Z" level=info msg="Job Status" active=1 failed=2 jobName=config-vqfx2 successful=0
time="2019-03-11T00:48:09Z" level=error msg="Problem configuring with config-vqfx2"
10.38.0.38 - - [11/Mar/2019:00:48:11 +0000] "GET /exp/livelesson/21-qofps64daw5eqzf6 HTTP/1.1" 200 2778
10.38.0.38 - - [11/Mar/2019:00:48:11 +0000] "GET /exp/lessondef HTTP/1.1" 200 181938
10.38.0.0 - - [11/Mar/2019:00:48:13 +0000] "GET / HTTP/1.1" 200 2
time="2019-03-11T00:48:14Z" level=info msg="Job Status" active=1 failed=2 jobName=config-vqfx2 successful=0
time="2019-03-11T00:48:14Z" level=error msg="Problem configuring with config-vqfx2"
10.38.0.38 - - [11/Mar/2019:00:48:15 +0000] "GET /exp/livelesson/15-qofps64daw5eqzf6 HTTP/1.1" 200 7144
time="2019-03-11T00:48:19Z" level=info msg="Job Status" active=1 failed=2 jobName=config-vqfx2 successful=0
time="2019-03-11T00:48:19Z" level=error msg="Problem configuring with config-vqfx2"
10.38.0.0 - - [11/Mar/2019:00:48:23 +0000] "GET / HTTP/1.1" 200 2
10.38.0.38 - - [11/Mar/2019:00:48:23 +0000] "GET /exp/livelesson/15-qofps64daw5eqzf6 HTTP/1.1" 200 7144
time="2019-03-11T00:48:24Z" level=info msg="Job Status" active=1 failed=2 jobName=config-vqfx2 successful=0
time="2019-03-11T00:48:24Z" level=error msg="Problem configuring with config-vqfx2"
time="2019-03-11T00:48:29Z" level=info msg="Job Status" active=1 failed=2 jobName=config-vqfx2 successful=0
time="2019-03-11T00:48:29Z" level=error msg="Problem configuring with config-vqfx2"
10.38.0.38 - - [11/Mar/2019:00:48:32 +0000] "GET /exp/livelesson/15-qofps64daw5eqzf6 HTTP/1.1" 200 7144
10.38.0.0 - - [11/Mar/2019:00:48:33 +0000] "GET / HTTP/1.1" 200 2
time="2019-03-11T00:48:34Z" level=info msg="Job Status" active=1 failed=2 jobName=config-vqfx2 successful=0
time="2019-03-11T00:48:34Z" level=error msg="Problem configuring with config-vqfx2"
10.38.0.38 - - [11/Mar/2019:00:48:39 +0000] "GET /exp/livelesson/15-qofps64daw5eqzf6 HTTP/1.1" 200 7144
time="2019-03-11T00:48:39Z" level=info msg="Job Status" active=1 failed=2 jobName=config-vqfx2 successful=0
time="2019-03-11T00:48:39Z" level=error msg="Problem configuring with config-vqfx2"
10.38.0.0 - - [11/Mar/2019:00:48:43 +0000] "GET / HTTP/1.1" 200 2
time="2019-03-11T00:48:44Z" level=info msg="Job Status" active=1 failed=3 jobName=config-vqfx2 successful=0
time="2019-03-11T00:48:44Z" level=error msg="Problem configuring with config-vqfx2"
time="2019-03-11T00:48:47Z" level=debug msg="No old namespaces found. No need to GC."
10.38.0.38 - - [11/Mar/2019:00:48:48 +0000] "GET /exp/livelesson/15-qofps64daw5eqzf6 HTTP/1.1" 200 7144
time="2019-03-11T00:48:49Z" level=info msg="Job Status" active=1 failed=3 jobName=config-vqfx2 successful=0
time="2019-03-11T00:48:49Z" level=error msg="Problem configuring with config-vqfx2"
10.38.0.0 - - [11/Mar/2019:00:48:53 +0000] "GET / HTTP/1.1" 200 2
time="2019-03-11T00:48:54Z" level=info msg="Job Status" active=1 failed=3 jobName=config-vqfx2 successful=0
time="2019-03-11T00:48:54Z" level=error msg="Problem configuring with config-vqfx2"
10.38.0.38 - - [11/Mar/2019:00:48:58 +0000] "GET /exp/lessondef/16 HTTP/1.1" 200 23674
time="2019-03-11T00:48:58Z" level=debug msg="Scheduler received new request. Sending to handle function." Operation=4 Stage=0 Uuid=16-qofps64daw5eqzf6
time="2019-03-11T00:48:58Z" level=debug msg="Booping 16-qofps64daw5eqzf6-ns"
10.38.0.38 - - [11/Mar/2019:00:48:58 +0000] "POST /exp/livelesson HTTP/1.1" 200 28
10.38.0.38 - - [11/Mar/2019:00:48:58 +0000] "GET /exp/livelesson/16-qofps64daw5eqzf6 HTTP/1.1" 200 4451
10.38.0.38 - - [11/Mar/2019:00:48:58 +0000] "GET /exp/lessondef HTTP/1.1" 200 181938
time="2019-03-11T00:48:59Z" level=info msg="Job Status" active=1 failed=3 jobName=config-vqfx2 successful=0
time="2019-03-11T00:48:59Z" level=error msg="Problem configuring with config-vqfx2"
10.38.0.38 - - [11/Mar/2019:00:49:00 +0000] "GET /exp/livelesson/15-qofps64daw5eqzf6 HTTP/1.1" 200 7144
time="2019-03-11T00:49:02Z" level=debug msg="Recording periodic influxdb metrics"
time="2019-03-11T00:49:02Z" level=debug msg="Creating influxdb point: ID: 17 | NAME: Version Control with Git | ACT

Configuration improvements

If the configs are the same between stages don't try reconfiguring

Also should clean up the approach with merge/overwrite

iframe resource shouldn't be tested

time="2019-01-15T06:49:08Z" level=debug msg="Connectivity testing endpoint selfservice via :0"
time="2019-01-15T06:49:08Z" level=debug msg="Connectivity testing endpoint selfservice via 10.107.156.241:5000"

It works right now so be careful removing anything important

Interactive wizard for building a new lesson and a new collection.

Should use the same functionality regardless of the resource type. Build it once, and iterate over the fields of whatever resource type is specified.
Should probably embed questions and choices into the data model somehow.

Should also update docs with a page on creating a new lesson/collection/etc that uses this.

Strange GC behavior

time="2019-01-29T22:07:59Z" level=info msg="Deleted namespace 19-ne14ch9gqwobnrr4-ns"
time="2019-01-29T22:07:59Z" level=info msg="Deleted namespace 19-ne14ch9gqwobnrr4-ns"
time="2019-01-29T22:07:59Z" level=info msg="Deleted namespace 19-ne14ch9gqwobnrr4-ns"
time="2019-01-29T22:07:59Z" level=info msg="Deleted namespace 19-ne14ch9gqwobnrr4-ns"
time="2019-01-29T22:07:59Z" level=info msg="Deleted namespace 19-ne14ch9gqwobnrr4-ns"
time="2019-01-29T22:07:59Z" level=info msg="Finished garbage-collecting 5 old lessons"
time="2019-01-29T22:07:59Z" level=debug msg="Received result from scheduler." Operation=3

Standardize logging configuration

We need to be able to configure a syslog destination, or a file. Currently just statically outputting to stdout.

Should also standardize the logging output itself. Like, for every message, lesson ID and session ID should be implicitly provided so it's always there.

Collections

Need a new resource type: collection.

You can refer to these collections in the lesson definition (should load collections first from YAML files, and then add check to make sure the collection ID exists

The collection resource will have all of the metadata needed to construct a page describing that collection.

TODO

  • In the web UI - support a mock collection set, and provide an option to declare which one should be top of the list, so we can use this for evangelism.

TLC to the configuration abilities in Syringe

  • Shell scripts for non-network devices
  • Remove configs from lesson definition - just rely on files placed
  • Update existing lessons to remove configs
  • Put in check on import to ensure that all files are in the right place

If the configs are the same between stages don't try reconfiguring

Also should clean up the approach with merge/overwrite

grpc error when removing whitelist entry

[mierdin@antidote-controller-lklr ~]$ kubectl exec -n=prod syringe-fbc65bdf5-zf4l4 syrctl wl remove m9eujuuzvqiq47b4
rpc error: code = Internal desc = grpc: error while marshaling: proto: Marshal called with nil
command terminated with exit code 1

influx error

10.47.0.0 - - [12/Feb/2019:02:40:52 +0000] "GET / HTTP/1.1" 200 2
10.47.0.0 - - [12/Feb/2019:02:41:02 +0000] "GET / HTTP/1.1" 200 2
10.47.0.0 - - [12/Feb/2019:02:41:12 +0000] "GET / HTTP/1.1" 200 2
10.47.0.0 - - [12/Feb/2019:02:41:22 +0000] "GET / HTTP/1.1" 200 2
10.47.0.0 - - [12/Feb/2019:02:41:32 +0000] "GET / HTTP/1.1" 200 2
time="2019-02-12T02:41:34Z" level=debug msg="No old namespaces found. No need to GC."
10.47.0.0 - - [12/Feb/2019:02:41:42 +0000] "GET / HTTP/1.1" 200 2
time="2019-02-12T02:41:52Z" level=debug msg="Recording periodic influxdb metrics"
time="2019-02-12T02:41:52Z" level=debug msg="Creating influxdb point: ID: 15 | NAME: Event-Driven Network Automation with StackStorm | ACTIVE: 1"
time="2019-02-12T02:41:52Z" level=debug msg="Creating influxdb point: ID: 30 | NAME: Network Automation with Salt | ACTIVE: 1"
10.47.0.0 - - [12/Feb/2019:02:41:52 +0000] "GET / HTTP/1.1" 200 2
10.47.0.0 - - [12/Feb/2019:02:42:02 +0000] "GET / HTTP/1.1" 200 2
10.40.2.110 - - [12/Feb/2019:02:42:03 +0000] "GET /exp/lessondef/30 HTTP/1.1" 200 8746
time="2019-02-12T02:42:03Z" level=debug msg="Scheduler received new request. Sending to handle function." Operation=4 Stage=0 Uuid=30-neiflnh3stq6q9yl
time="2019-02-12T02:42:03Z" level=debug msg="Booping 30-neiflnh3stq6q9yl-ns"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xfce8a1]
goroutine 2372 [running]:
github.com/nre-learning/syringe/api/exp.(*server).recordRequestTSDB(0xc4204b5bc0, 0xc42035e900, 0x0, 0x0)
	/go/src/github.com/nre-learning/syringe/api/exp/influxdb.go:106 +0x321
github.com/nre-learning/syringe/api/exp.(*server).RequestLiveLesson(0xc4204b5bc0, 0x142f560, 0xc420820210, 0xc42035e8c0, 0xc4204b5bc0, 0xc420820150, 0x1199de0)
	/go/src/github.com/nre-learning/syringe/api/exp/livelessons.go:86 +0x651
github.com/nre-learning/syringe/api/exp/generated._LiveLessonsService_RequestLiveLesson_Handler(0x12d9060, 0xc4204b5bc0, 0x142f560, 0xc420820210, 0xc420333220, 0x0, 0x0, 0x0, 0xc420c0cc80, 0x16)
	/go/src/github.com/nre-learning/syringe/api/exp/generated/livelesson.pb.go:1016 +0x241
github.com/nre-learning/syringe/vendor/google.golang.org/grpc.(*Server).processUnaryRPC(0xc42045ad80, 0x143b720, 0xc420516d80, 0xc420474b00, 0xc4204b5ce0, 0x1d9dad8, 0x0, 0x0, 0x0)
	/go/src/github.com/nre-learning/syringe/vendor/google.golang.org/grpc/server.go:966 +0x4bc
github.com/nre-learning/syringe/vendor/google.golang.org/grpc.(*Server).handleStream(0xc42045ad80, 0x143b720, 0xc420516d80, 0xc420474b00, 0x0)
	/go/src/github.com/nre-learning/syringe/vendor/google.golang.org/grpc/server.go:1245 +0xd69
github.com/nre-learning/syringe/vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc4203de070, 0xc42045ad80, 0x143b720, 0xc420516d80, 0xc420474b00)
	/go/src/github.com/nre-learning/syringe/vendor/google.golang.org/grpc/server.go:685 +0x9f
created by github.com/nre-learning/syringe/vendor/google.golang.org/grpc.(*Server).serveStreams.func1
	/go/src/github.com/nre-learning/syringe/vendor/google.golang.org/grpc/server.go:683 +0xa1

namespace delete

time="2018-08-24T22:10:00Z" level=error msg="failed to create namespace, not creating kubelab"
time="2018-08-24T22:10:00Z" level=error msg="Error creating lesson: namespaces \"12-6viedvg5rctwdpcd-ns\" already exists"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0xf0e5f9]
goroutine 6 [running]:
github.com/nre-learning/syringe/scheduler.(*KubeLab).ToLiveLesson(0x0, 0xc4201b9ef8)
	/go/src/github.com/nre-learning/syringe/scheduler/scheduler.go:300 +0x79
github.com/nre-learning/syringe/scheduler.(*LessonScheduler).Start(0xc42033b4e0, 0x0, 0x0)
	/go/src/github.com/nre-learning/syringe/scheduler/scheduler.go:98 +0x53e
main.main.func2(0xc42033b4e0, 0xc420367430)
	/go/src/github.com/nre-learning/syringe/cmd/syringed/main.go:78 +0x2f
created by main.main
	/go/src/github.com/nre-learning/syringe/cmd/syringed/main.go:77 +0x4c9

Enforce DNS-1123 resource name compliance on ingest

Endpoint names currently have no validation on ingest that forces compliance with kubernetes standards. See below error when trying to create an endpoint name with capital letters.

time="2018-11-06T07:41:26Z" level=error msg="Problem creating pod StackStorm: Pod \"StackStorm\" is invalid: [metadata.name: Invalid value: \"StackStorm\": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*'), spec.containers[0].name: Invalid value: \"StackStorm\": a DNS-1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?')]"

Validation should be added to all relevant fields in db types that will result in a kubernetes resource name that conforms to this spec.

Remove URL for lesson guide

Shouldn't be too tough to read the files from the expected location on request, and pass that data through the API itself.

Lesson spares

Should be easy to set on a kubelab basis, the number of spares to keep ready at all times. Default to 0. Tweak as demand rises and as costs can handle. Will need to figure out how to deal with namespaces. Maybe create the pod ahead of time and then add the service into the namespace? That will ensure DNS consistency with JIT provisioned stuff. Need an API for this, so that we can not only adjust these levels at runtime, but also to get metrics on how long lesson spares have remained unused, so we can optimize the configuration over time.

The problem you'll have to deal with is that resources can't have their namespaces changed. So if you are provisioning stuff ahead of time, when they're "called up" to be used, you'll have to access them where they live. Garbage collection will also need to be changed to clean up resource created JIT or in advance.

Run initial clone directly from Syringe

Currently, there are two places you have to update the lesson repo in Syringe, since Syringe must use an init container to clone it's own copy of the lesson directory, but then the environment variable is used to create init containers for all pods and jobs spawned by Syringe.

We should use the environment variable as the source of truth, and use something like https://github.com/src-d/go-git to take care of the initial clone, rather than an init container for the syringe pod(s). The pods/jobs spawned by syringe can and should continue to use init containers, however, as these are all informed by the env variable.

This will also make selfmedicate simpler.

NOTE that #75 introduces the use of local lesson directory for Syringe. When implementing the feature for this issue, be careful to honor this new functionality as well - only clone if the configuration permits. Otherwise use the local directory.

Syringe database back-end

#4 is a temporary fix, but the long-term fix is to back-end Syringe with a proper database for the small amount of state it keeps.

This will allow you to create multiple instances of syringe, as they'll be stateless. When they need to make a change, lock the value in etcd, change, and unlock. Do things properly

This will make #3 unnecessary, as when syringe crashes, the state will be kept elsewhere. It will also mean that we can truly have a proper load-balancing setup, as everything will scale out. What little state exists will exist in a database built to keep it distributed and properly accessed/locked.

Increate syringe replicas to 3 once done

Also should make sure that when a new request comes in, that other requests are delayed or declined until the first is completed.

Provide config directly to a pod via a volume

Many network images, such as the vqfx and vmx images currently supported, allow for the passing of configs at boot time.

This probably won't work with snapshots, but for the rare occassions where it's preferable to boot an image from start, providing a config file on first boot can help save precious seconds.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.