nre-learning / antidote-core Goto Github PK

View Code? Open in Web Editor NEW

14.0 14.0 9.0 31.84 MB

💉 Core Services that make up the Antidote Platform

License: Apache License 2.0

Makefile 1.55% Go 92.16% Dockerfile 0.32% Shell 5.96%

golang kubernetes

antidote-core's People

Contributors

Stargazers

Watchers

Forkers

jnpr-raylam smk4664 olberger tinyfire9 lara29 glasspangolin crosscacus shaundubuque guyou

antidote-core's Issues

Timeout earlier on job failure

If you see failures, you can return an error via the API much sooner.

Remove ToplogyType field

Not needed anymore. Was going to be useful if we used read-only, shared topologies, but we abandoned that approach.

Will need to rewrite scheduler to use presence of devices in lieu of this explicit field
Update defs
Update existing lessons

Pick up existing lessons after restart

Syringe holds state of what lessons are provisioned in memory, which means if syringe restarts, it loses track of this. This isn't too bad, because now that GC is in place, and it works based on purely kubernetes calls, they'll get cleaned up eventually but it means everyone will need to start new stuff from scratch. Would be nice if when syringe gets a request to create a resource, it detects if one already exists, and if so, add it back up to the in-memory map.

Syringe Architectural Changes

Considering a few changes:

Make the API server totally stateless. Keep state only in the scheduler, via Kubelab.
De-couple the scheduler into it's own running binary. Communicate between scheduler and api via message queue
Use database behind scheduler (shouldn't need this for API)

k8s mock for unit tests

https://medium.com/@e_frogers/unit-testing-with-kubernetes-client-go-283b11aaa7db

Potential race condition and poor handling of kubelab state

I was working on the NAPALM lesson, which is the first to use jupyter notebooks as the lab guide. It uses the notebook for the lab guide for stages 1, 2, and 4. For stage 3 it will revert back to a markdown file.

EVERY ONCE IN A WHILE (and I mean very rarely) it will fail to load the markdown version. It was very difficult to reproduce this. As a result of this, it's also not clear whether the problem is caused by starting at stage 1 and navigating to stage 3, or if it could happen any time. I only tried going directly to stage 3 a few times but it never happened when doing that. However, with the infrequency of the event, it's still inconclusive.

I still haven't gotten to the root cause but digging through the scheduler code caused me to realize how fragile and shitty it is the way I'm mapping kubelab to livelessons, handling kubelab state, applying state changes between stages, etc. etc. This whole thing needs re-vamped.

I hate discussing a solution before I've REALLY nailed down the problem, but after spending hours trying to reliably reproduce, I'm leaning towards just rebuilding the scheduler to not suck and assuming that will take care of it.

For instance, there's a global var in scheduler.go:

kubeLabs                           = map[string]*KubeLab{}

I'm only writing to this map in one place, and in my testing, I was only spinning up lessons for me, so there shouldn't have been concurrent writes (though obviously I should still be using a Mutex or something to control concurrent writes). I also am not sure of the implications of this global variable as opposed to a property of the scheduler.

Fail validation if there are missing config files for devices

Need a check in Syringe if there are no configs but there are devices.

Confusing Lesson DIR parameters

Some config parameters point to something like /antidote, others /antidote/lessons.

Should reposition configuration input as "curriculum". This will allow you to incorporate collections resources underneath this as well.

trace from launch day

time="2018-10-11T08:00:15Z" level=info msg="New KubeLab for lesson 19 is of TopologyType custom"
time="2018-10-11T08:00:15Z" level=debug msg="Creating devices and connections"
time="2018-10-11T08:00:15Z" level=error msg="Problem creating network vqfx1-vqfx2-net: network-attachment-definitions.k8s.cni.cncf.io \"vqfx1-vqfx2-net\" is forbidden: unable to create new content in namespace 19-rapsjjo3ypp6m1bc-ns because it is being terminated"
time="2018-10-11T08:00:15Z" level=error msg="network-attachment-definitions.k8s.cni.cncf.io \"vqfx1-vqfx2-net\" is forbidden: unable to create new content in namespace 19-rapsjjo3ypp6m1bc-ns because it is being terminated"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xf56869]
goroutine 447350 [running]:
github.com/nre-learning/syringe/scheduler.(*LessonScheduler).createKubeLab(0xc420409520, 0xc4218010c0, 0xc422fb37a0, 0x2, 0x2)
	/go/src/github.com/nre-learning/syringe/scheduler/scheduler.go:347 +0xd79
github.com/nre-learning/syringe/scheduler.(*LessonScheduler).handleRequest(0xc420409520, 0xc4218010c0)
	/go/src/github.com/nre-learning/syringe/scheduler/scheduler.go:120 +0xbcd
created by github.com/nre-learning/syringe/scheduler.(*LessonScheduler).Start
	/go/src/github.com/nre-learning/syringe/scheduler/scheduler.go:110 +0xb5

might be solved by #5

Delying incoming requests until after nuke finishes

Kind of a niche problem, but requests for lessons immediately after startup, and before the nuke has completed, gives a 408 error. We should probably cache all incoming requests while nuking is in progress, and wait to process them until finished.

time="2019-01-13T22:40:27Z" level=debug msg="Waiting for namespace 15-jjtigg867ghr3gye-ns to delete..."
10.47.0.0 - - [13/Jan/2019:22:40:31 +0000] "GET / HTTP/1.1" 200 2
time="2019-01-13T22:40:32Z" level=debug msg="Waiting for namespace 15-jjtigg867ghr3gye-ns to delete..."
time="2019-01-13T22:40:37Z" level=debug msg="Waiting for namespace 15-jjtigg867ghr3gye-ns to delete..."
10.47.0.0 - - [13/Jan/2019:22:40:41 +0000] "GET / HTTP/1.1" 200 2
time="2019-01-13T22:40:42Z" level=debug msg="Waiting for namespace 15-jjtigg867ghr3gye-ns to delete..."
time="2019-01-13T22:40:47Z" level=debug msg="Waiting for namespace 15-jjtigg867ghr3gye-ns to delete..."
10.47.0.0 - - [13/Jan/2019:22:40:51 +0000] "GET / HTTP/1.1" 200 2
time="2019-01-13T22:40:52Z" level=debug msg="Waiting for namespace 15-jjtigg867ghr3gye-ns to delete..."
ERROR: 2019/01/13 22:40:53 grpc: server failed to encode response:  rpc error: code = Internal desc = grpc: error while marshaling: proto: Marshal called with nil
10.40.0.3 - - [13/Jan/2019:22:40:56 +0000] "GET /exp/lessondef/all HTTP/1.1" 200 143850
10.40.0.3 - - [13/Jan/2019:22:40:56 +0000] "GET /exp/lessondef/15 HTTP/1.1" 200 27172
time="2019-01-13T22:40:57Z" level=debug msg="Waiting for namespace 15-jjtigg867ghr3gye-ns to delete..."
10.47.0.0 - - [13/Jan/2019:22:41:01 +0000] "GET / HTTP/1.1" 200 2
time="2019-01-13T22:41:02Z" level=debug msg="Waiting for namespace 15-jjtigg867ghr3gye-ns to delete..."
10.40.0.3 - - [13/Jan/2019:22:40:57 +0000] "POST /exp/livelesson HTTP/1.1" 408 66
10.40.0.3 - - [13/Jan/2019:22:41:07 +0000] "GET /exp/syringeinfo HTTP/1.1" 200 114

Should also consider handling this error better in antidote-web. The behavior on that side is that the loading screen constantly shows, when in fact it broke on the initial error within seconds. Should go through all requests on that side and make sure nothing is able to fail silently like that.

kubelab crashes


10.38.0.38 - - [17/Feb/2019:18:22:41 +0000] "GET /exp/lessondef/17 HTTP/1.1" 200 4235
time="2019-02-17T18:22:41Z" level=debug msg="Scheduler received new request. Sending to handle function." Operation=3 Stage=2 Uuid=17-wxd1zqhqmej6qoiv
panic: runtime error: index out of range
goroutine 180539 [running]:
github.com/nre-learning/syringe/scheduler.(*KubeLab).ToLiveLesson(0xc420c6eae0, 0xc421310a20)
	/go/src/github.com/nre-learning/syringe/scheduler/kubelab.go:105 +0xb7f
github.com/nre-learning/syringe/scheduler.(*LessonScheduler).handleRequestMODIFY(0xc4201edc70, 0xc420b699c0)
	/go/src/github.com/nre-learning/syringe/scheduler/requests.go:176 +0x146
github.com/nre-learning/syringe/scheduler.(*LessonScheduler).(github.com/nre-learning/syringe/scheduler.handleRequestMODIFY)-fm(0xc420b699c0)
	/go/src/github.com/nre-learning/syringe/scheduler/scheduler.go:93 +0x34
github.com/nre-learning/syringe/scheduler.(*LessonScheduler).Start.func2(0xc4206d2810, 0xc420b699c0)
	/go/src/github.com/nre-learning/syringe/scheduler/scheduler.go:107 +0x6d
created by github.com/nre-learning/syringe/scheduler.(*LessonScheduler).Start
	/go/src/github.com/nre-learning/syringe/scheduler/scheduler.go:106 +0x3ff

Metrics not quite right

Something changed in the last release. Ever since last Monday (v0.2.0), the metrics have been askew:

This contradicts the current namespaces (the only session is me):

kubectl get ns

NAME                     STATUS   AGE
14-jjtigg867ghr3gye-ns   Active   46m
default                  Active   25d
kube-public              Active   25d
kube-system              Active   25d
prod                     Active   25d
ptr                      Active   25d

Certificate management for iframe resources

Previously, there was no generic iframe resource - it was specific to jupyter notebooks. Thus, it was easy to load the letsencrypt cert into the docker image because I controlled the build.

However, if we're planning to allow generic web resources to be embedded via iframe, this means in order to display the content, it needs to be a trusted cert. So we need to put some thought into how we're going to do this - I can't always slipstream a cert into every image that needs it.

Add authentication to grpc

Shouldn't panic on failure to connect to k8s

panic: Post https://10.96.0.1:443/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions: dial tcp 10.96.0.1:443: connect: connection refused
goroutine 22 [running]:
github.com/nre-learning/syringe/scheduler.(*LessonScheduler).createNetworkCrd(0xc4204c12c0, 0x1386d40, 0xc420568360)
	/go/src/github.com/nre-learning/syringe/scheduler/networks.go:33 +0x94
github.com/nre-learning/syringe/scheduler.(*LessonScheduler).Start(0xc4204c12c0, 0x0, 0x0)
	/go/src/github.com/nre-learning/syringe/scheduler/scheduler.go:83 +0x3d
main.main.func2(0xc4204c12c0, 0xc42036aec0)
	/go/src/github.com/nre-learning/syringe/cmd/syringed/main.go:65 +0x2f
created by main.main
	/go/src/github.com/nre-learning/syringe/cmd/syringed/main.go:64 +0x538

State management abstraction

To prepare for the upcoming move to external state, we should prepare an abstraction for state management. Support internal state as we do today, but using an Interface that could also be satisfied by an external driver like etcd

Should continue to support both for different use cases. Be able to configure the plugin choice via env

Improve logging for endpoint configuration

Config logs don't contain any context of what they're configuring.

We should also error out earlier than 10 or so failures. Like 3 should be fine.

10.38.0.38 - - [11/Mar/2019:00:48:07 +0000] "GET /exp/lessondef HTTP/1.1" 200 181938
time="2019-03-11T00:48:09Z" level=info msg="Job Status" active=1 failed=2 jobName=config-vqfx2 successful=0
time="2019-03-11T00:48:09Z" level=error msg="Problem configuring with config-vqfx2"
10.38.0.38 - - [11/Mar/2019:00:48:11 +0000] "GET /exp/livelesson/21-qofps64daw5eqzf6 HTTP/1.1" 200 2778
10.38.0.38 - - [11/Mar/2019:00:48:11 +0000] "GET /exp/lessondef HTTP/1.1" 200 181938
10.38.0.0 - - [11/Mar/2019:00:48:13 +0000] "GET / HTTP/1.1" 200 2
time="2019-03-11T00:48:14Z" level=info msg="Job Status" active=1 failed=2 jobName=config-vqfx2 successful=0
time="2019-03-11T00:48:14Z" level=error msg="Problem configuring with config-vqfx2"
10.38.0.38 - - [11/Mar/2019:00:48:15 +0000] "GET /exp/livelesson/15-qofps64daw5eqzf6 HTTP/1.1" 200 7144
time="2019-03-11T00:48:19Z" level=info msg="Job Status" active=1 failed=2 jobName=config-vqfx2 successful=0
time="2019-03-11T00:48:19Z" level=error msg="Problem configuring with config-vqfx2"
10.38.0.0 - - [11/Mar/2019:00:48:23 +0000] "GET / HTTP/1.1" 200 2
10.38.0.38 - - [11/Mar/2019:00:48:23 +0000] "GET /exp/livelesson/15-qofps64daw5eqzf6 HTTP/1.1" 200 7144
time="2019-03-11T00:48:24Z" level=info msg="Job Status" active=1 failed=2 jobName=config-vqfx2 successful=0
time="2019-03-11T00:48:24Z" level=error msg="Problem configuring with config-vqfx2"
time="2019-03-11T00:48:29Z" level=info msg="Job Status" active=1 failed=2 jobName=config-vqfx2 successful=0
time="2019-03-11T00:48:29Z" level=error msg="Problem configuring with config-vqfx2"
10.38.0.38 - - [11/Mar/2019:00:48:32 +0000] "GET /exp/livelesson/15-qofps64daw5eqzf6 HTTP/1.1" 200 7144
10.38.0.0 - - [11/Mar/2019:00:48:33 +0000] "GET / HTTP/1.1" 200 2
time="2019-03-11T00:48:34Z" level=info msg="Job Status" active=1 failed=2 jobName=config-vqfx2 successful=0
time="2019-03-11T00:48:34Z" level=error msg="Problem configuring with config-vqfx2"
10.38.0.38 - - [11/Mar/2019:00:48:39 +0000] "GET /exp/livelesson/15-qofps64daw5eqzf6 HTTP/1.1" 200 7144
time="2019-03-11T00:48:39Z" level=info msg="Job Status" active=1 failed=2 jobName=config-vqfx2 successful=0
time="2019-03-11T00:48:39Z" level=error msg="Problem configuring with config-vqfx2"
10.38.0.0 - - [11/Mar/2019:00:48:43 +0000] "GET / HTTP/1.1" 200 2
time="2019-03-11T00:48:44Z" level=info msg="Job Status" active=1 failed=3 jobName=config-vqfx2 successful=0
time="2019-03-11T00:48:44Z" level=error msg="Problem configuring with config-vqfx2"
time="2019-03-11T00:48:47Z" level=debug msg="No old namespaces found. No need to GC."
10.38.0.38 - - [11/Mar/2019:00:48:48 +0000] "GET /exp/livelesson/15-qofps64daw5eqzf6 HTTP/1.1" 200 7144
time="2019-03-11T00:48:49Z" level=info msg="Job Status" active=1 failed=3 jobName=config-vqfx2 successful=0
time="2019-03-11T00:48:49Z" level=error msg="Problem configuring with config-vqfx2"
10.38.0.0 - - [11/Mar/2019:00:48:53 +0000] "GET / HTTP/1.1" 200 2
time="2019-03-11T00:48:54Z" level=info msg="Job Status" active=1 failed=3 jobName=config-vqfx2 successful=0
time="2019-03-11T00:48:54Z" level=error msg="Problem configuring with config-vqfx2"
10.38.0.38 - - [11/Mar/2019:00:48:58 +0000] "GET /exp/lessondef/16 HTTP/1.1" 200 23674
time="2019-03-11T00:48:58Z" level=debug msg="Scheduler received new request. Sending to handle function." Operation=4 Stage=0 Uuid=16-qofps64daw5eqzf6
time="2019-03-11T00:48:58Z" level=debug msg="Booping 16-qofps64daw5eqzf6-ns"
10.38.0.38 - - [11/Mar/2019:00:48:58 +0000] "POST /exp/livelesson HTTP/1.1" 200 28
10.38.0.38 - - [11/Mar/2019:00:48:58 +0000] "GET /exp/livelesson/16-qofps64daw5eqzf6 HTTP/1.1" 200 4451
10.38.0.38 - - [11/Mar/2019:00:48:58 +0000] "GET /exp/lessondef HTTP/1.1" 200 181938
time="2019-03-11T00:48:59Z" level=info msg="Job Status" active=1 failed=3 jobName=config-vqfx2 successful=0
time="2019-03-11T00:48:59Z" level=error msg="Problem configuring with config-vqfx2"
10.38.0.38 - - [11/Mar/2019:00:49:00 +0000] "GET /exp/livelesson/15-qofps64daw5eqzf6 HTTP/1.1" 200 7144
time="2019-03-11T00:49:02Z" level=debug msg="Recording periodic influxdb metrics"
time="2019-03-11T00:49:02Z" level=debug msg="Creating influxdb point: ID: 17 | NAME: Version Control with Git | ACT

Configuration improvements

If the configs are the same between stages don't try reconfiguring

Also should clean up the approach with merge/overwrite

iframe resource shouldn't be tested

time="2019-01-15T06:49:08Z" level=debug msg="Connectivity testing endpoint selfservice via :0"
time="2019-01-15T06:49:08Z" level=debug msg="Connectivity testing endpoint selfservice via 10.107.156.241:5000"

It works right now so be careful removing anything important

Interactive wizard for building a new lesson and a new collection.

Should use the same functionality regardless of the resource type. Build it once, and iterate over the fields of whatever resource type is specified.
Should probably embed questions and choices into the data model somehow.

Should also update docs with a page on creating a new lesson/collection/etc that uses this.

Strange GC behavior

time="2019-01-29T22:07:59Z" level=info msg="Deleted namespace 19-ne14ch9gqwobnrr4-ns"
time="2019-01-29T22:07:59Z" level=info msg="Deleted namespace 19-ne14ch9gqwobnrr4-ns"
time="2019-01-29T22:07:59Z" level=info msg="Deleted namespace 19-ne14ch9gqwobnrr4-ns"
time="2019-01-29T22:07:59Z" level=info msg="Deleted namespace 19-ne14ch9gqwobnrr4-ns"
time="2019-01-29T22:07:59Z" level=info msg="Deleted namespace 19-ne14ch9gqwobnrr4-ns"
time="2019-01-29T22:07:59Z" level=info msg="Finished garbage-collecting 5 old lessons"
time="2019-01-29T22:07:59Z" level=debug msg="Received result from scheduler." Operation=3

Standardize logging configuration

We need to be able to configure a syslog destination, or a file. Currently just statically outputting to stdout.

Should also standardize the logging output itself. Like, for every message, lesson ID and session ID should be implicitly provided so it's always there.

Collections

Need a new resource type: collection.

You can refer to these collections in the lesson definition (should load collections first from YAML files, and then add check to make sure the collection ID exists

The collection resource will have all of the metadata needed to construct a page describing that collection.

TODO

In the web UI - support a mock collection set, and provide an option to declare which one should be top of the list, so we can use this for evangelism.

Add ability to mark a session ID to bypass GC

Explore locking down created bridge networks too, if necessary

Follow up to nre-learning/antidote#43

Need to track spin-up time in influx

Also add to grafana. Then write stress tests and watch this graph.

TLC to the configuration abilities in Syringe

Shell scripts for non-network devices
Remove configs from lesson definition - just rely on files placed
Update existing lessons to remove configs
Put in check on import to ensure that all files are in the right place

If the configs are the same between stages don't try reconfiguring

Also should clean up the approach with merge/overwrite

grpc error when removing whitelist entry

[mierdin@antidote-controller-lklr ~]$ kubectl exec -n=prod syringe-fbc65bdf5-zf4l4 syrctl wl remove m9eujuuzvqiq47b4
rpc error: code = Internal desc = grpc: error while marshaling: proto: Marshal called with nil
command terminated with exit code 1

influx error

10.47.0.0 - - [12/Feb/2019:02:40:52 +0000] "GET / HTTP/1.1" 200 2
10.47.0.0 - - [12/Feb/2019:02:41:02 +0000] "GET / HTTP/1.1" 200 2
10.47.0.0 - - [12/Feb/2019:02:41:12 +0000] "GET / HTTP/1.1" 200 2
10.47.0.0 - - [12/Feb/2019:02:41:22 +0000] "GET / HTTP/1.1" 200 2
10.47.0.0 - - [12/Feb/2019:02:41:32 +0000] "GET / HTTP/1.1" 200 2
time="2019-02-12T02:41:34Z" level=debug msg="No old namespaces found. No need to GC."
10.47.0.0 - - [12/Feb/2019:02:41:42 +0000] "GET / HTTP/1.1" 200 2
time="2019-02-12T02:41:52Z" level=debug msg="Recording periodic influxdb metrics"
time="2019-02-12T02:41:52Z" level=debug msg="Creating influxdb point: ID: 15 | NAME: Event-Driven Network Automation with StackStorm | ACTIVE: 1"
time="2019-02-12T02:41:52Z" level=debug msg="Creating influxdb point: ID: 30 | NAME: Network Automation with Salt | ACTIVE: 1"
10.47.0.0 - - [12/Feb/2019:02:41:52 +0000] "GET / HTTP/1.1" 200 2
10.47.0.0 - - [12/Feb/2019:02:42:02 +0000] "GET / HTTP/1.1" 200 2
10.40.2.110 - - [12/Feb/2019:02:42:03 +0000] "GET /exp/lessondef/30 HTTP/1.1" 200 8746
time="2019-02-12T02:42:03Z" level=debug msg="Scheduler received new request. Sending to handle function." Operation=4 Stage=0 Uuid=30-neiflnh3stq6q9yl
time="2019-02-12T02:42:03Z" level=debug msg="Booping 30-neiflnh3stq6q9yl-ns"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xfce8a1]
goroutine 2372 [running]:
github.com/nre-learning/syringe/api/exp.(*server).recordRequestTSDB(0xc4204b5bc0, 0xc42035e900, 0x0, 0x0)
	/go/src/github.com/nre-learning/syringe/api/exp/influxdb.go:106 +0x321
github.com/nre-learning/syringe/api/exp.(*server).RequestLiveLesson(0xc4204b5bc0, 0x142f560, 0xc420820210, 0xc42035e8c0, 0xc4204b5bc0, 0xc420820150, 0x1199de0)
	/go/src/github.com/nre-learning/syringe/api/exp/livelessons.go:86 +0x651
github.com/nre-learning/syringe/api/exp/generated._LiveLessonsService_RequestLiveLesson_Handler(0x12d9060, 0xc4204b5bc0, 0x142f560, 0xc420820210, 0xc420333220, 0x0, 0x0, 0x0, 0xc420c0cc80, 0x16)
	/go/src/github.com/nre-learning/syringe/api/exp/generated/livelesson.pb.go:1016 +0x241
github.com/nre-learning/syringe/vendor/google.golang.org/grpc.(*Server).processUnaryRPC(0xc42045ad80, 0x143b720, 0xc420516d80, 0xc420474b00, 0xc4204b5ce0, 0x1d9dad8, 0x0, 0x0, 0x0)
	/go/src/github.com/nre-learning/syringe/vendor/google.golang.org/grpc/server.go:966 +0x4bc
github.com/nre-learning/syringe/vendor/google.golang.org/grpc.(*Server).handleStream(0xc42045ad80, 0x143b720, 0xc420516d80, 0xc420474b00, 0x0)
	/go/src/github.com/nre-learning/syringe/vendor/google.golang.org/grpc/server.go:1245 +0xd69
github.com/nre-learning/syringe/vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc4203de070, 0xc42045ad80, 0x143b720, 0xc420516d80, 0xc420474b00)
	/go/src/github.com/nre-learning/syringe/vendor/google.golang.org/grpc/server.go:685 +0x9f
created by github.com/nre-learning/syringe/vendor/google.golang.org/grpc.(*Server).serveStreams.func1
	/go/src/github.com/nre-learning/syringe/vendor/google.golang.org/grpc/server.go:683 +0xa1

Deprecate IGNORE_DISABLED in favor of TIER var

Need to lock a lesson when it's being changed

Will need to lock down a lesson when it's being changed, for instance you can send a request for stage 1 and then stage 2 in succession and cause a race condition.

Duplicate env vars in config

LessonDir vs LessonsDir

Augment API to provide more insight into lesson spin-up progress

TODO

Open sister PR in antidote-web to take advantage of this

namespace delete

time="2018-08-24T22:10:00Z" level=error msg="failed to create namespace, not creating kubelab"
time="2018-08-24T22:10:00Z" level=error msg="Error creating lesson: namespaces \"12-6viedvg5rctwdpcd-ns\" already exists"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0xf0e5f9]
goroutine 6 [running]:
github.com/nre-learning/syringe/scheduler.(*KubeLab).ToLiveLesson(0x0, 0xc4201b9ef8)
	/go/src/github.com/nre-learning/syringe/scheduler/scheduler.go:300 +0x79
github.com/nre-learning/syringe/scheduler.(*LessonScheduler).Start(0xc42033b4e0, 0x0, 0x0)
	/go/src/github.com/nre-learning/syringe/scheduler/scheduler.go:98 +0x53e
main.main.func2(0xc42033b4e0, 0xc420367430)
	/go/src/github.com/nre-learning/syringe/cmd/syringed/main.go:78 +0x2f
created by main.main
	/go/src/github.com/nre-learning/syringe/cmd/syringed/main.go:77 +0x4c9

Print reason for skipping a lesson on import (i.e. due to tier, etc)

Should also add a message that indicates that a lesson was skipped because of tier

Add ability to kill a session via syrctl

Can't just kill the NS with kubectl, since the state will still be in syringe. Need to trigger a syringe cleanup

Enforce DNS-1123 resource name compliance on ingest

Endpoint names currently have no validation on ingest that forces compliance with kubernetes standards. See below error when trying to create an endpoint name with capital letters.

time="2018-11-06T07:41:26Z" level=error msg="Problem creating pod StackStorm: Pod \"StackStorm\" is invalid: [metadata.name: Invalid value: \"StackStorm\": a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*'), spec.containers[0].name: Invalid value: \"StackStorm\": a DNS-1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?')]"

Validation should be added to all relevant fields in db types that will result in a kubernetes resource name that conforms to this spec.

Remove URL for lesson guide

Shouldn't be too tough to read the files from the expected location on request, and pass that data through the API itself.

Should gracefully handle influx connection failures

Syringe isn't crashing, but the goroutine which periodically publishes usage data seems to be.

Should make sure we keep trying to connect, and should also improve logging.

Remove subnet requirement from connections list

Add “quiz” resource type

Should support branching questions (with limits).

/cc @riw777 for his input on this

Use key-based auth instead of static password for endpoints

Adjust NetworkPolicy to allow git-clone init pod connection to github

Lesson spares

Should be easy to set on a kubelab basis, the number of spares to keep ready at all times. Default to 0. Tweak as demand rises and as costs can handle. Will need to figure out how to deal with namespaces. Maybe create the pod ahead of time and then add the service into the namespace? That will ensure DNS consistency with JIT provisioned stuff. Need an API for this, so that we can not only adjust these levels at runtime, but also to get metrics on how long lesson spares have remained unused, so we can optimize the configuration over time.

The problem you'll have to deal with is that resources can't have their namespaces changed. So if you are provisioning stuff ahead of time, when they're "called up" to be used, you'll have to access them where they live. Garbage collection will also need to be changed to clean up resource created JIT or in advance.

Run initial clone directly from Syringe

Currently, there are two places you have to update the lesson repo in Syringe, since Syringe must use an init container to clone it's own copy of the lesson directory, but then the environment variable is used to create init containers for all pods and jobs spawned by Syringe.

We should use the environment variable as the source of truth, and use something like https://github.com/src-d/go-git to take care of the initial clone, rather than an init container for the syringe pod(s). The pods/jobs spawned by syringe can and should continue to use init containers, however, as these are all informed by the env variable.

This will also make selfmedicate simpler.

NOTE that #75 introduces the use of local lesson directory for Syringe. When implementing the feature for this issue, be careful to honor this new functionality as well - only clone if the configuration permits. Otherwise use the local directory.

Disable influxdb writes by default - enable them explicitly via env

This is so the dev logs don't get cluttered with connection failed messages

Mocked testing

https://medium.com/@e_frogers/unit-testing-with-kubernetes-client-go-283b11aaa7db

Syringe database back-end

#4 is a temporary fix, but the long-term fix is to back-end Syringe with a proper database for the small amount of state it keeps.

This will allow you to create multiple instances of syringe, as they'll be stateless. When they need to make a change, lock the value in etcd, change, and unlock. Do things properly

This will make #3 unnecessary, as when syringe crashes, the state will be kept elsewhere. It will also mean that we can truly have a proper load-balancing setup, as everything will scale out. What little state exists will exist in a database built to keep it distributed and properly accessed/locked.

Increate syringe replicas to 3 once done

Also should make sure that when a new request comes in, that other requests are delayed or declined until the first is completed.

Provide config directly to a pod via a volume

Many network images, such as the vqfx and vmx images currently supported, allow for the passing of configs at boot time.

This probably won't work with snapshots, but for the rare occassions where it's preferable to boot an image from start, providing a config file on first boot can help save precious seconds.

nre-learning / antidote-core Goto Github PK

antidote-core's People

Contributors

Stargazers

Watchers

Forkers

antidote-core's Issues

TODO

TODO

Recommend Projects

Recommend Topics

Recommend Org