brunoreboul / ram Goto Github PK

View Code? Open in Web Editor NEW

11.0 11.0 2.0 1.63 MB

Real-time Asset Monitor

License: Apache License 2.0

Go 100.00%

asset-monitor compliance-as-code configuration gcp

ram's People

Contributors

Stargazers

Watchers

Forkers

sdenef-adeo syllogy

ram's Issues

Cloud functions killed on first retry as expiration delay computed in nanoseconds instead of seconds

Cloud build API not activated when executing deployReleasePipelinePrerequsites

Not blocking when using a service account to run ramcli that is hosted in a different project where Cloud Build API is activated

As RAM User, I want to see the resource’s parent project_id in assets and violations tables so that I am no more stuck with the less useful project number nor the friendly but not technical project display name

As RAM Manager, I want a Cloud Monitoring Dashboard showing resource consumption to monitor Cloud Function execution so that I can measure dump peak time width and scalability height

to keep it simple, two dashboards:

RAM code services
RAM groups services

Export ramcli function getConstraintFolderRelativePaths to support testing in ram-config-template

listgroup: shared triggering pubsub topics leads to errors in multi directory environment

as sub queries per domain and prefix are launched in every dorectory

solution:

update the pubsub data structure to add the directoryCustomerId
use the directoryCustomerId from the triggering message to discard events not related to this cloud function instance (one per directory)

403 dumpinventory cannot dump, missing org run role when multiple orgs are monitored

package dumpinventory
meth deployGRMMonitoringOrgBindings
the for loop exist with a return after dealing with the first org of the list, leading not to implrement the org run role in other monitored orgs

looking for potential similar for pattern in other services

Found other impacts:

dumpinventory
monitor
stream2bq
upload2gcs

Missing assets in BQ asset table leading to missing ancestryPaths in result dashboards

Usually this corner case in not that visible as for many asset types the information is already provided by the RESOURCE feed
.
Nervertheless, this issue is visible when:

RAM is set up to focus on IAM policies only (no RESOURCe feed)
An asset type have IAM policies and no configure RESOURCE feed

Fix: add the instance stream2bq_iam_assets to the microservice stream2bq configured to be triggered on the Pubsub topic cai-iam-policies and writing to the BQ table assets

ram -config configure stream2bq instances using method deployment.configureStream2bqAssetTypes() in the package ramcli

As RAM Manager, I want a settings to limit `upload2gcs` deployment to the `dev` environment so that cloud storage cost can be reduced in others environments

AssetType is empty for CGI GroupsSettings and GroupMembers assets in `last_compliancestatus` view

AssetName pattern
//directories//groups//groupSettings
//directories//groups//members/

The number of groups, groupsettings and groupmembers may be 10x the number of GCP assets.
There is not valuable ancestryPath for these assets.
In consequence these assets are not streamed into BQ asset table. The cost/value ratio is too low.

In Last_compliancestatus view, the assetType comes from the asset table. Explaining while assetType is empty for theses assets.

Workarround: update last_compliancestatus view SQL code to deduct the assetType from the asset name for theses assets

www.googleapis.com/admin/directory/groups
www.googleapis.com/admin/directory/members
groupssettings.googleapis.com/groupSettings

missing go.sum entry error RAM deployment failed in cloud build

Error:

Step #1 - "build a fresh ram cli": Already have image: golang
Step #1 - "build a fresh ram cli": go: github.com/BrunoReboul/[email protected]: missing go.sum entry; to add it:
Step #1 - "build a fresh ram cli":  go mod download github.com/BrunoReboul/ram

Occurs on any build, including existing successful build when hitting RETRY

Root Cause:

Cloud function GO runtime is bounded to GO v1.13

RAM developper’s GO version is aligned to the same version to avoid using new GO feature that won t be available on cloud functions.

go version go1.13.8 linux/amd64

Cloud Build environment uses the online GOLANG container that is continuously updated, currently using GO v1.16

Step #0 - "display go language version": Status: Downloaded newer image for golang:latest
Step #0 - "display go language version": docker.io/library/golang:latest
Step #0 - "display go language version": go version go1.16 linux/amd64

With GO 1.16 the -mod option for automatic updates default value changes

In Go 1.15 and lower, the -mod=mod flag was enabled by default, so updates were performed automatically. Since Go 1.16, the go command acts as if -mod=readonly were set instead: if any changes to go.mod are needed, the go command reports an error and suggests a fix.

Proposed fix

Update the Cloud Build triggers build steps definition:
add go version to ease troubleshooting by always displaying which version of GO is used by the GOLANG Docker container
Add to the go build command the flag -mod=mod to allow automatic updates

steps:
 - name: golang
   args:
     - go
     - version
   id: display go language version
 - name: golang
   args:
     - go
     - build
     - '-mod=mod'
     - ram.go
   id: build a fresh ram cli

Do not retry on firestore error doc not found, it is costly and useless

gfs.getDoc
when err text contains 'notfound' or 'not found' exit the retry loop

As RAM manager, I want a tool to check if configured instances have a running instance so that it saves time

Running instances means:

a running cloud function
a running CAI feed
a running log sink

As RAM Manager, I want structured logs to be implemented including severity categorization of log entries so that log show clearly where issues are

RAM results in BigQuery disappears after 1 day

The function gbq.getTable set a partition expiration to 24h, which is not requests ;-/
timePartitioning.Expiration, _ = time.ParseDuration("24h")

BQ doc: Updating the partition expiration

GO BQ module doc:
https://pkg.go.dev/cloud.google.com/go/bigquery#TimePartitioning

// The amount of time to keep the storage for a partition.
// If the duration is empty (0), the data in the partitions do not expire.
Expiration time.Duration

Cloud function Retry at 600 sec to short to deal with Quota exceeded on ADMIN SDK and GROUPS SETTINGS API

Set to 3600 sec instead of 600 sec

As RAM user, I want Owner and Resolver field to be always populated so that the % of assigned non compliance is increased so that the % of resolution is improved too

Always populated means:

Have a yaml content providing owner and resolver identity for a given folder ID or org ID (aka address missing gcp label on folder and orgs)
When a resources does not have a owner of resolver label, use the parent project's owner and resolver label
When a project does not have a owner of resolver label, use the parent folder's owner and resolver provided in the yaml content
When a folder does not have a owner of resolver label, use the parent folder's owner and resolver provided in the yaml content

Good practice, at least have owner and resolver described to each top level folders

To deal with emails in label value:

replace _dot_ by .
replace _at_ by @
example: marie-pierre_dot_dupondt_at_mycompany_dot_com translates to [email protected]

Update the DataStudio report template accordingly to leverage owner and resolver

As RAM manager I want a `compliant` field (true/false) in `monitor` service finish logs entries to ease filtering

Expected behaviour

When searching about monitor service logs in Cloud Logging, I would like to filter log entries depending on the compliance result status. Then please add a field for that.
It could be one of:

compliant: true|false
status: compliant|not_compliant|whatever

Actual behaviour

In the current version I have to filter on the message field which starts with finish compliant|not_compliant

RAM version

v0.3.2

As RAM Manager I want PubSub subscription / Oldest unacked message age metric to be added into RAM monitoring dashboard so that I can control latency on processing

specially when hitting CGI quotas

Dumpinventory should optimize retry when :quota exceeded" / exhausted

Context:

All dumpinventory jobs start on the same cloud scheduler
The number on dumpinventory = (number of asset in solution.yaml + 1 (iam)) x number of org
- Example: (34 + 1)x2 = 70
While Cloud Asset Inventory quota is limited to 60 request/minute
The cloud function exit on error / redo_on_transitent, which get worst as the defaut backoff do arround 100 retry on the first minute, keeping the quota exhauted

Proposed work arround:

If the error from the requested API is "quota exceeded" then wait for a configurable timer before exiting
pro:: fix the issue
con: more CPU time to pay for, while waiting, balanced by may be couple of tens cloud function, once a week. anyway useless retry are also paid when nothing is done to fix

As RAM manager I want Service Level measurements and reports in order to drive reliability v innovation balance based on error budgets

As RAM mgr I want a default memory size to 4GB instead of 2 for the function splitdump to secure the crunching of very large cai dumps

link to current default value

New limit for could functions: 4GB

Bug: stream2bq insertID too long

BigQuery streaming insertID field length max is 128

violation table insertID may be more than 1000 length

This issue did not popup before using cloud.google.com/go/bigquery 1.16.0

Proposed solution: hash the string the reduce its size
The underlying one way function preserving the BigQuery best effort to deduplicate

dumpinventory concurent deployment may lead to CloudSchedulerClient.CreateJob rpc error: code = AlreadyExists

ramcli ignore the repository path argument when extracting versions from `go.mod`.

goModFilePath := "./go.mod"

Example:
~/repos/ > ramcli -repo=./brunore-ram
2020/09/21 20:19:47 Path does not exist: ./go.mod

As RAM manager, I want a documentation `how to deploy RAM` addressing multiple scenario so that I RAM production operations are secured and fast

Multiple scenario means:

How to upgrade multiple environments to a new RAM release?
How to add a new asset type to RAM scope?
How to deploy new compliance rules in a granular way?
How to get new compliance rule templates?

Manage transient errors from Cloud Resource Manager API in grm package

Example error when using ram -pipe

2020/10/22 18:35:33 grm projectsService.GetIamPolicy googleapi: Error 503: The service is currently unavailable., backendError

As maintainer I want to limit usage of log.Fatal so that code testing is facilitated leading to improve reliability

Where log.Fatal may be OK:

in *_test.go themselves
ramcli ramcli.Initialize to avoid a breaking change on the RAM version upgrade mechanism, and there is not that much to test ramcli.Initialize

As Compliance Manager, I want one files (yaml, csv) with all configured constraints so that a text diff tool can be used to compare 1) sets of rules 2) over time

file name: constraints.xaml, constraints.csv
Example of yaml structure:

services:
  - name: bq
    rules:
      - name: dataset_location
        constraints:
          - apiVersion: constraints.gatekeeper.sh/v1alpha1
            kind: GCPBigQueryDatasetLocationConstraintV1
            metadata:
              name: myorg_sanboxes_europe_bq

File location: in monitor folder
Add the file to .gitiignore to avoid conflct
make it on anay ramcli execution

bug listgroupmembers count members incorrect

missing waitgroup.Wait()

As RAM Manager, I want a setting to configure the time span of RAM views on daily partitioned tables, so that Big Query consumption can be reduced by aligning this setting with dumpinventory service period. e.g. 1 week

Check cloud build trigger prerequisite once instead of for each trigger to deploy so ramcli -pipe take less time to execute and log is cleaner

Bug: very first deployment of RAM stuck due to race on resource manager quota

Error

Deployemnt all microservices instance at the same time by using a ram-vx.y.z-env tag does not guarentee in which order the instances are deployed (each deployment is designed to be idempotent
setfeed deployments ussualy complete in 1min30sec while deployment based on cloud functions complete in 4min30sec
This lead to activate the real time tirggers while publish2fs cache as not yet been deplyed
the monitor instance are triggered by the realtime flows, as the cache is empty each execution fallback on querying resource manager to resolve org / folders / projects ids into displayNames
resources manager quotas it far smaller than the rate of real time changes on many existing org. leading to continuously exhaust resource manager quotas
the remaining deployment then all fail as each deployment nead a couple of query to resource manager to check iam bindigns.
leading to a dead lock.

Workarround

delete the CAI feeds,
relaunch all deployments but setfeeds
Trigger manually dumpinventory for Orgs, Folders, Projetcs, wait for the firestore cahce to be populated
Deploy setfeeed as the last microservice.

Fix

remove the fall back mechanism to query resource manager when the data is not found in cache, avoid so the dead lock to occur
Simplify the install doc accordingly

Add init_id to initialize function to log clod start operations by cloud function instance to ease troubleshooting

Refactor initialretrycheck, when to exit error, what to log

get RID of function InitialRetryCheck, as it does not save any code redundancy
How to exit:

exit error means log entry with REDO_ON_TRANSIENT + specific message, e.g.:
- not able to retrieve function cloud function metadata mean exit error to retry
exit nil
- case ERROR, means having logged NORETRY_ERROR + specific message, e,g,:
  - pubsub message too old
- case INFO, means having logged NORETRY_INFO

Always include pussub_id %s (unless not retrievable) to enable tracing of retries

For each core.go:

global type: add PubSubID string, keep the structure ordered
initialze, replace "ERROR - " by "" as all issues are returned in an err structure
EntryPoint func:
- Replace InitialRetryCheck by direct code
- Update all log messages
- Remove retry not retry comment

decommision initialRetryCheck

As compliance mgr, I want a rule documentation menu so that compliance stakeholders can navigate rules easily

Menu means

a readme.md file located as constraints.csv and constraints.yaml in the folder servies/monitor/instances
containing 3 table of contents:

Service / rule / constraints - severity - category
Severity / service / constraint - category
Category / service / rule / constraint - severity

the constraint name have a link to the constraint readme.md file

As RAM manager, I want RAM custom metrics and related dashboard to check the volume of RAM specific errors

Log based metric
new set service similar to setdashboards

examples:

ram_errors
ram_expired_retry
ram_missing_in_cache
ram_cannot_delete
ram_convertlog2feed_cannot_cleanup_groupMember
ram_stream2bq_cnx_reset_by_peer

ram -pipe without -service nor -instance lead to weird regex trigger tags on splitdump and publish2fs microservices

ram -pipe without -service nor -instance means target all microservices all instances
weird means:

(^publish2fs_cloudresourcemanager_Folder-v\d*.\d*.\d*-prd)|(^publish2fs-v\d*.\d*.\d*-prd)|(^ram-v\d*.\d*.\d*-prd)|(^container_Cluster-v\d*.\d*.\d*-prd)
(^splitdump_single_instance-v\d*.\d*.\d*-prd)|(^splitdump-v\d*.\d*.\d*-prd)|(^ram-v\d*.\d*.\d*-prd)|(^bigquery_Dataset-v\d*.\d*.\d*-prd)

As maintainer I want go sub test to run in parallel so that testing time is reduced improving velocity and reducing tests's costs

aka generalize what is already done in this test

Add retry on cloud build 503 transient errors to ramcli -pipe when it provision the triggers

As “RAM installer” I want “./ram -pipe” command implements automated retries to deal with cloud build create trigger transient backend errors so that time to (re)create all triggers is reduced avoiding manual retries

Broken link `Product Overview` on this repo root README.MD

Race between getKeyJSONdataAndCleanKeys in running functions and release pipeline executions may lead when under load to wrong deletion of the just created key

This issue does not only occur on a race.
It also occurs systematically in a multi org scenario, as multiple cloud functions (one per org) are deployed a different times on the sme service account (one per micro services). So the service account key used by the first deployed cloud function is killed by the second etc ..

As REGO rule developer , I want more trace in common audit.rego code to ease rule development

Upgrade dependencies v0.3.1

Last update was on 2020-07-08
Update dependencies:

cloud.google.com/go v0.60.0 => v0.72.0
cloud.google.com/go/bigquery v1.9.0 => v0.13.0
cloud.google.com/go/firestore v1.2.0 => v1.3.0
cloud.google.com/go/logging v1.0.0 => v1.1.2
cloud.google.com/go/pubsub v1.4.0 => v1.8.3
cloud.google.com/go/storage v1.10.0 => v1.12.0
github.com/google/uuid v1.1.1 => v1.1.2
github.com/open-policy-agent/opa v0.21.0 => v0.24.0
golang.org/x/oauth2 v0.0.0-20200107190931-bf48bf16ab8d => v0.0.0-20201109201403-9fd604954f58
google.golang.org/api v0.28.0 => v0.35.0
google.golang.org/genproto v0.0.0-20200626011028-ee7919e894b5 => v0.0.0-20201119123407-9b1e624d6bc4
gopkg.in/yaml.v2 v2.3.0 => v2.4.0

As RAM Manager, in the csv export of the rules and constraints, I would like to have the target(s) and exclude(s) for each constraint

Ideally, in a cell, the list of target(s) and in another cell the list of exclude(s) for each constraint.

Name length greater than 63 lead to failed cloud function deployment

Error

Cloud function Instance dumpinventory_org_cloudresourcemanager_Organization deployment failed.

Cause

Cloud function names length is limited to 63

OrganizationIDs have variable lengh, the string length is 51 without the organizationID, leaving 12 for the organizationID while it can more more (seen 14)

Fix

Truncate cloud function mame to max 63

Bug, Failure during cold start must exit with error to avoid invalid cloud function instance to receive more traffic

When errors occur during the initialize function they are logged on purpose as basic log entry, so that the cloud function terminates without error to avoid retries.

As retry is avoided by design then the pubsub message is acknowledged, so that it is not persisted to bigquery leading to data losses.

This behavior addresses well non transient errors like 403 missing permission where 1) retry will not solve the 403 and 2) it avoids having to pay for compute time during 1 hours of retries.

On another hand, If the error is a transient that occurs for a proportion of execution, like 443: write: connection reset by peer then it leads to data losses of the same proportion of pubsub message, while it would be worth retrying.

As RAM manager, I want a tool to check if configured instances have a cloud build trigger so that it saves time to do this control

A group of instances means:

the instances of a microservice, e.g, upload2gcs
the instances of all microservices aka RAM
the instances related to an asset type (TO BE CHALLENGED)

have cloud build trigger assume an environment is provided to find the associated project hosting the triggers

In complement to succeed of failed (issue an error) the result should list what is found vs what is missing.

Bug, ram cli -check -deploy rotate keys while it should not

scope: convertfeed2logs, listgroups, listgroupmembers, getgroupsettings
impact: as new jey are create while not needed, and recorded to firestore, the running usefull key are no more protected (name overwritten in firestore), which will lead to deletion of running keys in a multi org scenarion at the next time the cloud function is instanciateed (exe initialize)

sfix: during deploy, only create a record keys when mode is not -check, to be implemented in the 4 impacted services

v0.3.0-rc1 missing permission to deploy dashboards for Cloud Build service account

Step #2 - "deploy instance monitor_clouddns_rsasha1": 2020/10/21 07:33:48 dashboardService.List googleapi: Error 403: The caller does not have permission, forbidden

brunoreboul / ram Goto Github PK

ram's People

Contributors

Stargazers

Watchers

Forkers

ram's Issues

Error:

Root Cause:

Proposed fix

Expected behaviour

Actual behaviour

RAM version

Error

Workarround

Fix

Error

Cause

Fix

Recommend Projects

Recommend Topics

Recommend Org