Coder Social home page Coder Social logo

anomaly-detection's Introduction

caa-streamworx

caa-streamworx is a code repository for the backend data pipeline development of a CRA data science project aimed at investigating the security value of siteminder logs with a view towards automated intrusion detection. This work is being undertaken as part of a BCIP initiative in collaboration with streamworx.ai, and development work here is being carried out by AISS and CAA teams at the CRA.

The code in this repository is written in python and meant to be deployed on a databricks service running in a CRA azure subscription. Thus, it is largely a spark-based project. Since the deployment takes place in Azure, some dev ops are staged in the appropriate azure services, and some additional development takes place in notebooks which are not version-controlled here.

Installation/Deployment

Creating a new library release:

  1. Increment version number in setup.cfg in master.
  2. Create a pull request from master to azureDevOpsbranch.
  3. Run Pipeline:
  • Open dev.azure.com
  • Select Pipelines -> Pipelines
  • Select CRA-CAA.caa-streamworx
  • Select Run Pipeline
  1. Edit release on Azure Devops Page:
  • Open dev.azure.com
  • Select Pipelines -> Releases
  • Select Edit -> Tasks
  1. Update version number.
  2. Update feed and dbfs names in Release script to the current version number.
  3. Save, then Create Release -> Create Note: If the release is not initially deployed, try again as we may not have been assigned a worker yet

To install on single cluster:

  1. Use: %pip install /dbfs/FileStore/wheelFiles/caaswx-0.0.XX-py3-none-any.whl in a notebook to install for that session.

To change initial install cluster:

  1. Select the release pipeline, then Edit -> Variables, and change variables to those corresponding to desired cluster.

For further details reference here.

Project Structure

This project consists of three broad categories of assets for constructing pipelines and machine learning models:

  • Scripts for performing repeatable tasks, like ingesting data from a fixed source in a specific format
  • Transformers, implemented as python classes extending the appropriate notions in either scikitlearn or spark
  • Development related items like unit tests and test data.

Spark related scripts and transformers are located in /src/caaswx/spark/. Testing modules are located in /tests/. Testing data is located in /data/.

Sample Usage

In a databricks notebook:

import caaswx

df = table("raw_logs")
feature_generator = caaswx.spark.transformers.UserFeatureGenerator(window_step = 900, window_length = 600)

feature_generator.transform(df).take(50)

Description of dataset's columns

Column Name Description
CRA_SEQ Serves as the primary key for the Siteminder data and can be used for counting unique rows via aggregation steps.
CRA_TZ_OFFSET Time zone offset (Majority of rows have 5 and 6 as the value of this column.
SM_ACTION Records the HTTP action. Get, Post, and Put. It can contain NULLs.
SM_AGENTNAME Name associated with the agent that is being used in conjunction with the policy server.
SM_AUTHDIRNAME This is not used by the reports generator and by the programs of this project.
SM_AUTHDIRNAMESPACE This is not used by the reports generator and by the programs of this project.
SM_AUTHDIRSERVER This is not used by the reports generator and by the programs of this project.
SM_CATEGORYID The identifier for the type of logging.
SM_CLIENTIP The IP address for the client machine that is trying to utilize a protected resource.
SM_DOMAINNAME The unique name for the domain in which the realm and resource the user is accessing exist.
SM_DOMAINOID The unique identifier for the domain in which the realm and resource the user is accessing exist.
SM_EVENTID Marks the particular event that caused the logging to occur.
SM_HOSTNAME Machine on which the server is running.
SM_IMPERSONATORDIRNAME Login name of the administrator directory that is acting as the impersonator in an impersonated session.
SM_IMPERSONATORNAME Login name of the administrator directory that is acting as the impersonator in an impersonated session.
SM_REALMNAME Current realm's name in which the resource that the user wants resides.
SM_REALMOID Current realm's identifier in which the resource that the user wants resides.
SM_REASON Motivations for logging. 32000 and above are user defined.
SM_RESOURCE Resource, for example a web page, that the user is requesting.
SM_HOSTNAME Machine on which the server is running.
SM_SESSIONID Session identifier for this user’s activity.
SM_STATUS Some descriptive text about the action.
SM_TIMESTAMP Marks the time at which the entry was made to the database.
SM_TIMESTAMPTRUNC Stores the truncated timestamp recording the date from the SM_TIMESTAMP.
SM_TIMESTAMPTRUNC Machine on which the server is running.
SM_TRANSACTIONID This is not used by the reports generator.
SM_USERNAME Username logged into the session

Feature Documentation

serverfeaturegenerator.py

Features Description
StartTime The beginning of a time window
EndTime The end of a time window
VolOfAllLoginAttempts Number of all login attempts in the specified time window
VolOfAllFailedLogins Number of all failed login attempts in the specified time window
MaxOfFailedLoginsWithSameIPs Maximum Number of all failed login attempts with same IPs
NumOfIPsLoginMultiAccounts Number of IPs used for logging into multiple accounts
NumOfReqsToChangePasswords Number of requests to change passwords; see #65
NumOfUsersWithEqualIntervalBtnReqs Number of users with at least interval_threshold intervals between consecutive requests that are equal up to precision interval_epsilon

Userfeaturegenerator.py

Features Description
COUNT_ADMIN_LOGOUT Count of Admin Logout events during the time window, defined by sm_eventid = 8.
COUNT_AUTH_ACCEPT Count of Auth Accept events during the time window, defined by sm_eventid = 1.
COUNT_ADMIN_ATTEMPT Count of Admin Accept events during the time window, defined by sm_eventid = 3.
COUNT_ADMIN_REJECT Count of Admin Reject events during the time window, defined by sm_eventid = 2.
COUNT_AZ_ACCEPT Count of Az Accept events during the time window, defined by sm_eventid = 5.
COUNT_AZ_REJECT Count of Az Reject events during the time window, defined by sm_eventid = 6.
COUNT_AUTH_LOGOUT Count of Auth Logout events during the time window, defined by sm_eventid = 10.
COUNT_VISIT Count of Visit events during the time window, defined by sm_eventid = 13.
COUNT_AUTH_CHALLENGE Count of Auth Challenge events during the time window, defined by sm_eventid = 4.
COUNT_ADMIN_REJECT Count of Admin Reject events during the time window, defined by sm_eventid = 9.
COUNT_ADMIN_LOGIN Count of Admin Login events during the time window, defined by sm_eventid = 7.
COUNT_VALIDATE_ACCEPT Count of Validate Accept events during the time window, defined by sm_eventid = 11.
COUNT_VALIDATE_REJECT Count of Validate Reject events during the time window, defined by sm_eventid = 12.
COUNT_FAILED Count of all Reject events during the time window, defined by sm_eventid = 2, 6 and 9.
COUNT_GET Count of all GET HTTP actions in SM_ACTION during the time window.
COUNT_POST Count of all POST HTTP actions in SM_ACTION during the time window.
COUNT_HTTP_METHODS Count of all GET and POST HTTP actions in SM_ACTION during the time window.
COUNT_OU_AMS Count of all “ams” or “AMS” occurrences in SM_USERNAME OR SM_RESOURCE during the time window.
COUNT_OU_CMS Count of all “cra-cp” occurrences in SM_USERNAME during the time window.
COUNT_OU_IDENTITY Count of all “ou=Identity” occurrences in SM_USERNAME during the time window.
COUNT_OU_CRED Count of all “ou=Credential” occurrences in SM_USERNAME during the time window.
COUNT_OU_SECUREKEY Count of all “ou=SecureKey” occurrences in SM_USERNAME during the time window.
COUNT_PORTAL_MYA Count of all “mima” occurrences in SM_RESOURCE during the time window.
COUNT_PORTAL_MYBA Count of all “myba” occurrences in SM_RESOURCE during the time window.
COUNT_UNIQUE_ACTIONS Count of distinct HTTP Actions in SM_ACTION during the time window.
COUNT_UNIQUE_IPS Count of distinct IPs in SM_CLIENTIP during the time window.
COUNT_UNIQUE_EVENTS Count of distinct EventIDs in SM_EVENTID during the time window.
COUNT_UNIQUE_USERNAME Count of distinct CNs in CN during the time window.
COUNT_UNIQUE_RESOURCES Count of distinct Resource Strings in SM_RESOURCE during the time window.
COUNT_UNIQUE_SESSIONS Count of distinct SessionIDs in SM_SESSIONID during the time window.
COUNT_RECORDS Counts number of CRA_SEQs (dataset primary key)
UNIQUE_SM_ACTIONS A distinct list of HTTP Actions in SM_ACTION during time window.
UNIQUE_SM_CLIENTIPS A distinct list of IPs in SM_CLIENTIPS during time window.
UNIQUE_SM_PORTALS A distinct list of Resource Strings in SM_RESOURCE during time window.
UNIQUE_SM_TRANSACTIONS A distinct list of Transaction Ids in SM_TRANSACTIONID during time window.
SM_SESSION_IDS A distinct list of SessionIDs in SM_SESSIONID during the time window.
COUNT_UNIQUE_OU A count of distinct Entries containing “ou=” and a string ending in “,” in SM_USERNAME during time window.
UNIQUE_USER_OU A distinct list of Entries containing “ou=” and a string ending in “,” in SM_USERNAME during time window.
COUNT_PORTAL_RAC A count of Entries containing “rep” followed by a string ending in “/” in SM_RESOURCE during time window.
UNIQUE_PORTAL_RAC A distinct list of Entries containing “rep” followed by a string ending in “/” in SM_RESOURCE during time window.
UNIQUE_USER_APPS A distinct list of root nodes from each record in SM_RESOURCE during time window.
COUNTUNIQUE_USER_APPS A count of distinct root nodes from each record in SM_RESOURCE during time window.
USER_TIMESTAMP Minimum timestamp in SM_TIMESTAMP during time window.
AVG_TIME_BT_RECORDS Average time between records during the time window.
MAX_TIME_BT_RECORDS Maximum time between records during the time window.
MIN_TIME_BT_RECORDS Minimum time between records during the time window.
UserLoginAttempts Total number of login attempts from the user within the specified time window
UserAvgFailedLoginsWithSameIPs Average number of failed logins with same IPs from the user (Note: the user may use multiple IPs; for each of the IPs, count the failed logins; then compute the average values of failed logins from all the IPs used by the same user)
UserNumOfAccountsLoginWithSameIPs Total number of accounts visited by the IPs used by this user (this might be tricky to implement and expensive to compute, open to nixing).
UserNumOfPasswordChange Total number of requests for changing passwords by the user (See Seeing a password change from the events in `raw_logs` #65)
UserIsUsingUnusualBrowser Whether or not the browser used by the user in current time window is same as that in the previous time window, or any change within the current time window

Sessionfeaturegenerator.py

Features Description
SESSION_APPS A distinct list of root nodes from each record in SM_RESOURCE during time window.
COUNT_UNIQUE_APPS A count of distinct root nodes from each record in SM_RESOURCE during time window.
SESSION_USER A distinct list of CNs in CN during time window.
COUNT_ADMIN_LOGIN Count of Admin Login events during the time window, defined by sm_eventid = 7.
COUNT_ADMIN_LOGOUT Count of Admin Logout events during the time window, defined by sm_eventid = 8.
COUNT_ADMIN_REJECT Count of Admin Reject events during the time window, defined by sm_eventid = 2.
COUNT_FAILED Count of all Reject events during the time window, defined by sm_eventid = 2, 6 and 9.
COUNT_VISIT Count of Visit events during the time window, defined by sm_eventid = 13.
COUNT_GET Count of all GET HTTP actions in SM_ACTION during the time window.
COUNT_POST Count of all POST HTTP actions in SM_ACTION during the time window.
COUNT_HTTP_METHODS Count of all GET and POST HTTP actions in SM_ACTION during the time window.
COUNT_RECORDS Counts number of CRA_SEQs (dataset primary key)
COUNT_UNIQUE_ACTIONS Count of distinct HTTP Actions in SM_ACTION during the time window.
COUNT_UNIQUE_EVENTS Count of distinct EventIDs in SM_EVENTID during the time window.
COUNT_UNIQUE_USERNAME Count of distinct CNs in CN during the time window.
COUNT_UNIQUE_RESOURCES Count of distinct Resource Strings in SM_RESOURCE during the time window.
COUNT_UNIQUE_REP A count of Entries containing “rep” followed by a string ending in “/” in SM_RESOURCE during time window.
SESSION_SM_ACTION A distinct list of HTTP Actions in SM_ACTION during time window.
SESSION_RESOURCE A distinct list of Resource Strings in SM_RESOURCE during time window.
SESSION_REP_APP A distinct list of Entries containing “rep” followed by a string ending in “/” in SM_RESOURCE during time window.
SESSSION_FIRST_TIME_SEEN Minimum time at which a record was logged during the time window.
SESSSION_LAST_TIME_SEEN Maximum time at which a record was logged during the time window.
SDV_BT_RECORDS Standard deviation of timestamp deltas during the time window.

IPfeaturegenerator.py

Features Description
IP_APP A distinct list of root nodes from each record in SM_RESOURCE during time window.
IP_AVG_TIME_BT_RECORDS Average time between records during the time window.
IP_MAX_TIME_BT_RECORDS Maximum time between records during the time window.
IP_MIN_TIME_BT_RECORDS Minimum time between records during the time window.
IP_COUNT_ADMIN_LOGIN Count of Admin Login events during the time window, defined by sm_eventid = 7.
IP_COUNT_ADMIN_LOGOUT Count of Admin Logout events during the time window, defined by sm_eventid = 8.
IP_COUNT_ADMIN_REJECT Count of Admin Reject events during the time window, defined by sm_eventid = 9.
IP_COUNT_AUTH_ACCEPT Count of Auth Accept events during the time window, defined by sm_eventid = 1.
IP_COUNT_ADMIN_ATTEMPT Count of Admin Accept events during the time window, defined by sm_eventid = 3.
IP_COUNT_AUTH_CHALLENGE Count of Auth Challenge events during the time window, defined by sm_eventid = 4.
IP_COUNT_AUTH_LOGOUT Count of Auth Logout events during the time window, defined by sm_eventid = 10.
IP_COUNT_ADMIN_REJECT Count of Admin Reject events during the time window, defined by sm_eventid = 2.
IP_COUNT_AZ_ACCEPT Count of Az Accept events during the time window, defined by sm_eventid = 5.
IP_COUNT_AZ_REJECT Count of Az Reject events during the time window, defined by sm_eventid = 6.
IP_COUNT_FAILED Count of all Reject events during the time window, defined by sm_eventid = 2, 6 and 9.
IP_COUNT_GET Count of all GET HTTP actions in SM_ACTION during the time window.
IP_COUNT_POST Count of all POST HTTP actions in SM_ACTION during the time window.
IP_COUNT_HTTP_METHODS Count of all GET and POST HTTP actions in SM_ACTION during the time window.
IP_COUNT_OU_AMS Count of all “ams” or “AMS” occurrences in SM_USERNAME OR SM_RESOURCE during the time window.
IP_COUNT_OU_CMS Count of all “cra-cp” occurrences in SM_USERNAME during the time window.
IP_COUNT_OU_IDENTITY Count of all “ou=Identity” occurrences in SM_USERNAME during the time window.
IP_COUNT_OU_CRED Count of all “ou=Credential” occurrences in SM_USERNAME during the time window.
IP_COUNT_OU_SECUREKEY Count of all “ou=SecureKey” occurrences in SM_USERNAME during the time window.
IP_COUNT_PORTAL_MYA Count of all “mima” occurrences in SM_RESOURCE during the time window.
IP_COUNT_PORTAL_MYBA Count of all “myba” occurrences in SM_RESOURCE during the time window.
IP_COUNT_UNIQUE_ACTIONS Count of distinct HTTP Actions in SM_ACTION during the time window.
IP_COUNT_UNIQUE_EVENTS Count of distinct EventIDs in SM_EVENTID during the time window.
IP_COUNT_UNIQUE_USERNAME Count of distinct CNs in CN during the time window.
IP_COUNT_UNIQUE_RESOURCES Count of distinct Resource Strings in SM_RESOURCE during the time window.
IP_COUNT_UNIQUE_SESSIONS Count of distinct SessionIDs in SM_SESSIONID during the time window.
IP_COUNT_PORTAL_RAC A count of Entries containing “rep” followed by a string ending in “/” in SM_RESOURCE during time window.
IP_COUNT_RECORDS Counts number of CRA_SEQs (dataset primary key)
IP_COUNT_VISIT Count of Visit events during the time window, defined by sm_eventid = 13.
IP_COUNT_VALIDATE_ACCEPT Count of Validate Accept events during the time window, defined by sm_eventid = 11.
IP_COUNT_VALIDATE_REJECT Count of Validate Reject events during the time window, defined by sm_eventid = 12.
IP_UNIQUE_SM_ACTIONS A distinct list of HTTP Actions in SM_ACTION during time window.
IP_UNIQUE_USERNAME A distinct list of CNs in CN during time window.
IP_UNIQUE_SM_SESSION A distinct list of SessionIDs in SM_SESSIONID during time window.
IP_UNIQUE_SM_PORTALS A distinct list of Resource Strings in SM_RESOURCE during time window.
IP_UNIQUE_SM_TRANSACTIONS A distinct list of Transaction Ids in SM_TRANSACTIONID during time window.
IP_UNIQUE_USER_OU A distinct list of Entries containing “ou=” and a string ending in “,” in SM_USERNAME during time window.
IP_UNIQUE_REP_APP A distinct list of Entries containing “rep” followed by a string ending in “/” in SM_RESOURCE during time window.
IP_TIMESTAMP Earliest timestamp during time window.
IP_COUNT_UNIQUE_OU A count of distinct Entries containing “ou=” and a string ending in “,” in SM_USERNAME during time window.

License

MIT

anomaly-detection's People

Contributors

andrewgpeng avatar chitra-rajagopal avatar dg1223 avatar ghulamusman avatar kangchenyufei avatar kerrycerqueira avatar ketan14a avatar sagarmoghe avatar skenar avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.