Due to the increasing threats and attacks on network infrastructure, it has become essential for companies to closely monitor the traffic flowing through their network systems. This monitoring allows them to promptly identify unusual network activity, often caused by malicious actors. This project addresses this concern by developing a solution that continuously monitors network traffic in nearly real-time and triggers alerts whenever any abnormal behavior is detected.
- subscriberId: This field represent the identifier of the subscriber or user associated with the network activity.
- srcIP: The
srcIP
address represents the sender's IP address. - dstIP: The
dstIP
address indicates where the network traffic is being sent. - srcPort:
srcPort
refers to the port number used on the sender's side of the connection. Port numbers can help identify the application or service associated with the traffic. - dstPort:
dstPort
is the port number used on the receiver's side of the connection. It's important for determining the service or application receiving the traffic. - txBytes: This field indicates the number of bytes transmitted from the source to the destination. Unusual spikes in data volume can be indicative of anomalies.
- rxBytes: It represents the number of bytes received by the destination. Similar to
txBytes
, monitoringrxBytes
helps identify unusual data patterns. - tcpFlag:
tcpFlag
provide information about the state of a TCP connection (e.g., SYN, ACK, FIN). Anomalies in flag patterns may indicate suspicious behavior. - startTime: The
startTime
indicates when the connection or network activity started. - endTime: The
endTime
indicates when the connection or activity ended. Duration of connections can be analyzed for anomalies. - timestamp: A
timestamp
indicating when the log entry was recorded. - protocolName: This field specifies the name of the protocol used for the network connection (e.g., HTTP, UDP, TCP). Anomalies can be detected by monitoring protocol usage.
- protocolNumber: This might represent the protocol as a numeric identifier (e.g., 6 for TCP, 17 for UDP). It can be used to classify connections based on protocols.
- geoCountry: The country associated with the
srcIP
. Geo-location data can help identify suspicious traffic from unexpected locations. - geoCity: The city associated with the
srcIP
. Like geoCountry, this can be used to track traffic from specific geographical areas. - latitude: These fields provide precise geographical coordinates associated with the
srcIP
. They can be used for mapping the locations of network activity and detecting unusual patterns geographically. - longitude: These fields provide precise geographical coordinates associated with the
srcIP
. They can be used for mapping the locations of network activity and detecting unusual patterns geographically.
- BigQuery
- Cloud Storage
- Pub/Sub
- DataFlow
- Cloud Build
- Vertex AI
- Historical Data Ingestion
- Realtime Data Ingestion
- Data Normalization
- Feature Store
- Model Development
- Model Serving in real Time
- Model Retraining for Drift
To ensure the creation of consistent and meaningful mock data, we implemented specific techniques:
-
Time Sequencing: We establish a logical sequence in our data by guaranteeing that the end time of an event always occurs after the start time. The time intervals between events are generated from a uniform distribution with an adjustable parameter known as
average_time_between_requests
(measured in seconds). This parameter allows us to control the time gaps between events, introducing a sense of order into the data. -
Btyes Size Per Request: Additionally, the value of the time gap, determined by
average_time_between_requests
, plays a dual role in controlling the size of thetxBytes
(bytes transmitted) andrxBytes
(bytes received) fields. By manipulating this parameter, we can influence the data volume associated with each event. This approach enables us to create data that mimics various scenarios, from low traffic to high-volume data transfers, providing a more comprehensive testing environment. -
Geolocation Attributes: To enhance the realism of the data generated, the Value of the
srcIP
is used to derive related attributes, includinggeoCountry
,geoCity
,latitude
, andlongitude
. This data synergy enriches the mock data's context. -
Destination IP: We recognize the need for determinism in generating the
dstIP
. This attribute's deterministic nature is vital because it plays a vital role during aggregation of the events. Hence, we've implemented a custom generator to ensure consistency and reliability in the data simulation.
git clone https://github.com/hardeybisey/Netlog-Anomaly-Detection-Pipeline.git
gcloud auth configure-docker
gcloud services enable bigquery
gcloud services enable storage_component
gcloud services enable dataflow
gcloud services enable cloudbuild.googleapis.com
gcloud config set project <project_id>
export PROJECT_NUMBER=$(gcloud projects list --filter=${PROJECT_ID} --format="value(PROJECT_NUMBER)")
gcloud projects add-iam-policy-binding ${PROJECT_ID} --member serviceAccount:[email protected] --role roles/editor
gcloud projects add-iam-policy-binding ${PROJECT_ID} --member serviceAccount:[email protected] --role roles/storage.objectAdmin
export TOPIC_NAME="netlog-stream"
export SUBSCRIPTION_ID="netlog-stream-sub"
export DATASET_ID="netlog_dataset"
export BUCKET="$(gcloud config get-value project -q)-netlog-bucket"
export TAG="$(date +%Y%m%d%H%M%S)"
export EVENT_TYPE="normal"
gcloud builds submit . --config cloud-build.yaml --substitutions _TOPIC_NAME=${TOPIC_NAME},_SUBSCRIPTION_ID=${SUBSCRIPTION_ID},_DATASET_ID=${DATASET_ID},_BUCKET=${BUCKET},_TAG=${TAG},_EVENT_TYPE=${EVENT_TYPE}