Coder Social home page Coder Social logo

shinesolutions / aem-orchestrator Goto Github PK

View Code? Open in Web Editor NEW
12.0 13.0 9.0 465 KB

Java application for orchestrating AEM infrastructure created using aem-aws-stack-builder

License: Apache License 2.0

Java 100.00%
aem aem-opencloud auto-recovery self-healing orchestration

aem-orchestrator's Introduction

Build Status

AEM Orchestrator

AEM Orchestrator is a stateless Java application for orchestrating AEM infrastructure created using aem-aws-stack-builder. It's primary function is to keep Adobe Experience Manager (AEM) running in a healthy state despite scaling events or other such impacts on the stack. It does this by listening to a predefined SQS queue and reacting to changes on the stack.

Build

This project requires Java 8 to compile and run the source code. Apache Maven 3.3 was used as the build tool.

Create JAR file

$ mvn clean package

This will create a JAR file in the '\target' directory called aem-orchestrator-x.x.x.jar. By default the generated JAR file will contain all of the required dependencies

Usage

Requirements

The AEM Orchestrator requires aem-aws-stack-builder to have created a stack in order to work. You'll need to get this running before attempting to run AEM Orchestrator. If using puppet then you should also take a look at puppet-aem-orchestrator.

The AEM Orchestrator uses AWS Instance Profiles authentication. If you do not already have this, you will need to set it up.

The application requires there to be an application.properties file in the same base directory as the JAR file. Here is an example of a properties file with the minimum properties set:

aws.cloudformation.stackName.author = example-aem-author-stack
aws.cloudformation.stackName.authorDispatcher = example-aem-author-dispatcher-stack
aws.cloudformation.stackName.publish = example-aem-publish-stack
aws.cloudformation.stackName.publishDispatcher = example-aem-publish-dispatcher-stack
aws.cloudformation.stackName.messaging = example-aem-messaging-stack
aws.sqs.queueName = example-aem-asg-event-queue

The aem-aws-stack-builder will generate these names for you, they just need to be added to this Orchestrator application.properties file. See here for more information on configuration properties. You can also view the base application.properties file.

Running the JAR

The JAR file is created as a 'fully executable' jar. See Spring Boot deployment and install. There are two ways to run the JAR:

$ java -jar aem-orchestrator-x.x.x.jar

or as a 'fully executable' direct application (only works on Unix/Linux based systems):

$ ./aem-orchestrator-x.x.x.jar

Logging

By default the Orchestrator will log to a file called orchestrator.log in the root directory. It uses logback and here is the default configuration file. Note that debug is enabled by default. To override the default logging, place your custom logback.xml file in the same directory as the JAR file.

Functionality

The AEM Orchestrator reacts to three different types of events:

  1. Instance Scale Up
  2. Instance Scale Down
  3. Alarms

Scale Events

The scale up and scale down events are generated by the AWS autoscaling groups (ASG). These can occur at the following tiers of the stack:

  • Author Dispatcher
  • Publish
  • Publish Dispatcher

For example if a Publish instance stopped responding the ASG would terminate it, which would generate a message to the Orchestrator to perform a Scale Down Publish Action. The ASG would also start a new Publish instance, which would generate a Scale Up Publish Action. In each case, the AWS instance ID is passed to the Orchestrator so that it can perform it's action.

Alarms

Currently there is only one alarm, the AEM Content Health Check. This will trigger if the Publish Dispatcher is not seeing healthy content (defined in a descriptor file) on the Publish instance. When the alarm triggers the Orchestrator is notified via an SQS message and it will perform an Alarm Content Health Check Action.

Recovery

Aside from stack startup, the AEM Orchestrator is designed to recover the stack from multiple scenarios:

  • Termination of a single instance (Author Dispatcher, Publish or Publish Dispatcher)
  • Termination of many or all instances
  • Scale up of a single instance (Author Dispatcher, Publish or Publish Dispatcher)
  • Scale up of all instances

NOTE: The AEM Orchestrator must be running to perform this recovery, however due to it's stateless nature, the AEM Orchestrator can be started post termination or scale up of an instance. Upon startup, it will begin processing messages off the SQS queue. In essence the queue holds the state of the stack not the AEM Orchestrator.

Reading Messages

The AEM Orchestrator attaches itself as a listener to the SQS queue defined in the application.properties. It works in an asynchronous manner and does not poll the queue. If the Orchestrator fails to perform an action on a message for what ever reason, the message will not be acknowledged and instead kept in flight on the queue to be reprocessed at a later time. Only one message is processed at a time (no concurrency). There is no guarantee to the order of processing, but the Orchestrator is designed to handle messages in any order.

The message format is the Amazon SNS HTTP/HTTPS Notification JSON Format. A JSON definition of the format can be found here.

Troubleshooting

If all goes well at start up you should see something like this in your log file:

DEBUG c.s.a.s.OrchestratorMessageListener - Initialising message receiver, starting SQS connection
INFO  c.s.aemorchestrator.AemOrchestrator - AEM Orchestrator started

Otherwise you will need to view the orchestrator.log and try and decipher what is going wrong. Most likely causes are:

  • Stack not properly initialised
  • Invalid or missing IAM role permissions
  • Invalid or missing property in the application.properties file
  • AEM not in a healthy state on author or publish instances
  • AWS Instance Profile authentication not set up

How do I know it's working?

Upon startup of the stack you will see the Orchestrator log showing the processing of many messages. Here is a log example (with debug off) for a Scale Up Publish Dispatcher Action:

INFO  c.s.a.s.OrchestratorMessageListener - Message received ID:3dfea271-254c-4cef-a793-ff67a2218239
INFO  c.s.a.a.ScaleUpPublishDispatcherAction - ScaleUpPublishDispatcherAction executing
INFO  c.s.a.a.ScaleUpPublishDispatcherAction - Desired capacity already matching for publish auto scaling group and it's dispatcher. No changes will be made
INFO  c.s.a.s.OrchestratorMessageListener - Acknowledged message (removing from queue): ID:3dfea271-254c-4cef-a793-ff67a2218239

If the Orchestrator log isn't moving but there are definitely messages in the SQS queue, then something may be wrong (refer to likely causes above).

aem-orchestrator's People

Contributors

cliffano avatar dependabot[bot] avatar jimmyting44 avatar karchit avatar kaveensingh31 avatar lenuhc avatar mbloch1986 avatar michaeldiender-shinesolutions avatar nletts avatar ovlords avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aem-orchestrator's Issues

Initial publish instances shouldn't snapshot each other

When a new stack is initialised, the first two publish instances are snapshotting each other because they both identify each other as the healthy source to base on.

This is likely to happen because the publish instance is currently not waiting for SnapshotId tag to exist, which causes both instances to start AEM at about the same time rather than allowing Orchestrator to control the sequence.

Configure if the SSL Certificate verification when connecting to AEM

The verification of the SSL Certificate when connecting to AEM via SSL should be configurable. Currently it does not support SSL connections to AEM when AEM uses a self-signed certifiacte.

The default value should be false, so it doesn't verify the SSL certificate. Which matches the usual use-case of using a self-signed certificate.

TEST_NOTIFICATION log level should not be an error

Currently an autoscaling TEST_NOTIFICATION notification type generates an error in Orchestrator's log file. This event is coming from ASG so it's known, Orchestrator doesn't have to do anything with it.

2017-02-24 11:38:26 [SessionCallBackSchedulerThread-1] ERROR c.s.a.handler.SqsMessageHandler - No event handler found for message type: autoscaling:TEST_NOTIFICATION

I have made sure that none of publish-dispatcher, publish, and author ASGs declare this notification type.

Configurable snapshot tags

When Orchestrator creates a snapshot, it should copy the tags from the EC2 instance it's taking the snapshot from.

Orchestrator's application.properties need to have a new configuration for a comma-separated tag names.
Default value for this config: Component,Name,StackPrefix .

Orchestrator should also add a tag SnapshotType with value orchestration.
This is used to distinguish the retention policy between various snapshot types.

Scaling up publish instances should pause not disable

Hey, first off thanks for open sourcing a lot of your tools! They have been very helpful in my quest to automate scaling AEM6 publish servers in Google Cloud with HashiCorp Consul.

I am opening this issue in reference to the scale up publish event.

During a scale up publish event step 4 takes a healthy publish agent and and pauses it in the replication queue. This is to make sure no changes get pushed to the healthy publish agent during the snapshot.

The issue I found when I looked at the method getPauseReplicationAgentRequest which pauses the healthy publish agent. Is that it disables the publish agent instead of pausing it. Wouldn't this cause the healthy publish instance to miss any updates that were published during the snapshot? Or is there a process that prevents users from publishing during a scale up publish event that I am not seeing?

Thanks again for sharing your great work with the community,
Ry

Rename contenthealthcheck alarm metric

It would be an enhancement to update the metric for the content healthcheck alarm.

At the moment the metric namespace is the Publish Dispatcher cloudformation stack name, which makes the Cloudwatch Metric overview in the AWS Management Console confusing and it would be easier for the acceptance check the metric.

e.g. Namespace may be "AEM" and as dimension we keep the existing "PairInstanceID" & "contentHealthCheck" and add an additional with "StackPrefix".

Change retry backoff policy to exponential on resource readiness checks

Orchestrator currently checks resource readiness with a fixed backoff policy

FixedBackOffPolicy backOffPolicy = new FixedBackOffPolicy();
.

Even though the backoff period can be configured, it is still not ideal when the error continuously occurs for a long period of time, potentially causing a large number of API calls which might in turn cause the account to hit API rate limit.

The backoff policy should be replaced with exponential backoff policy https://github.com/spring-projects/spring-retry/blob/master/src/main/java/org/springframework/retry/backoff/ExponentialBackOffPolicy.java .

Wait for author ELB to be ready before processing any message

When the stack is starting up, the resources are still being initialised and are not ready, this leads to Orchestrator to potentially:

  • log plenty of error messages because Orchestrator sends request to Author ELB before it is ready
  • keep taking Publish snapshot before failing when trying to create replication agent via Author ELB, this then leads to AWS error with too many snapshots taken from the same volume

Orchestrator needs to wait for author ELB to be ready (keep checking until its health check is responding with 200) before processing any message from the queue. This will prevent the above errors.

Coverage report

To provide visibility on our current test coverage report, please integrate a coverage tool (Cobertura? Jacoco?) to Orchestrator's Maven POM.

Device name should be configurable

Device name is currently a constant which defaults to /dev/sdb https://github.com/shinesolutions/aem-orchestrator/blob/master/src/main/java/com/shinesolutions/aemorchestrator/actions/ScaleUpPublishAction.java#L35

This value is configurable during stack creation because source image might already use up /dev/sdb and the stack might have to use /dev/sdc for AEM repository.
Hence Orchestrator config should also support configurable device name for AEM repository.

Use AEM Healthcheck URL to verify if Publish is healthy

When we introduced the new EC2 Tag COMPONENT_INIT_STATUS to identify if the provisioning of the EC2 instance finished successful. We've replaced the existing Publisher healthcheck logic in the orchestrator to use this EC2 tag for verification.

Ideally we should use both to identify if the Publisher is healthy or not.

  • Verify if Provisoining was successful
  • Verify that the AEM is healthy via the AEM Healthcheck URL

The Commit where we replaced the healthcheck logic:
a5eee90

Missing snapshot name

The snapshots that Orchestrator creates currently have empty name.

For consistency and to assist in filtering snapshot resources, please add the following name:

AEM <component> Snapshot <instance_id>"

This can be done by adding Name tag to the snapshot.

Scale down handlers need to cater for inexisting replication/flush agent

Similar to other message-reprocessing issue, it is possible that a scale down event deletes a replication/flush agent and then (e.g. due to connectivity issue) fail to proceed with the next steps, which then causes the message to stay in the queue and gets reprocessed.

When Orchestrator process the message again, it will try to delete an agent that doesn't exist and AEM will respond with 403, which then causes an error on the Orchestrator side.
This should be modified so Orchestrator handles 403 as a 'node doesn't exist' and log a message, then proceed with the rest.

Note: 403 handling will also be added to ruby_aem layer across various resource handling.

Error starting Orchestrator with latest master as of 13/10/2020

Starting the compiled orchestrator code with the latest master of 13/10/2020 fails with the following error message:

2020-10-13 08:13:58 [main] INFO  c.s.aemorchestrator.AemOrchestrator - Starting AemOrchestrator v2.0.2-SNAPSHOT on ip-10-0-15-185.ap-southeast-2.compute.internal with PID 14082 (/opt/shinesolutions/aem-orchestrator/aem-orchestrator.jar started by aem-orchestrator in /opt/shinesolutions/aem-orchestrator)
2020-10-13 08:13:58 [main] DEBUG c.s.aemorchestrator.AemOrchestrator - Running with Spring Boot v2.1.6.RELEASE, Spring v5.1.8.RELEASE
2020-10-13 08:13:58 [main] INFO  c.s.aemorchestrator.AemOrchestrator - No active profile set, falling back to default profiles: default
2020-10-13 08:14:04 [main] DEBUG c.s.a.c.ProxyConfig$$EnhancerBySpringCGLIB$$a603fa68 - https_proxy environment variable not found, no proxy details set
2020-10-13 08:14:04 [main] ERROR o.s.b.w.e.tomcat.TomcatStarter - Error starting Tomcat context. Exception: org.springframework.beans.factory.BeanCreationException. Message: Error creating bean with name 'servletEndpointRegistrar' defined in class path resource [org/springframework/boot/actuate/autoconfigure/endpoint/web/ServletEndpointManagementContextConfiguration$WebMvcServletEndpointManagementContextConfiguration.class]: Bean instantiation via factory method failed;
 nested exception is org.springframework.beans.BeanInstantiationException: Failed to instantiate [org.springframework.boot.actuate.endpoint.web.ServletEndpointRegistrar]: Factory method 'servletEndpointRegistrar' threw exception;
 nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'infoEndpoint' defined in class path resource [org/springframework/boot/actuate/autoconfigure/info/InfoEndpointAutoConfiguration.class]: Bean instantiation via factory method failed;
 nested exception is org.springframework.beans.BeanInstantiationException: Failed to instantiate [org.springframework.boot.actuate.info.InfoEndpoint]: Factory method 'infoEndpoint' threw exception;
 nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'infoActuator': Injection of resource dependencies failed;
 nested exception is org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'envValues' defined in class path resource [com/shinesolutions/aemorchestrator/config/AemConfig.class]: Unsatisfied dependency expressed through method 'envValues' parameter 0;
 nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'awsHelperService': Injection of resource dependencies failed;
 nested exception is org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'amazonEC2Client' defined in class path resource [com/shinesolutions/aemorchestrator/config/AwsConfig.class]: Unsatisfied dependency expressed through method 'amazonEC2Client' parameter 1;
 nested exception is org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'awsClientConfig' defined in class path resource [com/shinesolutions/aemorchestrator/config/AwsConfig.class]: Unsatisfied dependency expressed through method 'awsClientConfig' parameter 0;
 nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No qualifying bean of type 'com.shinesolutions.aemorchestrator.model.ProxyDetails' available: expected at least 1 bean which qualifies as autowire candidate. Dependency annotations: {}
2020-10-13 08:14:04 [main] DEBUG c.s.a.c.ProxyConfig$$EnhancerBySpringCGLIB$$a603fa68 - https_proxy environment variable not found, no proxy details set
2020-10-13 08:14:04 [main] ERROR o.s.b.d.LoggingFailureAnalysisReporter - 

***************************
APPLICATION FAILED TO START
***************************

Description:

Parameter 0 of method awsClientConfig in com.shinesolutions.aemorchestrator.config.AwsConfig required a bean of type 'com.shinesolutions.aemorchestrator.model.ProxyDetails' that could not be found.

The following candidates were found but could not be injected:
	- User-defined bean method 'proxyDetails' in 'ProxyConfig' ignored as the bean value is null


Action:

Consider revisiting the entries above or defining a bean of type 'com.shinesolutions.aemorchestrator.model.ProxyDetails' in your configuration.

NoSuchElementException error when scaling publish up

This error shows up on Publish instance scale up event:

2017-02-23 17:28:54 [SessionCallBackSchedulerThread-1] ERROR c.s.a.h.AutoScalingEventHandler - Failed to execute autoscaling:EC2_INSTANCE_LAUNCH action for auto scaling group name: stackprefix-aem-publish-stack-PublishAutoScalingGroup-KML9H6V7L7M9
java.util.NoSuchElementException: No value present
        at java.util.Optional.get(Optional.java:135)
        at com.shinesolutions.aemorchestrator.service.AemInstanceHelperService.getPublishIdToSnapshotFrom(AemInstanceHelperService.java:184)
        at com.shinesolutions.aemorchestrator.service.AemInstanceHelperService$$FastClassBySpringCGLIB$$a28d9dcf.invoke(<generated>)
        at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204)
        at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:721)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157)
        at org.springframework.retry.annotation.AnnotationAwareRetryOperationsInterceptor.invoke(AnnotationAwareRetryOperationsInterceptor.java:122)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
        at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:656)
        at com.shinesolutions.aemorchestrator.service.AemInstanceHelperService$$EnhancerBySpringCGLIB$$b1e78141.getPublishIdToSnapshotFrom(<generated>)
        at com.shinesolutions.aemorchestrator.actions.ScaleUpPublishAction.execute(ScaleUpPublishAction.java:55)
        at com.shinesolutions.aemorchestrator.handler.AutoScalingEventHandler.handleEvent(AutoScalingEventHandler.java:38)
        at com.shinesolutions.aemorchestrator.handler.SqsMessageHandler.handleMessage(SqsMessageHandler.java:52)
        at com.shinesolutions.aemorchestrator.service.MessageReceiver.onMessage(MessageReceiver.java:42)
        at com.amazon.sqs.javamessaging.SQSSessionCallbackScheduler.run(SQSSessionCallbackScheduler.java:151)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Prioritising Publish and Publish-Dispatcher instances from same AZ

When Orchestrator is finding a Publish instance candidate to be paired to Publish-Dispatcher instance, it should prioritise the instance from the same AZ. The goal is to reduce the probability of cross-pairing between AZs.

However, if there's no candidate from the same AZ, allow it to be paired and log a warning message.
This is to handle the rare scenario when there are unbalanced number of instances between the AZs between Publish and Publish Dispatcher component layers.

Publisher alarm action should be configurable

Rather than defaulting to terminating an instance when alarm is triggered https://github.com/shinesolutions/aem-orchestrator/blob/master/src/main/java/com/shinesolutions/aemorchestrator/actions/AlarmContentHealthCheckAction.java#L28 , this behaviour should be configurable to either terminate or simply notify / log.

This is due to the possibility of alarms being triggered by either system (known, so should be terminated) or authors (often not known, could be caused by publishing, so might be limited to just notification without termination).

Publish snapshot ID source should perform a health check

On publish scale up event, when Orchestrator is trying to identify the source snapshot ID, it currently only checks for an existing instance.
This causes 2 problems:

  1. Snapshot could be done when AEM is starting up, causing the repository to be in an unexpected state.
  2. Double snapshotting as mentioned here #13

The logic has to be improved so not only it checks for a publish instance ID, it should also check for its healthiness (via /system/health?tags=shallow).
If it cannot find any healthy publish instance, create SnapshotId tag but with empty value.
If it can find one, use that healthy instance as the source of the snapshot

Relevant method:

public String getPublishIdToSnapshotFrom(String excludeInstanceId) {

Change the HTTP Method from POST to GET

The Reverse Replication Agents have a setting protocolHTTPMethod which is automatically set to POST.

This setting needs to be set to GET or else we get the following error:

ERROR - reverseReplicationAgent-i-09a9ac5ed462ed006 : Error while polling agent reverseReplicationAgent-i-09a9ac5ed462ed006: com.day.cq.replication.ReplicationException: Unsupported http method POST

Orchestrator unable to delete flush and replication agents

Orchestrator errors with "Method Not Allowed" message when deleting flush and replication agents.

2017-03-05 22:33:40 [SessionCallBackSchedulerThread-1] ERROR c.s.a.a.ScaleDownAuthorDispatcherAction - Failed to delete flush agent for dispatcher id: i-04dc42481fe9c8067, and run mode: author
com.shinesolutions.swaggeraem4j.ApiException: Method Not Allowed
	at com.shinesolutions.swaggeraem4j.ApiClient.handleResponse(ApiClient.java:1047)
	at com.shinesolutions.swaggeraem4j.ApiClient.execute(ApiClient.java:970)
	at com.shinesolutions.swaggeraem4j.ApiClient.execute(ApiClient.java:953)
	at com.shinesolutions.swaggeraem4j.api.SlingApi.deleteAgentWithHttpInfo(SlingApi.java:145)
	at com.shinesolutions.aemorchestrator.aem.FlushAgentManager.deleteFlushAgent(FlushAgentManager.java:40)
	at com.shinesolutions.aemorchestrator.aem.FlushAgentManager$$FastClassBySpringCGLIB$$e599d4da.invoke(<generated>)
	at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204)
	at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:721)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157)
	at org.springframework.retry.annotation.AnnotationAwareRetryOperationsInterceptor.invoke(AnnotationAwareRetryOperationsInterceptor.java:122)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
	at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:656)
	at com.shinesolutions.aemorchestrator.aem.FlushAgentManager$$EnhancerBySpringCGLIB$$581ee12e.deleteFlushAgent(<generated>)
	at com.shinesolutions.aemorchestrator.actions.ScaleDownAuthorDispatcherAction.execute(ScaleDownAuthorDispatcherAction.java:34)
	at com.shinesolutions.aemorchestrator.handler.AutoScalingTerminateEventHandler.handleEvent(AutoScalingTerminateEventHandler.java:31)
	at com.shinesolutions.aemorchestrator.handler.SqsMessageHandler.handleMessage(SqsMessageHandler.java:52)
	at com.shinesolutions.aemorchestrator.service.MessageReceiver.onMessage(MessageReceiver.java:42)
	at com.amazon.sqs.javamessaging.SQSSessionCallbackScheduler.run(SQSSessionCallbackScheduler.java:151)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Log message payload at debug level

To assist with troubleshooting effort, Orchestrator should log message payload with log level set to debug.

Even though it's not needed in most happy days scenario, it will be very helpful during troubleshooting of unexpected error.

Lack of pairing candidate availability shouldn't cause continual snapshotting

When Orchestrator encountered an error during publish layer scaling up, e.g. when it can't find any unpaired publish-dispatcher instance, it currently reprocess the message which causes a snapshot to be created again on each retry.

2017-03-07 15:03:28 [SessionCallBackSchedulerThread-1] WARN  c.s.a.actions.ScaleUpPublishAction - Failed to find unpaired publish dispatcher
java.util.NoSuchElementException: No value present
at java.util.Optional.get(Optional.java:135)
at com.shinesolutions.aemorchestrator.service.AemInstanceHelperService.findUnpairedPublishDispatcher(AemInstanceHelperService.java:236)
at com.shinesolutions.aemorchestrator.service.AemInstanceHelperService$$FastClassBySpringCGLIB$$a28d9dcf.invoke(<generated>)
at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:721)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157)
at org.springframework.retry.interceptor.RetryOperationsInterceptor$1.doWithRetry(RetryOperationsInterceptor.java:74)
at org.springframework.retry.support.RetryTemplate.doExecute(RetryTemplate.java:276)
at org.springframework.retry.support.RetryTemplate.execute(RetryTemplate.java:157)
at org.springframework.retry.interceptor.RetryOperationsInterceptor.invoke(RetryOperationsInterceptor.java:101)
at org.springframework.retry.annotation.AnnotationAwareRetryOperationsInterceptor.invoke(AnnotationAwareRetryOperationsInterceptor.java:119)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:656)
at com.shinesolutions.aemorchestrator.service.AemInstanceHelperService$$EnhancerBySpringCGLIB$$b1e78141.findUnpairedPublishDispatcher(<generated>)
at com.shinesolutions.aemorchestrator.actions.ScaleUpPublishAction.execute(ScaleUpPublishAction.java:94)
at com.shinesolutions.aemorchestrator.handler.AutoScalingEventHandler.handleEvent(AutoScalingEventHandler.java:38)
at com.shinesolutions.aemorchestrator.handler.SqsMessageHandler.handleMessage(SqsMessageHandler.java:52)
at com.shinesolutions.aemorchestrator.service.MessageReceiver.onMessage(MessageReceiver.java:42)
at com.amazon.sqs.javamessaging.SQSSessionCallbackScheduler.run(SQSSessionCallbackScheduler.java:151)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Attach content health check alarm to publish instance

The AEM Orchestrator needs to create a Cloud Watch alarm (on Publish scale up) which monitors the contentHealthCheck metric being generated by this script. When the alarm triggers, it should add a message on the SQS queue via the SNS topic which will notify the Orchestrator to terminate the affected publish instance. Termination of the instance should also remove the alarm.

Replication agent state should be persisted during scale up publish action

On scale up publish action, the replication agent is created for the new pair, but it starts with an empty queue. This means we're losing the state of the replication agent on the originating publish instance.

Replication agent state should be treated just like the publish and publish-dispatcher pair's state, along with the publish instance's repository state (via EBS volume).

Error 400 while scaling up publisher

This error shows up when recovering from all publish instances being terminated:

2017-02-23 17:29:24 [SessionCallBackSchedulerThread-1] ERROR c.s.a.h.AutoScalingEventHandler - Failed to execute autoscaling:EC2_INSTANCE_LAUNCH action for auto scaling group name: stackprefix-aem-publish-stack-PublishAutoScalingGroup-KML9H6V7L7M9
com.amazonaws.services.ec2.model.AmazonEC2Exception: The volume 'vol-03d57d541eb57af5a' is 'creating' (Service: AmazonEC2; Status Code: 400; Error Code: IncorrectState; Request ID: 98cbc9d8-db1e-4e3d-a753-101cdc5c4f8e)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1586)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1254)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1035)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:747)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:721)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:704)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:672)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:654)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:518)
        at com.amazonaws.services.ec2.AmazonEC2Client.doInvoke(AmazonEC2Client.java:11840)
        at com.amazonaws.services.ec2.AmazonEC2Client.invoke(AmazonEC2Client.java:11816)
        at com.amazonaws.services.ec2.AmazonEC2Client.createSnapshot(AmazonEC2Client.java:2769)
        at com.shinesolutions.aemorchestrator.service.AwsHelperService.createSnapshot(AwsHelperService.java:209)
        at com.shinesolutions.aemorchestrator.service.AwsHelperService$$FastClassBySpringCGLIB$$e115e7b0.invoke(<generated>)
        at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204)
        at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:721)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157)
        at org.springframework.retry.annotation.AnnotationAwareRetryOperationsInterceptor.invoke(AnnotationAwareRetryOperationsInterceptor.java:122)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
        at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:656)
        at com.shinesolutions.aemorchestrator.service.AwsHelperService$$EnhancerBySpringCGLIB$$c62f753a.createSnapshot(<generated>)
        at com.shinesolutions.aemorchestrator.actions.ScaleUpPublishAction.execute(ScaleUpPublishAction.java:63)
        at com.shinesolutions.aemorchestrator.handler.AutoScalingEventHandler.handleEvent(AutoScalingEventHandler.java:38)
        at com.shinesolutions.aemorchestrator.handler.SqsMessageHandler.handleMessage(SqsMessageHandler.java:52)
        at com.shinesolutions.aemorchestrator.service.MessageReceiver.onMessage(MessageReceiver.java:42)
        at com.amazon.sqs.javamessaging.SQSSessionCallbackScheduler.run(SQSSessionCallbackScheduler.java:151)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Orchestrator fails instance scale up due to extra fields

Orchestrator is failing during scale up action. Error below is from Orchestrator log.

2017-09-22 15:53:30 [SessionCallBackSchedulerThread-1] ERROR c.s.a.h.AutoScalingLaunchEventHandler - Failed to execute 'scale up' action
com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "InvokingAlarms" (class com.shinesolutions.aemorchestrator.model.Details), not marked as ignorable (2 known properties: "Availability Zone", "Subnet ID"])
at [Source: {"Progress":50,"AccountId":"918473058104","Description":"Launching a new EC2 instance:

Looks like com.shinesolutions.aemorchestrator.model.Details class should be either annotated with @JsonIgnoreProperties(ignoreUnknown = true) or extended to allow for new field "InvokingAlarms"

get sns topic arn

More efficient and more secure way to retrieve the sns topic arn.

Current in AwsHelperService.getSnsTopicArn it is retrieving all topics in account.

a better solution would be to retrieve it from the cloudformation resources based on the logical id.

Change /libs/cq to cq

When creating a replication agent, the orchestrator is currently setting the sling:resourceType as /libs/cq/*
This should be changed to cq/*

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.