Coder Social home page Coder Social logo

apache / helix Goto Github PK

View Code? Open in Web Editor NEW
457.0 35.0 218.0 41.44 MB

Mirror of Apache Helix

License: Apache License 2.0

Shell 0.91% Python 1.25% Java 78.37% TypeScript 1.31% HTML 17.08% JavaScript 0.70% SCSS 0.07% CSS 0.28% BitBake 0.03%
helix java big-data cloud

helix's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

helix's Issues

Implement the WAGED rebalancer

Based on the existing components (or the interfaces), we can now implement the initial version of the WAGED rebalancer.

AC:

  • Implement the rebalancer with unit tests.
  • Roughly integrate with the rebalance pipeline. The final integration will be done when all the components implementation is ready.

Fine-grained state transition throttling and respect MIN_ACTIVE replica

MIN_ACTIVE replica is only applied in DelayAutoRebalancer and not respected by throttling logic in Helix. Thus if preference list for partition is: Master, Slave, Slave and current state is Master, Slave, Offline. It will generate a state transition (ST) Offline -> Slave. Even if the MIN_ACTIVE replica number is 2, Helix will treat this Offline -> Slave to be Recovery rebalance. We should fix this.

Also we need the logic to have more fine-grained state transition instead of marking all state transition of a partition to same state transition type. For example, the preference list is Master, Slave, Slave and current state is Slave, Slave, Offline. There will be two STs: Slave -> Master and Offline -> Slave. In current computation logic, they are all Recovery STs because this partition is marked as Recovery partition. With new logic, Slave -> Master is Recovery ST and Offline -> Slave will be Load rebalance when MIN_ACTIVE rebalance set to be 2.

No stats fetched for ResourceConfig through ConfigAccessor

If you read ResourceConfig through ConfigAccessor, the stats are not pulled from ZK. Thus the stats data is incorrect with these objects.

We need to add the logic to the ConfigAccessor, not only for ResourceConfig but also other objects.

TASK: Drop all tasks whose requested states are DROPPED

Upon a Participant disconnect, the Participant would carry over from the last session. This would copy all previous task states to the current session and set their requested states as DROPPED (for INIT and RUNNING states).

It came to our attention that sometimes these Participants experience connection issues and the tasks happen to be in TASK_ERROR or COMPLETED states. These tasks would get stuck on the Participant and never be dropped. This issue proposes to add the logic that would get all tasks whose requested states are DROPPED to be dropped immediately.

See: JobDispatcher.java line 441.

Don't ignore exceptions: adding log messages to help check exceptions.

Ex.:

// ignore if it is not a workflow config.

     try {
        WorkflowConfig config = WorkflowConfig.fromHelixProperty(resource.getValue());
        workflowConfigMap.put(resource.getKey(), config);
      } catch (IllegalArgumentException ex) {
        // ignore if it is not a workflow config.
      }

If you catch an exception, you may not want to ignore it. Adding a log message would help engineers to understand what happens.

Don't log and throw: removing redundant logs right before throwing exception

Same log message should not be logged right while an exception is also thrown. Ex.:

LOG.error("Failed to delete job " + job + " from queue " + queue);

The principal "don't log and throw" would be a nice practice for handling exceptions in Java. We would like to keep the logs clean and helpful without redundancy. An exception message is enough to describe the exceptional event. And the stack trace tells you in which class, method, and line the exception was thrown.

TASK: Execution delay is not respected for the jobs

Current implementation of Task Framework checks if a Job is ready for the scheduling or not. If not, the job is inserted in the inflight jobs without any delay time restrictions which is not desirable.

TASK: Use CurrentState as the source of truth instead of Workflow/JobContext

When there are frequent disconnects to Participants, it is possible to have discrepancy between what's in CurrentState and what's in task contexts. This could leave the task metadata in a bad state where jobs could get stuck. Using CurrentState from all LiveInstances would solve this problem and give the controller the most up-to-date view of all tasks. No extra read would be required because CurrentState is cached.

Replace org.codehaus.jackson with FasterXML/jackson

This issue is the same as https://issues.apache.org/jira/browse/HELIX-747 (it's a bit confusing which issue list should be used for this project as the documentation on https://cwiki.apache.org/confluence/display/HELIX/Contributor+Workflow doesn't seem to be up to date):

The current json lib Helix uses is out of date. We should consider replacing it with a well-maintained lib.

FasterXML/jackson is compatible with the current lib we used. So it could be a good candidate.

The old version of Jackson that Helix currently uses contains multiple CVEs:
CVE-2017-17485, CVE-2017-7525, CVE-2017-15095

Support Property Read API - helix-rest

Currently, there's no endpoint available for users to read propertystore content directly via helix-rest endpoint. Users have to rely on tools to view the data on zookeeper directly.

Set instance_weight = 0 in config doesn't mean no partition assigned to the instance

Problem
The issue is raised by our internal client who sets the instance_weight = 0 in instance/config. Instead of expecting zero partitions assigned to it, based on our Helix controller algorithm, there's still a good chance some partitions assigned to the instance.

Code of issue
After investigation, the issue was due to the current CardDealingAlgorithm. It treats instance weight as a comparator factor or preference, rather than respecting the weight = 0.

https://github.com/apache/helix/blob/d5bf3ad410e0a54b1e79d9570d3e7897c0e2c947/helix-core/src/main/java/org/apache/helix/controller/rebalancer/strategy/crushMapping/CardDealingAdjustmentAlgorithmV2.java

Solution
There's no way to strictly follow all the weight setting of instances now. Because the algorithm is not designed for the purpose. But for the extreme case of instance weight = 0, the current workaround solution is add a simple check to skip the assignment of the instance.

Exclude ANY_INSTANCE for customized sibling checks

Current Helix HealthCheck API checks the ANY_INSTANCE resources, which is not necessary. Since ANY_INSTANCE resources only have single partition with 1 replica, there is no need to check sibling health status.

Need to improve the code style & formatting regulation

The current format file helix-style.xml has very limited configured items. This leaves too much flexibility to Helix developers. We shall add more items to it.

The known missing ones:

  • Wildcard Imports
  • Annotation wrapping
    Could be more.

Please comment on this issue and we will modify the format file accordingly.

Implementation of StateModelDef access in rest 2.0

Current implementation of Rest 2.0 does not support stateModelDef modification. Here, we will implement

delete -- remove the stateModelDef with the input id.

put -- create new statemodeldef if no existing one with same input id

set -- replace the content of node with input id

Make the reservoir used in ZkClientPathMonitor metric configurable

Currently it is hardcoded as a SlidingTimeWindowArrayReservoir which uses a default interval of 1hr. While the reservoir choice is fine (but still should be configurable for upgrade), the window length of 1 hour occupies too much heap space for a busy Prod Helix client. More importantly, one really does not need 1 hour window to compute a histogram stats. Typically 1 min is used.

E.g., Our heapdump here in Uber shows as much as 4G or even higher of heap could be used for these metrics (initiated through ZkHelixPropertyStore). This creates too much memory and GC pressure. As in the manual for the SlidingTimeWindowArrayReservoir, a 1 minute window could used as much as 9M of memory already. So we believe a much lower interval length should be passed in. Even better, the reservoir should be made configurable.

image

Performance issues in `InstancesAccessor#getParallelStoppableInstances`

Problems
The implementation of the method

getZoneBasedInstances(instances, orderOfZone, clusterTopology.toZoneMapping());
for (String instance : zoneBasedInstance) {
StoppableCheck stoppableCheckResult =
instanceService.getInstanceStoppableCheck(clusterId, instance, customizedInput);
if (!stoppableCheckResult.isStoppable()) {
ArrayNode failedReasonsNode = failedStoppableInstances.putArray(instance);
for (String failedReason : stoppableCheckResult.getFailedChecks()) {
failedReasonsNode.add(JsonNodeFactory.instance.textNode(failedReason));
}
} else {
stoppableInstances.add(instance);
}

has several performance issues:

  1. It's not actually run in parallel, think about each instance stoppable check has some lags and these will pile up in a large cluster.
  2. There're potential lots of duplicate requests because the current implementation of InstanceServiceImpl won't memorize the response of each query. In production, the performance will downgrade very badly. The duplicate requests are known mostly coming from partitions stoppable checks
    E.g

public StoppableCheck getInstanceStoppableCheck(String clusterId, String instanceName,
String jsonContent) throws IOException {
LOG.info("Perform instance level helix own health checks for {}/{}", clusterId, instanceName);
Map<String, Boolean> helixStoppableCheck = getInstanceHealthStatus(clusterId, instanceName,
InstanceService.HealthCheck.STOPPABLE_CHECK_LIST);
StoppableCheck result =
new StoppableCheck(helixStoppableCheck, StoppableCheck.Category.HELIX_OWN_CHECK);
if (!result.isStoppable()) {
return result;
}
LOG.info("{} passed helix side health checks", instanceName);
return performCustomInstanceChecks(clusterId, instanceName, getCustomPayLoads(jsonContent));

Solutions

  1. Parallelism executing the instance checks
  2. Cache the result of http request without submitting the same type of http requests again.

Invoke rebalance by "touching" IdealState/ResourceConfig

Current HelixDataAccesor updateProperty uses ZNRecordUpdater. It's merge logic just simply adding all elements when do a merge for ZNRecord. That could cause lot of duplication of listFields.
This impact the invokeRebalanceForResourceConfig. The fix will be implementing a customized updater.
"id" : "Test",
"simpleFields" : {
},
"mapFields" : {
},
"listFields" : {
"0" : [ "1", "2", "3", "1", "2", "3", "1", "2", "3", "1", "2", "3", "1", "2", "3", "1", "2", "3", "1", "2", "3", "1", "2", "3" ]
}
}

Fix config classes' equals()

ClusterConfig, InstanceConfig, ResourceConfig's equals() methods were only comparing the IDs, which is not desirable because it is possible to have two instances of each with the same IDs but different contents/fields. This diff improves the equals() methods for these configs so that the actual contents are compared as well.

Change the way Helix triggers rebalance in purgeExpiredJobs

Currently, we use ResourceConfigs to trigger a rebalance. But this is causing Helix to write a ResourceConfig back to ZK, which is not desirable when a mixed-mode rebalance is happening (user may create a non-task ResourceConfig for custom preferenceLists).

AC:

  1. Use an alternative way to trigger a controller rebalance.

See the code below:
if (expiredJobs.size() > 0) { // Update workflow context will be in main pipeline not here. Otherwise, it will cause // concurrent write issue. It is possible that jobs got purged but there is no event to // trigger the pipeline to clean context. HelixDataAccessor accessor = manager.getHelixDataAccessor(); List<String> resourceConfigs = accessor.getChildNames(accessor.keyBuilder().resourceConfigs()); if (resourceConfigs.size() > 0) { RebalanceScheduler.invokeRebalanceForResourceConfig(manager.getHelixDataAccessor(), resourceConfigs.get(0)); } else { LOG.warn( "No resource config to trigger rebalance for clean up contexts for" + expiredJobs); }

We have an alternative way to trigger a rebalance instead of having to "touch" a resource config. @lei-xia coded this up as part his P2P messaging fix. Let's use that instead.

Issue when client ONLY sets cluster level ANY throttle config

The current implementation has a loophole when

  1. When client ONLY sets cluster level ANY throttle config (no specific type throttle)
  2. The config won't be respected when charging load_rebalance or recovery_rebalance types

The root cause is the implementation relies on the counters in the hashmap, where each type including ANY is of different entry.

TASK: Fix ZkClient's "Failure to delete..." for task contexts

When we delete a workflow the following log was printed repeatedly (reproducible):

2019/08/06 23:46:58.152 WARN [ZkClient] [HelixController-pipeline-task-XXX] [helix] [] Failed to delete path /XXX/PROPERTYSTORE/TaskRebalancer/WORKFLOW_NAME! org.I0Itec.zkclient.exception.ZkException: org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode = Directory not empty for /XXX/PROPERTYSTORE/TaskRebalancer/WORKFLOW_NAME

We see a reproducible deletion problem with this use case, so this log might be a clue to what is exactly going on.

Leader controller loses all the callback handlers after leadership switch

A problem was found that the leader controller may lose all the callback handlers after leadership switch.
To reproduce the issue, the cluster must be using leader election mode (DistributedLeaderElection). Then frequent leadership switch caused by ZK session expiring may trigger the problem.

The symptom is that, although the leader controller exists, it won't process any ZK notification. So the cluster will not be managed.

Improve WorkflowConfig.fromHelixProperty()

Problem:
fromHelixProperty() throws IllegalArgumentException if the HelixProperty is not a WorkflowConfig.

public static WorkflowConfig fromHelixProperty(HelixProperty property)
throws IllegalArgumentException {
Map<String, String> configs = property.getRecord().getSimpleFields();
if (!configs.containsKey(WorkflowConfigProperty.Dag.name())) {
throw new IllegalArgumentException(
String.format("%s is an invalid WorkflowConfig", property.getId()));
}

And getWorkflows() uses the way of catching an exception to check whether or not the property is a WorkflowConfig.

public Map<String, WorkflowConfig> getWorkflows() {
Map<String, WorkflowConfig> workflowConfigMap = new HashMap<String, WorkflowConfig>();
Map<String, ResourceConfig> resourceConfigMap =
_accessor.getChildValuesMap(_accessor.keyBuilder().resourceConfigs());
for (Map.Entry<String, ResourceConfig> resource : resourceConfigMap.entrySet()) {
try {
WorkflowConfig config = WorkflowConfig.fromHelixProperty(resource.getValue());
workflowConfigMap.put(resource.getKey(), config);
} catch (IllegalArgumentException ex) {
// ignore if it is not a workflow config.

Solutions:

  1. Add boolean isWorkflowConfig(HelixProperty property) to check so catch exception is avoided.
  2. Use Optional. Implement Optional<WorkflowConfig> fromHelixProperty(HelixProperty property)
    Use Optional like below, which might look more intuitive to me than null/exception.
  public static Optional<WorkflowConfig> parseWorkflowConfig(HelixProperty property) {
    Map<String, String> configs = property.getRecord().getSimpleFields();
    if (!configs.containsKey(WorkflowConfigProperty.Dag.name())) {
      return Optional.empty();
    }
    return Optional.of(Builder.fromMap(configs).setWorkflowId(property.getId()).build());
  }
    for (Map.Entry<String, ResourceConfig> resource : resourceConfigMap.entrySet()) {
      String key = resource.getKey();
      ResourceConfig resourceConfig = resource.getValue();
      WorkflowConfig.parseWorkflowConfig(resourceConfig)
           .ifPresent(w -> workflowConfigMap.put(key, x));
    }

Previous discussions in #357:

  1. #357 (comment)
    narendly: As for parseWorkflow - I think it will be cleaner if we could have something like isWorkflowConfig() instead that returns T/F, and based on the result, include it here or not. This way, we do not return null (a code smell), and there are other places in the codebase where we could re-use isWorkflowConfig.

narendly: I don't think using Optional just for the sake of removing null is appropriate. I wouldn't think it would help with performance either. You could call isWorkflowConfig and parse if it is true, and skip if false, thereby doing minimal amount of parsing needed overall.

narendly: I think Optional would be beneficial if you want to have fluid chained-calls (similar to functional programming). Here, it is possible to avoid optionality/returning null by checking for the presence of a field. So using Optional is not necessary because you do not need to introduce null in the first place. It would also incur performance/memory overhead. -Hunter

narendly: Optional is a wrapper for its underlying objects, so it uses more memory and adds the work of boxing/unboxing. I am not claiming that this is significant, but it is a cost we do not have to incur.

  1. #357 (comment)
    i3wangyi: I'm in full support of using Optional.empty() instead of null. There're numerous articles on the internet saying NPE is a million-dollar mistake, the creation of Optional is just for solving the problem and the wrapper always reminds the developer the possibility of the object being empty. One can search for numerous supports online while the overhead of the optional object itself is simply negligible

Welcome to discuss...

java.lang.IllegalStateException: null is thrown when resource is disabled.

When CRUSH based rebalance strategy algorithm is used, if the related resource isn't enabled yet, Helix will throw IllegalStateException. The below is an example from one of the integration tests in Pinot code base:

2019/06/22 09:05:16.929 WARN [BestPossibleStateCalcStage] [HelixController-async_tasks-OfflineClusterIntegrationTest] Event aaadf5d1_DEFAULT : Failed to calculate best possible states for 1 resources.
2019/06/22 09:05:21.954 WARN [AutoRebalancer] [HelixController-pipeline-default-OfflineClusterIntegrationTest] Resource leadControllerResource has tag controller but no configured participants have this tag
2019/06/22 09:05:21.954 ERROR [CRUSHPlacementAlgorithm] [HelixController-pipeline-default-OfflineClusterIntegrationTest] 1 nodes of type INSTANCE were requested but the tree has only 0 nodes!
2019/06/22 09:05:21.955 ERROR [BestPossibleStateCalcStage] [HelixController-pipeline-default-OfflineClusterIntegrationTest] Event 512cdf47_DEFAULT : Error computing assignment for resource leadControllerResource. Skipping.
java.lang.IllegalStateException: null
	at org.apache.helix.controller.rebalancer.strategy.crushMapping.CRUSHPlacementAlgorithm$Selector.select(CRUSHPlacementAlgorithm.java:308) ~[helix-core-0.9.0.jar:0.9.0]
	at org.apache.helix.controller.rebalancer.strategy.crushMapping.CRUSHPlacementAlgorithm.select(CRUSHPlacementAlgorithm.java:119) ~[helix-core-0.9.0.jar:0.9.0]
	at org.apache.helix.controller.rebalancer.strategy.CrushRebalanceStrategy.doSelect(CrushRebalanceStrategy.java:174) ~[helix-core-0.9.0.jar:0.9.0]
	at org.apache.helix.controller.rebalancer.strategy.CrushRebalanceStrategy.select(CrushRebalanceStrategy.java:140) ~[helix-core-0.9.0.jar:0.9.0]
	at org.apache.helix.controller.rebalancer.strategy.CrushRebalanceStrategy.computePartitionAssignment(CrushRebalanceStrategy.java:92) ~[helix-core-0.9.0.jar:0.9.0]
	at org.apache.helix.controller.rebalancer.strategy.CrushRebalanceStrategy.computePartitionAssignment(CrushRebalanceStrategy.java:48) ~[helix-core-0.9.0.jar:0.9.0]
	at org.apache.helix.controller.rebalancer.strategy.AbstractEvenDistributionRebalanceStrategy.computePartitionAssignment(AbstractEvenDistributionRebalanceStrategy.java:89) ~[helix-core-0.9.0.jar:0.9.0]
	at org.apache.helix.controller.rebalancer.strategy.AbstractEvenDistributionRebalanceStrategy.computePartitionAssignment(AbstractEvenDistributionRebalanceStrategy.java:49) ~[helix-core-0.9.0.jar:0.9.0]
	at org.apache.helix.controller.rebalancer.AutoRebalancer.computeNewIdealState(AutoRebalancer.java:129) ~[helix-core-0.9.0.jar:0.9.0]
	at org.apache.helix.controller.rebalancer.AutoRebalancer.computeNewIdealState(AutoRebalancer.java:51) ~[helix-core-0.9.0.jar:0.9.0]
	at org.apache.helix.controller.stages.BestPossibleStateCalcStage.computeResourceBestPossibleState(BestPossibleStateCalcStage.java:245) ~[helix-core-0.9.0.jar:0.9.0]
	at org.apache.helix.controller.stages.BestPossibleStateCalcStage.compute(BestPossibleStateCalcStage.java:121) ~[helix-core-0.9.0.jar:0.9.0]
	at org.apache.helix.controller.stages.BestPossibleStateCalcStage.process(BestPossibleStateCalcStage.java:77) ~[helix-core-0.9.0.jar:0.9.0]
	at org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:68) ~[helix-core-0.9.0.jar:0.9.0]
	at org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:640) ~[helix-core-0.9.0.jar:0.9.0]
	at org.apache.helix.controller.GenericHelixController.access$400(GenericHelixController.java:117) ~[helix-core-0.9.0.jar:0.9.0]

This exception should be cleaned up and the behavior should just be the same as the default rebalance strategy.

Related Pinot issue: apache/pinot#4355

Convert LinkedList to ArrayList or ArrayDeque while possible to improve performance.

I find that some code is still using LinkedList instead of ArrayList for the List object. Ex.:

  1. refreshProperties(accessor, new LinkedList<>(reloadKeys), new ArrayList<>(cachedKeys),

I believe ArrayList has better performance over LinkedList in most cases, especially when the dataset is big and we don't need to randomly delete data when using an iterator.

If LinkedList is used as a queue, we might want to look at Deque/ArrayDeque, which has better performance over LinkedList. ex.:

LinkedList<Pair> queue = new LinkedList<Pair>();

In summary, we should avoid using LinkedList as far as possible. Maybe for small dataset, linkedlist and ArrayList have similar performance. If we only target at small dataset in this framework, it should be fine. But how about using ArrayDeque over LinkedList for a Queue implementation?

References:
https://stackoverflow.com/questions/6163166/why-is-arraydeque-better-than-linkedlist
http://brianandstuff.com/2016/12/12/java-arraydeque-vs-linkedlist/

Multiple problems with 0.9.0

We see the following issues in Pinot deployment:

  1. Controller callback does not happen on a PARTICIPANT helix manager when leadership changes.
  2. Other participants that host regular partitions do not get a state transition after the situation in previous step happens. The only solution is to restart the leader controller.
  3. Task queues are stuck with multiple tasks in IN_PROGRESS state.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.