apache / helix Goto Github PK
View Code? Open in Web Editor NEWMirror of Apache Helix
License: Apache License 2.0
Mirror of Apache Helix
License: Apache License 2.0
Issues:
CurrentStateCache updating snapshot would miss all the existing partitions that having state change.
RoutingTableProvider callback on the main event thread. Time is not accounted in log.
Based on the existing components (or the interfaces), we can now implement the initial version of the WAGED rebalancer.
AC:
MIN_ACTIVE replica is only applied in DelayAutoRebalancer and not respected by throttling logic in Helix. Thus if preference list for partition is: Master, Slave, Slave and current state is Master, Slave, Offline. It will generate a state transition (ST) Offline -> Slave. Even if the MIN_ACTIVE replica number is 2, Helix will treat this Offline -> Slave to be Recovery rebalance. We should fix this.
Also we need the logic to have more fine-grained state transition instead of marking all state transition of a partition to same state transition type. For example, the preference list is Master, Slave, Slave and current state is Slave, Slave, Offline. There will be two STs: Slave -> Master and Offline -> Slave. In current computation logic, they are all Recovery STs because this partition is marked as Recovery partition. With new logic, Slave -> Master is Recovery ST and Offline -> Slave will be Load rebalance when MIN_ACTIVE rebalance set to be 2.
If you read ResourceConfig through ConfigAccessor, the stats are not pulled from ZK. Thus the stats data is incorrect with these objects.
We need to add the logic to the ConfigAccessor, not only for ResourceConfig but also other objects.
Upon a Participant disconnect, the Participant would carry over from the last session. This would copy all previous task states to the current session and set their requested states as DROPPED (for INIT and RUNNING states).
It came to our attention that sometimes these Participants experience connection issues and the tasks happen to be in TASK_ERROR or COMPLETED states. These tasks would get stuck on the Participant and never be dropped. This issue proposes to add the logic that would get all tasks whose requested states are DROPPED to be dropped immediately.
See: JobDispatcher.java line 441.
Ex.:
try {
WorkflowConfig config = WorkflowConfig.fromHelixProperty(resource.getValue());
workflowConfigMap.put(resource.getKey(), config);
} catch (IllegalArgumentException ex) {
// ignore if it is not a workflow config.
}
If you catch an exception, you may not want to ignore it. Adding a log message would help engineers to understand what happens.
Please find the design doc here:
https://github.com/apache/helix/wiki/Design-Proposal---Weight-Aware-Globally-Even-Distribute-Rebalancer
This data provider will help to convert Helix cluster status data into the rebalancer readable Cluster Model object. The result will be used to support rebalance calculation.
AC:
We observed some jobs whose start time (with an execution delay (DelayTime) set) is greater than the current time get scheduled. Investigate the following two issues (they may be related):
Investigate the title. E.g.) If workflow is stopped, do all the tasks ultimately get updated with the STOPPED state?
Same log message should not be logged right while an exception is also thrown. Ex.:
The principal "don't log and throw" would be a nice practice for handling exceptions in Java. We would like to keep the logs clean and helpful without redundancy. An exception message is enough to describe the exceptional event. And the stack trace tells you in which class, method, and line the exception was thrown.
In Zookeeper client, when a znode doesn't exist (NoNodeException) a readFailure will be reported which shouldn't happen.
Current implementation of Task Framework checks if a Job is ready for the scheduling or not. If not, the job is inserted in the inflight jobs without any delay time restrictions which is not desirable.
The issue tracks work of refining the POC work: constraint based rebalance algorithm
POC code:
During the discussion with our potential customers, we noticed that a default instance capacity could be very helpful.
The default capacity will be applied to the instance when the instance level capacity map is empty or not configured.
When there are frequent disconnects to Participants, it is possible to have discrepancy between what's in CurrentState and what's in task contexts. This could leave the task metadata in a bad state where jobs could get stuck. Using CurrentState from all LiveInstances would solve this problem and give the controller the most up-to-date view of all tasks. No extra read would be required because CurrentState is cached.
This issue is the same as https://issues.apache.org/jira/browse/HELIX-747 (it's a bit confusing which issue list should be used for this project as the documentation on https://cwiki.apache.org/confluence/display/HELIX/Contributor+Workflow doesn't seem to be up to date):
The current json lib Helix uses is out of date. We should consider replacing it with a well-maintained lib.
FasterXML/jackson is compatible with the current lib we used. So it could be a good candidate.
The old version of Jackson that Helix currently uses contains multiple CVEs:
CVE-2017-17485, CVE-2017-7525, CVE-2017-15095
Current this test is disabled due to it is unstable. It is better to understand why the MBean is not always showing in the default monitoring and fix test.
Currently, there's no endpoint available for users to read propertystore content directly via helix-rest endpoint. Users have to rely on tools to view the data on zookeeper directly.
Go through the logs and change some debug logs to info logs since they might help with debugging.
Problem
The issue is raised by our internal client who sets the instance_weight = 0 in instance/config. Instead of expecting zero partitions assigned to it, based on our Helix controller algorithm, there's still a good chance some partitions assigned to the instance.
Code of issue
After investigation, the issue was due to the current CardDealingAlgorithm. It treats instance weight as a comparator factor or preference, rather than respecting the weight = 0.
Solution
There's no way to strictly follow all the weight setting of instances now. Because the algorithm is not designed for the purpose. But for the extreme case of instance weight = 0, the current workaround solution is add a simple check to skip the assignment of the instance.
This links to #320
Current Helix HealthCheck API checks the ANY_INSTANCE resources, which is not necessary. Since ANY_INSTANCE resources only have single partition with 1 replica, there is no need to check sibling health status.
The current format file helix-style.xml has very limited configured items. This leaves too much flexibility to Helix developers. We shall add more items to it.
The known missing ones:
Please comment on this issue and we will modify the format file accordingly.
Current implementation of Rest 2.0 does not support stateModelDef modification. Here, we will implement
delete -- remove the stateModelDef with the input id.
put -- create new statemodeldef if no existing one with same input id
set -- replace the content of node with input id
Currently it is hardcoded as a SlidingTimeWindowArrayReservoir which uses a default interval of 1hr. While the reservoir choice is fine (but still should be configurable for upgrade), the window length of 1 hour occupies too much heap space for a busy Prod Helix client. More importantly, one really does not need 1 hour window to compute a histogram stats. Typically 1 min is used.
E.g., Our heapdump here in Uber shows as much as 4G or even higher of heap could be used for these metrics (initiated through ZkHelixPropertyStore). This creates too much memory and GC pressure. As in the manual for the SlidingTimeWindowArrayReservoir, a 1 minute window could used as much as 9M of memory already. So we believe a much lower interval length should be passed in. Even better, the reservoir should be made configurable.
Problems
The implementation of the method
has several performance issues:
InstanceServiceImpl
won't memorize the response of each query. In production, the performance will downgrade very badly. The duplicate requests are known mostly coming from partitions stoppable checksSolutions
Current implementation of task framework involves utilization of ideal state which is not necessary anymore. Ideal State can be removed.
Current HelixDataAccesor updateProperty uses ZNRecordUpdater. It's merge logic just simply adding all elements when do a merge for ZNRecord. That could cause lot of duplication of listFields.
This impact the invokeRebalanceForResourceConfig. The fix will be implementing a customized updater.
"id" : "Test",
"simpleFields" : {
},
"mapFields" : {
},
"listFields" : {
"0" : [ "1", "2", "3", "1", "2", "3", "1", "2", "3", "1", "2", "3", "1", "2", "3", "1", "2", "3", "1", "2", "3", "1", "2", "3" ]
}
}
ClusterConfig, InstanceConfig, ResourceConfig's equals() methods were only comparing the IDs, which is not desirable because it is possible to have two instances of each with the same IDs but different contents/fields. This diff improves the equals() methods for these configs so that the actual contents are compared as well.
Currently, we use ResourceConfigs to trigger a rebalance. But this is causing Helix to write a ResourceConfig back to ZK, which is not desirable when a mixed-mode rebalance is happening (user may create a non-task ResourceConfig for custom preferenceLists).
AC:
See the code below:
if (expiredJobs.size() > 0) { // Update workflow context will be in main pipeline not here. Otherwise, it will cause // concurrent write issue. It is possible that jobs got purged but there is no event to // trigger the pipeline to clean context. HelixDataAccessor accessor = manager.getHelixDataAccessor(); List<String> resourceConfigs = accessor.getChildNames(accessor.keyBuilder().resourceConfigs()); if (resourceConfigs.size() > 0) { RebalanceScheduler.invokeRebalanceForResourceConfig(manager.getHelixDataAccessor(), resourceConfigs.get(0)); } else { LOG.warn( "No resource config to trigger rebalance for clean up contexts for" + expiredJobs); }
We have an alternative way to trigger a rebalance instead of having to "touch" a resource config. @lei-xia coded this up as part his P2P messaging fix. Let's use that instead.
The current implementation has a loophole when
The root cause is the implementation relies on the counters in the hashmap, where each type including ANY is of different entry.
In REST partition read, we reads partition level health status one by one. It increased the call the ZK. We should improve it with batch call.
When we delete a workflow the following log was printed repeatedly (reproducible):
2019/08/06 23:46:58.152 WARN [ZkClient] [HelixController-pipeline-task-XXX] [helix] [] Failed to delete path /XXX/PROPERTYSTORE/TaskRebalancer/WORKFLOW_NAME! org.I0Itec.zkclient.exception.ZkException: org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode = Directory not empty for /XXX/PROPERTYSTORE/TaskRebalancer/WORKFLOW_NAME
We see a reproducible deletion problem with this use case, so this log might be a clue to what is exactly going on.
A problem was found that the leader controller may lose all the callback handlers after leadership switch.
To reproduce the issue, the cluster must be using leader election mode (DistributedLeaderElection). Then frequent leadership switch caused by ZK session expiring may trigger the problem.
The symptom is that, although the leader controller exists, it won't process any ZK notification. So the cluster will not be managed.
Since controller will be notified when resourceconfig, it is not necessary to do touch logic once resourceconfig is updated. Hence some of the places issuing touch logic can be removed.
Problem:
fromHelixProperty() throws IllegalArgumentException if the HelixProperty is not a WorkflowConfig.
helix/helix-core/src/main/java/org/apache/helix/task/WorkflowConfig.java
Lines 318 to 324 in 1063e7a
And getWorkflows() uses the way of catching an exception to check whether or not the property is a WorkflowConfig.
helix/helix-core/src/main/java/org/apache/helix/task/TaskDriver.java
Lines 770 to 780 in 1063e7a
Solutions:
boolean isWorkflowConfig(HelixProperty property)
to check so catch exception is avoided.Optional<WorkflowConfig> fromHelixProperty(HelixProperty property)
public static Optional<WorkflowConfig> parseWorkflowConfig(HelixProperty property) {
Map<String, String> configs = property.getRecord().getSimpleFields();
if (!configs.containsKey(WorkflowConfigProperty.Dag.name())) {
return Optional.empty();
}
return Optional.of(Builder.fromMap(configs).setWorkflowId(property.getId()).build());
}
for (Map.Entry<String, ResourceConfig> resource : resourceConfigMap.entrySet()) {
String key = resource.getKey();
ResourceConfig resourceConfig = resource.getValue();
WorkflowConfig.parseWorkflowConfig(resourceConfig)
.ifPresent(w -> workflowConfigMap.put(key, x));
}
Previous discussions in #357:
narendly: I don't think using Optional just for the sake of removing null is appropriate. I wouldn't think it would help with performance either. You could call isWorkflowConfig and parse if it is true, and skip if false, thereby doing minimal amount of parsing needed overall.
narendly: I think Optional would be beneficial if you want to have fluid chained-calls (similar to functional programming). Here, it is possible to avoid optionality/returning null by checking for the presence of a field. So using Optional is not necessary because you do not need to introduce null in the first place. It would also incur performance/memory overhead. -Hunter
narendly: Optional is a wrapper for its underlying objects, so it uses more memory and adds the work of boxing/unboxing. I am not claiming that this is significant, but it is a cost we do not have to incur.
Welcome to discuss...
We will soon start the development of the new rebalancer. https://github.com/apache/helix/wiki/Design-Proposal---Weight-Aware-Globally-Even-Distribute-Rebalancer
This issue is created for tracking the preparation.
Current AutoFallbackPropertyStore instantiation mark the zkCachePath to be null. So the caching functionality is not work for this type of ProertyStore.
When CRUSH based rebalance strategy algorithm is used, if the related resource isn't enabled yet, Helix will throw IllegalStateException
. The below is an example from one of the integration tests in Pinot code base:
2019/06/22 09:05:16.929 WARN [BestPossibleStateCalcStage] [HelixController-async_tasks-OfflineClusterIntegrationTest] Event aaadf5d1_DEFAULT : Failed to calculate best possible states for 1 resources.
2019/06/22 09:05:21.954 WARN [AutoRebalancer] [HelixController-pipeline-default-OfflineClusterIntegrationTest] Resource leadControllerResource has tag controller but no configured participants have this tag
2019/06/22 09:05:21.954 ERROR [CRUSHPlacementAlgorithm] [HelixController-pipeline-default-OfflineClusterIntegrationTest] 1 nodes of type INSTANCE were requested but the tree has only 0 nodes!
2019/06/22 09:05:21.955 ERROR [BestPossibleStateCalcStage] [HelixController-pipeline-default-OfflineClusterIntegrationTest] Event 512cdf47_DEFAULT : Error computing assignment for resource leadControllerResource. Skipping.
java.lang.IllegalStateException: null
at org.apache.helix.controller.rebalancer.strategy.crushMapping.CRUSHPlacementAlgorithm$Selector.select(CRUSHPlacementAlgorithm.java:308) ~[helix-core-0.9.0.jar:0.9.0]
at org.apache.helix.controller.rebalancer.strategy.crushMapping.CRUSHPlacementAlgorithm.select(CRUSHPlacementAlgorithm.java:119) ~[helix-core-0.9.0.jar:0.9.0]
at org.apache.helix.controller.rebalancer.strategy.CrushRebalanceStrategy.doSelect(CrushRebalanceStrategy.java:174) ~[helix-core-0.9.0.jar:0.9.0]
at org.apache.helix.controller.rebalancer.strategy.CrushRebalanceStrategy.select(CrushRebalanceStrategy.java:140) ~[helix-core-0.9.0.jar:0.9.0]
at org.apache.helix.controller.rebalancer.strategy.CrushRebalanceStrategy.computePartitionAssignment(CrushRebalanceStrategy.java:92) ~[helix-core-0.9.0.jar:0.9.0]
at org.apache.helix.controller.rebalancer.strategy.CrushRebalanceStrategy.computePartitionAssignment(CrushRebalanceStrategy.java:48) ~[helix-core-0.9.0.jar:0.9.0]
at org.apache.helix.controller.rebalancer.strategy.AbstractEvenDistributionRebalanceStrategy.computePartitionAssignment(AbstractEvenDistributionRebalanceStrategy.java:89) ~[helix-core-0.9.0.jar:0.9.0]
at org.apache.helix.controller.rebalancer.strategy.AbstractEvenDistributionRebalanceStrategy.computePartitionAssignment(AbstractEvenDistributionRebalanceStrategy.java:49) ~[helix-core-0.9.0.jar:0.9.0]
at org.apache.helix.controller.rebalancer.AutoRebalancer.computeNewIdealState(AutoRebalancer.java:129) ~[helix-core-0.9.0.jar:0.9.0]
at org.apache.helix.controller.rebalancer.AutoRebalancer.computeNewIdealState(AutoRebalancer.java:51) ~[helix-core-0.9.0.jar:0.9.0]
at org.apache.helix.controller.stages.BestPossibleStateCalcStage.computeResourceBestPossibleState(BestPossibleStateCalcStage.java:245) ~[helix-core-0.9.0.jar:0.9.0]
at org.apache.helix.controller.stages.BestPossibleStateCalcStage.compute(BestPossibleStateCalcStage.java:121) ~[helix-core-0.9.0.jar:0.9.0]
at org.apache.helix.controller.stages.BestPossibleStateCalcStage.process(BestPossibleStateCalcStage.java:77) ~[helix-core-0.9.0.jar:0.9.0]
at org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:68) ~[helix-core-0.9.0.jar:0.9.0]
at org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:640) ~[helix-core-0.9.0.jar:0.9.0]
at org.apache.helix.controller.GenericHelixController.access$400(GenericHelixController.java:117) ~[helix-core-0.9.0.jar:0.9.0]
This exception should be cleaned up and the behavior should just be the same as the default rebalance strategy.
Related Pinot issue: apache/pinot#4355
Currently we can set only one instance group tag for a job.
jobCfg.setInstanceGroupTag("INSTANCEGROUPTAG");
It will be really helpful to support support multiple instance group tags for one job so that the job can be run either in group A or group B. Through this, we can support more advanced scheduling use cases like in Node Affinity in Kubernetes [1]
[1] https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
Create a stage after ReadClusterDataStage for cluster change detection.
There are many unused imports in project, we should remove them to keep our code clean
In CallbackHandler, the re-subscription of the watcher for CALLBACK happens asynchronously. If the re-subscription happens after the change of the path, then we will miss the change until the next change happens.
Let's do something like this:
thread.setName("TaskStateModelFactory-task_thread" + thread.getId())
I find that some code is still using LinkedList instead of ArrayList for the List object. Ex.:
I believe ArrayList has better performance over LinkedList in most cases, especially when the dataset is big and we don't need to randomly delete data when using an iterator.
If LinkedList is used as a queue, we might want to look at Deque/ArrayDeque, which has better performance over LinkedList. ex.:
In summary, we should avoid using LinkedList as far as possible. Maybe for small dataset, linkedlist and ArrayList have similar performance. If we only target at small dataset in this framework, it should be fine. But how about using ArrayDeque over LinkedList for a Queue implementation?
References:
https://stackoverflow.com/questions/6163166/why-is-arraydeque-better-than-linkedlist
http://brianandstuff.com/2016/12/12/java-arraydeque-vs-linkedlist/
If debugging NPE is complete, some if the logs can be removed for CRUSH based algorithm (in computeResourceBestPossibleState function).
We see the following issues in Pinot deployment:
Sometimes ZK hangs in the getWorkflow() call, and we could provide an API with a timeout to handle this at the application level.
AC:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.