netflix / hystrix Goto Github PK

Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.

CSS 0.06% Clojure 1.36% Java 98.52% Groovy 0.05%

hystrix's Introduction

Hystrix: Latency and Fault Tolerance for Distributed Systems

Hystrix Status

Hystrix is no longer in active development, and is currently in maintenance mode.

Hystrix (at version 1.5.18) is stable enough to meet the needs of Netflix for our existing applications. Meanwhile, our focus has shifted towards more adaptive implementations that react to an application’s real time performance rather than pre-configured settings (for example, through adaptive concurrency limits). For the cases where something like Hystrix makes sense, we intend to continue using Hystrix for existing applications, and to leverage open and active projects like resilience4j for new internal projects. We are beginning to recommend others do the same.

Netflix Hystrix is now officially in maintenance mode, with the following expectations to the greater community: Netflix will no longer actively review issues, merge pull-requests, and release new versions of Hystrix. We have made a final release of Hystrix (1.5.18) per issue 1891 so that the latest version in Maven Central is aligned with the last known stable version used internally at Netflix (1.5.11). If members of the community are interested in taking ownership of Hystrix and moving it back into active mode, please reach out to [email protected].

Hystrix has served Netflix and the community well over the years, and the transition to maintenance mode is in no way an indication that the concepts and ideas from Hystrix are no longer valuable. On the contrary, Hystrix has inspired many great ideas and projects. We thank everyone at Netflix, and in the greater community, for all the contributions made to Hystrix over the years.

Introduction

Full Documentation

See the Wiki for full documentation, examples, operational details and other information.

See the Javadoc for the API.

Communication

Google Group: HystrixOSS
Twitter: @HystrixOSS
GitHub Issues

What does it do?

1) Latency and Fault Tolerance

Stop cascading failures. Fallbacks and graceful degradation. Fail fast and rapid recovery.

Thread and semaphore isolation with circuit breakers.

2) Realtime Operations

Realtime monitoring and configuration changes. Watch service and property changes take effect immediately as they spread across a fleet.

Be alerted, make decisions, affect change and see results in seconds.

3) Concurrency

Parallel execution. Concurrency aware request caching. Automated batching through request collapsing.

Hello World!

Code to be isolated is wrapped inside the run() method of a HystrixCommand similar to the following:

public class CommandHelloWorld extends HystrixCommand<String> {

    private final String name;

    public CommandHelloWorld(String name) {
        super(HystrixCommandGroupKey.Factory.asKey("ExampleGroup"));
        this.name = name;
    }

    @Override
    protected String run() {
        return "Hello " + name + "!";
    }
}

This command could be used like this:

String s = new CommandHelloWorld("Bob").execute();
Future<String> s = new CommandHelloWorld("Bob").queue();
Observable<String> s = new CommandHelloWorld("Bob").observe();

More examples and information can be found in the How To Use section.

Example source code can be found in the hystrix-examples module.

Binaries

Binaries and dependency information for Maven, Ivy, Gradle and others can be found at http://search.maven.org.

Change history and version numbers => CHANGELOG.md

Example for Maven:

<dependency>
    <groupId>com.netflix.hystrix</groupId>
    <artifactId>hystrix-core</artifactId>
    <version>x.y.z</version>
</dependency>

and for Ivy:

<dependency org="com.netflix.hystrix" name="hystrix-core" rev="x.y.z" />

If you need to download the jars instead of using a build system, create a Maven pom file like this with the desired version:

<?xml version="1.0"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
	<modelVersion>4.0.0</modelVersion>
	<groupId>com.netflix.hystrix.download</groupId>
	<artifactId>hystrix-download</artifactId>
	<version>1.0-SNAPSHOT</version>
	<name>Simple POM to download hystrix-core and dependencies</name>
	<url>http://github.com/Netflix/Hystrix</url>
	<dependencies>
		<dependency>
			<groupId>com.netflix.hystrix</groupId>
			<artifactId>hystrix-core</artifactId>
			<version>x.y.z</version>
			<scope/>
		</dependency>
	</dependencies>
</project>

Then execute:

mvn -f download-hystrix-pom.xml dependency:copy-dependencies

It will download hystrix-core-*.jar and its dependencies into ./target/dependency/.

You need Java 6 or later.

Build

To build:

$ git clone [email protected]:Netflix/Hystrix.git
$ cd Hystrix/
$ ./gradlew build

Further details on building can be found on the Getting Started page of the wiki.

Run Demo

To run a demo app do the following:

$ git clone [email protected]:Netflix/Hystrix.git
$ cd Hystrix/
./gradlew runDemo

You will see output similar to the following:

Request => GetUserAccountCommand[SUCCESS][8ms], GetPaymentInformationCommand[SUCCESS][20ms], GetUserAccountCommand[SUCCESS, RESPONSE_FROM_CACHE][0ms]x2, GetOrderCommand[SUCCESS][101ms], CreditCardCommand[SUCCESS][1075ms]
Request => GetUserAccountCommand[FAILURE, FALLBACK_SUCCESS][2ms], GetPaymentInformationCommand[SUCCESS][22ms], GetUserAccountCommand[FAILURE, FALLBACK_SUCCESS, RESPONSE_FROM_CACHE][0ms]x2, GetOrderCommand[SUCCESS][130ms], CreditCardCommand[SUCCESS][1050ms]
Request => GetUserAccountCommand[FAILURE, FALLBACK_SUCCESS][4ms], GetPaymentInformationCommand[SUCCESS][19ms], GetUserAccountCommand[FAILURE, FALLBACK_SUCCESS, RESPONSE_FROM_CACHE][0ms]x2, GetOrderCommand[SUCCESS][145ms], CreditCardCommand[SUCCESS][1301ms]
Request => GetUserAccountCommand[SUCCESS][4ms], GetPaymentInformationCommand[SUCCESS][11ms], GetUserAccountCommand[SUCCESS, RESPONSE_FROM_CACHE][0ms]x2, GetOrderCommand[SUCCESS][93ms], CreditCardCommand[SUCCESS][1409ms]

#####################################################################################
# CreditCardCommand: Requests: 17 Errors: 0 (0%)   Mean: 1171 75th: 1391 90th: 1470 99th: 1486 
# GetOrderCommand: Requests: 21 Errors: 0 (0%)   Mean: 100 75th: 144 90th: 207 99th: 230 
# GetUserAccountCommand: Requests: 21 Errors: 4 (19%)   Mean: 8 75th: 11 90th: 46 99th: 51 
# GetPaymentInformationCommand: Requests: 21 Errors: 0 (0%)   Mean: 18 75th: 21 90th: 24 99th: 25 
#####################################################################################

Request => GetUserAccountCommand[SUCCESS][10ms], GetPaymentInformationCommand[SUCCESS][16ms], GetUserAccountCommand[SUCCESS, RESPONSE_FROM_CACHE][0ms]x2, GetOrderCommand[SUCCESS][51ms], CreditCardCommand[SUCCESS][922ms]
Request => GetUserAccountCommand[SUCCESS][12ms], GetPaymentInformationCommand[SUCCESS][12ms], GetUserAccountCommand[SUCCESS, RESPONSE_FROM_CACHE][0ms]x2, GetOrderCommand[SUCCESS][68ms], CreditCardCommand[SUCCESS][1257ms]
Request => GetUserAccountCommand[SUCCESS][10ms], GetPaymentInformationCommand[SUCCESS][11ms], GetUserAccountCommand[SUCCESS, RESPONSE_FROM_CACHE][0ms]x2, GetOrderCommand[SUCCESS][78ms], CreditCardCommand[SUCCESS][1295ms]
Request => GetUserAccountCommand[FAILURE, FALLBACK_SUCCESS][6ms], GetPaymentInformationCommand[SUCCESS][11ms], GetUserAccountCommand[FAILURE, FALLBACK_SUCCESS, RESPONSE_FROM_CACHE][0ms]x2, GetOrderCommand[SUCCESS][153ms], CreditCardCommand[SUCCESS][1321ms]

This demo simulates 4 different HystrixCommand implementations with failures, latency, timeouts and duplicate calls in a multi-threaded environment.

It logs the results of HystrixRequestLog and metrics from HystrixCommandMetrics.

Dashboard

The hystrix-dashboard component of this project has been deprecated and moved to Netflix-Skunkworks/hystrix-dashboard. Please see the README there for more details including important security considerations.

Bugs and Feedback

For bugs, questions and discussions please use the GitHub Issues.

LICENSE

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

hystrix's People

Contributors

Stargazers

Watchers

Forkers

rushit jboulon dcobb purcaro todun dreamfrog ccasado zeedunk dixingxing0 gspandy orangelpai cfregly zwqjsj0404 fantayeneh andrewfinnell agilemobiledev koushikdasika milosjovac revskill10 souravzzz jayleegm dblchu deepakprakash dondeng sniweef mhawthorne chenkaigithub pbutlerm chadsmall devcamcar michaelneale jjlauer benschmaus quidryan daveray jametong andreaturli gdenning obsdeck neerajrj fkjellberg armangal opuneet sodabox machinesareus abersnaze viet-nguyen culgnol web5design mpeltonen joshmoore lgwulele aishahalim softmentor garethbowles lunbobi gigfork ogafosir mailmahee sopel plducharme stgwilli pochadri shyamss lloydchang dinedal tanob chsjiang kuguobing joesondow rajmadhamanchi flatiron32 vineethvarghese npbendre bvleto jlming6 anwarchk songjingbo bhatti mikesells skpcandy ainewsbot unquietcode paulcichonski grze mrphoebs rmarshasatx suteerth howardyuan kalni bitted xorlev aphyr gnomix josephwilk running2 lakshmi-kannan deepakmk9 sitina vic999

hystrix's Issues

Remove Strategy Injection on HystrixCommand

As part of the opensourcing of Hystrix there was a refactoring of how plugin behavior worked and I ended up making a poor design decision that makes it easy to end up in a bad and non-obvious state.

Namely, when injecting strategies via HystrixCommand.Setter different instances with the same HystrixCommandKey could end up with different strategies.

This sounds "flexible" but becomes very non-obvious, especially when trying to report on metrics.

When showing a dashboard or monitoring something people expect a command with a given key to have the same properties, metrics, concurrency strategy etc not different ones depending on code-path to instantiate.

Worse, the first one to instantiate will currently create a CircuitBreaker for a given HystrixCommandKey.

We don't want multiple circuit-breakers for the same CommandKey just because different strategies were injected in, yet right now that circuit-breaker will use the first properties strategy and thus become confusing if someone does inject multiple different strategies.

Thus, I'm going to make the change in 1.1 to disallow injecting strategies via the HystrixCommand.

This is a "breaking change" if anyone is doing this already but I'm hoping that I'm catching this before anyone has integrated down this path.

If it affects anyone I apologize for this design mistake.

Language Adaptor: Clojure

A hystrix-contrib module to provide idiomatic interfaces for Clojure.

Rolling Metric for Max Concurrent Requests

We already have this for thread executions but it would be good to have for the command in general (thread or semaphore isolated).

The value of this is in tracking down bursts. Even though we are generally tracking metrics at the second or subsecond level the concurrent bursts at the millisecond level can be missed and make it hard to determine where rejections are coming from.

Dashboard and Dollar sign

If a HystrixCommand-name (also GroupKey) contains a dollar sign ($) the Dashboard does not display the diagram correctly.

e.g.:
data: {"type":"HystrixCommand","name":"$DomInfo.isFreeFast" ..... }

IllegalStateException: Future Not Started

Originally was issue #80 but that got hijacked by a HystrixCollapser issue that at first was thought to be related or the same but wasn't.

We are seeing this exception in logs. It suggests a concurrency bug.

It's a very small volume of errors - 1 or 2 per hour out of 10s of thousands of executions per instance - but it's concerning that it is happening at all.

Logging was added in pull request #81 to help identify the issue.

Extend caching beyond a single thread

Would it make sense to extend request caching beyond the current thread?

In our use case all operations are performed in response to a state machine event, and the thread that executes each event will likely be different. It would be nice if there were some other way to define a RequestContext outside the initiating thread.

More commonly, we will have a scenario where multiple users will be looking up information relating to their conference, and it would be nice if that information could be cached.

TotalExecutionTime not tracked on queue()

The TotalExecutionTime is tracked on execute() but not on queue().

This is not a problem with the normal executionTime metric around the run() method, but the total end-to-end metric.

For queue() we want to track from the time queue() is invoked until the time the underlying thread completes and a Future.get() could retrieve the value even if it doesn't.

We don't want to track until Future.get() is invoked as that is not the actual processing time of the execution and can be variant depending on what the client code is doing.

This was a known missing feature but I'm marking this as a bug as it should have been there before marking the 1.0 release.

Feature Request: Pluggable Signature Generator/Verifier

Thanks for all the awesome work on Hystrix. I'm currently in the beginning stages of trying it out on a small subset of our Production use cases, and I'm looking forward to reaping the benefits!

However, I'm slightly paranoid and uncomfortable with exposing the streams out of each box just to the entire world (at the application-layer).

I'd like for there to be at least some prima facie way to ensure that a request for the streams comes from a trusted consumer.

An extension of some base class would be able to provide the information requested.

public class HystrixStreamRequestSigner {

public boolean requireSignature() { returns false; }

// Can assume that we'd possibly use the unix time as the input
public String getSignatureOutput(final long input) { ... }

public boolean verifySignature(final long input, final String signed) {...}
}

(I may be able to contribute some example code if you're interested in seeing further.)

Code Generation for Existing Libraries

A potentially useful hystrix-contrib module would be a standalone utility that could scan public methods on a service class and generate HystrixCommand objects for each of them.

This is the single biggest complaint about adopting an existing library to use Hystrix.

Annotations and AOP have their own issues (https://github.com/Netflix/Hystrix/wiki/FAQ) so this is a possibility – nothing magical but it would reduce tedious labor.

Here is an example service class with a large number of methods that would be really obnoxious to do manually: http://docs.amazonwebservices.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/ec2/AmazonEC2Client.html

MaxConcurrentConnections reached

It would be useful if the dashboard would report if the max number of connections is reached. Currently it just shows as "loading...", whereas the http request is responding with "MaxConcurrentConnections reached: 5"

Mitchell

Property to disable fallbacks

Create a property to disable fallbacks so when run() fails the getFallback() method is not attempted.

There are some types of applications (such as offline batch compute) that would rather fail and retry later than get a fallback.

They may however have HystrixCommand implementations in their codebase (part of libraries, transitive dependencies etc) that have fallbacks implemented.

This property will allow turning off fallbacks for the entire app or for specific commands.

Dashboard: Threshhold for Error Percentage Sorting

The dashboard allows sorting by Error Percentage and Error + Volume.

It can be a little jumpy on circuits with small error rates like 0.1, 0.2 that flip back and forth.

Also, sorting a low volume circuit with 0.1 error rate to the top is generally not useful.

We should pursue having a threshold and better handling of jittery circuits at the decimal level (maybe ignore decimals when sorting) so that the sorting is useful without being noisy and jittery.

Counter System: Stuck on it

Hi Ben,

Hysterix is an awesome platform! However we are keen to know what counter system are you using to store Hysetrix metrics. If it is in house, is there a plan to open source it?

Why does the fallback value enter the cache?

Hello,

I'm really failing to see the usefulness of adding the fallback value in the cache!

In most cases, fallback value is crap. If it is something useful, then the primary-secondary pattern should be followed in order to provide a secondary useful value when the primary call fails.

Is there a way to disable this feature or even better invert its usage and provide a way when someone actually wants the fallback value to enter the cache?

Thread pools lost custom names in opensource refactoring

The threadpools lost their custom names which makes debugging thread-dumps difficult.

Bring back proper naming ...

Dashboard: Hover for full name (when shortened with ellipsis)

A recent change handles long names in the dashboard by putting an ellipsis in them.

To allow someone to see the actual name let's add the ability to hover and see a tooltip or something like that for the full name.

Count BadRequestExceptions thrown

Similar to counting exceptions throw a new metric for BadRequestExceptions thrown would be good so that we see user impact due to this exception.

Dashboard problem when using Turbine

When I use turbine as an aggregator I do not see command metrics. If I attach directly to each server I do see the command metrics.

Mitchell

HysrixRequestLog: Missing Events and Time on Timeouts

It appears that the request log is showing the time a fallback took instead of execution when a TIMEOUT occurs followed by FALLBACK.

It should show the execution time.

Should we show both the execution and fallback somehow?

Here is an example request log with a timeout occurring on a request with a log of executions during a latency monkey simulation:

DeviceTypeServiceGetType[SUCCESS][0ms], IdentityCookieAuth[SUCCESS][0ms], SubscriberGetAccount[SUCCESS][5ms], ABCallServiceInternal[SUCCESS][87ms]x2, ABTestGetAllocationMap[SUCCESS][1ms]x2, PlaylistGet[SUCCESS][11ms], ViewingHistoryGetViewingHistory[SUCCESS][29ms], ABTestGetAllocationMap[SUCCESS, RESPONSE_FROM_CACHE][0ms]x306, PlaylistGet[Executed][0ms]x5, ViewingHistoryGetViewingHistory[Executed][0ms]x5, PlaylistGet[SUCCESS, RESPONSE_FROM_CACHE][0ms]x10, NDCMapReadCassandraPersister[SUCCESS][83ms]x2, NDCMapReadEVCachePersister[SUCCESS][79ms]x2, NDCMapReadBulkCassandraPersister[SUCCESS][29ms], NDCMapReadBulkEVCachePersister[SUCCESS][19ms], MapGetLists[TIMEOUT, FALLBACK_SUCCESS][0ms], MapGetLists[TIMEOUT, FALLBACK_SUCCESS, RESPONSE_FROM_CACHE][0ms]x5, VideoMetadataGetEpisodes[SUCCESS][0ms]x28, VideoMetadataGetEpisodes[Executed][0ms]x32, SubscriberGetAccount[SUCCESS, RESPONSE_FROM_CACHE][0ms]x260, QTGetQTVGenres[SUCCESS][0ms]x13, CinematchGetMovieRatings[SUCCESS][7ms], VideoMetadataGetEpisodes[SUCCESS, RESPONSE_FROM_CACHE][0ms]x10, ABTestGetAllocationsCommand[SUCCESS][78ms], VideoHistoryGetBookmarks[SUCCESS, COLLAPSED][179ms]x13, CinematchGetMovieRatings[SUCCESS, RESPONSE_FROM_CACHE][0ms]x18, CinematchGetPredictions[SUCCESS, COLLAPSED][95ms]x17, VideoMetadataGetEpisode[SUCCESS][0ms]x6, QTGetQTVGenres[SUCCESS, RESPONSE_FROM_CACHE][0ms]x6, MapODPGetConfig[TIMEOUT][0ms]

Property to disable percentile calculations

If someone does not want to use percentile calculations there is no point in tracking and calculating them so let's have a property to disable.

Execution Hooks via Plugin

Create an ExecutionHook strategy that can be implemented to do things such as:

decorate a response
conditionally throw exceptions, modify responses
do other forms of logging or diagnostics
be invoked ...
- before run()
- start of run()
- end of run()
- after run()
- start of getFallback()
- end of getFallback()
- etc

CollapserMetrics

The HystrixCollapser received some new functionality in recent months that should be exposed in its own metrics but hasn't yet.

Let's expose a new metrics publisher similar to the ones for Command and ThreadPool.

Example metrics:

collapser executions
number of batches
number of collapsed requests per batch
number of responses from cache
number of shards

Question: shortCircuit contributes to error count?

Hey Ben, quick question. (Great lib, BTW!)

I see that when you register a short circuit event through the metrics object, it contributes to the errorPercentage (HystrixCommandMetrics.getHealthCounts), which is in turn used by the circuit breaker (HystrixCircuitBreaker.isOpen) to determine if the error rate is high enough to trip the circuit.

But while the circuit is tripped, every request is registering as a shortCircuit, and thus, as an error (for this health calculation). Doesn't that mean you're artificially inflating how many real "errors" are happening? I guess it doesn't really matter, because as soon as there's one success (via allowSingleTest), you reset the counter anyway, so it didn't matter (to the circuit breaker) what you were recording as metrics while it was tripped.

So from an external perspective, they're errors (the caller didn't get what they wanted), but from the Circuit Breaker's perspective they really might as well be ignored. Am I grokking that right?

Thanks!
Ian

Question: Cache key expiration control

Hi again,

I am trying to figure out how to establish expiration on the keys? How to plug-in some cache implementation?

Thanks in advance.

Capture exception from run() and expose getter

If a HystrixCommand has a fallback implemented the exception is not thrown. Sometimes there are use cases where application code wants to inspect the exception even though there is a fallback.

It would be something like:

HystrixCommand.getExceptionIfThrown()

Break hystrix-clj out into its own project

I'm really excited to see the Clojure binding by @daveray in contrib. In fact, I'd love to try it out on my own projects. One suggestion I'd like to make, however, is that it be broken out into its own project. This would free it from what I imagine are the "single build tool" constraints that are likely driving the gradle => clojuresque => lein toolchain. As it stands, Leiningen is the de-facto standard in the Clojure community and it's dead simple for someone to look at your project.clj, see your version number, group and artifact and add the dependency to their project (assuming you've published the jar to Maven central). This would allow hystrix-clj to ft better within the Clojure ecosystem and allow developers to treat it the same way they treat every other library—something that may not seem like a big deal, but which definitely lowers the friction co-efficient. I believe this could only help increase adoption. I also realize the bits aren't even dry yet, so the last thing I want to do is seem pushy. That being said, I figured I'd suggest this while you're actively working on it and hystrix-clj seems to have some momentum. Finally, thanks for releasing Hystrix!

Question about number and percentile buckets

We are building a Zabbix plugin to perform snapshotting our Hystrix command metrics every 1 minute. I am trying to figure out best configuration in terms of the following properties:

hystrix.command.default.metrics.rollingStats.timeInMilliseconds = 60000
hystrix.command.default.metrics.rollingStats.numBuckets = 60;

hystrix.command.default.metrics.rollingPercentile.timeInMilliseconds = 60000
hystrix.command.default.metrics.rollingPercentile.numBuckets = 60;

Should it be 60000/60 or am I mis-calculating something? Should I make it wider than 1 minute to make metrics/counters more precise.

Use logger.error not logger.debug for fallback failure

Change the following line to use logger.error:
https://github.com/Netflix/Hystrix/blob/master/hystrix-core/src/main/java/com/netflix/hystrix/HystrixCommand.java#L1355

Even though the Throwable is captured inside HystrixRuntimeException most code handling the exception is not capable of inspecting it thus it becomes very hard in production operations to determine why a fallback fails.

It should be very rare that a fallback fails (the point of fallback is to do something that can't fail) but when it does it's difficult to track down right now.

HystrixPlugins Bootstrapping Problem - Race Conditions

Currently the HystrixPlugins implementation provides a single globally static mechanism for registering plugins.

It's simple and easy to grok but also has non-obvious issues.

This is a new concern for Hystrix as internally at Netflix there was only ever the "default behavior" but now anything can be customized.

If a system starts up, uses some HystrixCommands and then registers new plugins, the commands executed before will have already initialized metrics, thread-pools, metrics publishers, properties and other such things expected to live for the live of the JVM.

They are cached for a given HystrixCommandKey.

The simple answer is to just clear the caches and let everything initialize again.

That isn't quite so simple though because these aren't just objects that get garbage collected.

We need shutdown hooks then for thread-pools and metrics publishers to unpublish and gracefully cleanup - things that typically don't ever happen.

The possible approaches I've thought of are:

Bootstrap with JVM properties

If properties were injected from the command-line (ie -Dhystrix.plugin.???=???) then we could guarantee that all HystrixCommand executions initialize with the correct values on system startup.

Lifecycle Hooks

All plugin implementations (including the defaults) could have shutdown hooks that get executed when another plugin is registered and the plugin implementation is responsible for tearing down any resources is allocated (thread-pools, metrics being published, etc).

IllegalStateExceptions to lock down

Allow "locking" the plugin config so that no further plugins can be registered and alter behavior.

If something tries it would get an IllegalStateException.

This would at least protect against unexpected race conditions or malicious overrides of behavior (a bad library for example).

Nothing has been done on this yet as none of the solutions are great so I'm thinking on it more and discussing with others.

Is there a way to automatically pass thread static information to the Hystrix Command thread pool?

Hello,

I have a number of 3 REST-based micro services. Each of these belongs to a logical layer. A Web app calls the service on the upper layer and the call goes down to the second layer and finally the third one where data are obtained and get responded all the way back. I need to log a common RequestId (as a GUID) that identifies a single web request in all layers. This request id is created on the first service and as a Header it automatically passes from each http client to the next service and from the service to the next client using Client & Servlet filters and a Logback MDC instance. So, in each layer, if the Servlet filter does not get a RequestId header, it creates it, and it puts it in the MDC for the current thread. Whenever an http client request is made, the Client Filter, reads the request id from the MDC, creates a header and passes it to the request and the chain goes on.

If i use a Hystrix Command as a wrapper to the actual http client call, the chain breaks. Apparently, the threads created by a Hystrix Command are not "children" of the thread that called its constructor and therefore the MDC values do not propagate.

Is there a solution to this problem? I read about the execution hooks. Can they offer a solution?

More info at http://logback.qos.ch/manual/mdc.html#autoMDC and http://logback.qos.ch/manual/mdc.html#managedThreads

ThreadPool stream should include reportingHosts

The Command stream includes reportingHosts but ThreadPool does not. It needs to for Turbine and the Dashboard to correctly count and display the number of hosts.

More Flexible Circuit-breaker Strategy

RIght now the circuit-breaker implementation is fixed.

Expose a CircuitBreakerStrategy to allow custom implementations that can be plugged in.

Yammer Metrics Support

At Simple, we use Yammer's metrics pretty heavily and we're very interesting in Hystrix but would prefer to use the same metrics tools there that we do elsewhere in our services. It would be nice be able to optionally provide other implementations to HystrixCommand.

I'm happy to do the work if this it something you're OK with.

Explore use of backup requests to handle long-tail latency

I would like to explore the use of backup requests and adding them to Hystrix as an optional tool for dealing with latency.

It has been discussed in several places (particularly referencing Google) such as these:

http://highscalability.com/blog/2012/3/12/google-taming-the-long-latency-tail-when-more-machines-equal.html
http://highscalability.com/blog/2012/6/18/google-on-latency-tolerant-systems-making-a-predictable-whol.html
http://www.bailis.org/blog/doing-redundant-work-to-speed-up-distributed-queries/
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/Stanford-DL-Nov-2010.pdf

Default Collapser scope to REQUEST if using Setter

The default is set to REQUEST via the constructor, but it someone uses the Setter fluent interface it doesn't get defaulted.

Unable to configure Hystrix dashboard

Hi,

I am new to Hystrix and trying to configure Hystrix dashboard for a sample application that i am developing. I had gone thru the documentation that is available in GitHub Wiki and was able to configure the application properly in my local environment. When i hit hystrix stream url "http://hostname:port/application/hystrix.stream", it is returning me the JSON data. But if i use the same URL in Hystrix Dashboard, it is not displaying any data in the dashboard.

I traced the request and found that the below link is not returning any response and getting failed.

http://localhost:8090/hystrix-dashboard/proxy.stream?origin=http://localhost:8090/showcase/hystrix.stream

Error message:

EventSource's response has a MIME type ("text/plain") that is not "text/event-stream". Aborting the connection.
Connection was closed on error: [object Event]

I tried searching in internet for this issue but was not able to find any solution. In addition to this problem, when hystrix dashboard is not able to establish the connection, i observed that it is chewing up all the available connections within no time (may be metrics poller) and displaying a 503 error page.

Can you please help me with the problems described above? Thanks.

-Suresh

Asynchronous Executables

Support wrapping asynchronous clients and network access.

Concepts right now are HystrixSelector and/or HystrixFuture objects that could wrap asynchronous calls.

Thread isolation would obviously not make sense in these cases, but semaphore isolation to limit concurrent access, timeouts (such as an existing Future.get being wrapped), callback timeouts, circuit-breakers, metrics, monitoring, properties all make sense.

Mocking HystrixCommands

Hello,

I'm trying to Unit Test a REST resource that uses a Client Access Library of HystrixCommands. Therefore, i need to mock a HystrixCommand and especially its execute() method. Nevertheless, the execute method is marked as "final" and therefore cannot get mocked using EasyMock or Mockito. Is there a solution for this? Do you propose an other way?

Dashboard: Detail View

Add detail view (didn't make it from internal Netflix build to opensource version yet) that contains:

current properties and realtime view of when they change across cluster
other metrics that don't show on the summary
hooks into historical datasources for line graphs of various metrics
- need to determine the right way to provide this hook that is abstract yet simple to make work
per instance view to help isolating if an issue is happening on all instances or just some
- a prototype of a visualization for displaying a large cluster of instances in a single place is here: http://bl.ocks.org/1488375 (I played with it a full year ago and haven't yet gotten to implementing it)

Include more info when collapsed requests remain in queue

When a collapser is shutdown and requests remain in the queue a warning message is logged. Include more details about the remaining requests (class, etc, etc) to assist in debugging.

Memory leak in Tomcat

After stopping, undeploying or redeploying a Hystrix-powered webapp in Tomcat we find some errors as the following

SEVERE: The web application [/api-0.0.1-SNAPSHOT] appears to have started a thread named [hystrix-ProductsGroup-1] but has failed to stop it. This is very likely to create a memory leak.

Additionally, using JConsole we can see the threads, i.e. hystrix-ProductsGroup-1, are still alive.

We have tried to take control of the thread destruction setting withKeepAliveTimeMinutes as described here https://github.com/Netflix/Hystrix/wiki/Configuration , but found that the parameter is no longer available (1.2.5).

As a result, after a number of deploys, Tomcat ends with a PermGen error.

Can Hystrix remove these threads? Is there an alternative way we can get rid of the old threads?

Thanks in advance.

Ensure Throwable/Error Propagation from Child Threads

If a Throwable (java.lang.Error for example) is thrown from a child thread (or NIO) we should ensure it is propagated up past Hystrix into the host application.

Sharded command execution?

Explore adding the ability to model HystrixCommand executions after the backend service sharding strategy so that semaphores can be used to restrict concurrent access to any given shard instead of 1 bad shard saturating all available resources for that service.

Here's a use case to demonstrate why each shard can't just be another HystrixCommand semaphore/threadpool at the top level:

1 HystrixCommand using 5-10 threads with typical max concurrency of 2-3 threads
80+ shards on backend service
1 shard becoming latent can saturate the 5-10 threads and prevent any other shards from receiving traffic

It is not a valid approach to have 80 different HystrixCommands dynamically created - the metrics explosion and operations of it is not feasible.

However, we could allow something such as:

ShardKey getShard(argument) method that returns a key for a given argument
per shard semaphore that restricts concurrent access to each shard

In practice we could then have the above scenario configured such as:

10 threads
80 shards
limit of 2 concurrent per shard
up to 5 shards can become latent before the entire service is failed

Dashboard: Dynamically Determine Scale of Circle Radius

Right now the circles are hardcoded to size themselves within a domain of 0-400rps.

They should instead dynamically set this domain to a floor/ceiling derived from the data being streamed -- or at least be easily configurable if that doesn't work well.

Dashboard: Filter by name and request volume

Add ability to UI to filter by circuit name and request volume.

For example, when 100 circuits are showing, filter out all with < 1rps.

RequestLog: Reduce Chance of Memory Leak

Need to evaluate if "hystrix.command.default.requestLog.enabled" should default to false instead of true.

It can potentially cause memory leaks if the request context is not setup right so it may be better for it to be opt-in ... or either:

find a way to only use it if we can determine the context is setup right
print warning if the size of the log exceeds some large value (1000+ ?)

Add "throws Exception" to HystrixCommand run() method

First of all, Hystrix is an incredible library; really enjoying my initial experimentation with it, and I was able to make use of it very quickly.

I wanted to suggest changing the declaration of the HystrixCommand run method from:

protected String run();

protected String run() throws Exception;

Currently, calling any methods inside the run implementation which throw checked exceptions require that the exception be caught and re-thrown as an unchecked exception so that Hystrix will detect a failure.

As an example, the following throws an IOException, which generates a compile-time exception within the HystrixCommand run method unless caught and re-thrown as an unchecked exception:

        URL url = new URL(targetURL);
        HttpURLConnection connection = (HttpURLConnection) url.openConnection();

HystrixCollapser: Response Not Received

Symptoms:

sometimes collapsed requests do not receive their response from the executing batch
collapsed requests that don't receive responses would block calling threads for 15+ seconds
latency in batch being executed

Causes:

the only cause that was reproducible (by using real production traffic) is high load average on a box causing the timer thread to become very latent in triggering batches (seconds, not just 10s or 100s of milliseconds).
various concurrency bugs that made the handling of this latency bad such as resulting in long running blocked threads

This issue was originally "IllegalStateException: Future Not Started" and thought to be related to this HystrixCollapser issue. It ended up not being so but taken over by it so the original bug was re-created at #113. The first 2 comments including pull request #81 are related to #113 and not this issue about HystrixCollapser.

Hystrix ThreadPools are not shutdown, system hangs.

Hi,
We are having an issue with Hystrix where the ThreadPools are not shutdown, so when our process tries to terminate it can't due to treads still in waiting state. According to the logs 2 HystrixCommands ran with an exception happened in run so fallback got executed. What we had to do is to keep track of Hystrix Thread Pools and when the main process terminates (hook in via a Listener) we shutdown all Hystrix TreadPools. via com.netflix.hystrix.HystrixThreadPool.Factory.getInstance() (unfortunately Factory class is package-private which forces us to create our class in com.netflix package). Can you suggest a better way of dealing with this issue or perhaps provide a better way to gracefully shutdown all ThreadPools, would settle fo making com.netflix.hystrix.HystrixThreadPool.Factory public?
Here is the full thread dumb

2013-02-07 10:01:11
Full thread dump Java HotSpot(TM) 64-Bit Server VM (23.6-b04 mixed mode):

"Attach Listener" daemon prio=10 tid=0x0000000002dd1000 nid=0x31c2 runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"DestroyJavaVM" prio=10 tid=0x00007f905db76000 nid=0x4ef0 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"hystrix-TaxCalculationGroupKey-1" prio=10 tid=0x0000000003a55800 nid=0x3486 waiting on condition [0x00007f9059952000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x000000079678a558> (a java.util.concurrent.SynchronousQueue$TransferStack)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:458)
        at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:359)
        at java.util.concurrent.SynchronousQueue.take(SynchronousQueue.java:925)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1043)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1103)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)

"hystrix-WarehouseSourcingGroupKey-1" prio=10 tid=0x0000000003245000 nid=0x3485 waiting on condition [0x00007f9059bc2000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x000000079678d948> (a java.util.concurrent.SynchronousQueue$TransferStack)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:458)
        at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:359)
        at java.util.concurrent.SynchronousQueue.take(SynchronousQueue.java:925)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1043)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1103)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)

"Java2D Disposer" daemon prio=10 tid=0x00007f90556a1800 nid=0x5c43 in Object.wait() [0x00007f905adee000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x000000078e2bf1b0> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
        - locked <0x000000078e2bf1b0> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
        at sun.java2d.Disposer.run(Disposer.java:145)
        at java.lang.Thread.run(Thread.java:722)

"MultiThreadedHttpConnectionManager cleanup" daemon prio=10 tid=0x00007f905d601000 nid=0x535d in Object.wait() [0x00007f90605dd000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000007867b8750> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
        - locked <0x00000007867b8750> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
        at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ReferenceQueueThread.run(MultiThreadedHttpConnectionManager.java:1122)

"ActiveMQ Scheduler" daemon prio=10 tid=0x00007f905d927800 nid=0x5359 in Object.wait() [0x00007f90609e1000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:503)
        at java.util.TimerThread.mainLoop(Timer.java:526)
        - locked <0x000000078aef0380> (a java.util.TaskQueue)
        at java.util.TimerThread.run(Timer.java:505)

"Service Thread" daemon prio=10 tid=0x0000000001e0e800 nid=0x4eff runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" daemon prio=10 tid=0x0000000001e0c800 nid=0x4efe waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" daemon prio=10 tid=0x0000000001e09800 nid=0x4efd waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x0000000001e07000 nid=0x4efc runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x0000000001dbb800 nid=0x4efb in Object.wait() [0x00007f9063932000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
        - locked <0x0000000780116298> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:177)

"Reference Handler" daemon prio=10 tid=0x0000000001db4000 nid=0x4efa in Object.wait() [0x00007f9063a33000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:503)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
        - locked <0x0000000780115ca8> (a java.lang.ref.Reference$Lock)

"VM Thread" prio=10 tid=0x0000000001dac000 nid=0x4ef9 runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x0000000001d28800 nid=0x4ef1 runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000001d2a800 nid=0x4ef2 runnable

"GC task thread#2 (ParallelGC)" prio=10 tid=0x0000000001d2c800 nid=0x4ef3 runnable

"GC task thread#3 (ParallelGC)" prio=10 tid=0x0000000001d2e000 nid=0x4ef4 runnable

"GC task thread#4 (ParallelGC)" prio=10 tid=0x0000000001d30000 nid=0x4ef5 runnable

"GC task thread#5 (ParallelGC)" prio=10 tid=0x0000000001d32000 nid=0x4ef6 runnable

"GC task thread#6 (ParallelGC)" prio=10 tid=0x0000000001d33800 nid=0x4ef7 runnable

"GC task thread#7 (ParallelGC)" prio=10 tid=0x0000000001d35800 nid=0x4ef8 runnable

"VM Periodic Task Thread" prio=10 tid=0x00007f905c013800 nid=0x4f00 waiting on condition

JNI global references: 234

Regards
Denis

hystrix.stream holds connection open if no metrics

If a server has not yet had any command executions to generate metrics the hystrix.stream servlet will loop infinitely waiting on metrics.

Since it keeps looping with no data it will never attempt writing to the outputstream.

Thus, it will never close the connection even if the client disconnects as it won't get an IOException or call close().

This will keep the servlet inside the try/catch and never count down the counter.

netflix / hystrix Goto Github PK

hystrix's Introduction

Hystrix: Latency and Fault Tolerance for Distributed Systems

Hystrix Status

Introduction

Full Documentation

Communication

What does it do?

1) Latency and Fault Tolerance

2) Realtime Operations

3) Concurrency

Hello World!

Binaries

Build

Run Demo

Dashboard

Bugs and Feedback

LICENSE

hystrix's People

Contributors

Stargazers

Watchers

Forkers

hystrix's Issues

Recommend Projects

Recommend Topics

Recommend Org