Coder Social home page Coder Social logo

hpnl's People

Contributors

haodongt avatar jiafuzha avatar sfblackl-intel avatar svalat avatar tanghaodong25 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hpnl's Issues

wrong parameter when pushSendBuffer

con should be "(jlong)&con" in the below pushSendBuffer call.

`JNIEXPORT jlong JNICALL Java_com_intel_hpnl_core_RdmService_get_1con(JNIEnv *env, jobject obj, jstring ip_, jstring port_, jlong nativeHandle) {
ExternalRdmService service = (ExternalRdmService)&nativeHandle;
const char *ip = (*env).GetStringUTFChars(ip_, 0);
const char *port = (*env).GetStringUTFChars(port_, 0);
RdmConnection con = (RdmConnection)service->get_con(ip, port);
if (!con) {
(env).CallVoidMethod(obj, reallocBufferPool);
con = (RdmConnection
)service->get_con(ip, port);
if (!con) {
return -1;
}
}
(*env).CallVoidMethod(obj, regCon, (jlong)&con);

std::vector<Chunk*> send_buffer = con->get_send_buffer();
int chunks_size = send_buffer.size();
for (int i = 0; i < chunks_size; i++) {
(*env).CallVoidMethod(obj, pushSendBuffer, con, send_buffer[i]->buffer_id);
}
return (jlong)&con;
}`

Cache and reference java connection

When there is a CQ event coming, we invoke callback methods registered in Java connection object. To determine from which connection the callback methods being invoked, we go through from cqservice -> eqservice -> connection pool -> connection via connection id. To reduce method call stack and eliminate look-up, we can cache Java connection object in C++'s FIConnection during initialization. And FIConnection is associated with data chunk which is passed along with event. Thus, we can get the Java connection from FIConnection directly and make method call. Be noted, we need to remove global reference explicitly when connection shuts down. Otherwise, we may have memory leak.
By the way, when return receive buffer, the data chunk can get from event and thus not necessary looking it up from external cq service.
Here is sample code.
cache Java connection
JNIEXPORT void JNICALL Java_test_jni_Connection_init(JNIEnv * env, jobject thisObj, jobject conn){
jobject globalConn = env->NewGlobalRef(conn);
Connection1 fiConn = new FIConnection();
fiConn->set_context(&globalConn);
_set_self(env, thisObj, fiConn);
}
reference Java connection
JNIEXPORT void JNICALL Java_test_jni_Connection_sayHello(JNIEnv * env, jobject thisObj){
FIConnection
fiConn = (FIConnection*)_get_self(env, thisObj);
jmethodID mid = _get_callback_method_id(env);
int v = 100;
jobject conn = static_cast<jobject>(fiConn->get_context());
(env).CallIntMethod(conn, mid, v);
}
delete global reference in shutdown method
JNIEXPORT void JNICALL Java_test_jni_Connection_deleteGlobalRef(JNIEnv * env, jobject thisObj){
FIConnection
fiConn = (FIConnection
)_get_self(env, thisObj);
env->DeleteGlobalRef(fiConn->conn);
}

Java API: shutdown issue

need a flag to mark the eq service's status after executing the EqService::shutdown function.

make library loading configurable

For now, the path of shared library is fixed to system directory when JVM loads it. It may not be convenient in some cases especially deploying HPNL in large cluster.

Fortunately, library searching path can be configured in environment variable, LD_LIBRARY_PATH. And JVM can use it too to load libraries by using System.loadLibrary instead of System.load. For example, changing
System.load("/usr/local/lib/libhpnl.so")
to
"export LD_LIBRARY_PATH = /usr/local/lib " before starting JVM. And copy libfabric.so files and libhpnl.so to this folder. Then change code to,
System.loadLibrary("hpnl")

Content sent via sendBuf/sendBufTo is not correct

Take sendBuf as example, the parameter buffer is not put to fi_context2. After sendBuf method return, the pointer may be get released even the buffer pointer was passed to fi_send which is asynchronous. Thus, the content referenced by the buffer pointer may be not the same content when we call sendBuf. It causes incorrect content being sent.

int RdmConnection::sendBuf(const char* buffer, int buffer_size)

Make EqThread and CqThread as daemon thread

driver and executors need to quit when app is done. JVM can quit only if there are only daemon threads running. Both EqThread and CqThread are not daemon which prevent driver and executors from quitting.

One task per event instead of long-running thread for all events

For each HPNL client, there are two threads, one for EQ and the other for CQ. When we do shuffle in large cluster, e.g. 1000 nodes, it means there could be 2000 threads in each node. It’s too many.

For RPC, it’s ok since each node mainly talks to driver. But shuffle is different.

cache method ID in JNI

HPNL tries to call many Java methods from JNI. At each method call, JNI gets Java's class and method ID first and then call the real method by method ID. To improve performance, we can get the method ID for the first time and cache it for later use. Here is the sample code.
cache method id
static jmethodID _get_callback_method_id(JNIEnv *env){
static int init = 0;
static jmethodID callbackId;
if(!init){
jclass jc = (*env).FindClass("test/jni/Connection");
callbackId = (*env).GetMethodID(jc, "handleCallback", "(I)I");
cout << "here" << endl;
init = 1;
}
return callbackId;
}
reference method id
JNIEXPORT void JNICALL Java_test_jni_Connection_sayHello(JNIEnv * env, jobject thisObj){
jmethodID mid = _get_callback_method_id(env);
int v = 100;
(*env).CallIntMethod(thisObj, mid, v);
}

Prevent EqThread and CqThread from Dying when Non-fatal Exception Occurs

Currently, any exception from callback execution or network could kill EqThread and CqThread. It means no thread will poll any event and execute any callback afterwards. The threads should continue running for later events if exception is non-fatal. The exception can be notified to higher layer via some mechanism like error callback/handler.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.