intel-bigdata / hpnl Goto Github PK
View Code? Open in Web Editor NEWHigh Performance Network Library for RDMA
License: Apache License 2.0
High Performance Network Library for RDMA
License: Apache License 2.0
the MLX provider of libfabric uses libucs, which will lead to segment fault when running scala program.
con should be "(jlong)&con" in the below pushSendBuffer call.
`JNIEXPORT jlong JNICALL Java_com_intel_hpnl_core_RdmService_get_1con(JNIEnv *env, jobject obj, jstring ip_, jstring port_, jlong nativeHandle) {
ExternalRdmService service = (ExternalRdmService)&nativeHandle;
const char *ip = (*env).GetStringUTFChars(ip_, 0);
const char *port = (*env).GetStringUTFChars(port_, 0);
RdmConnection con = (RdmConnection)service->get_con(ip, port);
if (!con) {
(env).CallVoidMethod(obj, reallocBufferPool);
con = (RdmConnection)service->get_con(ip, port);
if (!con) {
return -1;
}
}
(*env).CallVoidMethod(obj, regCon, (jlong)&con);
std::vector<Chunk*> send_buffer = con->get_send_buffer();
int chunks_size = send_buffer.size();
for (int i = 0; i < chunks_size; i++) {
(*env).CallVoidMethod(obj, pushSendBuffer, con, send_buffer[i]->buffer_id);
}
return (jlong)&con;
}`
It has been over 2 years since last update.
Is this project still alive ? @zhouyuan @Jian-Zhang
When there is a CQ event coming, we invoke callback methods registered in Java connection object. To determine from which connection the callback methods being invoked, we go through from cqservice -> eqservice -> connection pool -> connection via connection id. To reduce method call stack and eliminate look-up, we can cache Java connection object in C++'s FIConnection during initialization. And FIConnection is associated with data chunk which is passed along with event. Thus, we can get the Java connection from FIConnection directly and make method call. Be noted, we need to remove global reference explicitly when connection shuts down. Otherwise, we may have memory leak.
By the way, when return receive buffer, the data chunk can get from event and thus not necessary looking it up from external cq service.
Here is sample code.
cache Java connection
JNIEXPORT void JNICALL Java_test_jni_Connection_init(JNIEnv * env, jobject thisObj, jobject conn){
jobject globalConn = env->NewGlobalRef(conn);
Connection1 fiConn = new FIConnection();
fiConn->set_context(&globalConn);
_set_self(env, thisObj, fiConn);
}
reference Java connection
JNIEXPORT void JNICALL Java_test_jni_Connection_sayHello(JNIEnv * env, jobject thisObj){
FIConnection fiConn = (FIConnection*)_get_self(env, thisObj);
jmethodID mid = _get_callback_method_id(env);
int v = 100;
jobject conn = static_cast<jobject>(fiConn->get_context());
(env).CallIntMethod(conn, mid, v);
}
delete global reference in shutdown method
JNIEXPORT void JNICALL Java_test_jni_Connection_deleteGlobalRef(JNIEnv * env, jobject thisObj){
FIConnection fiConn = (FIConnection)_get_self(env, thisObj);
env->DeleteGlobalRef(fiConn->conn);
}
need a flag to mark the eq service's status after executing the EqService::shutdown function.
For now, the path of shared library is fixed to system directory when JVM loads it. It may not be convenient in some cases especially deploying HPNL in large cluster.
Fortunately, library searching path can be configured in environment variable, LD_LIBRARY_PATH. And JVM can use it too to load libraries by using System.loadLibrary instead of System.load. For example, changing
System.load("/usr/local/lib/libhpnl.so")
to
"export LD_LIBRARY_PATH = /usr/local/lib " before starting JVM. And copy libfabric.so files and libhpnl.so to this folder. Then change code to,
System.loadLibrary("hpnl")
Take sendBuf as example, the parameter buffer is not put to fi_context2. After sendBuf method return, the pointer may be get released even the buffer pointer was passed to fi_send which is asynchronous. Thus, the content referenced by the buffer pointer may be not the same content when we call sendBuf. It causes incorrect content being sent.
int RdmConnection::sendBuf(const char* buffer, int buffer_size)
driver and executors need to quit when app is done. JVM can quit only if there are only daemon threads running. Both EqThread and CqThread are not daemon which prevent driver and executors from quitting.
For each HPNL client, there are two threads, one for EQ and the other for CQ. When we do shuffle in large cluster, e.g. 1000 nodes, it means there could be 2000 threads in each node. It’s too many.
For RPC, it’s ok since each node mainly talks to driver. But shuffle is different.
In EqService.java, we'll try to connect number of times depends on the value of "worker_num" parameter. But in native code, ExternalEqService.cc, "worker_num" is hard-coded to 1. If worker_num > 1, the connecting from EqService.java causes native code error and crashes JVM.
HPNL tries to call many Java methods from JNI. At each method call, JNI gets Java's class and method ID first and then call the real method by method ID. To improve performance, we can get the method ID for the first time and cache it for later use. Here is the sample code.
cache method id
static jmethodID _get_callback_method_id(JNIEnv *env){
static int init = 0;
static jmethodID callbackId;
if(!init){
jclass jc = (*env).FindClass("test/jni/Connection");
callbackId = (*env).GetMethodID(jc, "handleCallback", "(I)I");
cout << "here" << endl;
init = 1;
}
return callbackId;
}
reference method id
JNIEXPORT void JNICALL Java_test_jni_Connection_sayHello(JNIEnv * env, jobject thisObj){
jmethodID mid = _get_callback_method_id(env);
int v = 100;
(*env).CallIntMethod(thisObj, mid, v);
}
Currently, any exception from callback execution or network could kill EqThread and CqThread. It means no thread will poll any event and execute any callback afterwards. The threads should continue running for later events if exception is non-fatal. The exception can be notified to higher layer via some mechanism like error callback/handler.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.