We feel there is synchronization issue with test_target_private.F90 testcase that may lead to inconsistent results:
Below is a code snippet from the mentioned test case
!$omp parallel private(p_val, fp_val) shared(actualThreadCnt)
fp_val = omp_get_thread_num() + 2
p_val = omp_get_thread_num() + 1
actualThreadCnt = omp_get_num_threads()
!$omp target map(tofrom:compute_array) map(to:fp_val) private(p_val)
p_val = fp_val - 1
compute_array(p_val,:) = 100
p_val = p_val + 99
!$omp end target
IF (p_val == omp_get_thread_num() + 1) THEN
compute_array(p_val,:) = compute_array(p_val,:) + 1
END IF
!$omp end parallel
Here the compute_array is mapped as tofrom for the target region. Hence every host thread created by parllel region will copy compute array to and from the device. Since the same array is updated after target is completed this will cause incorrect results. The reason is there is no synchronization of threads here.
For example if parallel region has spawned two threads, t0 (threadId = 0) and t1 (threadId = 1) then
for t0 - in target region, array is assigned as -> compute_array(1, :) = 100
for t1 - in target region, array is assigned as -> compute_array(2, :) = 100
Since there is no guarantee on the order of kernel completion, let's consider t1's kernel get's completed first.
for t1 - after target execution, array is updated as -> compute_array(2, :) = compute_array(2, :) + 1 , which is 101
Now if at this point if t0 get's completed, becuase of tofrom mapping, compute array's local copy of t0's kernel is transfrered back to host overwriting the values computed/updated by t1. Hence the check on compute array will fail.
OpenMP doesn't guarantee any implicit barrier at the end of target construct. So test case needs to be modified to add the required synchronization.