Coder Social home page Coder Social logo

Comments (23)

mrdeep1 avatar mrdeep1 commented on August 17, 2024

@jqiaobln Thanks for raising this.

There are multiple things going on here that do not add up. It may help if you can send the whole log to me privately.

Issue 1. How are you generating the tokens for the POST and subsequent GET request?
The first byte only appears to be random.
I would recommend using something like coap_session_new_token(3).

Issue 2. The retransmit queue timeout times are not expected - e.g. 346ms, 11094ms and 250ms. I would be expecting these to be somewhere between 2000ms and 3000ms (ACK_TIMEOUT to ACK_TIMEOUT * ACK_RAMDOM_FACTOR as per RFC7252 Section 4.2 for the first transmit of a CON pdu.

Are you modifying the ACK_TIMEOUT and/or ACK_RANDOM_FACTOR values ?

What OS are you using?

Does your host platform have a stable clock source?

Issue 3. You report a response with mid of 0xc451, but have not provided the request that uses that mid. This response has a code of 2.01, so it is not a response to a GET.

What was the request using mid 0xc451?

Is this a broken server responding to request with mid 0xc44f with a mid of 0xc451?

Issue 4. Where is the response to GET request with mid 0xc450 - is that missing as well?

Other questions:

Have you tried running your code using valgrind or similar to see if you are getting random memory corruptions?

Have you tried running your client code against the libcoap examples coap-server to see if that highlights anything?

Notes:
lg_xmit 200a0d30 will only get released once all the large blocked data has be successfully sent.
lg_crcv 200a0f40 gets deleted after an idle timeout, unless PR #948 is included in the code.

from libcoap.

mrdeep1 avatar mrdeep1 commented on August 17, 2024

@jqiaobln Do you have any further information to help get this resolved?

from libcoap.

jqiaobln avatar jqiaobln commented on August 17, 2024
  1. I use a random generator which seeds from device's time.
  2. The ACK_TIMEOUT is changed to 10 seconds
  3. FreeRTOS
  4. The device has stable time source
  5. The log file is missing the request for mid 0xc451. I included this response to show that device had network connection at that moment.
  6. The response for 0xc450 is missing.
  7. Cannot run valgrind on the current firmware.
  8. Our client code is highly customized; hard to compare with libcoap examples.

from libcoap.

mrdeep1 avatar mrdeep1 commented on August 17, 2024

@jqiaobln Thanks for this. If you could send me a copy of the full logs (at log level 7) privately, that would be helpful. Email address is in, for example, include/coap3/block.h as the second copyright entry. Do you know what version of CoAP code the server is running?

from libcoap.

jqiaobln avatar jqiaobln commented on August 17, 2024

Email sent.

The server is using Eclipse Californium.

from libcoap.

mrdeep1 avatar mrdeep1 commented on August 17, 2024

Have you changed RFC7252 NSTART to 2 from 1?

from libcoap.

jqiaobln avatar jqiaobln commented on August 17, 2024

Yes

from libcoap.

mrdeep1 avatar mrdeep1 commented on August 17, 2024

Ok. Just trying to work out deviations from standard default values, but NSTART should not be an issue here.

from libcoap.

mrdeep1 avatar mrdeep1 commented on August 17, 2024

Thanks for the full logs. I will use abstracts here.

**  <->  UDP : mid=0xc447: added to retransmit queue (12031ms)
**  <->  UDP : mid=0xc447: removed 1
**  <->  UDP : mid=0xc448: added to retransmit queue (12969ms)
**  <->  UDP : mid=0xc449: added to retransmit queue (1011ms)
**  <->  UDP : mid=0xc44a: delayed
**  <->  UDP : mid=0xc44b: delayed
**  <->  UDP : mid=0xc448: removed 1
**  <->  UDP : mid=0xc44a: transmitted after delay
**  <->  UDP : mid=0xc44a: added to retransmit queue (1473ms)
**  <->  UDP : mid=0xc44c: delayed
**  <->  UDP : mid=0xc449: removed 1
**  <->  UDP : mid=0xc44b: transmitted after delay
**  <->  UDP : mid=0xc44b: added to retransmit queue (14064ms)
**  <->  UDP : mid=0xc44a: removed 2
**  <->  UDP : mid=0xc44c: transmitted after delay
**  <->  UDP : mid=0xc44c: added to retransmit queue (13022ms)
**  <->  UDP : mid=0xc44b: removed 2
**  <->  UDP : mid=0xc44c: removed 1
**  <->  UDP : mid=0xc44d: added to retransmit queue (12500ms)
**  <->  UDP : mid=0xc44e: added to retransmit queue (346ms)
**  <->  UDP : mid=0xc44d: removed 1
**  <->  UDP : mid=0xc44e: removed 1
**  <->  UDP : mid=0xc44f: added to retransmit queue (11094ms)
**  <->  UDP : mid=0xc450: added to retransmit queue (250ms)
**  <->  UDP : mid=0xc451: removed 2
**  <->  UDP : mid=0xc453: added to retransmit queue (13282ms)
**  <->  UDP : mid=0xc452: removed 2
**  <->  UDP : mid=0xc453: removed 1
**  <->  UDP : mid=0xc454: added to retransmit queue (12344ms)

Several things to note here.

As NSTART > 1, the added to retransmit queue entries of less than 10000ms are reporting the time to transmit after the previous entry has expired. I think that the number reported should be the actual delay time for that specific entry and will create a fix.

The difference between removed 1 and removed 2 is that the queued entry is not the first in the queue for removed 2.

Each PDU created for transmission gets a unique message id - in the case of libcoap, this is monotonically incrementing, done by coap_new_message_id() in the coap_new_pdu() function. As per coap_pdu_init(3) (if you are using this as an alternative) which is used by coap_new_pdu(), you should be using coap_new_message_id() to generate the correct message id.

Looking at your log output, there is no creation for mid=0xc451 (added or delayed). This could possibly be because you are using NON, but the response packet is ACK, so it was not a NON request. The same is true for mid=0xc452. Just a response in both cases, which appears to be against a POST (response code 2.01).

v:1 t:CON c:GET i:c44d {dd2300d0} [ Uri-Host:B, Block2:0/_/1024 ]
PDU presented by app.
v:1 t:CON c:POST i:c44e {fb2300d0} [ Uri-Host:B ] :: binary data length 1488
v:1 t:CON c:POST i:c44e {fb2300d0} [ Uri-Host:B, Block1:0/M/1024, Size1:1488, Request-Tag:0xacf445e6 ] :: binary data length 1024
v:1 t:ACK c:2.05 i:c44d {dd2300d0} [ Content-Format:application/octet-stream, Size1:200 ] :: binary data length 200
v:1 t:ACK c:2.31 i:c44e {fb2300d0} [ Block1:0/M/1024 ]
v:1 t:CON c:POST i:c44f {020000dd2300d2} [ Uri-Host:B, Block1:1/_/1024, Size1:1488, Request-Tag:0xacf445e6 ] :: binary data length 464
v:1 t:CON c:GET i:c450 {c72700d0} [ Uri-Host:B, Block2:0/_/1024 ]
v:1 t:ACK c:2.01 i:c451 {5c2800d0} [ Content-Format:text/plain ] :: 'Successfully written'
v:1 t:CON c:GET i:c453 {bb2900d0} [ Uri-Host:B, Block2:0/_/1024 ]
v:1 t:ACK c:2.01 i:c452 {0f290000} [ Content-Format:text/plain ] :: 'Successfully written'
v:1 t:ACK c:2.05 i:c453 {bb2900d0} [ Content-Format:application/octet-stream, Size1:200 ] :: binary data length 200

Also for mid c451 and c452 the tokens in the response do not match any request. This implies to me for whatever reason the the 2 POST requests have not been logged - but somehow coap_new_message_id() has been called as otherwise all the mids are in sequence.

Are you using the same CoAP session when talking to hosts A and B?

Is your client using libcoap as a single threaded, not multi-threaded (which is not safe) library?

Can you send me privately the changes you made to libcoap, as well as the copy of coap-config.h you are using so I can try to track down why there is no logging (and possibly processing) for mid c451 and c452? Thanks.

from libcoap.

mrdeep1 avatar mrdeep1 commented on August 17, 2024

Are you able to shed any light on the entry Total: 255 bytes (with a short amount of actual data) and what it likely to be? It is at the point where there is the 2 missing mids.

v:1 t:CON c:GET i:c450 {c72700d0} [ Uri-Host:B, Block2:0/_/1024 ]
**  <->  UDP : mid=0xc450: added to retransmit queue (250ms)
2022-10-17 14:35:25 CoapManager> Sent CoAP request: 0xC450
2022-10-17 14:35:25 DTLSManager> network send 162 bytes
17FEFD000100000000000A0095000100000000000AD511144B2B072735253C808E5F85EA8467F7A500EB7C998DE5A13875B455F9AB8889E9A2F6AE1DDB73E3F4EF54856AFB8658C7FA6BBA36857753DB5E191918277020A65F04B0503019C0DD6E3DD8CD
Total: 162 bytes
C5E3848C9FD1CAEA82                                        <--------------------------
Total: 255 bytes                                           <--------------------------
2022-10-17 14:35:25 DTLSManager> network recv 67 bytes
17FEFD000100000000000C0036000100000000000C243B9F194A99733303F00494D1065476A5321CC4D027DCB39599DBD481C33C49F7887827DBB97D2605C8C602430A
Total: 67 bytes
*   <->  UDP : received 30 bytes
v:1 t:ACK c:2.01 i:c451 {5c2800d0} [ Content-Format:text/plain ] :: 'Successfully written'
**  <->  UDP : mid=0xc451: removed 2
2022-10-17 14:35:25 CoapManager> Response: 2.1
2022-10-17 14:35:26 Received message = 3
*   <->  UDP : sent 125 bytes
v:1 t:CON c:GET i:c453 {bb2900d0} [ Uri-Host:B, Block2:0/_/1024 ]

from libcoap.

jqiaobln avatar jqiaobln commented on August 17, 2024

from libcoap.

mrdeep1 avatar mrdeep1 commented on August 17, 2024

The output can be stressed sometimes and miss a few lines.

Yes, it would be good to see the server side logs if you still have them for this issue - then I can fill in some of the missing output to see what the code is not correctly detecting the end of the transmission of a large body of data.

I don;t think it is an issue here, but you appear to be randomizing just one byte of the token which has the potential to occasional have duplication of token for 2 requests and then get responses matching the wrong token (yes, mid also does de-duplication, but a non-piggybacked (separate) response will have different mids, but the same token.)

from libcoap.

mrdeep1 avatar mrdeep1 commented on August 17, 2024

As your client application is dropping some logging lines - is it possible that it is also dropping the lg_xmit deletion entry (and possibly the lg_crcv deletion entry)?

It may be worth trying out out PR #948 (waiting to be merged) as this cleans up the lg_crcv promptly.

As you have ACK_TIMEOUT set to 10 seconds instead of 2, then the initial retry if there is a packet loss will be somewhere between 10 and 15 seconds. On a lossy network waiting for retries, as the number of outstanding CON requests waiting for a response is limited to the NSTART value, a lot of transmissions potentially will get delayed.

from libcoap.

mrdeep1 avatar mrdeep1 commented on August 17, 2024

@jqiaobln Can you please try the latest develop branch code to see if you still have issues.

from libcoap.

mrdeep1 avatar mrdeep1 commented on August 17, 2024

Private update from @jqiaobln, providing the full logs with the comment

Line 1494: GET block 0 of cid 18, token c2800b28
…
Line 1993: GET block 2 of cid 18, mid d34b, token 03000061c00515
Line 2028: Retransmission of GET block 2 of cid 18
Line 2049: ACK block 2 of cid 18, mid d34b, token 03000061c00515
Line 2053: GET block 3 of cid 18, mid d34d, token 04000061c00515
Line 2134: ACK block 2 of cid 18, mid d34b, token 03000061c00515
Line 2137: ACK the token c2800b28 with 1KB data
Line 2192: ACK block 3 of cid 18, mid d34d, token 04000061c00515
Line 2196: ACK the token c2800b28 with 4KB data
 
I think the problem happened at lines 2134 and 2137.

Thanks for this. Line 2134 is a duplicate response following the re-transmitted duplicate request at line 2049.

libcoap handles duplicate responses at the start of handle_response() by matching the current mid with the last seen mid. But in your case as NSTART is > 1, there is an intervening set of requests/responses using mid d34c between lines 1993 and 2134, so the last seen mid was d34c, not dc4b at line 2134 and so duplication detection failed at this point.

The duplication response should also have been picked up in the ACK -> Get next block code in coap_handle_response_get_block() (the GET is correctly not re-requested for Block no 3) but erroneously passed the ACK up to the application (line 2137). I am assuming at this point that COAP_BLOCK_SINGLE_BODY is set when calling coap_context_set_block_mode().

I will look at getting a fix for this block logic issue.

from libcoap.

mrdeep1 avatar mrdeep1 commented on August 17, 2024

As a separate note, I am surprised to see the Size1 option in your detailed logs coming back from the server as per RFC7959 Section 4

Similarly, the Size1 Option may be used for two purposes:

   o  In a request carrying a Block1 Option, to indicate the current
      estimate the client has of the total size of the resource
      representation, measured in bytes ("size indication").

   o  In a 4.13 response, to indicate the maximum size that would have
      been acceptable [RFC7252], measured in bytes.

Furthermore, it would be helpful to have the ETag option in the server's response as per RFC7959 Section 2.4 to make sure libcoap re-assembles the correct set of blocks.

   Block-wise transfers can be used to GET resources whose
   representations are entirely static (not changing over time at all,
   such as in a schema describing a device), or for dynamically changing
   resources.  In the latter case, the Block2 Option SHOULD be used in
   conjunction with the ETag Option [RFC7252, Section 5.10.6], to
   ensure that the blocks being reassembled are from the same version of
   the representation: The server SHOULD include an ETag Option in each
   response.

from libcoap.

mrdeep1 avatar mrdeep1 commented on August 17, 2024

@jqiaobln See PR #982 for a fix for handling the duplicate response as seen in your logs.

from libcoap.

jqiaobln avatar jqiaobln commented on August 17, 2024

from libcoap.

mrdeep1 avatar mrdeep1 commented on August 17, 2024

I will give it a try ASAP. Do you recommend any other fixes I should merge as well?

The other fix would be #948 which is now merged into the develop branch.

from libcoap.

jqiaobln avatar jqiaobln commented on August 17, 2024

from libcoap.

mrdeep1 avatar mrdeep1 commented on August 17, 2024

@jqiaobln This looks like a bug. There currently is no API to track back to the original token.

I will look at getting a fix out for this,

from libcoap.

mrdeep1 avatar mrdeep1 commented on August 17, 2024

@jqiaobln See #990 for a fix (now merged into develop branch) to making sure that the NACK handler gets the PDU with the corrected token. Please let us know if this fixes what you were seeing.

from libcoap.

mrdeep1 avatar mrdeep1 commented on August 17, 2024

@jqiaobln Can this issue now be closed off?

from libcoap.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.