The Cozmo mobile app stores audio files in .wem and .bnk formats. Both are <a href="ht

Resolved with <a class="issue-link js-issue-link" data-error-text="Failed to load titl

Cozmo protocol audio format about pycozmo HOT 10 CLOSED

zayfod commented on August 26, 2024

Cozmo protocol audio format

from pycozmo.

Comments (10)

gimait commented on August 26, 2024

Hi! I have been looking into this a bit and I think I figured out how the transfer of audio data happens between the app and the robot.

I believe that the transferred chunks correspond to 22kHz 8-bit unsigned PCM samples. Which are transferred as follows:

When a new sample is ready, this is put in a queue and the transmission of the sample starts.
The sample is then sent periodically until a Keyframe message is received from the robot. Now I am quite sure that these Keyframe messages are used to confirm that a new frame has been received.
When a Keyframe is received on the app, a new sample is transmitted.

This is more or less the pattern I could see looking at the packets transferred during this data transfer, however, there is an extra detail that makes this work and that I wanted to discuss before starting to prepare a fix for the audio.

To prevent packet loss, the app sends an audio packet several times (until a keyframe is sent back). Now, the OutputAudio messages don't have an id, so instead of checking that id to prevent repeating chunks, the app does one of the following (and I'm not sure of which one):

It either repeats the whole Frame that contains the OutputAudio,
or it uses the ack bytes as a message identifier, keeping them the same even when new information is included in the Frame.

In any case, I believe that in order to allow this, some changes are needed for the ClientConnection and SendThread, so I wanted to ask whether you have any ideas, suggestions about how to proceed.
Some of the changes I would include to fix this would be:

Refactoring the SendThread so that it allows sending several packets together in one message (limited by a max_frame_size), and include a way to resend messages and/or specify the first bytes of the Frame. For this, it would make sense that the thread has a different timer to send the messages than the current 'queue.get' method used.
Include callbacks in ClientConnection to manage when a Keyframe is received, and control the audio transfer.
Add a new file containing methods to read audio files.

I'm starting to work on a PR for this now, and I'll submit it once I have it ready to go.
Please, let me know your thoughts, and if this makes sense. I would like to know also if you have plans to improve/change stuff around the communication, so I can write things in the same direction.

from pycozmo.

zayfod commented on August 26, 2024

I also suspect that the audio is encoded as 22 kHz, 8-bit, unsigned, mono PCM and is transferred in 744 sample chunks. I am confident that a sample value of 0x80 represents silence, which seems to confirm this suspicion.

What do you call "keyframe message"? PacketType.KEYFRAME?

I do believe that Cozmo animations (body/lift/head movement, face images, backpack LED animations) are synchronized to audio. This means that the robot will buffer a number of animation commands sent with AnimHead, AnimLift, AnimBody, AnimBackpackLights, and will execute them when either OutputAudio or NextFrame is sent. On a separate not, the name of the NextFrame packet is probably incorrect and it should be something like OutputAudioSilence instead.

Without knowing more, I'd argue that SendThread and ClientConnection do not need modification but a new AnimationPlayer (or something like this) is needed on top of ClientConnection. It will register for the ~~AnimationState~~ KEYFRAME packet and drive animation playback based on it. This higher level construct will support animation and/or audio playback.

I assume you are aware of the audio.py example. Would it make sense to modify it to instead of sleep to maintain playback rate, wait for PacketType.KEYFRAME?

from pycozmo.

gimait commented on August 26, 2024

I also suspect that the audio is encoded as 22 kHz, 8-bit, unsigned, mono PCM and is transferred in 744 sample chunks. I am confident that a sample value of 0x80 represents silence, which seems to confirm this suspicion.

Actually, I think I might have gone a bit too fast with that conclusion... After doing some more tests, I am starting to think that the audio is compressed on some way before it is sent. If you take the audio sent from the app and play it in any audio player as 22kHz unsigned samples, you will find that you can hear mostly noise. Also when plotting the wave, it looks like the wave has been compressed. I am looking into compression and encoding algorithms now to see if I can figure out how to get it to work.

What do you call "keyframe message"? PacketType.KEYFRAME?

Yes, sorry, that is what I meant.

I do believe that Cozmo animations (body/lift/head movement, face images, backpack LED animations) are synchronized to audio.

I am not sure about this, that is partly why I wanted to ask before submitting any changes.
From looking at the network traffic, I concluded that the PacketType.KEYFRAME is related to the audio synchronization. At first, I thought that the app would use some identifier/timestamp like the num_audio_frames_played from the AnimationState to do this.

However, after trying to use this, I was still having problems with the synchronization (although I might recheck this). There was also the fact that the number of PacketType.KEYFRAME fits exactly with the number of audio frames sent, while the AnimationState does not necessarily fit with the number of audio frames or with the timing of the transmission. What is more, it is possible to reproduce the audio without specifying any animations by simply using PacketType.KEYFRAME to confirm that an audio frame was received.

Because of all this, now I am almost sure that the PacketType.KEYFRAME are used to confirm that a new frame containing an OutputAudio packet was received, and is the key to reproduce audio on the robot. If this is how it works, maybe the audio-movement synchronization is done by the app, I don't really know.

Without knowing more, I'd argue that SendThread and ClientConnection do not need modification but a new AnimationPlayer (or something like this) is needed on top of ClientConnection. It will register for the AnimationState packet and drive animation playback based on it. This higher level construct will support animation and/or audio playback.

I think small changes are needed for the SendThread to allow resending frames. This is necessary because cozmo uses the header of the frame to prevent playing repeated frames. It also makes sense to allow resending frames to prevent data loss during communication. About the ClientConnection, I agree, I think it can be untouched. Now I am trying to keep if without changes. Instead, I'm creating an "audio manager" class that would take care of sending the audio and managing the synchronization. And yeah, this class I am working on does very much the same as the audio.py example.

from pycozmo.

zayfod commented on August 26, 2024

Agreed. There seems to be some sort of light compression. Interestingly when plain PCM audio is played with the audio.py example it is with poor quality but understandable...

One experiment that I have been thinking about is to pick a Cozmo .wem file, play it with the app, and record the OutputAudio packets. .wem can be converted to .ogg/.wav and it can be compared to the processed audio, sent to Cozmo. This could provide more information about what is going on.

Thoughts?

Matching the number of OutputAudio packets and KEYFRAME packets is a great observation!

Yes, you are right that SendThread is missing frame retransmission. This is completely independent from audio and will be a great general communication robustness improvement.

from pycozmo.

gimait commented on August 26, 2024

That's a good idea, I'll try that and see if I find some encoding that fits. Do you have any idea on what kind of algorithms might be used? I have been reading a little about it and I can't find any obvious encoding that would fit with the data sent. I think I'll start trying some random methods and see what happens, but I'd appreciate any input :)

from pycozmo.

zayfod commented on August 26, 2024

I have no idea about the audio encoding format. I've done some research also but I have not been able to make progress for now.

The Android app uses ilbogg but it is not clear whether this is for it's own purposes or for the communication with the robot.

It is possible that this is a non-standard encoding, similar to the non-standard image transfer format for the camera, referred to internally as "minicolor" / "minigray". Basically JPEG with removed header made to save bandwidth.

from pycozmo.

gimait commented on August 26, 2024

That's okay, thanks! I'll spend some time going through .ogg and some of the most common codecs, maybe one will give something that looks like the signal transmitted. Let's see :)

from pycozmo.

gimait commented on August 26, 2024

So, it turns out that the encoding was a u-law.
I implemented a solution in #20, it seems to work well but I'd like some feedback on the way I fixed the frame repetition on the SendThread and the structure I chose for the AudioManager class.
Let me know if you have any suggestions!

from pycozmo.

zayfod commented on August 26, 2024

This is very exciting! I've actually experimented with WAVE files in u-law and a-law format but for some reason decided that was not the right encoding. Maybe I did not get some of the other properties right.

I'll comment on the PR directly.

from pycozmo.

zayfod commented on August 26, 2024

Resolved with #20

from pycozmo.

Cozmo protocol audio format about pycozmo HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent