w3c / mediacapture-extensions Goto Github PK

View Code? Open in Web Editor NEW

19.0 20.0 14.0 409 KB

Extensions to Media Capture and Streams by the WebRTC Working Group

Home Page: https://w3c.github.io/mediacapture-extensions/

License: Other

HTML 100.00%

webrtc getusermedia

mediacapture-extensions's Issues

Expose camera presets to web pages

Following on https://github.com/w3c/mediacapture-main/issues/739, current API is making it difficult for web developers to select constraints when they are tied with each other.
Also, it makes it usually hard for web developers to select particular native presets for which they could expect the best performance as user agents would limit processing such as downsampling.
One possibility would be to expose native camera presets to web developers so that they can generate their constraints given to applyConstraints more easily.
A preset could be defined as a set of constraints (width, heigh, frame rate, pixel format...) with discrete values.

Echo cancellation: Need to specify the source of the echo cancellation reference signal

At the moment, echo cancellation is defined on a MediaStreamTrack, but the source of the signal to be cancelled is not specified, leaving this up to the implementation.

Since most cases of echo are from a specific output device creating echo into an input device, it makes sense to specify for a given input (made visible as a MediaStreamTrack) to be echo-cancelled against the output that the application thinks will most affect it - most of the time, this will be the system default output device, but sometimes (as with headphones on the non-default device with mechanical-path echo), the right device is something else.

This seems to be addressable with a means of specifying which output device the input is to be echo-cancelled against; the most logical source of such identifiers is the output device ID from [[mediacapture-output]].

Tagging @o1ka

Update MediaStream and/or MediaStreamTrack objects when input device changes?

Hi all,

I am relatively new to the media capture API. Recently I found after obtaining the MediaStream object, even if the input device changed, the metadata of the already-obtained MediaStream object is not updated. For example, if I do the following,

In a website, use getUserMedia to obtain a stream object with "audio=true".
Insert an external microphone.
The system will then automatically capture audio from the external microphone. But if the metadata of the stream object doesn't change, although it is already capturing audio from the external microphone.

It seems the only way to observe this is to listen to the "MediaDevices.ondevicechange" event. And it seems video conferencing websites will recall the getUserMedia when this happens. Then they can obtain a new stream with correct metadata.

Is this an intended behaviour?

Expose the type of device in MediaDeviceInfo

Chrome is currently exposing the type (builtin, bluetooth, maybe usb) in MediaDeviceInfo.label.
This seems ok to expose like this but is difficult to use by web applications, if they would like to do so. Parsing the label might be hard and localisation might make this even more difficult to use.
It might be interesting to split that information in its own attribute, whose value could be an enumeration. One potential use case would be a website showing the current device being used as an icon (instead of a label) according this information.

Origin isolation

I couldn't find anything in the specification regarding the origin that a track is attributed to. I suspect that all browsers have settled on a model that is sensible, but the spec should make a few things clear:

MediaStreamTrack objects are only readable by the origin that requested them, unless other constraints cause them to gain a different origin (the peerIdentity constraint for WebRTC does this).
MediaStreamTrack objects can be rendered if they belong to another origin, but only their size is known.
We need to decide what the rules are for constraints on cross origin tracks. I think that if the model for transferrance is that they are copied when transferred, then constraints can be both read and written, just as we permit a site to read and write constraints on peerIdentity-constrained tracks.
We need to consider what happens to synchronization of playback for mixed-origin MediaStreamTrack objects. Do we consider clock skew from a particular source to be something that we should protect? Whatever the decision, this is part of the set of things that we need to be very clear on.

Work progresses on transferring tracks between origins, which I think is OK, but this is groundwork for that.

The best text we have is in the from-element spec, which is honestly a little on the light side.

This came up in w3c/mediacapture-screen-share#53.

Specification has external spec broken links

Following editorial issues are identified:

Couldn't find "PermissionDescriptor" in this document or other cited documents: [dom], [html], [infra], [mediacapture-streams], [permissions], [webaudio], and [webidl]. See search matches for "PermissionDescriptor" or Learn about this error. Occurred at: 1. (Plugin: "core/xref").
Couldn't find "Initialize the underlying source" in this document or other cited documents: [dom], [html], [infra], [mediacapture-streams], [permissions], [webaudio], and [webidl]. See search matches for "Initialize the underlying source" or Learn about this error. Occurred at: 1. (Plugin: "core/xref").
Couldn't find "tieSourceToContext", for "MediaStreamTrack", in this document or other cited documents: [dom], [html], [infra], [mediacapture-streams], [permissions], [webaudio], and [webidl]. See search matches for "tieSourceToContext" or Learn about this error. Occurred at: 1. (Plugin: "core/xref").
Bad reference: [GETUSERMEDIA] (appears 3 times) (Plugin: "core/render-biblio").
Bad reference: [permissions] (appears 1 times) (Plugin: "core/render-biblio").
Bad reference: [RFC2119] (appears 1 times) (Plugin: "core/render-biblio").
Bad reference: [RFC8174] (appears 1 times) (Plugin: "core/render-biblio").
Bad reference: [HTML] (appears 0 times) (Plugin: "core/render-biblio").
Bad reference: [mediacapture-streams] (appears 0 times) (Plugin: "core/render-biblio").
Bad reference: [infra] (appears 0 times) (Plugin: "core/render-biblio").
Bad reference: [webidl] (appears 0 times) (Plugin: "core/render-biblio").

Face Detection: How metadata should be tied to requestVideoFrameCallback

rvfc can be used to grab information of video frames and do canvas painting.
FaceDetection metadata can be useful in that context.
There can be different approaches to how we could expose FaceDetection metadata in that context:

Extend https://wicg.github.io/video-rvfc/#dictdef-videoframemetadata with an optional faceDetection object.
Have the callback take a third additional parameter that would contain FaceDetection metadata (and potentially more).
Create a new rvfc dedicated to FaceDetection

Add normative steps to getCapabilities() or deprecate it in favor of an async API

It seems several interpretations for how to implement getCapabilities() is possible:

Query capabilities on createOffer() and expose them in subsequent getCapabilities() calls.
Like 1) but also do this at setRemoteDescription(offer).
Have getCapabilities() block so that it can return even HW capabilities.
Etc

We need to add normative steps or deprecate it in favor of an async API.

In-content device selection a mistake. Too complicated, leaks info

In hindsight, in-content device selection was a mistake. It's

Too permissive—assumes all devices granted up-front to work effectively (wo/reprompt)
Too complicated—having every site write a decent picker compatibly has been a failure:
- Exhibit A: Changing camera or mic in webrtc samples re-prompts both in Firefox; flickers
- Mobile devices typically can't open more than once device at a time ("stop-then-pick")
- Stop-then-pick is an inferior user experience on desktops
- Dealing with different browser permission models severely limits design (no previews)
Leaks private info—fails PING review w3c/mediacapture-main#640
Too limiting—no path to privacy (can't avoid redundant re-prompts after user selection)

The PING outlines the way forward in w3c/mediacapture-main#640 (comment):

Privacy-by-default flow:

Initially site has access to no devices or device labels

site asks for category (or categories) of device

browser prompts user for one, many or all devices

site gains access to only the device, and device label, of the hardware the user selects.

That's an in-chrome picker ("in-chrome" = implemented in the browser). In-chrome pickers

Are proven successful—in getDisplayMedia
The way forward—for speakers w3c/mediacapture-output#86
Remove the need to grant all devices ❤️
Let UAs solve mobile platforms that can’t open multiple devices (mute temporarily)
Let UAs solve camera previews maybe even in a (non-creepy) Brady Bunch grid!

~~w3c/mediacapture-main#644 is my proposal for reshaping getUserMedia to serve this need, as well as solve w3c/mediacapture-main#648.~~

MediaStreamTrack transfer requires secure context?

Many APIs providing a MediaStreamTrack are limited by SecureContext in WebIDL, such as enumerateDevices, getUserMedia, and getDisplayMedia. Other APIs do not require a SecureContext, e.g. captureStream.

Which of the following is the correct restriction for transferring a MediaStreamTrack?

Any track can be transferred, regardless of whether the origin and/or destination contexts are secure. (This is how I interpret the current wording.)
Transfer fails if either the origin or destination contexts are not secure.
If the origin context is secure, transfer fails if the destination context is not secure.
Something else?

@alvestrand

Investigate the possibility to transfer MediaStreamTrack

This might be useful in case identified in w3c/mediacapture-screen-share#158.
If we go with media capture insertable streams, JavaScript could potentially shim such a postMessage by getting access to individual frames and sending them through postMessage to recreate a MediaStreamTrack.
Implementing transfer by the user agent could make it easier to developers and potentially more efficient.

Do we need an infrared constraint?

Infrared cameras are common on phones, and are typically included in enumerateDevices() w3c/mediacapture-main#553.

They're rarely desirable except for special purposes, and browser vendors occasionally get bugreports where an infrared camera is chosen by default on some phones. The fix for dealing with them is usually to put them after the first non-infrared front camera and first non-infrared back camera in the list, and label them as " (infrared)".

But since they're special-purpose, should we let apps constrain them out, using e.g. {infrared: {exact: false}} (or in)?

Should media delivery during transfer be specified?

Regarding Transferable MediaStreamTrack, I think there is only one reasonable approach for handling delivery of media (video frames or audio data) during transfer: media stops flowing before postMessage returns on the sending side and does not start flowing until the time the transferred MediaStreamTrack is connected to a sink on the receiving side. All intermediate frames/data are dropped.

A consequence is that media won't be delivered while a transfer is in progress, which may be surprising to some developers.

There are a few unreasonable (IMO) approaches too:

Media begins to be queued when postMessage is called and all frames are delivered once a sink is connected. (This is problematic because (1) the duration of queuing is unbounded, and (2) there is an expectation with MediaStreamTrack that frames are delivered in real time.)
If originalTrack is connected to a sink before postMessage, media may continue to be delivered to that sink for some period after postMessage. (Since originalTrack's ReadyState is "ended," this seems unreasonable.)
If originalTrack is connected to a sink before postMessage, there may be some overlap where the same frames/data are delivered to both the sink on the sending side and the sink on the receiving side. (This would conflict with the concept that there is a single MediaStreamTrack that is transferred between contexts, since a MediaStreamTrack can't "rewind.")

If I'm correct and there's only one reasonable approach here based on this and other specs, then perhaps nothing needs to be added to the specification for Transferable MediaStreamTrack. However, if it's likely that different UAs would choose different approaches to media delivery, we should consider nailing this down so that developers don't start relying on behavior that's not specified, e.g. if a UA provides (best-effort) continuous media delivery during a transfer.

EyeGazeCorrection.

Why ?

We have all heard “eyes are the window to the soul” and their importance in effective communication. The disparity of locations of the subject and the camera make it hard to have eye contact during the video call. Recent consumer-level platforms have been able to solve the eye gaze correction problem, more often employing custom AI accelerators on the client platforms. The ability to render the gaze corrected face would help in a realistic imitation of real-world communication in an increasingly virtual world and undoubtedly be a welcome feature for the WebRTC developer community, something native platforms have been offering for some time.

Microsoft eloquently blogged about EyeGazeCorrection for their Surface lineup. MediaFoundation already has a KSCAMERA_EXTENDEDPROP_EYEGAZECORRECTION_ON starting from Windows 11, provided there is driver support.

Apple’s FaceTime already has something very similar in the form of Attention Correction on iOS 14.0 or later devices.

How ?

Strawman Proposal

<script>
const videoStream = await navigator.mediaDevices.getUserMedia({
   video: true,
});

// Show camera video stream to the user.
const video = document.querySelector("video");
video.srcObject = videoStream;

// Get video track capabilities.
const videoTrack = videoStream.getVideoTracks()[0];
const capabilities = videoTrack.getCapabilities();

// Check whether eyegazeCorrection is supported.
if (!capabilities.eyegazeCorrection) return;

async function applyEyegazeCorrection() {
  try {
    await track.applyConstraints({
      eyegazeCorrection: true;
    });
  } catch (err) {
    console.error(err);
  }
}
</script>

MediaTrackSettings lacks support of channel layout

Current specification supports channelCount

https://rawgit.com/w3c/mediacapture-main/master/getusermedia.html#def-constraint-channelCount

but this is not sufficient when the track contains more than two channels. In those cases, a channelLayout is also required [1].

[1] https://en.wikipedia.org/wiki/Surround_sound

"user-chooses": Does required constraints make any sense now?

The purpose of prompting and the user picking is...

Address privacy issues. If the user makes the decision, the application does not need to know what other options are available.
Ensure consistent prompting behavior across browsers.

My gut-reaction to the user making the choice is that we don't need a lot of constraints anymore.
But there is still value in specifying desired resolution and frame rate. If the application only wants X then exceeding X is just wasting resources. For example if the application is happy with VGA 20 fps then it wastes resources to open the camera at UltraHD 60 fps.

But what if your device(s) can't do what the application asks for?

Example 1: I have a single device and it can only do 30 fps but the application is asking for 60.
I would argue that 30 fps is better than no camera whatsoever.
I would also argue that if the requests rejects because of over-constraining, then we are exposing unnecessary information to the application.

"60 fps" makes sense as a guideline, not as a requirement.
Constraints should not be a way to probe for capabilities.

Example 2: Front/back camera or multiple cameras. E.g. I have two cameras, one pointing at me and one pointing at my living room.
Maybe one of the cameras can do HD and the other can't and the application is asking for HD. When it was the application's job to do the picking for you, it made a lot of sense to rule out which device to pick. If the user is picking anyway, I'm not sure it is valid to rule out options. In getDisplayMedia() we purposefully prevented the application from influencing selection, ensuring that we only provide fingerprinting surface to whether audio, video and display surfaces are present.

I don't see why getUserMedia(), in a world where device picking is not the application's job, would be any different from getDisplayMedia(). I don't think it is valid to rule out one camera or the other. It is the user's decision whether to show their face or their living room.

Constraints may possibly influence device settings, but should not influence user choices (e.g. which device).
Device capabilities is not a valid guideline for what the "right choice" is. Example: even if HD is generally preferable, whether or not my camera can do HD has nothing to do with what the direction that camera is facing, which I as a user definitely care more about.

Example 3: Audio+video? No, only audio? Re-prompt!
Today, getUserMedia() asks for the kinds of media that was specified. And they're required, with "audio+video" you either give both or none. So the application may ask for both only to have a mute button later (unnecessarily opening both camera and microphone, not ideal for privacy), or it asks, rejects and then asks again. Or the application asks the user in an application-specific UI which kinds to pass in to getUserMedia(), doing some of the choosing for the user outside the browser UI.

Discussion:

Should constraints be able to limit the user choices? I would argue no.
Do constraints need to be more complicated than limits used for downsampling? Possibly not?
Should audio/video be optional? Seems like a good idea.

Spec sections are not well organised

After merging #59, we have a constraint section between sections related to UA device picker.
We should probably move powerEfficientPixelFormat close to background blur.
And move the Algorithms/Examples section as a subsection of the UA device picker section.

MediaStreamTrack.stop() should be async

Hello there,

I'm currently working with the mediacapture api in different browsers. While working on the MediaStreamTracks, I've noticed that the stop method doesn't provide a way to get notified when the track is actually stopped by the underlying user agent.

This can cause issues in platforms, where a camera can only be opened once by the browser api.

My proposal for a solution would be to make the .stop() method async by returning a promise.
The promise should be resolved when the user agent made sure, that the media track is stopped.

The described method in the RFC.

MediaStreamTrack resizeMode constraint is too vague to request specific strategy

Currently, when you want to ask for a specific resolution using height, width, aspectRatio and resizeMode there is no way to specify that you do not want to crop.

Situation:
You want to use 640x360 res. The camera supports 640x480, 1280x720, and 1920x1080
Despite the resizeMode "none" or "crop-and-scale" Chrome for instance will take the 4:3 640x480 and crop it to 16:9 ratio which effectively zooms the image leaving barely enough room for someone's head to fit.

There is not a way using the spec to insist that you want to scale down the nearest 16:9 res (1280x720) to 640x360
Even if you say you want an exact aspect of 16:9 it still chooses the crop technique.

So the spec needs to be precise enough to entice the browser developers to implement something that allows this specificity.

Perhaps a resize mode of "scale-only" or "preserve-aspect-ratio" or just some interpretation of a combination of existing params such as "crop-and-scale" when used with an exact aspect ratio, limits the source video to those who meet the required aspect ratio.

Broader API proposals for device selection and permissions

As supplement to a previous ticket I posted a broader API suggestions, and it was suggested these would be a good 'issue' in their own right for discussion.

I've tried to wholistically address a number of issues around permissions, fingerprinting and device selection.

I'm afraid I'm very much just a user of these APIs on a practical level, with a day job, so that doesn't leave me a lot of time to comb the specifications or be intricately in tune with details of every discussion. Though I have been reading here to try and get up to speed with the issues to do my best to ensure this is relevant.

I do hope you will see this from a distance; reining back complexity and special cases, not adding more, and within the constraints of the existing API. I hope you don't mind considering broader goals from an outside contributor.

Calls to getUserMedia that do not specify a device ID (or specify "default") would be governed by a "permission to use your camera/microphone" dialogue provided by the browser:

The browser should provide option to choose a device here
The persistence of this permission does not need to be spec'd
- Browser or user policy can device: eg. forever; this session
- Or every call to getUserMedia()
- Spec may demand some implicitly permitted operations; eg. if the device is already open by that page
This means that most web developers can just call getUserMedia, once, and never worry about enumerateDevices

And then, independently a permissions flag (looks like [[canExposeDeviceInfo]]?):

Browser prompts for "permission to use a range of media devices on your system"
Governs access to anything involving a deviceID: eg.
- enumerateDevices()
- getUserMedia use with a deviceID
- setSinkID
- "new device available" event
Pop up the permissions dialogue on the first of any of these API uses
- Even if the page remembers a deviceID in a cookie, it must still have this permission to make an API call with it
Browser policy or user sets how long this permission is granted for: eg. forever; this session; next 5 minutes
This dialogue can provide access control
- eg. "default devices only", "microphones only", or a specific allowlist
These deviceIDs now act as a pass on getUserMedia(), and not subject to the checks above.

And that's it!

What my goals are in the above proposal:

Clarity of not having permissions based on events ordering:
- enumerateDevices() needing to happen after getUserMedia()
- 'fake' data now, complete data later should be considered evil
Tackle fingerprinting issues; no deviceIDs are ever seen by a page without user permission
User is clear what they granted permission on
- No opaque relationships
- which page can access which devices, not implicit based on history (#703)
Retain compatibility with exisiting code
Prevent 99% of developers having to build device selection UIs unnecessarily:
- Chromium presently omits device selection on getUserMedia, which means most developers need to call enumerateDevices at present to give good user experience, and do this after an early getUserMedia() to call for permissions.
Leave policy to the browser
- Spec can be simpler and less bug footprint
- Policy is variable in different browsing environment; eg. mobile vs. desktop vs. pro user/producitivity app
- Policy is going to need to evolve without breaking existing sites or forcing them to new APIs

One thing the above embodies, which seems to be fighting in the other issues (such as #6) is whether deviceID is a first class consideration. Whilst its clever to try and channel deviceID as just another constraint, the separation of 'what' media the client has access to, versus 'how' it samples that media, could be inevitible.

It is good to remember that not all apps are standard video conferencing apps, and already WebAudio apps for producivity use multiple devices concurrently.

Transferable MediaStreamTrack: Within Agent Cluster or between Agent Clusters?

https://w3c.github.io/mediacapture-extensions/#transferable-mediastreamtrack does not specify any restrictions on transfer of a MediaStreamTrack.

When investigating the possibility of implementing this extension, we found that there were fundamental differences in the complexity of transferring within an Agent Cluster (basically same-origin iframes and DedicatedWorker) and transferring outside an Agent Cluster.

It's fairly obvious that there are potential use cases for cross-cluster transfer, but it's also obvious that supporting them is costly.

Should we restrict transfer of MediaStreamTracks to within an Agent Cluster, or should we require universal transferability?

Face Detection: How metadata should be tied to MediaStreamTrack video frames

Following on #69 and media capture transform, face detection metadata could be made available to mediastreamtrack transforms.
There are a few possibilities we could envision. The following come to mind:

Attach FaceDetection metadata to VideoFrame with dedicated face detection metadata getter/setter (new VideoFrame slot that can be cloned/postMessaged).
Attach FaceDetection metadata to VideoFrame using a generic metadata mechanism (mechanism to be defined, see VideoTrackGenerator).
Make MediaStreamTrack transforms expose objects that have a VideoFrame and a metadata object.
Extend MediaStreamTrackProcessor.readable with a face detection metadata getter (related to the last read video frame). And VideoTrackGenerator.writable with a metadata setter.

We need to tighten lifetime of transferred MediaStreamTrack, in particular capture tracks

Face Detection.

Why ?

Face Detection on Video Conferencing.
Support WebRTC-NV use cases like Funny Hats, etc
On client side, developers have to use Computer Vision libraries (OpenCV.js / TensorFlow.js) either with a WASM (SIMD+Threads) or a GPU backend for acceptable performance. Many developers would resort to cloud based solutions like Face API from Azure Cognitive Services or Face Detection from Google Cloud's Vision API. On modern client platforms, we can save a lot of data movement and even on-device computation by leveraging the work the camera stack / Image Processing Unit (IPU) anyways does to improve image quality, for free.

What ?

Prior Work
WICG has proposed the Shape detection API which enables Web applications to use a system-provided face detector, but the API requires that the image data be provided by the Web application itself. To use the API, the application would first need to capture frames from a camera and then give the data to the Shape detection API. This may not only cause extraneous computation and copies of the frame data, but may outright prevent using the camera-dedicated hardware or system libraries for face detection. Often the camera stack performs face detection in any case to improve image quality (like 3A algorithms) and the face detection results could be made available to applications without extra computation.

Many platforms offer a camera API which can perform face detection directly on image frames from the system camera. The face detection can be assisted by the hardware which may not allow applying the functionality to user-provided image data or the API may prevent that.

Platform Support

OS	API	FaceDetection
Windows	Media Foundation	KSPROPERTY_CAMERACONTROL_EXTENDED_FACEDETECTION
ChromeOS/Android	Camera HAL3	STATISTICS_FACE_DETECT_MODE_FULL STATISTICS_FACE_DETECT_MODE_SIMPLE
Linux	GStreamer	facedetect
macOS	Core Image Vision	CIDetectorTypeFace VNDetectFaceRectanglesRequest

ChromeOS + Android
Chrome OS and Android provide the Camera HAL3 API for any camera user. The API specifies a method to transfer various image-related metadata to applications. One metadata type contains information on detected faces. The API allows selecting the face detection mode with

`STATISTICS_FACE_DETECT_MODE`	Returns
`STATISTICS_FACE_DETECT_MODE_FULL`	face rectangles, scores, and landmarks including eye positions and mouth position.
`STATISTICS_FACE_DETECT_MODE_SIMPLE`	only face rectangles and confidence values.

In Android, the resulting face statistics is parsed and stored into class Face.

Windows
Face detection is performed in DeviceMFT on the preview frame buffers. The DeviceMFT integrates the face detection library, and turns on features, when requested by application. Face detection is enabled with property ID KSPROPERTY_CAMERACONTROL_EXTENDED_FACEDETECTION. When enabled, the face detection results are returned using metadata attribute MF_CAPTURE_METADATA_FACEROIS which contains, for each face, the face coordinates:

typedef struct tagFaceRectInfo {
  RECT Region;
  LONG confidenceLevel;
} FaceRectInfo;

The API also supports blink and smile detection which can be enabled with property IDs KSCAMERA_EXTENDEDPROP_FACEDETECTION_BLINK and KSCAMERA_EXTENDEDPROP_FACEDETECTION_SMILE.

macOS
Apple offers face detection using Core Image CIDetectorTypeFace or Vision VNDetectFaceRectanglesRequest.

How ?

Strawman proposal

<script>
// Check if face detection is supported by the browser.
const supports = navigator.mediaDevices.getSupportedConstraints();
if (supports.faceDetection) {
    // Browser supports camera face detection.
} else {
    throw('Face detection is not supported');
}

// Open camera with face detection enabled and show to user.
const stream = await navigator.mediaDevices.getUserMedia({
    video: { faceDetection: true }
});
const video = document.querySelector("video");
video.srcObject = stream;

// Get face detection results for the latest frame
videoTracks = stream.getVideoTracks();
videoTrack = videoTracks[0];
const settings = videoTrack.getSettings();
if (settings.faceDetection) {
    const detectedFaces = settings.detectedFaces;
    for (const face of detectedFaces) {
        console.log(
         ` Face @ (${face.boundingBox.x}, ${face.boundingBox.y}),` +
         ` size ${face.boundingBox.width}x${face.boundingBox.height}`);
    }
}
</script>

Alternative stream-based API (for mediacapture-transform)

From https://github.com/alvestrand/mediacapture-transform/issues/59 (see link for earlier discussion):

As requested, this is to discuss our upcoming proposal, which I've written up as a standards document, with 3 examples.

This brief explainer isn't a substitute for that doc or the slides, but walks through a 41-line fiddle:

Since tracks are transferable, instead of of creating all tracks ahead of time and transferring their streams, we simply transfer the camera track to the worker:
const stream = await navigator.mediaDevices.getUserMedia({video: {width: 1280, height: 720}});
video1.srcObject = stream.clone();
const [track] = stream.getVideoTracks();
const worker = new Worker(`worker.js`);
worker.postMessage({track}, [track]);
...and receive a processed track in return:
  const {data} = await new Promise(r => worker.onmessage);
  video2.srcObject = new MediaStream([data.track]);
};
The worker pipes the camera track.readable through a video processing step into a writable VideoTrackSource source, whose resulting source.track it transfers back with postMessage.
// worker.js
onmessage = async ({data: {track}}) => {
  const source = new VideoTrackSource();
  parent.postMessage({track: source.track}, [source.track]);

  await track.readable.pipeThrough(new TransformStream({transform: crop})).pipeTo(source.writable);
};
This avoids exposing data on main thread by default. The source (and the real-time media pipeline) stays in the worker, while its source.track (a control surface) can be transferred to the main thread. track.clone() inherits the same source.

This aligns with mediacapture-main's sources and sinks model, which separates a source from its track, and makes it easy to extend the source interface later.

The slides go on to show how clone() and applyConstraints() used in the worker help avoid needing the tee() function.

Please conduct further WG discussion here, so that discussion is tracked.

Add constraint to avoid inefficient (compressed) pixel formats

Capturing in compressed pixel formats such as MJPEG adds CPU overhead because the browser has to convert (decompress) every frame before delivery to the MediaStreamTrack and beyond.

An application that cares about both quality and performance might ask with non-required constraints for Full HD. If the user has a USB 3.0 camera, Full HD might be delivered without any compression overhead. Great! But if the user has a USB 2.0 camera, due to bus limitations, Full HD would (on cameras available for testing) be captured in MJPEG, adding this overhead. The application pays a performance debt, even though it might have been just as happy if it got HD frames at a lower cost.

TL;DR: Should we add a {video:{avoidCapturingExpensivePixelFormats:true}} constraint? I'm not married to the name :)

Motivation in Numbers

Frames are captured in one format, typically NV12 (420v), YUY2 (yuvs) or MJPEG (dmb1) and then converted. Chromium traditionally converts to I420 (y420) as this format is widely supported by encoders, though it is possible to have other destination pixel formats (e.g. NV12 is supported by some encoders which could allow for a zero-conversion pipeline in WebRTC).

While YUY2 to I420 is fairly cheap, MJPEG to I420 isn't as cheap.

I set up thin "capture and convert" demo (code) and measured the CPU usage (utilization percentage normalized by CPU frequency using Intel Power Gadget and a script to obtain a sense of the "absolute" amount of work performed).

Here is the result of capturing in various formats*, converting to I420 at 30 fps and measuring the CPU and power consumption.

* Caveat: NV12 and YUY2 are captured with the built-in MacBook Pro camera and MJPEG is captured using an external Logitech Webcam C930e. The external webcam could contribute to some of the added CPU usage and power consumption, so it would be good to compare MJPEG on webcam with YUY2 on the same webcam, but the majority of the work is in the pixel conversions.

Capture Format	Resolution	Normalized CPU Usage [M cycles/s]	Power Consumption [Watt]
NV12 (420v)	640x480 (VGA)	26.51	3.10
...	1280x720 (HD)	28.94	3.23
YUY2 (yuvs)	640x480 (VGA)	20.57	2.98
...	1280x720 (HD)	30.97	3.31
MJPEG (dmb1)	640x480 (VGA)	52.85	4.85
...	1280x720 (HD)	67.99	5.28
...	1920x1080 (Full HD)	102.41	6.27

Note: I am not measuring the entire browser, I am only measuring a demo that does capturing and conversion.

In this example...

Capturing in YUY2 at HD instead of MJPEG at Full HD reduces the CPU usage from 102.41 to 30.97 M cycles/s (-70%) and the power consumption from 6.27 to 3.31 W (-47%). We save ~3 W.
Capturing in YUY2 at HD instead of MJPEG at HD reduces the CPU usage from 67.99 to 30.97 M cycles/s (-54%) and the power consumption from 5.28 to 3.31 W (-37%). We save ~2 W.

Proposal

Add a new video constraint, e.g. BooleanConstraint avoidCapturingExpensivePixelFormats, that if true allows the browser to skip pixel formats of a device that are deemed inefficient (e.g. MJPEG) if that same device supports capturing in other pixel formats.

On Logitech Webcam C930e, where Full HD is only available as MJPEG but 1024x576 and below is available as YUY2, getUserMedia would pick a lower resolution but avoid MJPEG.

P.S. This could result in a tradeoff between frame rate and resolution, more discussion needed.

Adopt API for generating a Whatwg Stream from a MediaStreamTrack and vice versa

Given recent discussions in the WG, the proposal for doing this that has been debated since November has been reallocated to the following URL:

https://alvestrand.github.io/mediacapture-transform/

This proposal handles both video and audio, and exposes the resulting Whatwg Stream object on the same context as the MediaStreamTrack, both on the main thread and on worker threads.

Alternative approach to configurationchange

During the WebRTC WG's interim meeting of January 2023, I presented a proposal for a API, which will auto-pause tracks under certain conditions, and fire an event notifying the app that this has happened. This prevents incorrectly processed frames from being being placed on the wire before the Web application has time to process the event. Resources on this proposal include:

I believe configuration changes are sub-case here, which is why I proposed:

enum PauseReason {
  “top-level-navigation”,
  "surface-switch",
  “config-change”
};

I think if we adopt some variant of my proposal, we'll end up with a more useful and general API than the configurationchange event. (It is still on my backlog to respond to the feedback given during the meeting, but I sensed the room as mostly supportive of the general thrust of that proposal.)

Wdyt? @eehakkin? Others?

MediaStreamTrack: actual frame rates captured

Sub-issue of w3c/mediacapture-main#826.

track.getSettings() already gives us the setting (assuming w3c/mediacapture-main#906 is fixed), but it's still useful to know what the actual frame rate is as it could be lower if...

Camera is unable to produce the FPS of the setting (e.g. poor lighting).
Frames are produced but dropped (e.g. frame decimation or pipeline not consuming frames fast enough).

Edit: New proposal:

Just add framesCaptured. The "frames emitted" or similar can be observed from other APIs later in the pipeline.

Previous proposal: track.framesEmitted + track.framesDropped = framesCaptured

Enforcing user gesture for getUserMedia

Ideally getUserMedia would require a user gesture similarly to getDisplayMedia.
This is not web-compatible as many pages call getUserMedia on page load or quickly after page load.
It would still be nice to define web-compatible heuristics where user gesture could be enforced.

Filtering for relevant configurationchange events

There is the potential for many different configuration changes. What if an application is only interested in some? For example, we see this demo code by François, which currently reads:

function configurationChange(event) {
  const settings = event.target.getSettings();
  if ("backgroundBlur" in settings) {
    log(`Background blur changed to ${settings.backgroundBlur ? "ON" : "OFF"}`);
  }
}

This assumes that the only possible event is to blur, or else the following would be possible:

Background blur changed to OFF
Background blur changed to OFF
Background blur changed to OFF

A likely future modification of the code would be:

function configurationChange(event) {
  if (event.whatChanged != "blur") {
    return;
  }
  const settings = event.target.getSettings();
  if ("backgroundBlur" in settings) {
    log(`Background blur changed to ${settings.backgroundBlur ? "ON" : "OFF"}`);
  }
}

However, depending on what changes and how often, this could mean that the event handler is invoked in vain 99% of the time, and possible multiple times per second, needlessly wasting CPU on processing the event by JS code.

I think a more reasonable shape for the API, would be to expose a subscribe-method.

track.subscribeToConfigChanges(eventHandler, ["blur", "xxx", "yyy", "zzz"]);

Wdyt?

(Some prior art here is in w3c/mediacapture-screen-share#80. I propose my change as incremental progress over it. CC @beaufortfrancois, @guidou, @eehakkin and @youennf)

Have a common template for MediaStreamTrack constraints

background blur and powerEfficientPixelFormat new constraints are not written in the same way.
We should probably converge on a single template and use consistently for both (as well as future new constraints like voiceIsolation).

Simplify device selection algorithm

Current approach suggests to test all possible constraint combinations and use the combination that has the lowest fitness distance.
This is difficult to implement and provides sometimes interesting but maybe unexpected results.
It would be good to try providing a simpler algorithm which could hopefully be implemented consistently.

Solve user agent camera/microphone double-mute

User agent mute-toggles for camera & mic can be useful, yielding enhanced privacy (no need to trust site), and quick access (a sneeze coming on, or a family member walking into frame?)

Safari has pause/resume in its URL bar
Firefox has global cam/mic mute toggles behind a pref (set privacy.webrtc.globalMuteToggles in about:config)
Chrome has recently opened an issue discussing it.

It's behind a pref in Firefox because:

The double-mute problem: site-mute + ua-mute = 4 states, where 3 produce no sound ("Can you hear me now?")
UA-mute of microphone interferes with "Are you talking?" features
Some sites (Meet) stop camera to work around crbug 642785 in Chrome, so there's no video track to UA-mute

This image is titled: "Am I muted?"

This issue is only about (1) the double-mute problem.

We determined we can only solve the double-mute problem by involving the site, which requires standardization.

The idea is:

If the UA mutes or unmutes, the site should update its button to match.
If the user unmutes using the site's button, the UA should unmute(!)

The first point requires no spec change: sites can listen to the mute and unmute events on the track (but they don't).

The second point is key: if the user sees the site's button turn to "muted", they'll expect to be able to click it to unmute.

This is where it gets tricky, because we don't want to allow sites to unmute themselves at will, as this defeats any privacy benefits.

The proposal here is:

partial interface MediaStreamTrack {
  undefined unmute();
}

It would throw InvalidStateError unless it has transient activation, is fully active, and has focus. User agents may also throw NotAllowedError for any reason, but if they don't then they must unmute the track (which will fire the unmute event).

This should let user agents that wish to develop UX without the double-mute problem.

Voice Isolation - beyond Noise Cancellation

We know that noise cancellation can be quite effective in many scenarios.
However, noise cancellation is, by default, somewhat restrictive in what it considers "noise", in order to lessen the chance that it is damping stuff that the recipient wants to hear.

There are quite powerful algorithms out there that allow better noise removal if we're more sure what the recipient wants to hear - such as removing anything that does not form part of a human voice.

This behavior is sometimes desirable (such as in person to person conversation), and sometimes very undesirable (such as when playing music to each other).

Suggestion: Add a new constraint "voiceIsolation" (values true & false) that, when true, tries to isolate the human voice and remove all other parts of the audio signal.
This may also enable features such as directionality (beam-forming) that attempt to take signal only from the direction from which a human voice is detected.

Should Apple's "mic mode" be reflected in the API somehow?

A new Apple feature called "mic mode" apparently permits specifying some kinds of audio processing for microphone devices; this was called to Chrome's attention in https://bugs.chromium.org/p/chromium/issues/detail?id=1282442

Is this something that could be useful to expose in the WebRTC API? Are there adaptations that could be made to accomodate this without changing the API?

Assigning to @youennf for comment.

Make MediaStreamTrack serializable

@annevk says most objects that are transferable are also serializable.

MediaStreamTrack has a custom clone() method today, like VideoFrame has. But VideoFrame is serializable while MediaStreamTrack is not.

After looking over our custom clone algorithm in w3c/mediacapture-main#821, it seems semantically compatible to me, and since tracks are already transferable, I think it would make sense to make them serializable as well. This should simplify our algorithms, and make the following work:

const stream = await navigator.mediaDevices.getUserMedia({video: true});
const [track] = stream.getVideoTracks();
const clone = structuredClone(track);

Today this produces DataCloneError: The object could not be cloned. in Firefox Nightly.

Face Detection: Segmentation metadata

FaceDetection metadata is one example of VideoFrame segmentation, which is useful for:

Background blur: Keep the area defined as "foreground" sharp, blur areas outside the foreground
Segmentation: The encoder can allocate more bits to the "foreground", less to background areas of the frame.

It occurs to me that rather than defining FaceDetection metadata, we might instead define Segmentation metadata, with a type field of "face detection".

Deprecate inputDeviceInfo.getCapabilities() for better privacy

From https://github.com/w3c/mediacapture-main/issues/669#issuecomment-605114117:

@henbos Specifically on removing required constraints, note that Chrome today implements info.getCapabilities() which gives the site capability information about all devices after gUM.

That API exists to allow a site to enforce its constraints while building a picker, or choosing another device outright. Most sites enforce some constraints.

That API is also a trove of fingerprinting information.

Luckily, "user-chooses" provides feature-parity with this, without the massive information leak:
await getUserMedia({video: constraints, semantics: "user-chooses"});
So merging w3c/mediacapture-main#667 would let us retire info.getCapabilities() provided we leave constraints alone. 🎉

So far, it looks like only Chrome/Edge implement it (WPT)

MediaStreamTrack and back/forward cache

Current browsers do not add pages that have MediaStreamTrack to the b/f cache.
It would be good to get consensus on how we could end up doing so.
A possibility is for local MediaStreamTracks to be ended when page enters the b/f cache.
This would simulate capture failures, which can always happen.

Transferring a MediaStreamTrack should preserve whether the object type

It is not clear from the spec whether transferring a MediaStreamTrack preserves the subtype.
For instance, are we expecting transferring CanvasCaptureMediaStreamTrack to end up being a CanvasCaptureMediaStreamTrack?
Probably, but we would probably need to state something like this in CanvasCaptureMediaStreamTrack spec.
We could also state that it is expected that subtypes are preserved in media capture-extensions transfer section and refer to extension specs to define the additional steps required.

Add motivation and examples of how to use transferable MediaStreamTracks

As pointed out by @aboba, we could improve the spec by adding examples and motivation.

Background Concealment (Blur/Replacement).

Why ?

Google Meet, Microsoft Teams, Zoom and every video-conferencing application these days has the Background Concealment / (Blur, Replacement) so that users can minimize distractions and keep the focus on the subject. Most web based apps will use some form of AI inference to implement this feature, like say Jitsi is using Meet’s model and TensorflowLite’s WASM backend [commit].

The popularity on the native side and also usage of NN frameworks to implement on the Web platform warrants a discussion if it makes sense bring this feature to the Web Platform (WebRTC) in a shape which might benefit all without bringing their own frameworks and leveraging the underlying platform support, which in many cases might be accelerated via VPUs or other ASIC processors.

How ?

MediaFoundation has added the support for background segmentation using properties like KSCAMERA_EXTENDEDPROP_BACKGROUNDSEGMENTATION_BLUR from Windows 11, if there is support from the underlying driver. By encapsulating the preferably (ASIC) hardware accelerated inference work in the driver and leveraging standard platform APIs, we do not have to re-invent the wheel for every web application.

Apple's Segmentation Matte in Portrait Mode captures are essentially a general framework to implement many new features, one of which can be Background Replacement.

Opens

Do we try to compose a single API for Background Blur (BB) and Background Replacement (BR) ?
Blur level; // [0,1]
enum media_type ; // [image, video, 3d_animation]

additional interfaces need `Exposed=(Window,Worker)` to support transfer to worker context

Currently the transfer spec changes only the interface definition for MediaStreamTrack, making it Transferable as well as adding Worker to the Exposed list. If we keep worker exposure, some other interfaces need to be exposed on workers as well, including:

Face Detection: Variance of Results

Since the supported Face Detection models may vary by camera, the API proposed in PR 89 can potentially give varying results depending on the hardware. This will impose a support burden on applications, which could need to maintain maintain a camera blacklist.

Such a list would be difficult to develop without the ability to identify the camera hardware, which in turn could be considered a fingerprinting risk.

These issues do not arise for applications utilizing an existing face detection model written for an ML platform, since those models will yield the same results, albeit with better or worse performance depending on the (GPU) hardware. The variance of results therefore represents a dis-incentive to use of the proposed APIs.

Design pattern for constraints with system-level UI?

The discussion around adding backgroundBlur to the spec uncovered a pattern that I think we may see more often.

A feature can be set from the browser
The same feature can be set by OS-level UI
In some environments, browser can override the OS-level setting; in other environments, it cannot.

With backgroundBlur, we handled this by saying:

The OS-level setting is the default for the value
If the OS-level setting can be overridden, capabilities for that constraint return {false, true}
If the OS-level setting can't be overridden, capabilities return just a single value (the current one)

That way, an app can detect whether the function is on or off, and whether it can change it or not.

It might be worth pulling out this pattern as a documented pattern, so that other features that behave like this can refer to it instead of re-explaining it for every constraint.
(I note that the description in the spec for backgroundBlur doesn't go into details on this pattern either. Might be worth improving.)

Thoughts?

How to avoid wide-lens backfacing camera on new phones?

From a stackoverflow question, newer phones have multiple back cameras. How would an app distinguish between them, specifically: avoid or pick the often unsuitable wide-lens camera?

While getUserMedia does support device selection using constraints, I see no reliable difference in constrainable properties between a wide-lens camera and its regular counterpart.

Proposal: A new focalLength constraint.

This would be the distance between sensor and lens (I'm no photographer or device expert, but this seems to often be an inherent property of the lens, e.g. on the Samsung S10).

This would be different from the existing (and similar-sounding) focusDistance, which is distance from lens to object.

No way to reliably choose correct camera & microphone upfront

Visit a new web site in Chrome, Safari, or Edge, & do something requiring camera + microphone:

They'll say the site wants to "use your camera and microphone", without saying which ones (USB webcams, headsets, modern phones https://github.com/w3c/mediacapture-main/issues/655). If it's wrong, you'll need to correct it after the fact.

Browsers may not even choose the same camera and microphone, a web compat issue (e.g. headset detection).

Firefox is different, showing which camera and microphone will be used, even letting you change it (within the constraints of the app):

But not everyone with multiple devices use Firefox.

It would be better if users got to choose based on how many devices they have, not what browser they use, and maybe regardless of permission if an app is this indecisive on subsequent visits.

It also feels like this should be an app decision, not a browser trait.

Proposal A: w3c/mediacapture-main#644 (comment) would fix this, prompting on indecision or lack of permission. But it may not be web compatible at this point.

Proposal B: Add a new getUserMedia boolean that enables the w3c/mediacapture-main#644 (comment) behavior:

await navigator.mediaDevices.getUserMedia({video: true, chosen: true});

"chosen" means both tracks must be chosen by the user (or app), not the user agent.

In the interest of web compat, Firefox would remove its picker unless chosen is true, giving web users the same experience across browsers.

Proposal C: Same as B, but with new method:

await navigator.mediaDevices.chooseUserMedia({video: true});

Incidentally, this would be the same API used to replace in-content device selection https://github.com/w3c/mediacapture-main/issues/652.

Face Detection: Scope of Applicability

Currently the API in PR 78 proposes to provide hw acceleration for Face Detection based on camera driver support. Tying support for accelerated Face Detection to support in a camera driver seems unlikely to provide wide coverage, since it is likely to only be supported on new camera models. Acceleration using more commonly available hardware (such as GPUs) will be likely to have wider coverage, which is why ML development tools such as tensorflow.js utilize this technique. It is also why GPU acceleration is mentioned in WebRTC-NV Uses Cases Section 3.6.

Face Detection APIs that do not achieve wide coverage will be frustrating for applications that do not wish to develop their own face detection models. These applications will either need to operate without face detection support if it is not available, or they will need to include their own face detection capabilities - which will lessen the need for the proposed APIs.

Those applications that include their own face detection models will also probably choose to forgo the proposed APIs, choosing instead to leverage GPU-based acceleration approaches supported by Web ML platforms such as Tensorflow.js.

References: https://lists.w3.org/Archives/Public/public-webrtc/2023Jan/0047.html

Detect OS-level mutation?

Hi all,

It seem currently the mute event and the muted status of MediaStreamTrack don't reflect the OS-level settings. For example, if a user mutes the mic in OS setting, there is no way for webapps to get notified about it. How do you think we make OS-level mutation also trigger the "mute" event?

Thanks!

w3c / mediacapture-extensions Goto Github PK

mediacapture-extensions's Issues

Why ?

How ?

Why ?

What ?

How ?

Motivation in Numbers

Proposal

This issue is only about (1) the double-mute problem.

Why ?

How ?

Recommend Projects

Recommend Topics

Recommend Org