w3c / mediacapture-extensions Goto Github PK
View Code? Open in Web Editor NEWExtensions to Media Capture and Streams by the WebRTC Working Group
Home Page: https://w3c.github.io/mediacapture-extensions/
License: Other
Extensions to Media Capture and Streams by the WebRTC Working Group
Home Page: https://w3c.github.io/mediacapture-extensions/
License: Other
Following on https://github.com/w3c/mediacapture-main/issues/739, current API is making it difficult for web developers to select constraints when they are tied with each other.
Also, it makes it usually hard for web developers to select particular native presets for which they could expect the best performance as user agents would limit processing such as downsampling.
One possibility would be to expose native camera presets to web developers so that they can generate their constraints given to applyConstraints more easily.
A preset could be defined as a set of constraints (width, heigh, frame rate, pixel format...) with discrete values.
At the moment, echo cancellation is defined on a MediaStreamTrack, but the source of the signal to be cancelled is not specified, leaving this up to the implementation.
Since most cases of echo are from a specific output device creating echo into an input device, it makes sense to specify for a given input (made visible as a MediaStreamTrack) to be echo-cancelled against the output that the application thinks will most affect it - most of the time, this will be the system default output device, but sometimes (as with headphones on the non-default device with mechanical-path echo), the right device is something else.
This seems to be addressable with a means of specifying which output device the input is to be echo-cancelled against; the most logical source of such identifiers is the output device ID from [[mediacapture-output]].
Tagging @o1ka
Hi all,
I am relatively new to the media capture API. Recently I found after obtaining the MediaStream object, even if the input device changed, the metadata of the already-obtained MediaStream object is not updated. For example, if I do the following,
It seems the only way to observe this is to listen to the "MediaDevices.ondevicechange" event. And it seems video conferencing websites will recall the getUserMedia when this happens. Then they can obtain a new stream with correct metadata.
Is this an intended behaviour?
Chrome is currently exposing the type (builtin, bluetooth, maybe usb) in MediaDeviceInfo.label.
This seems ok to expose like this but is difficult to use by web applications, if they would like to do so. Parsing the label might be hard and localisation might make this even more difficult to use.
It might be interesting to split that information in its own attribute, whose value could be an enumeration. One potential use case would be a website showing the current device being used as an icon (instead of a label) according this information.
I couldn't find anything in the specification regarding the origin that a track is attributed to. I suspect that all browsers have settled on a model that is sensible, but the spec should make a few things clear:
MediaStreamTrack objects are only readable by the origin that requested them, unless other constraints cause them to gain a different origin (the peerIdentity constraint for WebRTC does this).
MediaStreamTrack objects can be rendered if they belong to another origin, but only their size is known.
We need to decide what the rules are for constraints on cross origin tracks. I think that if the model for transferrance is that they are copied when transferred, then constraints can be both read and written, just as we permit a site to read and write constraints on peerIdentity-constrained tracks.
We need to consider what happens to synchronization of playback for mixed-origin MediaStreamTrack objects. Do we consider clock skew from a particular source to be something that we should protect? Whatever the decision, this is part of the set of things that we need to be very clear on.
Work progresses on transferring tracks between origins, which I think is OK, but this is groundwork for that.
The best text we have is in the from-element spec, which is honestly a little on the light side.
This came up in w3c/mediacapture-screen-share#53.
Following editorial issues are identified:
Couldn't find "PermissionDescriptor" in this document or other cited documents: [dom], [html], [infra], [mediacapture-streams], [permissions], [webaudio], and [webidl]. See search matches for "PermissionDescriptor" or Learn about this error. Occurred at: 1. (Plugin: "core/xref").
Couldn't find "Initialize the underlying source" in this document or other cited documents: [dom], [html], [infra], [mediacapture-streams], [permissions], [webaudio], and [webidl]. See search matches for "Initialize the underlying source" or Learn about this error. Occurred at: 1. (Plugin: "core/xref").
Couldn't find "tieSourceToContext", for "MediaStreamTrack", in this document or other cited documents: [dom], [html], [infra], [mediacapture-streams], [permissions], [webaudio], and [webidl]. See search matches for "tieSourceToContext" or Learn about this error. Occurred at: 1. (Plugin: "core/xref").
Bad reference: [GETUSERMEDIA] (appears 3 times) (Plugin: "core/render-biblio").
Bad reference: [permissions] (appears 1 times) (Plugin: "core/render-biblio").
Bad reference: [RFC2119] (appears 1 times) (Plugin: "core/render-biblio").
Bad reference: [RFC8174] (appears 1 times) (Plugin: "core/render-biblio").
Bad reference: [HTML] (appears 0 times) (Plugin: "core/render-biblio").
Bad reference: [mediacapture-streams] (appears 0 times) (Plugin: "core/render-biblio").
Bad reference: [infra] (appears 0 times) (Plugin: "core/render-biblio").
Bad reference: [webidl] (appears 0 times) (Plugin: "core/render-biblio").
rvfc can be used to grab information of video frames and do canvas painting.
FaceDetection metadata can be useful in that context.
There can be different approaches to how we could expose FaceDetection metadata in that context:
It seems several interpretations for how to implement getCapabilities() is possible:
We need to add normative steps or deprecate it in favor of an async API.
In hindsight, in-content device selection was a mistake. It's
The PING outlines the way forward in w3c/mediacapture-main#640 (comment):
Privacy-by-default flow:
Initially site has access to no devices or device labels
- site asks for category (or categories) of device
- browser prompts user for one, many or all devices
- site gains access to only the device, and device label, of the hardware the user selects.
That's an in-chrome picker ("in-chrome" = implemented in the browser). In-chrome pickers
w3c/mediacapture-main#644 is my proposal for reshaping getUserMedia
to serve this need, as well as solve w3c/mediacapture-main#648.
Many APIs providing a MediaStreamTrack are limited by SecureContext in WebIDL, such as enumerateDevices, getUserMedia, and getDisplayMedia. Other APIs do not require a SecureContext, e.g. captureStream.
Which of the following is the correct restriction for transferring a MediaStreamTrack?
This might be useful in case identified in w3c/mediacapture-screen-share#158.
If we go with media capture insertable streams, JavaScript could potentially shim such a postMessage by getting access to individual frames and sending them through postMessage to recreate a MediaStreamTrack.
Implementing transfer by the user agent could make it easier to developers and potentially more efficient.
Infrared cameras are common on phones, and are typically included in enumerateDevices() w3c/mediacapture-main#553.
They're rarely desirable except for special purposes, and browser vendors occasionally get bugreports where an infrared camera is chosen by default on some phones. The fix for dealing with them is usually to put them after the first non-infrared front camera and first non-infrared back camera in the list, and label them as " (infrared)".
But since they're special-purpose, should we let apps constrain them out, using e.g. {infrared: {exact: false}}
(or in)?
Regarding Transferable MediaStreamTrack, I think there is only one reasonable approach for handling delivery of media (video frames or audio data) during transfer: media stops flowing before postMessage returns on the sending side and does not start flowing until the time the transferred MediaStreamTrack is connected to a sink on the receiving side. All intermediate frames/data are dropped.
A consequence is that media won't be delivered while a transfer is in progress, which may be surprising to some developers.
There are a few unreasonable (IMO) approaches too:
If I'm correct and there's only one reasonable approach here based on this and other specs, then perhaps nothing needs to be added to the specification for Transferable MediaStreamTrack. However, if it's likely that different UAs would choose different approaches to media delivery, we should consider nailing this down so that developers don't start relying on behavior that's not specified, e.g. if a UA provides (best-effort) continuous media delivery during a transfer.
We have all heard “eyes are the window to the soul” and their importance in effective communication. The disparity of locations of the subject and the camera make it hard to have eye contact during the video call. Recent consumer-level platforms have been able to solve the eye gaze correction problem, more often employing custom AI accelerators on the client platforms. The ability to render the gaze corrected face would help in a realistic imitation of real-world communication in an increasingly virtual world and undoubtedly be a welcome feature for the WebRTC developer community, something native platforms have been offering for some time.
Microsoft eloquently blogged about EyeGazeCorrection for their Surface lineup. MediaFoundation already has a KSCAMERA_EXTENDEDPROP_EYEGAZECORRECTION_ON
starting from Windows 11, provided there is driver support.
Apple’s FaceTime already has something very similar in the form of Attention Correction on iOS 14.0 or later devices.
Strawman Proposal
<script>
const videoStream = await navigator.mediaDevices.getUserMedia({
video: true,
});
// Show camera video stream to the user.
const video = document.querySelector("video");
video.srcObject = videoStream;
// Get video track capabilities.
const videoTrack = videoStream.getVideoTracks()[0];
const capabilities = videoTrack.getCapabilities();
// Check whether eyegazeCorrection is supported.
if (!capabilities.eyegazeCorrection) return;
async function applyEyegazeCorrection() {
try {
await track.applyConstraints({
eyegazeCorrection: true;
});
} catch (err) {
console.error(err);
}
}
</script>
Current specification supports channelCount
https://rawgit.com/w3c/mediacapture-main/master/getusermedia.html#def-constraint-channelCount
but this is not sufficient when the track contains more than two channels. In those cases, a channelLayout is also required [1].
The purpose of prompting and the user picking is...
My gut-reaction to the user making the choice is that we don't need a lot of constraints anymore.
But there is still value in specifying desired resolution and frame rate. If the application only wants X then exceeding X is just wasting resources. For example if the application is happy with VGA 20 fps then it wastes resources to open the camera at UltraHD 60 fps.
But what if your device(s) can't do what the application asks for?
Example 1: I have a single device and it can only do 30 fps but the application is asking for 60.
I would argue that 30 fps is better than no camera whatsoever.
I would also argue that if the requests rejects because of over-constraining, then we are exposing unnecessary information to the application.
Example 2: Front/back camera or multiple cameras. E.g. I have two cameras, one pointing at me and one pointing at my living room.
Maybe one of the cameras can do HD and the other can't and the application is asking for HD. When it was the application's job to do the picking for you, it made a lot of sense to rule out which device to pick. If the user is picking anyway, I'm not sure it is valid to rule out options. In getDisplayMedia() we purposefully prevented the application from influencing selection, ensuring that we only provide fingerprinting surface to whether audio, video and display surfaces are present.
I don't see why getUserMedia(), in a world where device picking is not the application's job, would be any different from getDisplayMedia(). I don't think it is valid to rule out one camera or the other. It is the user's decision whether to show their face or their living room.
Example 3: Audio+video? No, only audio? Re-prompt!
Today, getUserMedia() asks for the kinds of media that was specified. And they're required, with "audio+video" you either give both or none. So the application may ask for both only to have a mute button later (unnecessarily opening both camera and microphone, not ideal for privacy), or it asks, rejects and then asks again. Or the application asks the user in an application-specific UI which kinds to pass in to getUserMedia(), doing some of the choosing for the user outside the browser UI.
Discussion:
After merging #59, we have a constraint section between sections related to UA device picker.
We should probably move powerEfficientPixelFormat close to background blur.
And move the Algorithms/Examples section as a subsection of the UA device picker section.
Hello there,
I'm currently working with the mediacapture api in different browsers. While working on the MediaStreamTracks, I've noticed that the stop method doesn't provide a way to get notified when the track is actually stopped by the underlying user agent.
This can cause issues in platforms, where a camera can only be opened once by the browser api.
My proposal for a solution would be to make the .stop() method async by returning a promise.
The promise should be resolved when the user agent made sure, that the media track is stopped.
Currently, when you want to ask for a specific resolution using height, width, aspectRatio and resizeMode there is no way to specify that you do not want to crop.
Situation:
You want to use 640x360 res. The camera supports 640x480, 1280x720, and 1920x1080
Despite the resizeMode "none" or "crop-and-scale" Chrome for instance will take the 4:3 640x480 and crop it to 16:9 ratio which effectively zooms the image leaving barely enough room for someone's head to fit.
There is not a way using the spec to insist that you want to scale down the nearest 16:9 res (1280x720) to 640x360
Even if you say you want an exact aspect of 16:9 it still chooses the crop technique.
So the spec needs to be precise enough to entice the browser developers to implement something that allows this specificity.
Perhaps a resize mode of "scale-only" or "preserve-aspect-ratio" or just some interpretation of a combination of existing params such as "crop-and-scale" when used with an exact aspect ratio, limits the source video to those who meet the required aspect ratio.
As supplement to a previous ticket I posted a broader API suggestions, and it was suggested these would be a good 'issue' in their own right for discussion.
I've tried to wholistically address a number of issues around permissions, fingerprinting and device selection.
I'm afraid I'm very much just a user of these APIs on a practical level, with a day job, so that doesn't leave me a lot of time to comb the specifications or be intricately in tune with details of every discussion. Though I have been reading here to try and get up to speed with the issues to do my best to ensure this is relevant.
I do hope you will see this from a distance; reining back complexity and special cases, not adding more, and within the constraints of the existing API. I hope you don't mind considering broader goals from an outside contributor.
Calls to getUserMedia that do not specify a device ID (or specify "default") would be governed by a "permission to use your camera/microphone" dialogue provided by the browser:
And then, independently a permissions flag (looks like [[canExposeDeviceInfo]]?):
And that's it!
What my goals are in the above proposal:
One thing the above embodies, which seems to be fighting in the other issues (such as #6) is whether deviceID is a first class consideration. Whilst its clever to try and channel deviceID as just another constraint, the separation of 'what' media the client has access to, versus 'how' it samples that media, could be inevitible.
It is good to remember that not all apps are standard video conferencing apps, and already WebAudio apps for producivity use multiple devices concurrently.
https://w3c.github.io/mediacapture-extensions/#transferable-mediastreamtrack does not specify any restrictions on transfer of a MediaStreamTrack.
When investigating the possibility of implementing this extension, we found that there were fundamental differences in the complexity of transferring within an Agent Cluster (basically same-origin iframes and DedicatedWorker) and transferring outside an Agent Cluster.
It's fairly obvious that there are potential use cases for cross-cluster transfer, but it's also obvious that supporting them is costly.
Should we restrict transfer of MediaStreamTracks to within an Agent Cluster, or should we require universal transferability?
Following on #69 and media capture transform, face detection metadata could be made available to mediastreamtrack transforms.
There are a few possibilities we could envision. The following come to mind:
Face Detection on Video Conferencing.
Support WebRTC-NV use cases like Funny Hats, etc
On client side, developers have to use Computer Vision libraries (OpenCV.js / TensorFlow.js) either with a WASM (SIMD+Threads) or a GPU backend for acceptable performance. Many developers would resort to cloud based solutions like Face API from Azure Cognitive Services or Face Detection from Google Cloud's Vision API. On modern client platforms, we can save a lot of data movement and even on-device computation by leveraging the work the camera stack / Image Processing Unit (IPU) anyways does to improve image quality, for free.
Prior Work
WICG has proposed the Shape detection API which enables Web applications to use a system-provided face detector, but the API requires that the image data be provided by the Web application itself. To use the API, the application would first need to capture frames from a camera and then give the data to the Shape detection API. This may not only cause extraneous computation and copies of the frame data, but may outright prevent using the camera-dedicated hardware or system libraries for face detection. Often the camera stack performs face detection in any case to improve image quality (like 3A algorithms) and the face detection results could be made available to applications without extra computation.
Many platforms offer a camera API which can perform face detection directly on image frames from the system camera. The face detection can be assisted by the hardware which may not allow applying the functionality to user-provided image data or the API may prevent that.
Platform Support
OS | API | FaceDetection |
---|---|---|
Windows | Media Foundation | KSPROPERTY_CAMERACONTROL_EXTENDED_FACEDETECTION |
ChromeOS/Android | Camera HAL3 | STATISTICS_FACE_DETECT_MODE_FULL STATISTICS_FACE_DETECT_MODE_SIMPLE |
Linux | GStreamer | facedetect |
macOS | Core Image Vision | CIDetectorTypeFace VNDetectFaceRectanglesRequest |
ChromeOS + Android
Chrome OS and Android provide the Camera HAL3 API for any camera user. The API specifies a method to transfer various image-related metadata to applications. One metadata type contains information on detected faces. The API allows selecting the face detection mode with
STATISTICS_FACE_DETECT_MODE |
Returns |
---|---|
STATISTICS_FACE_DETECT_MODE_FULL |
face rectangles, scores, and landmarks including eye positions and mouth position. |
STATISTICS_FACE_DETECT_MODE_SIMPLE |
only face rectangles and confidence values. |
In Android, the resulting face statistics is parsed and stored into class Face.
Windows
Face detection is performed in DeviceMFT on the preview frame buffers. The DeviceMFT integrates the face detection library, and turns on features, when requested by application. Face detection is enabled with property ID KSPROPERTY_CAMERACONTROL_EXTENDED_FACEDETECTION. When enabled, the face detection results are returned using metadata attribute MF_CAPTURE_METADATA_FACEROIS which contains, for each face, the face coordinates:
typedef struct tagFaceRectInfo {
RECT Region;
LONG confidenceLevel;
} FaceRectInfo;
The API also supports blink and smile detection which can be enabled with property IDs KSCAMERA_EXTENDEDPROP_FACEDETECTION_BLINK
and KSCAMERA_EXTENDEDPROP_FACEDETECTION_SMILE
.
macOS
Apple offers face detection using Core Image CIDetectorTypeFace or Vision VNDetectFaceRectanglesRequest.
Strawman proposal
<script>
// Check if face detection is supported by the browser.
const supports = navigator.mediaDevices.getSupportedConstraints();
if (supports.faceDetection) {
// Browser supports camera face detection.
} else {
throw('Face detection is not supported');
}
// Open camera with face detection enabled and show to user.
const stream = await navigator.mediaDevices.getUserMedia({
video: { faceDetection: true }
});
const video = document.querySelector("video");
video.srcObject = stream;
// Get face detection results for the latest frame
videoTracks = stream.getVideoTracks();
videoTrack = videoTracks[0];
const settings = videoTrack.getSettings();
if (settings.faceDetection) {
const detectedFaces = settings.detectedFaces;
for (const face of detectedFaces) {
console.log(
` Face @ (${face.boundingBox.x}, ${face.boundingBox.y}),` +
` size ${face.boundingBox.width}x${face.boundingBox.height}`);
}
}
</script>
From https://github.com/alvestrand/mediacapture-transform/issues/59 (see link for earlier discussion):
As requested, this is to discuss our upcoming proposal, which I've written up as a standards document, with 3 examples.
This brief explainer isn't a substitute for that doc or the slides, but walks through a 41-line fiddle:
Since tracks are transferable, instead of of creating all tracks ahead of time and transferring their streams, we simply transfer the camera track to the worker:
const stream = await navigator.mediaDevices.getUserMedia({video: {width: 1280, height: 720}}); video1.srcObject = stream.clone(); const [track] = stream.getVideoTracks(); const worker = new Worker(`worker.js`); worker.postMessage({track}, [track]);...and receive a processed track in return:
const {data} = await new Promise(r => worker.onmessage); video2.srcObject = new MediaStream([data.track]); };The worker pipes the camera
track.readable
through a video processing step into a writable VideoTrackSourcesource
, whose resultingsource.track
it transfers back with postMessage.// worker.js onmessage = async ({data: {track}}) => { const source = new VideoTrackSource(); parent.postMessage({track: source.track}, [source.track]); await track.readable.pipeThrough(new TransformStream({transform: crop})).pipeTo(source.writable); };This avoids exposing data on main thread by default. The
source
(and the real-time media pipeline) stays in the worker, while itssource.track
(a control surface) can be transferred to the main thread.track.clone()
inherits the same source.This aligns with mediacapture-main's sources and sinks model, which separates a source from its track, and makes it easy to extend the source interface later.
The slides go on to show how clone() and applyConstraints() used in the worker help avoid needing the
tee()
function.
Please conduct further WG discussion here, so that discussion is tracked.
Capturing in compressed pixel formats such as MJPEG adds CPU overhead because the browser has to convert (decompress) every frame before delivery to the MediaStreamTrack and beyond.
An application that cares about both quality and performance might ask with non-required constraints for Full HD. If the user has a USB 3.0 camera, Full HD might be delivered without any compression overhead. Great! But if the user has a USB 2.0 camera, due to bus limitations, Full HD would (on cameras available for testing) be captured in MJPEG, adding this overhead. The application pays a performance debt, even though it might have been just as happy if it got HD frames at a lower cost.
TL;DR: Should we add a {video:{avoidCapturingExpensivePixelFormats:true}}
constraint? I'm not married to the name :)
Frames are captured in one format, typically NV12 (420v), YUY2 (yuvs) or MJPEG (dmb1) and then converted. Chromium traditionally converts to I420 (y420) as this format is widely supported by encoders, though it is possible to have other destination pixel formats (e.g. NV12 is supported by some encoders which could allow for a zero-conversion pipeline in WebRTC).
While YUY2 to I420 is fairly cheap, MJPEG to I420 isn't as cheap.
I set up thin "capture and convert" demo (code) and measured the CPU usage (utilization percentage normalized by CPU frequency using Intel Power Gadget and a script to obtain a sense of the "absolute" amount of work performed).
Here is the result of capturing in various formats*, converting to I420 at 30 fps and measuring the CPU and power consumption.
* Caveat: NV12 and YUY2 are captured with the built-in MacBook Pro camera and MJPEG is captured using an external Logitech Webcam C930e. The external webcam could contribute to some of the added CPU usage and power consumption, so it would be good to compare MJPEG on webcam with YUY2 on the same webcam, but the majority of the work is in the pixel conversions.
Capture Format | Resolution | Normalized CPU Usage [M cycles/s] | Power Consumption [Watt] |
---|---|---|---|
NV12 (420v) | 640x480 (VGA) | 26.51 | 3.10 |
... | 1280x720 (HD) | 28.94 | 3.23 |
YUY2 (yuvs) | 640x480 (VGA) | 20.57 | 2.98 |
... | 1280x720 (HD) | 30.97 | 3.31 |
MJPEG (dmb1) | 640x480 (VGA) | 52.85 | 4.85 |
... | 1280x720 (HD) | 67.99 | 5.28 |
... | 1920x1080 (Full HD) | 102.41 | 6.27 |
Note: I am not measuring the entire browser, I am only measuring a demo that does capturing and conversion.
In this example...
Add a new video constraint, e.g. BooleanConstraint avoidCapturingExpensivePixelFormats, that if true allows the browser to skip pixel formats of a device that are deemed inefficient (e.g. MJPEG) if that same device supports capturing in other pixel formats.
On Logitech Webcam C930e, where Full HD is only available as MJPEG but 1024x576 and below is available as YUY2, getUserMedia would pick a lower resolution but avoid MJPEG.
P.S. This could result in a tradeoff between frame rate and resolution, more discussion needed.
Given recent discussions in the WG, the proposal for doing this that has been debated since November has been reallocated to the following URL:
https://alvestrand.github.io/mediacapture-transform/
This proposal handles both video and audio, and exposes the resulting Whatwg Stream object on the same context as the MediaStreamTrack, both on the main thread and on worker threads.
During the WebRTC WG's interim meeting of January 2023, I presented a proposal for a API, which will auto-pause tracks under certain conditions, and fire an event notifying the app that this has happened. This prevents incorrectly processed frames from being being placed on the wire before the Web application has time to process the event. Resources on this proposal include:
I believe configuration changes are sub-case here, which is why I proposed:
enum PauseReason {
“top-level-navigation”,
"surface-switch",
“config-change”
};
I think if we adopt some variant of my proposal, we'll end up with a more useful and general API than the configurationchange event. (It is still on my backlog to respond to the feedback given during the meeting, but I sensed the room as mostly supportive of the general thrust of that proposal.)
Wdyt? @eehakkin? Others?
Sub-issue of w3c/mediacapture-main#826.
track.getSettings() already gives us the setting (assuming w3c/mediacapture-main#906 is fixed), but it's still useful to know what the actual frame rate is as it could be lower if...
Edit: New proposal:
Previous proposal: track.framesEmitted + track.framesDropped = framesCaptured
Ideally getUserMedia would require a user gesture similarly to getDisplayMedia.
This is not web-compatible as many pages call getUserMedia on page load or quickly after page load.
It would still be nice to define web-compatible heuristics where user gesture could be enforced.
There is the potential for many different configuration changes. What if an application is only interested in some? For example, we see this demo code by François, which currently reads:
function configurationChange(event) {
const settings = event.target.getSettings();
if ("backgroundBlur" in settings) {
log(`Background blur changed to ${settings.backgroundBlur ? "ON" : "OFF"}`);
}
}
This assumes that the only possible event is to blur, or else the following would be possible:
Background blur changed to OFF
Background blur changed to OFF
Background blur changed to OFF
A likely future modification of the code would be:
function configurationChange(event) {
if (event.whatChanged != "blur") {
return;
}
const settings = event.target.getSettings();
if ("backgroundBlur" in settings) {
log(`Background blur changed to ${settings.backgroundBlur ? "ON" : "OFF"}`);
}
}
However, depending on what changes and how often, this could mean that the event handler is invoked in vain 99% of the time, and possible multiple times per second, needlessly wasting CPU on processing the event by JS code.
I think a more reasonable shape for the API, would be to expose a subscribe-method.
track.subscribeToConfigChanges(eventHandler, ["blur", "xxx", "yyy", "zzz"]);
Wdyt?
(Some prior art here is in w3c/mediacapture-screen-share#80. I propose my change as incremental progress over it. CC @beaufortfrancois, @guidou, @eehakkin and @youennf)
background blur and powerEfficientPixelFormat new constraints are not written in the same way.
We should probably converge on a single template and use consistently for both (as well as future new constraints like voiceIsolation).
Current approach suggests to test all possible constraint combinations and use the combination that has the lowest fitness distance.
This is difficult to implement and provides sometimes interesting but maybe unexpected results.
It would be good to try providing a simpler algorithm which could hopefully be implemented consistently.
User agent mute-toggles for camera & mic can be useful, yielding enhanced privacy (no need to trust site), and quick access (a sneeze coming on, or a family member walking into frame?)
privacy.webrtc.globalMuteToggles
in about:config)It's behind a pref in Firefox because:
This image is titled: "Am I muted?"
We determined we can only solve the double-mute problem by involving the site, which requires standardization.
The idea is:
The first point requires no spec change: sites can listen to the mute and unmute events on the track (but they don't).
The second point is key: if the user sees the site's button turn to "muted", they'll expect to be able to click it to unmute.
This is where it gets tricky, because we don't want to allow sites to unmute themselves at will, as this defeats any privacy benefits.
The proposal here is:
partial interface MediaStreamTrack {
undefined unmute();
}
It would throw InvalidStateError
unless it has transient activation, is fully active, and has focus. User agents may also throw NotAllowedError
for any reason, but if they don't then they must unmute the track (which will fire the unmute event).
This should let user agents that wish to develop UX without the double-mute problem.
We know that noise cancellation can be quite effective in many scenarios.
However, noise cancellation is, by default, somewhat restrictive in what it considers "noise", in order to lessen the chance that it is damping stuff that the recipient wants to hear.
There are quite powerful algorithms out there that allow better noise removal if we're more sure what the recipient wants to hear - such as removing anything that does not form part of a human voice.
This behavior is sometimes desirable (such as in person to person conversation), and sometimes very undesirable (such as when playing music to each other).
Suggestion: Add a new constraint "voiceIsolation" (values true & false) that, when true, tries to isolate the human voice and remove all other parts of the audio signal.
This may also enable features such as directionality (beam-forming) that attempt to take signal only from the direction from which a human voice is detected.
A new Apple feature called "mic mode" apparently permits specifying some kinds of audio processing for microphone devices; this was called to Chrome's attention in https://bugs.chromium.org/p/chromium/issues/detail?id=1282442
Is this something that could be useful to expose in the WebRTC API? Are there adaptations that could be made to accomodate this without changing the API?
Assigning to @youennf for comment.
@annevk says most objects that are transferable are also serializable.
MediaStreamTrack has a custom clone()
method today, like VideoFrame has. But VideoFrame is serializable while MediaStreamTrack is not.
After looking over our custom clone algorithm in w3c/mediacapture-main#821, it seems semantically compatible to me, and since tracks are already transferable, I think it would make sense to make them serializable as well. This should simplify our algorithms, and make the following work:
const stream = await navigator.mediaDevices.getUserMedia({video: true});
const [track] = stream.getVideoTracks();
const clone = structuredClone(track);
Today this produces DataCloneError: The object could not be cloned.
in Firefox Nightly.
FaceDetection metadata is one example of VideoFrame segmentation, which is useful for:
It occurs to me that rather than defining FaceDetection metadata, we might instead define Segmentation metadata, with a type field of "face detection".
From https://github.com/w3c/mediacapture-main/issues/669#issuecomment-605114117:
@henbos Specifically on removing required constraints, note that Chrome today implements
info.getCapabilities()
which gives the site capability information about all devices after gUM.That API exists to allow a site to enforce its constraints while building a picker, or choosing another device outright. Most sites enforce some constraints.
That API is also a trove of fingerprinting information.
Luckily,
"user-chooses"
provides feature-parity with this, without the massive information leak:await getUserMedia({video: constraints, semantics: "user-chooses"});So merging w3c/mediacapture-main#667 would let us retire
info.getCapabilities()
provided we leave constraints alone. 🎉
So far, it looks like only Chrome/Edge implement it (WPT)
Current browsers do not add pages that have MediaStreamTrack to the b/f cache.
It would be good to get consensus on how we could end up doing so.
A possibility is for local MediaStreamTracks to be ended when page enters the b/f cache.
This would simulate capture failures, which can always happen.
It is not clear from the spec whether transferring a MediaStreamTrack preserves the subtype.
For instance, are we expecting transferring CanvasCaptureMediaStreamTrack to end up being a CanvasCaptureMediaStreamTrack?
Probably, but we would probably need to state something like this in CanvasCaptureMediaStreamTrack spec.
We could also state that it is expected that subtypes are preserved in media capture-extensions transfer section and refer to extension specs to define the additional steps required.
As pointed out by @aboba, we could improve the spec by adding examples and motivation.
Google Meet, Microsoft Teams, Zoom and every video-conferencing application these days has the Background Concealment / (Blur, Replacement) so that users can minimize distractions and keep the focus on the subject. Most web based apps will use some form of AI inference to implement this feature, like say Jitsi is using Meet’s model and TensorflowLite’s WASM backend [commit].
The popularity on the native side and also usage of NN frameworks to implement on the Web platform warrants a discussion if it makes sense bring this feature to the Web Platform (WebRTC) in a shape which might benefit all without bringing their own frameworks and leveraging the underlying platform support, which in many cases might be accelerated via VPUs or other ASIC processors.
MediaFoundation has added the support for background segmentation using properties like KSCAMERA_EXTENDEDPROP_BACKGROUNDSEGMENTATION_BLUR
from Windows 11, if there is support from the underlying driver. By encapsulating the preferably (ASIC) hardware accelerated inference work in the driver and leveraging standard platform APIs, we do not have to re-invent the wheel for every web application.
Apple's Segmentation Matte in Portrait Mode captures are essentially a general framework to implement many new features, one of which can be Background Replacement.
Opens
Do we try to compose a single API for Background Blur (BB) and Background Replacement (BR) ?
Blur level; // [0,1]
enum media_type ; // [image, video, 3d_animation]
Currently the transfer spec changes only the interface definition for MediaStreamTrack
, making it Transferable
as well as adding Worker
to the Exposed
list. If we keep worker exposure, some other interfaces need to be exposed on workers as well, including:
Since the supported Face Detection models may vary by camera, the API proposed in PR 89 can potentially give varying results depending on the hardware. This will impose a support burden on applications, which could need to maintain maintain a camera blacklist.
Such a list would be difficult to develop without the ability to identify the camera hardware, which in turn could be considered a fingerprinting risk.
These issues do not arise for applications utilizing an existing face detection model written for an ML platform, since those models will yield the same results, albeit with better or worse performance depending on the (GPU) hardware. The variance of results therefore represents a dis-incentive to use of the proposed APIs.
The discussion around adding backgroundBlur to the spec uncovered a pattern that I think we may see more often.
With backgroundBlur, we handled this by saying:
That way, an app can detect whether the function is on or off, and whether it can change it or not.
It might be worth pulling out this pattern as a documented pattern, so that other features that behave like this can refer to it instead of re-explaining it for every constraint.
(I note that the description in the spec for backgroundBlur doesn't go into details on this pattern either. Might be worth improving.)
Thoughts?
From a stackoverflow question, newer phones have multiple back cameras. How would an app distinguish between them, specifically: avoid or pick the often unsuitable wide-lens camera?
While getUserMedia
does support device selection using constraints, I see no reliable difference in constrainable properties between a wide-lens camera and its regular counterpart.
Proposal: A new focalLength
constraint.
This would be the distance between sensor and lens (I'm no photographer or device expert, but this seems to often be an inherent property of the lens, e.g. on the Samsung S10).
This would be different from the existing (and similar-sounding) focusDistance
, which is distance from lens to object.
Visit a new web site in Chrome, Safari, or Edge, & do something requiring camera + microphone:
They'll say the site wants to "use your camera and microphone", without saying which ones (USB webcams, headsets, modern phones https://github.com/w3c/mediacapture-main/issues/655). If it's wrong, you'll need to correct it after the fact.
Browsers may not even choose the same camera and microphone, a web compat issue (e.g. headset detection).
Firefox is different, showing which camera and microphone will be used, even letting you change it (within the constraints of the app):
But not everyone with multiple devices use Firefox.
It would be better if users got to choose based on how many devices they have, not what browser they use, and maybe regardless of permission if an app is this indecisive on subsequent visits.
It also feels like this should be an app decision, not a browser trait.
Proposal A: w3c/mediacapture-main#644 (comment) would fix this, prompting on indecision or lack of permission. But it may not be web compatible at this point.
Proposal B: Add a new getUserMedia boolean that enables the w3c/mediacapture-main#644 (comment) behavior:
await navigator.mediaDevices.getUserMedia({video: true, chosen: true});
"chosen" means both tracks must be chosen by the user (or app), not the user agent.
In the interest of web compat, Firefox would remove its picker unless chosen
is true, giving web users the same experience across browsers.
Proposal C: Same as B, but with new method:
await navigator.mediaDevices.chooseUserMedia({video: true});
Incidentally, this would be the same API used to replace in-content device selection https://github.com/w3c/mediacapture-main/issues/652.
Currently the API in PR 78 proposes to provide hw acceleration for Face Detection based on camera driver support. Tying support for accelerated Face Detection to support in a camera driver seems unlikely to provide wide coverage, since it is likely to only be supported on new camera models. Acceleration using more commonly available hardware (such as GPUs) will be likely to have wider coverage, which is why ML development tools such as tensorflow.js utilize this technique. It is also why GPU acceleration is mentioned in WebRTC-NV Uses Cases Section 3.6.
Face Detection APIs that do not achieve wide coverage will be frustrating for applications that do not wish to develop their own face detection models. These applications will either need to operate without face detection support if it is not available, or they will need to include their own face detection capabilities - which will lessen the need for the proposed APIs.
Those applications that include their own face detection models will also probably choose to forgo the proposed APIs, choosing instead to leverage GPU-based acceleration approaches supported by Web ML platforms such as Tensorflow.js.
References: https://lists.w3.org/Archives/Public/public-webrtc/2023Jan/0047.html
Hi all,
It seem currently the mute
event and the muted
status of MediaStreamTrack
don't reflect the OS-level settings. For example, if a user mutes the mic in OS setting, there is no way for webapps to get notified about it. How do you think we make OS-level mutation also trigger the "mute" event?
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.