Share via


Transparency note for Azure Video Indexer real-time analysis

An AI system includes not only the technology, but also the people who will use it, the people who will be affected by it, and the environment in which it is deployed. Creating a system that is fit for its intended purpose requires an understanding of how the technology works, what its capabilities and limitations are, and how to achieve the best performance.

What is a transparency note?

Microsoft’s Transparency Notes are intended to help you understand how our AI technology works, the choices system owners can make that influence system performance and behavior, and the importance of thinking about the whole system, including the technology, the people, and the environment. You can use Transparency Notes when developing or deploying your own system or share them with the people who will use or be affected by your system.

Microsoft’s Transparency Notes are part of a broader effort at Microsoft to put our AI principles into practice. To find out more, see the Microsoft AI principles.

The basics of real-time analysis

Azure AI Video Indexer enabled by Arc, as part of an adaptive cloud approach, introduces real-time analysis in private preview. It enables you to extract real-time insights from your live video footage, allowing immediate detection and action. VI real-time analysis offers out-of-the-box insights for your live stream, and the ability to create custom object detection insights using open vocabulary technology. You can view live insights directly on top of your video stream, with bounding boxes highlighting detected objects. You can also save streams and insights as files, as well as upload and index external media files. With Azure AI Video Indexer, you can generate more concise summaries for segments of your recorded video footage, helping you quickly catch up on key events without watching the entire video.

Term Description
Bounding box A rectangular border used to highlight detected objects in video streams.
Insight The information and knowledge derived from the processing and analysis of video and audio files that generate different types of insights and can include detected objects, people, faces, key frames and translations or transcriptions.
Detection counter A feature that counts the number of detected instances of an object.
Custom insight A user-defined model for object detection using open vocabulary technology.
Preset A set of AI insights applied to a camera.
Confidence score A measure of the certainty of a detection.
Live stream Real-time video streaming.
Media files Video files uploaded to Azure AI Video Indexer.
Insights timeline A timeline of detected objects in video streams.
Counter A feature that displays the count of detected objects.
Retention policy A policy that defines how long recorded video files are kept.
Pin camera A feature that indicates whether a camera appears on the gallery page.
Camera description A description of the camera view used to improve video summarization.
Focus on prompt A prompt used to specify the type of events to highlight in a video summary.
Event summary A feature that generates summaries of recorded video footage.
High-level overview A general description of activities within a video.
Highlights Specific events identified in a video summary.
ONVIF cameras Cameras that comply with the ONVIF standard.
RTCP sender reports Reports sent by cameras to provide timing information.
Continuous video streaming Video streaming that isn’t triggered by motion.
Open vocabulary technology Technology that allows users to define custom models using natural language.

Capabilities

System behavior

People and vehicle detection

Azure AI Video Indexer can detect the appearance of people and vehicles in live video streams. It displays a bounding box around the detections and shows a real-time count of people and vehicles in the frame. Also, Azure AI Video Indexer can track objects within the camera and maintain a unique ID for each object. The ID is tracked through visual embeddings and appearance rather than with any personal biometric information. So, when an object leaves the frame and enters it again, it will get a new ID.

Custom insights

With Azure AI Video Indexer, you can create custom object detection to meet your requirements without coding skills or extensive training over large datasets. Using open vocabulary (OV) technology you can define custom insights for object detection, then apply them to different live cameras.

Event summary

Azure AI Video Indexer offers more concise summaries for up to six-hour segments of recorded video footage from live cameras. The summary consists of two parts. The first part is a high-level overview that provides a general description of the activities within the video. The second part is a collection of highlights that specifically identify anomalous events and events requested in the "Focus on" field, including their timestamps and ranges.

The summary can help you catch up on the most notable events in the video without having to watch the entire video. It's designed to save you time by digesting long videos and providing you with the gist of a video in a short format.

To identify relevant unusual events in a video, add a "Focus on" prompt describing the type of events you’re interested in, such as "violent behavior". Without specifying the unusual events, you might get low quality results. The "Camera description" field also provides a more accurate summary that considers the camera location settings to understand what’s unusual in this scenario.

Use cases

Intended uses

Azure AI Video Indexer real-time analysis can be used in multiple scenarios in various industries, such as:

  • Retail - You can use real-time analysis to analyze video footage to help optimize store layouts and improve customer experience and safety. With real-time analysis you can monitor the number of customers in checkout lines in real time, helping retailers act immediately to optimize staffing and reduce wait times.
  • Manufacturing - You can use real-time analysis to help ensure quality control and worker safety through video analysis. For example, real-time analysis can help identify workers who aren’t wearing protective gear, which requires real-time detection of critical events and locating specific moments in video streams.
  • Modern Safety - Azure AI Video Indexer can be utilized to help detect and identify security and safety issues in recorded video footage from live cameras by using event summary, which identifies events based on the provided "focus on" prompt.

Considerations when choosing other use cases

  • Avoid using Video Indexer for decisions that might have serious adverse impacts. Decisions based on incorrect output could have serious adverse impacts. As such, it's advisable to incorporate human oversight over decisions that have the potential for serious impacts on individuals.
  • Utilizing event summaries is not meant to replace the full viewing, particularly for content where details and nuances are important.

Legal and regulatory considerations. Organizations need to evaluate potential specific legal and regulatory obligations when using any AI services and solutions, which might not be appropriate for use in every industry or scenario. Restrictions might vary based on regional or local regulatory requirements. Additionally, AI services or solutions are not designed for and might not be used in ways prohibited in applicable terms of service and relevant codes of conduct.

Limitations

People and vehicle detection limitations

People and vehicle detection refers to the real-time model of Video indexer that can monitor the appearance of individuals and vehicles within a live video stream.

  • Minimum object size: Real-time analysis only detects objects larger than 35 x 35 pixels. Smaller objects fall below the detection threshold and might be missed.
  • Low-light and weather sensitivity: The detector might not detect objects in dark areas and/or in poor weather conditions. Extreme weather conditions will likely deteriorate the quality of outputs. For example, real-time analysis might have difficulty identifying objects in heavy rain and fog. These environments reduce visibility and contrast, making it harder for the model to distinguish object features.
  • Occlusion impact: Occlusions might reduce the quality of results and cause fragmentation—where a single object is assigned multiple tracking IDs over time instead of being consistently tracked as one continuous entity—in object tracking. This happens because the model loses visual continuity of the object.
  • Extreme viewing angles: The detector can miss or misclassify objects when viewed from a steep or extreme angles.
  • Track limit per stream: The object tracker supports up to 150 concurrent tracks per video stream. This limit helps maintain system performance and prevents resource overload in high-traffic scenes.
  • Static confidence score display: The confidence score represents how certain the model is that a detected object is correctly identified. It’s a value between 0% to 100%, the higher the score, the more confident the model is in its prediction. The UI only shows the confidence score from the object’s first detection—the moment it initially appears in the frame. However, this score might not reflect the model’s confidence later in the track, especially if visibility improves or worsens. As a result, you might assume the model is still confident in its detection when it’s not—or vice versa. This can lead to misinterpretation of detection quality, especially in long or complex scenes where conditions change over time. In contrast, the API provides updated confidence scores throughout the object’s track, offering a more accurate and dynamic view of how confident the model is at any given moment.
  • Crowd underestimation: In densely populated scenes, the model might fail to detect every individual. Overlapping bodies and limited spacing can confuse the detector, leading to undercounting or missed detections.
  • Re-identification tracking limitation: When an object exits and later re-enters the camera’s view, the tracker assigns it a new ID. This occurs because the ID is tracked through visual embeddings and appearance rather than with any personal biometric information, so the system treats reappearing objects as new unless it can maintain continuous visual tracking.

Custom insights limitations

Custom insights refer to the ability to define and detect specific objects in video content using either a text prompt (for example "vest") or an image example. These insights are powered by open vocabulary models, which allow users to describe what they want to detect without needing to retrain the model. You can create a custom object in video indexer and apply it over live video stream to get detections in real time.

  • Minimum object size: Real-time analysis only detects objects larger than 35 x 35 pixels. Smaller objects fall below the detection threshold and might be missed entirely.
  • Low-light and weather sensitivity: The detector might not detect objects in dark areas and/or in poor weather conditions. Extreme weather conditions will likely deteriorate the quality of outputs. For example, real-time analysis might have difficulty identifying objects in heavy rain and fog. These environments reduce visibility and contrast, making it harder for the model to distinguish object features.
  • Occlusion impact: Occlusions might reduce the quality of results and cause fragmentation in object tracking, it means that a single object is assigned multiple tracking IDs over time, instead of being consistently tracked as one continuous entity. This happens because the model loses visual continuity of the object.
  • Extreme viewing angles: The detector can miss or misclassify objects when viewed from a steep or extreme angles.
  • Track limit per stream: The object tracker supports up to 150 concurrent tracks per video stream. This limit helps maintain system performance and prevents resource overload in high-traffic scenes.
  • Static confidence score display: The confidence score represents how certain the model is that a detected object is correctly identified. It’s a value between 0% to 100%, the higher the score, the more confident the model is in its prediction. The UI only shows the confidence score from the object’s first detection—the moment it initially appears in the frame. However, this score might not reflect the model’s confidence later in the track, especially if visibility improves or worsens. As a result, you might assume the model is still confident in its detection when it’s not—or vice versa. This can lead to misinterpretation of detection quality, especially in long or complex scenes where conditions change over time. In contrast, the API provides updated confidence scores throughout the object’s track, offering a more accurate and dynamic view of how confident the model is at any given moment.
  • Re-identification tracking limitation: When an object exits and later re-enters the camera’s view, the tracker assign it a new ID. This occurs because the system treats reappearing objects as new unless it can maintain continuous visual tracking.
  • Color-agnostic insights: Creating custom insights from an image doesn’t identify objects by color. For example, an image of a yellow vest results in detection for all vests, without specifying the yellow vest.
  • Training input limitation: You cannot combine an image and text prompt as training data for the same insight. The system currently supports only one input modality per insight definition.

Event summary limitations

Event summary refers to the ability to choose a video segment of recorded video footage from live cameras and generate a concise summary for that segment. The summary consists of two parts. The first part is a high-level overview that provides a general description of the activities within the video. The second part is a collection of highlights that specifically identify anomalous events and events requested in the "Focus on" field, including their timestamps and ranges.

  • AI-generated summaries: Summaries are created by an AI language model to provide a general overview. While designed for clarity and usefulness, they might not fully capture the nuance or intent of the original content.
  • Inconsistent results: Editing the timeframe or re-running the summary might produce different outputs. This is due to the model’s generative nature, which can yield variations even when the input is only slightly changed.
  • Style overlap: The summary style refers to the written tone and formatting used in the generated summaries. The styles options are Neutral, Formal, or Casual—influence how the content is phrased. The Neutral summary style might sometimes resemble the Formal style, and the Casual style might include hashtags. These overlaps occur because the stylistic boundaries are not strictly enforced by the model.
  • Unexpected summary lengths: A "Medium" summary might occasionally be shorter than a "Short" one. This happens when the model determines that fewer words are needed to convey the core message in a given context.
  • Short input limitations: Summaries generated from very short videos or timeframes might be inaccurate. With limited content, the model has less context to work with, which can lead to vague or misleading summaries.
  • Detail loss in long videos: Longer videos might result in high-level summaries with fewer details. This is because the model condenses more content into a limited output space, prioritizing breadth over depth.
  • Personal attribute inaccuracies: The characteristics underlying human attributes are complex, and there are cultural and geographical differences that influence how we might perceive and experience another individual's personal characteristics. Summary responses related to the personal attributes of people in images—such as gender or age—might not necessarily accurately indicate the actual characteristics of the individual. These errors might stem from the model’s reliance on visual or contextual cues that might be ambiguous or misleading.
  • Meta-prompt leakage: Occasionally, the summary might include internal instructions (meta-prompts), such as directives to exclude harmful content. This occurs when the model fails to fully separate system-level guidance from user-facing output.
  • Missed short events: The summarization algorithm samples video at one frame per second (FPS), which means very brief events might be missed or misinterpreted. Events that occur between sampled frames might not be captured accurately.
  • Inappropriate content handling: If the original video contains inappropriate material, the summary might be incomplete or include disclaimers. In some cases, it might even quote inappropriate content, with or without a warning.
  • Timestamp inaccuracies: If a camera disconnects during recording, the summary might show incorrect timeframes. This happens because the video’s internal timestamps become misaligned due to the interruption.

Evaluations

People and vehicle detection

The system was evaluated through internal testing, including both automated metrics and human judgment across multiple video scenarios. Testing focused on assessing detection performance for people and vehicles in varied environments, using precision and recall as key metrics. The evaluation covered both standard conditions—where camera placement, resolution, and scene setup followed expected guidelines—and more challenging scenarios involving occlusions and suboptimal camera angles.

In standard conditions, the system demonstrated strong performance, with an overall high recall and high precision. The scenarios tested included restricted airport zones, city streets, parking lots, indoor pedestrian monitoring and more. In addition, a soft evaluation was conducted on unlabeled videos captured in extreme weather conditions. While visibility remained sufficient, detection performance was generally stable; however, occlusions, glare, and heavy weather (e.g., snow) significantly impacted tracking and recognition accuracy.

Custom insights

The custom insights model was evaluated through internal testing, including manual and automated analysis across multiple datasets and object classes such as employee safety and retail datasets. Testing included measuring detection performance using precision and recall metrics across labeled video samples.

Event summary

The model was evaluated through internal testing, relying on human judgment due to the complexity of the task. The evaluation focused on the system’s ability to identify and localize activities based on camera descriptions and focus prompts, and to generate both high-level summaries and specific highlights. Given the absence of standardized benchmarks for this type of task, a team of independent evaluators from different backgrounds manually reviewed a dataset of ~100 diverse videos. Each evaluator assessed the accuracy and relevance of the generated summaries. The model achieved a high average score, reflecting strong performance across varied scenarios. Additional tradeoffs and limitations are documented above.

System performance

Best practices for people and vehicle detection

  • Ensure full visibility: To best detect and track objects, they should be fully visible (no occlusions), with good lighting.
  • Avoid steep viewing angles: Position cameras to capture objects from more neutral angles. Steep angles might lead to misclassification or missed detections.

Best practices for custom insights

  • Use specific vocabulary: Only use nouns like "dog" or "shopping cart." Avoid adjectives or descriptive words such as "big" or "empty" to ensure clarity and consistency in detection.
  • Group synonyms together: You can include up to 10 related terms for the same object in the "Training data" → "Text" section. For example, to detect computers, use "computer," "laptop," and "PC" to form one detection class.
  • Add insight Name to training data: The AI Insight Name is not automatically included in training data. If your insight is named "computer," make sure to also add "computer" to the training text manually.
  • Create insight per object: Avoid combining multiple object types in one insight. Instead, create separate insights for each object. For example, don’t create a single "Animal detection" insight and then attempt to include 10 different animals that you want to detect. Instead, create a different insight for each animal: Cow detection → "cow", Cat detection → "cat", Elephant detection → "elephant".
  • Avoid ambiguous words: Do not use words with multiple meanings like "bat" or "nail," as they can refer to two different objects.
  • Exclude logical operators: Avoid using logical terms like "and," "or," or "not." Enter each word individually to maintain clarity.
  • Use focused images: Provide an image that includes only the object you want to detect.
  • Ensure image quality: Use high-quality images as training data.

Best practices for event summary

  • Use general terms: When describing extreme events in the "focus on" prompt, use generalized language like "physical altercation" instead of specific terms like "fight". This helps the model better identify the events that interest you in the video.
  • Keep prompts concise: The prompt is limited to 300 characters, so make it clear and to the point. Focus on what makes an event unusual without adding unnecessary detail.
  • Provide camera description: Complete the "camera description" field as it will be utilized during the summary process, including a concise description of the camera's location such as "Parking Lot", "Grocery Shop" or "Factory Production Floor -2".

Next steps

Learn more about responsible AI:

Learn more about real-time analysis:

For relevant resources contact us at visupport@microsoft.com.

Contact us

VI Support visupport@microsoft.com