Renzo Schindeler

Renzo Schindeler

December 4, 2023

5 min read

Deep-Dive: Automatic Video Enhancement at Scale

overview of our video enhancement pipeline


In the current digital age, we love to capture and share content of our daily life & activities with everyone. Before sharing your selfie or summer holiday video on your favourite social media app, you are greeted by a never-ending range of filters and buttons to turn your media into eye-candy. Is the lighting too bland in your selfie? Well, let's pump up the contrast and add a nice warm filter! Is your video too long and shaky? Let’s crop and stabilise it! These are all steps to improve both the quality and appeal of multimedia. In most social media apps, the user mainly controls these processes. Getting the right look for your media could take a lot of time, since you have to tweak all sorts of parameters, especially if you want to share more than one piece of media. Additionally, filters or lightning adjustments applied to one image or video do not always apply well to a different one.

Using the Mingle app, you can share footage of football matches as a spectator. However, when recording using your phone, the result does not come close in terms of quality compared to the professional cameras used during matches on TV:

  • The action is happening far away from the person recording. Therefore, it is hard to see on the video what is happening.

  • It is difficult to predict when an exciting moment will occur during the match, so you could be recording for extended durations while only a small fragment of the video contains interesting contents.

  • The video can be subject to poor lighting conditions, shakiness or other factors reducing the image quality.

You have the tools to improve most of these issues, but since a lot can happen during a match, you simply do not have the time to both record and edit footage. It will feel more like a hassle and could even distract you from the most important part: the match itself!

At Mingle, we are continuously working on solutions to make it fun for spectators to share match content while reducing the added hassle of having to edit all your captured footage. At the Video & Machine Learning team, we automate these processes with the aid of cutting-edge artificial intelligence (AI) technologies. Keep on reading to get a glimpse under our hood!

Video enhancement

It all begins with the user uploading an unprocessed video on a match feed. This video will end up in the cloud at our Video Enhancement (VE) pipeline, which is a chain of steps to process & improve your video. Below, I will explain some of the steps we perform in this pipeline.

overview of our video enhancement pipeline

“An overview of our Video Enhancement pipeline”

Object detection & tracking

The first step in this process is extracting useful information from the video. The goal is to provide the user with insightful analytics about how they performed during a match. For instance, during a penalty we can extract the ball speed and direction from the video. In the future, you can think of other useful and more personal stats, such as player positioning, speed, pass accuracy etc.

To extract this information, we use an object detection model that can detect the locations of important object classes, such as goals, players, referees, balls etc. Based on the locations and trajectories of these objects over time, we can determine what is happening in a video and extract all sorts of useful statistics.


Over the last year, using an army of data labelers, we have continuously worked on an ever-increasing ball sports data set, dubbed the Mingle Athletics Data Set (MADS). At the time of writing, MADS contains over 273,000 labeled ball-sports pictures. These pictures combined contain a total of 2.24 million labeled objects! Using this labeled data, we trained various object detection models over time. Our biggest challenges involve relatively small objects, such as footballs. However, as MADS grows, we have more examples of different balls. By adding more variance to the data set, the model learns the representation of balls better and gets better at accurately detecting them.

Automatic zoom

Now that we have extracted useful information from a video, we can apply this to improve the video! The second step we want to do is remove useless content from the video. If the action is happening far away from the user recording, a large portion of the screen contains empty or useless space (e.g., the sky, the grass of the football field), while a tiny portion of the screen contains the important content: the match ball, goals and the players. This is where our object detection data comes into play.

Using a bunch of clever rules, we can determine in most cases where the active match ball & players are in the video. Using this information, we create our own virtual camera dolly to zoom in on the video whenever necessary, by cropping each video frame to only contain these important object classes.

“A video before and after automatically zooming in on the action.”

“Our object detector and automatic zoom in action.”

Thumbnail generation

For each video we create a unique thumbnail. This is the picture you will see in the video player before playing the video and often summarizes or represents the contents of a video. If a single user records many videos from a single angle, it could become hard to tell the videos apart based on the thumbnails, especially when the content is shot from far way. Therefore, we use the same object detection data from earlier to also pick shots from the video that contain interesting elements. Based on simple rules (e.g., the shot should contain at least one ball and/or player) and image quality measurements (e.g., how blurry does the shot look?), we select candidate video frames and pick a thumbnail at random. This randomness adds more variation to the thumbnail generation process and makes it easier to tell videos apart in the match feed. Just like during the zooming step, we zoom in on the interesting content in the thumbnail to make it immediately clear what the video is about to the user.


Now that we have a more engaging video which focuses more on the action, the video will be further processed to different resolutions and bitrates suitable for playback, which is a process called transcoding. This is to optimize the playback experience in the app. If your internet connection drops or switches, the video player in the app will switch to a lower quality setting, so you can keep watching the video with minimal buffering issues.

Here, your video’s journey through our VE pipeline comes to an end and can now be watched back in the app!

Future developments

At the time of writing, we are working on other tools to improve the quality and appeal of user content. For instance, we are working on automatic color correction & grading, so we can also improve the lighting conditions and stylistic appeal in your videos. Additionally, we are working on other ways to improve your videos, such as video stabilization and event detection. Keep an eye out on the Mingle App for these exciting new features in the future!