In a previous post, we discussed the Microsoft Power Platform and Low Code development. But how powerful are these low code solutions? When we combine Power Platform and Azure Cognitive services, the citizen developer gains seemingly unprecedented AI and ML capabilities.
Here, we begin a 3-part series where we’ll explore how to use these platforms together to create an image classifier that can determine if an image contains a man, a woman, or a monkey wearing a business suit, a Kung Fu uniform or casual wear (e.g., jeans & sneakers). For part 1, we’ll discuss some background information, how to perform the initial required setup operations in Azure, and how we’ve built the model used for our classifier.
In part 2, we’ll show you step-by-step how to integrate your Custom Vision model into a mobile app you build from scratch using Microsoft Power apps low-code platform. In part 3, we’ll discuss in detail how to refine your model and improve prediction accuracy.
First, let’s review the technology we’re going to use to accomplish this.
Azure Cognitive Services is a suite of cloud-based artificial intelligence (AI) and machine learning services provided by Microsoft. These services are designed to enable developers to easily integrate various AI capabilities into their applications without the need for extensive expertise in AI or machine learning. Azure Cognitive Services cover a wide range of AI functionalities, including vision, speech, language, knowledge, and search.
Within the Microsoft Azure ecosystem, the Azure AI Vision service offers ready-made models for typical computer vision activities. These tasks encompass providing descriptive captions and tags for images, identifying and classifying commonplace objects, recognizing landmarks, celebrities, brands, and detecting adult content. Additionally, Azure AI Vision can be employed for evaluating image characteristics like color and formats, as well as creating intelligently cropped thumbnail images.
Midjourney represents an innovative generative artificial intelligence program and service, developed by the independent research lab, Midjourney, Inc., headquartered in San Francisco, California. This cutting-edge technology harnesses the power of natural language descriptions, called "prompts," to translate text into images. This transformative process unfolds seamlessly through user interaction with the AI via a dedicated bot integrated into the popular chat platform, Discord. By issuing commands with varying descriptive complexity, users are able to use language to create intricate visual landscapes. The bot returns four unique artistic interpretations based on the supplied text. The user can either upscale each image for export (U1-U4) or the user can have the bot generate another set of interpretations based on one of the previously generated images (V1-V4).
Since its unveiling in open beta on July 12, 2022, individuals from many different fields have been tapping into the capabilities of Midjourney to manifest their creative visions into visual expressions, exploring new possibilities and functionality within the realm of digital creativity.
Before we go any further, it is important to note the distinction between object recognition and image classification.
Object recognition encompasses a broad category of computer vision tasks aimed at identifying and locating specific objects within digital images. It focuses on recognizing the presence of particular objects and providing information about their positions or bounding boxes (rectangular borders that surround the object). Common use cases include object tracking, counting objects, autonomous navigation, and augmented reality. Image classification, on the other hand, is used when you want to categorize entire images into predefined classes or labels. Image classification aims to determine what the image represents as a whole, without necessarily specifying the positions of individual objects. Object recognition can be more complex than image classification because it requires identifying multiple objects within an image, often with varying positions, sizes, and orientations.
Image classification is generally considered a relatively simpler task since it focuses on determining the most dominant or representative class for the entire image. Image classification is employed in applications such as image search, recommendation systems, content tagging, and classifying images for medical diagnosis. Which AI Vision task you choose depends on the specific requirements of the application and the nature of the visual data being analyzed.
For this blog post, we are interested in image classification using Azure Custom Vision.
Azure Custom Vision is a cloud-based machine learning service provided by Microsoft Azure. It is designed to enable users to easily build, train, and deploy custom image classification models without requiring deep expertise in machine learning or computer vision. Custom Vision models can be accessed through APIs, SDKs, or a dedicated website.
However, you don’t have to be a data scientist or code-first developer to use these vision models in simple everyday applications. You have the capability to access your image classifier model on your mobile or tablet device via a quickly designed application built in Power Apps, a Microsoft Power Platform low-code application. Integrating Azure Custom Vision into Power Apps typically involves using Power Apps' capabilities to call Azure Custom Vision's REST API. This allows you to leverage your custom image classification or object detection models created with Azure Custom Vision within your own set of Power Apps built specifically for your business.
An image classifier is a piece of software capable of identifying entities within an image. An Azure Custom Vision image classifier is a machine learning model created and trained using Azure Custom Vision, a cloud-based service offered by Microsoft Azure.
This custom image classifier is designed to categorize or classify images into specific predefined categories or classes. You can train custom machine learning models to recognize specific objects, patterns, or attributes unique to your domain. Azure AI Vision can classify images into predefined categories such as “people” or “animals”. Azure Custom Vision image classifiers are applicable to a wide range of real-world applications and are useful for tasks like content moderation, where you want to filter out inappropriate or sensitive content from user-generated images.
Using Azure AI Vision for image classification involves several steps, including setting up an Azure account, creating a custom vision model, training the model, and integrating it into your application. Here's a step-by-step guide on how to use Azure AI Vision for image classification:
1. If you don't already have an Azure account, you'll need to create one. Azure offers a free trial with some credits to get started. If you don't have one, you can sign up for a free trial or a paid subscription at https://azure.microsoft.com/free/.
2. An Azure Custom Vision resource is part of Microsoft's Azure cloud computing platform and is specifically designed for building custom machine learning models for image classification and object detection tasks. You will need to create this resource for your model to use during training and for after deployment. Log in to your Azure portal: https://portal.azure.com/. Click on "Create a resource". Search for "Custom Vision".
3. Select "Custom Vision" from the list of services, then click "Create" to start creating your Custom Vision resource.
4. You’ll then receive a form to configure your vision resource. Fill in the necessary details, including the resource group, name, region, and pricing tier for the training and prediction resources. Settings used for the application created for this blog post appear below.
5. Choose the pricing tier that suits your needs. You can start with the free tier for small-scale projects. After you have filled in the required fields, you can create the resource by clicking the blue “Review + create” button at the bottom of the page. Once your resource has been created and deployed, navigate to it in the Azure portal.
6. Now that your custom vision resource has been created, view your resource’s Keys and Endpoint page. You will need this information later on when you connect to it via Power Apps.
Now that you have created a resource, you can create the Custom Vision project:
1. In a separate tab, open customvision.ai and sign in with the Microsoft account associated with your Azure subscription.
2. Create a new Custom Vision project within the Azure Custom Vision portal.
3. Specify the name, resource group, and other project details. For our purposes, we used the General (A2) domain. This will create an empty model.
Now that your empty model has been created, you need to gather a dataset of images with the categories, classes, or labels of interest.
For example, if you want to classify images of animals, you might have classes like cats, dogs, and birds. Another common class used when learning to use image classifiers is fruit (i.e., apples, oranges, bananas).
These items have relatively simple and distinctive characteristics that make them relatively easy to train a model to identify and therefore make them a natural choice when creating your first Custom Vision model. However, we wanted to make this image classification experiment a little more fun and interesting.
Therefore, we decided to train the model to identify if the central character in the image was a man, woman, or monkey wearing a business suit, a Kung Fu uniform, or Casual Wear (i.e., jeans and sneakers). There are websites that have a lot of data sets available for Machine Learning projects.
However, given the low likelihood of finding many royalty-free photos of actual monkeys in Business suits or Casual Wear, we choose to use the MIdjourney via a Discord bot to generate over 2,000 photorealistic images containing our different classes of interest. This allowed us to quickly gather a wide variety of training images in which we have some degree of control over the content and style. For more information on Midjourney licensing, image ownership, and usage restrictions please see their terms of service document.
There are four key factors that must be taken into account when training a classification model to improve prediction accuracy: overfitting, data balance, data quality, and data variety.
The presence of contextual information can help or hinder classification, depending on the classifier's ability to focus on relevant objects. If certain contextual items appear frequently and consistently across different images, the classifier may focus on arbitrary characteristics more than the item you are trying to classify.
For example, since monkeys and people in business suits don’t usually inhabit the same biomes, we wanted to ensure that the classifier wasn’t using arbitrary environmental characteristics that the images had in common. Images containing a tree or other vegetation could be more strongly associated with monkeys whereas desks and books could be more strongly associated with business people.
If during training, the model used these arbitrary characteristics then it might wrongly predict that a character sitting on a chair in the office must be wearing a business suit or if there is any vegetation in the background or if the character is jumping around, it must be a monkey. Here are some examples:
Imbalanced class distributions, where some classes have significantly more examples than others, can affect the classifier's ability to generalize to minority classes. In other words, we do not want to have an image sample size of 200 men but only 50 women and 75 monkeys because the model would become much better at identifying men than identifying women and monkeys. It is recommended that any one class shouldn’t have more than a 2:1 ratio to another class (more details available here).
Microsoft recommends using at least 50 images per tag when starting to train your model. Having fewer images increases the likelihood of overfitting where the classifier picks an arbitrary yet common element across images (e.g., trees in the background as opposed to the monkey in the foreground). Although your performance metrics may initially appear promising, your model might encounter challenges when faced with real-world data. We started with approximately 200 images for each two-way class interaction of (Man / Woman / Monkey) by (Business suit, Kung Fu uniform / Casual Wear).
After a second round of quality checks, these are the remaining counts:
You will need to include a variety of images in the training set to ensure that your model can generalize well. The accuracy of an image classifier is influenced by various characteristics of the images being classified. Some of the key characteristics that can affect the performance of an image classifier include:
It's important to consider the above Data Variety characteristics when designing and training an image classifier, as well as when evaluating its performance on different datasets and real-world scenarios across a wide range of image variations and conditions. In an attempt to prevent overfitting and to ensure a balanced and diverse training set, we included images of Men/Women/Monkeys wearing a Business suit, a Kung Fu uniform, or Casual Wear1 that also varied along these six distinct dimensions:
The instructions provided to the Midjournney bot were generally constructed in the following format:
A {body size} {class: man / woman / monkey} with {hair color} hair who is wearing a {clothing color} {class: suit / uniform / casual}. The {class: man / woman / monkey} is {action in scene} [in / on / by / next to] a {secondary object in the scene} while [in / at] [a / an / the] {location}.
Some specific examples of this format appear below:
The description ”Full length body view” was appended to the front of each instruction sent to the AI bot but occasionally it was ignored resulting in some images being only half or three-quarters body shots. Fortunately, Midjourney has post-image generation options such as zoom out 1.5x or 2.0x, and pan (left, right, up, down) to get more of the central item in the image. However, in some instances of using those options, the bot chose not to extend the body in the image but rather strategically place another object in front of it such as a table, hot tub, or green bucket of ice.
The description ”The image should have a realistic photographic quality.” was also added to the end of each instruction. However, some images were generated that looked more like hand-drawn artwork or an animated cartoon character on a realistic background. For the sake of data variety, we included some of those images in the model providing they passed the quality checks discussed in the next section.
Even though we had some creative control over the images, not every image generated was usable. In fact, some images had us scratching our heads. To address potential issues of Image Quality, many images were discarded due to artifacts such as:
Some of the discarded images appear below:
Here are a few exemplars used in training the classifier of a Man / Woman / Monkey in a Business suit while standing next to a water cooler, standing next to a tree, crawling on the ground, or jumping in the air:
Here are a few exemplars used in training the classifier of a Man / Woman / Monkey in a Kung Fu uniform while standing next to a copy machine, standing next to a tree, crawling on the ground, or jumping in the air:
Here are a few exemplars used in training the classifier of a Man / Woman / Monkey in a Casual Wear while standing next to a water cooler, standing next to a tree, crawling on the ground, or jumping in the air:
Before training the model, we reviewed how well we addressed the key factors mentioned earlier that are important for prediction accuracy.
To account for overfitting based on the selection of clothing, we varied the background settings (e.g. in an office, next to an elevator, outside in a park) and actions (e.g. jumping, sitting, standing) of the central character (i.e., man, woman, monkey). We recognize that this created some rather absurd images such as a monkey wearing a Kung Fu uniform standing in an office next to a copy machine or a man wearing a business suit outside crawling on the ground next to a tree. We wanted to see how well the classifier would recognize the central item when out of context. In addition, since monkeys are energetic, we include men and women doing a variety of actions such that the action alone would not be indicative of what the central character was or what it was wearing.
As mentioned earlier, we were training the model with over 150 images for each of the resulting nine two-way class interactions of character and clothing which far exceeded the minimum number of 50 images per class. This kept the balance ratio between classes well under 2:1. We were also interested to see how well the classifier would do overall for the character and apparel class. Therefore, we created roll-up classes for “Man”, “Woman”, and “Monkey” regardless of apparel and roll-up classes for “Business suit”, “Kung Fu uniform”, and “Casual Wear” regardless of character.
From a data balance perspective, each roll-up class contained between 450-500 images so the ratio did not exceed 1:1.15 between them. However, each of the roll-up classes now had a 3:1 ratio to any of its component subclasses which would make it unbalanced from that perspective. Essentially, this meant that the classifier could potentially be more accurate at predicting a “Woman” or “Business suit” roll-up class than it would be at predicting the ”Man in Casual Wear” interaction class. We were curious to see if this turned out to be the case and will discuss this in further detail in part 3.
Above we’ve discussed in detail our classes and how we can generate images via Midjourney to train our classifier with. Next, let’s discuss how to add these images to our model:
Now you can begin training the model.
Once training is complete, you can evaluate the model's performance right away using the provided metrics and by testing images via URL or from your PC. If the model’s performance is not satisfactory, you can retrain it with more or higher quality images or increase your model training time.
Using the “Quick Test” link, you can individually upload an image that wasn’t part of the training set and see how well the model predicts the classification. As you can see in the examples below, our model identifies the roll-up classes: “Man”, “Women”, “Monkey”, “Business suit”, or “Kung Fu uniform” with a higher probability than the interaction of the two classes: “Man wearing a Business suit” or “Monkey in Kung Fu uniform”. This may be due to the disparity in image counts since there are three times more “Business suit” images than “Man in Business suit” images.
This was not the case, however, for “Man in Casual Wear”. While the classifier correctly predicted the roll-up class “Casual Wear”, it assigned a low probability for the “Man” roll-up class as seen in the examples below:
This unexpected result could be due to the following:
Even though jeans were specified in the instructions to the bot, some men and women were generated wearing shorts in the beach environment. We made an effort to keep the characters as fully clothed as possible. In fact, only 9 out of 189 women in Business suits had uncovered legs. This was an intentional quality check to prevent the classifier from using the presence of naked calves as an indicator of “Woman”.
This leaves us with under-representation or under-training as possible causes.We intend to generate more test and training images and then examine this in more detail in part 3.
Azure provides REST API endpoints that allow you to integrate your trained model into your applications. You can use these endpoints to send image data and get classification results in real-time. Azure also offers SDKs and client libraries in various programming languages to simplify integration. When you are satisfied with the model's performance, click on the performance link and then click on "Publish" to make it available for consumption. When you publish, you have to assign it to the prediction resource you set up at the beginning of the article. You can also rename your model to make it more memorable for when you use it in apps.
In the next article of this series, we will show you how to integrate the Custom Vision model with a mobile device using a Power Apps Low-Code approach. Once you go through the tutorial, your mobile classifier app will be able to do the following:
https://learn.microsoft.com/en-us/azure/ai-services/custom-vision-service/select-domain
https://www.youtube.com/watch?v=OzMRNVolrKE
https://www.youtube.com/watch?v=P5yKrEfKtEI
https://www.youtube.com/watch?v=92U0uNWepDw&list=PLPoQn6QlsOwMu-XDeh3SZemuTVYt0zVzT
Understanding low code development applications and uses, and the variety of AI complex use cases, might be something you are struggling with.
Turning to technologies that you do not grasp entirely its a challenge sometimes too hard to overtake alone. The best advice on how to do so effectively, is ironically to get some good advice. As experienced software and data experts, The Virtual Forge is here to help you understand your business problems, with up front engagement and guidance for you as the client: what are your problems and how can we solve them?
Have a project in mind? No need to be shy, drop us a note and tell us how we can help realise your vision.