Evaluating Semantic and Instance Segmentations

Evaluating your machine learning models is a key component of all data science projects. Your model may produce some output with a high training accuracy, but that doesn’t mean that your output is meaningful. In the computer vision sphere, semantic and instance segmentation are often utilized to classify one or multiple objects in an image. These techniques are implemented using convolutional neural networks (CNN). Researchers and engineers use several metrics to ensure that their neural networks are actually producing quality results. In this article, I will discuss some of the most used segmentation evaluation metrics for binary classification. I will assume that the reader is familiar with basic statistical terms such as true positive, true negative etc. At the end, I will provide Python code to evaluate each of these metrics for your segmentations.


This is one of the most basic metrics that is almost always used for evaluating segmentations. Essentially, we are reporting the percent of pixels that were accurately predicted. The accuracy can be calculated using the following formula:

Here, we evaluate the percentage of pixels that our prediction accurately classified as a true positive or true negative. Accuracy is generally not the best metric to utilize since it can provide misleading results based on your images. If your raw images have a very high number of negative cases, then there may be some class bias in your segmentations, leading to an abnormally high accuracy. It is possible in this scenario that your positive class is not predicted correctly at all.

Intersection Over Union (IoU)

The IoU (also known as the Jaccard Index) is another popular metric that is often used in research papers for quantitative analysis of segmentations. When we compute the IoU, we are finding the percent overlap between the ground truth and predicted mask. We can calculate the IoU using the following equation:

The numerator of the expression is simply the area of overlap and the denominator is the area of the union between the two images. An IoU value close to 1.0 means that your segmentation was exactly the same as the ground truth. An IoU value ranging from 0.7 to 1.0 is typically a good range for segmentations. Here is a visual to further clarify the concept of IoU.

Precision and Recall

Precision tells us how many predicted objects actually had a matching object present in the ground truth. Precision can easily be calculated using the following formula:

Recall quantifies how many objects annotated in the ground truth were displayed as positive predictions in the segmentation. Recall is computed using the following equation:

Here is a visual that concisely describes the overall concept of precision and recall.

Precision and Recall are also used to generate ROC curves or Precision-Recall curves. These graphs allow us to analyze the quality of the classifier. The following visual shows sample ROC curves and what these curves indicate about the segmentation:


Here are some Python functions that will allow you to calculate all the metrics discussed in the article.

Final Remarks

I hope you were able to understand some of the basics behind popular segmentation metrics through this article. If you have any questions or concerns, please comment on the article or send me an email at zeeshanp@berkeley.edu.

I’m a freshman studying CS and Statistics at UC Berkeley. Feel free to contact me at zeeshanp@berkeley.edu.