A Box-plot is used when you want to visualize the relationship between a continuous and categorical variable. This scenario occurs in classification as well as regression as listed below.

  • Regression: The target variable is continuous, the predictor is categorical
  • Classification: The target variable is categorical, the predictor is continuous

In both the scenarios, what you are trying to understand is whether the given two variables are related to each other or not? This can be done using a box plot.

A box plot shows the data distribution of the continuous variable for each category. If the distribution for each of the categories is similar, which means the boxes are aligned, then, it indicates no correlation.

Similarly, if the data distribution is different for each category, which means the boxes are far from each other, then, it indicates that there is a correlation between the two variables.

The logic behind this is the ability of the predictor column to bifurcate the values of the target variable.

Consider the below data which shows an extreme example of fuel types and prices of cars. If CNG, Diesel, and Petrol cars have similar kinds of prices, then you will NOT be able to say that if the car is Diesel, then the price would be high, or if the car is Petrol, then the price will be low, hence, you will not be able to use FuelType to predict the car prices.

Sample Output:

Box plot of FuelType vs CarPrices showing no correlation between the two variables
Box plot of FuelType vs CarPrices showing no correlation between the two variables

If you observe the above box-plot, all the "boxes" are aligned, Hence, FuelType and CarPrices are not correlated to each other, because changing FuelType does not affect the car prices.

Consider another extreme example below for the same data, the values are now different for each category, hence, the boxes will be far from each other which implies a correlation between the variables. Because, now if you change the fuel type, you can see changes in car prices as well.

Sample Output

Box plot of FuelType vs CarPrices showing correlation between the two variables
Box plot of FuelType vs CarPrices showing correlation between the two variables

Based on the above output, where the boxes are far from each other, you can conclude that there is a correlation between FuelType and CarPrices.


What does the box represent?

The box-plot is also known as box and whiskers plots. The lines coming out of the box are tails. Both tails represent 25% of the data each. The box represents 50% of the data. Any data point falling beyond the tails are outliers.

  • The box in the box-plot represents 50% of the data,
  • The green line in the middle of the box represents the median value of the data.
  • The tails on each side of the box represent 25% data each.

Understanding the distribution of a continuous variable

Box-plots can also be used to understand the data distribution of a continuous variable alone.

Sample Output:

Box-plot for a single continuous variable showing its data distribution
Box-plot in python for a single continuous variable showing its data distribution

Lead Data Scientist

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!