Python How to Compare a Discrete Variable With a Continuous Variable With a Graph
A Box-plot is used when you want to visualize the relationship between a continuous and categorical variable. This scenario occurs in classification as well as regression as listed below.
- Regression: The target variable is continuous, the predictor is categorical
- Classification: The target variable is categorical, the predictor is continuous
In both the scenarios, what you are trying to understand is whether the given two variables are related to each other or not? This can be done using a box plot.
A box plot shows the data distribution of the continuous variable for each category. If the distribution for each of the categories is similar, which means the boxes are aligned, then, it indicates no correlation.
Similarly, if the data distribution is different for each category, which means the boxes are far from each other, then, it indicates that there is a correlation between the two variables.
The logic behind this is the ability of the predictor column to bifurcate the values of the target variable.
Consider the below data which shows an extreme example of fuel types and prices of cars. If CNG, Diesel, and Petrol cars have similar kinds of prices, then you will NOT be able to say that if the car is Diesel, then the price would be high, or if the car is Petrol, then the price will be low, hence, you will not be able to use FuelType to predict the car prices.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | # Generating sample data import pandas as pd ColumnNames = [ 'FuelType' , 'CarPrice' ] DataValues = [ [ 'Petrol' , 2500 ] , [ 'Petrol' , 2000 ] , [ 'Petrol' , 1900 ] , [ 'Petrol' , 1850 ] , [ 'Petrol' , 1600 ] , [ 'Petrol' , 1500 ] , [ 'Petrol' , 1500 ] , [ 'Diesel' , 2500 ] , [ 'Diesel' , 2000 ] , [ 'Diesel' , 1900 ] , [ 'Diesel' , 1850 ] , [ 'Diesel' , 1600 ] , [ 'Diesel' , 1500 ] , [ 'Diesel' , 1500 ] , [ 'CNG' , 2500 ] , [ 'CNG' , 2000 ] , [ 'CNG' , 1900 ] , [ 'CNG' , 1850 ] , [ 'CNG' , 1600 ] , [ 'CNG' , 1500 ] , [ 'CNG' , 1500 ] ] #Create the Data Frame CarData = pd . DataFrame ( data = DataValues , columns = ColumnNames ) print ( CarData . head ( ) ) ############################################# # Generating boxplot for CarPrice Vs FuelType CarData . boxplot ( column = 'CarPrice' , by = 'FuelType' , figsize = ( 5 , 6 ) ) |
Sample Output:
If you observe the above box-plot, all the "boxes" are aligned, Hence, FuelType and CarPrices are not correlated to each other, because changing FuelType does not affect the car prices.
Consider another extreme example below for the same data, the values are now different for each category, hence, the boxes will be far from each other which implies a correlation between the variables. Because, now if you change the fuel type, you can see changes in car prices as well.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | # Generating sample data import pandas as pd ColumnNames = [ 'FuelType' , 'CarPrice' ] DataValues = [ [ 'Petrol' , 2000 ] , [ 'Petrol' , 2100 ] , [ 'Petrol' , 1900 ] , [ 'Petrol' , 2150 ] , [ 'Petrol' , 2100 ] , [ 'Petrol' , 2200 ] , [ 'Petrol' , 1950 ] , [ 'Diesel' , 2500 ] , [ 'Diesel' , 2700 ] , [ 'Diesel' , 2900 ] , [ 'Diesel' , 2850 ] , [ 'Diesel' , 2600 ] , [ 'Diesel' , 2500 ] , [ 'Diesel' , 2700 ] , [ 'CNG' , 1500 ] , [ 'CNG' , 1400 ] , [ 'CNG' , 1600 ] , [ 'CNG' , 1650 ] , [ 'CNG' , 1600 ] , [ 'CNG' , 1500 ] , [ 'CNG' , 1500 ] ] #Create the Data Frame CarData = pd . DataFrame ( data = DataValues , columns = ColumnNames ) print ( CarData . head ( ) ) ########################################## # Generating boxplot for CarPrice Vs FuelType CarData . boxplot ( column = 'CarPrice' , by = 'FuelType' , figsize = ( 5 , 6 ) ) |
Sample Output
Based on the above output, where the boxes are far from each other, you can conclude that there is a correlation between FuelType and CarPrices.
What does the box represent?
The box-plot is also known as box and whiskers plots. The lines coming out of the box are tails. Both tails represent 25% of the data each. The box represents 50% of the data. Any data point falling beyond the tails are outliers.
- The box in the box-plot represents 50% of the data,
- The green line in the middle of the box represents the median value of the data.
- The tails on each side of the box represent 25% data each.
Understanding the distribution of a continuous variable
Box-plots can also be used to understand the data distribution of a continuous variable alone.
# Generating a boxplot for single column CarData . boxplot ( column = 'CarPrice' , figsize = ( 8 , 4 ) , vert = False ) |
Sample Output:
Lead Data Scientist
Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!
Source: https://thinkingneuron.com/how-to-visualize-the-relationship-between-a-continuous-and-a-categorical-variable-in-python/
0 Response to "Python How to Compare a Discrete Variable With a Continuous Variable With a Graph"
Post a Comment