Introduction
Data visualisation is a crucial aspect of data analysis every data professional seeks to learn. This is especially true of professionals working in large organisations where they need to interact with several stake holders, all of whom might not be as tech-savvy as themselves; for instance, business strategists and business developers. A Data Science Course in Chennai and such cities that covers data visualisation attracts professional data analysts on a large-scale because expertise in this discipline allows them to present complex datasets and communicate insights effectively. Histograms and box plots are two of the most commonly used visualisation tools for understanding the distribution and variability of data. In this article, we will explore advanced techniques for creating and customising histograms and box plots using Matplotlib, Python’s go-to library for data visualisation.
Why Use Histograms and Box Plots?
Two terms that you will most frequently encounter in a Data Science Course that covers data visualisation techniques are Histograms and Box Plots. Here is a brief description of these terms.
Histograms: They provide a visual representation of the distribution of a dataset by dividing the data into bins and counting the number of observations within each bin. This helps in understanding the frequency distribution, skewness, and the presence of outliers.
Box Plots: Box plots, also known as box-and-whisker plots, summarise the distribution of a dataset by displaying the median, quartiles, and potential outliers. They are particularly useful for comparing distributions between different groups or datasets.
Setting Up Matplotlib
Before diving into advanced techniques, ensure you have Matplotlib installed:
pip install matplotlib
Then, import the necessary libraries:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Advanced Histogram Techniques
Here are the steps involved in advanced histogram techniques as will be covered in a standard Data Science Course.
- Creating Overlaid Histograms
Overlaid histograms are useful when you want to compare the distribution of multiple datasets on the same plot.
# Sample data
data1 = np.random.normal(0, 1, 1000)
data2 = np.random.normal(1, 1.5, 1000)
# Plotting overlaid histograms
plt.hist(data1, bins=30, alpha=0.5, label=’Dataset 1′)
plt.hist(data2, bins=30, alpha=0.5, label=’Dataset 2′)
plt.legend(loc=’upper right’)
plt.title(‘Overlaid Histograms’)
plt.xlabel(‘Value’)
plt.ylabel(‘Frequency’)
plt.show()
In this example, alpha=0.5 controls the transparency, making it easier to compare the two distributions.
- Creating a Density Plot with a Histogram
A density plot superimposed on a histogram provides a smoother representation of the distribution, making patterns easier to identify.
import seaborn as sns
# Sample data
data = np.random.normal(0, 1, 1000)
# Plotting histogram with density plot
sns.histplot(data, kde=True, bins=30)
plt.title(‘Histogram with Density Plot’)
plt.xlabel(‘Value’)
plt.ylabel(‘Frequency’)
plt.show()
Using Seaborn’s histplot, you can quickly add a kernel density estimate (KDE) to your histogram.
- Customising Histogram Bins
Customising the bin size and edges can reveal finer details in the data distribution.
# Sample data
data = np.random.normal(0, 1, 1000)
# Custom bins
bins = np.linspace(-4, 4, 20)
# Plotting histogram with custom bins
plt.hist(data, bins=bins, edgecolor=’black’)
plt.title(‘Histogram with Custom Bins’)
plt.xlabel(‘Value’)
plt.ylabel(‘Frequency’)
plt.show()
Here, we specify custom bin edges using np.linspace, allowing for more control over the histogram’s appearance.
Advanced Box Plot Techniques
Here are the steps involved in advanced box plot techniques as will be covered in a standard Data Science Course.
- Creating Grouped Box Plots
Grouped box plots are effective for comparing the distribution of different groups side by side.
# Sample data
data = pd.DataFrame({
‘Group’: np.repeat([‘A’, ‘B’, ‘C’], 100),
‘Value’: np.concatenate([np.random.normal(0, 1, 100),
np.random.normal(1, 1.5, 100),
np.random.normal(2, 0.5, 100)])
})
# Plotting grouped box plots
plt.figure(figsize=(8, 6))
sns.boxplot(x=’Group’, y=’Value’, data=data)
plt.title(‘Grouped Box Plots’)
plt.xlabel(‘Group’)
plt.ylabel(‘Value’)
plt.show()
This example uses Seaborn’s boxplot to create grouped box plots, which allow for easy comparison between different groups.
- Adding Notches to Box Plots
Notched box plots provide a visual indication of the confidence interval around the median, useful for comparing medians between groups.
# Sample data
data = pd.DataFrame({
‘Group’: np.repeat([‘A’, ‘B’, ‘C’], 100),
‘Value’: np.concatenate([np.random.normal(0, 1, 100),
np.random.normal(1, 1.5, 100),
np.random.normal(2, 0.5, 100)])
})
# Plotting notched box plots
plt.figure(figsize=(8, 6))
sns.boxplot(x=’Group’, y=’Value’, data=data, notch=True)
plt.title(‘Notched Box Plots’)
plt.xlabel(‘Group’)
plt.ylabel(‘Value’)
plt.show()
Adding the notch=True argument introduces notches in the box plot, making it easier to assess whether the medians of different groups are significantly different.
- Displaying Outliers with Box Plots
Box plots automatically show outliers, but you can customise how they are displayed to make them stand out more.
# Sample data
data = pd.DataFrame({
‘Group’: np.repeat([‘A’, ‘B’, ‘C’], 100),
‘Value’: np.concatenate([np.random.normal(0, 1, 100),
np.random.normal(1, 1.5, 100),
np.random.normal(2, 0.5, 100)])
})
# Plotting box plots with customized outliers
plt.figure(figsize=(8, 6))
sns.boxplot(x=’Group’, y=’Value’, data=data, flierprops={‘marker’: ‘o’, ‘color’: ‘red’, ‘alpha’: 0.5})
plt.title(‘Box Plots with Customized Outliers’)
plt.xlabel(‘Group’)
plt.ylabel(‘Value’)
plt.show()
Here, the flierprops parameter customises the appearance of outliers, using red circles (marker=’o’) to make them more noticeable.
Combining Histograms and Box Plots
In some cases, you may want to use both histograms and box plots together to provide a more comprehensive view of your data distribution.
# Sample data
data = np.random.normal(0, 1, 1000)
# Creating a figure with subplots
fig, axs = plt.subplots(2, 1, figsize=(8, 10))
# Histogram
axs[0].hist(data, bins=30, edgecolor=’black’)
axs[0].set_title(‘Histogram’)
# Box Plot
axs[1].boxplot(data, vert=False)
axs[1].set_title(‘Box Plot’)
plt.show()
This example creates a figure with two subplots: one for the histogram and one for the box plot, allowing you to analyse the data distribution from multiple perspectives.
Conclusion
Advanced histogram and box plot techniques in Matplotlib offer a powerful toolkit for visualising complex data. Whether you are comparing distributions, identifying outliers, or customising plots for clearer communication, these techniques can help you gain deeper insights into your data. By acquainting yourself with these tools by enrolling in a Data Science Course in Chennai and such cities that offer lessons in advanced visualisation techniques, you will be better equipped to create effective and informative visualisations that resonate with your audience.
BUSINESS DETAILS:
NAME: ExcelR- Data Science, Data Analyst, Business Analyst Course Training Chennai
ADDRESS: 857, Poonamallee High Rd, Kilpauk, Chennai, Tamil Nadu 600010
Phone: 8591364838
Email- enquiry@excelr.com
WORKING HOURS: MON-SAT [10AM-7PM]
