Palmer Penguins Data Set
Data visualization of the Palmer Penguins data set using a scatter plot.
Reading in the Palmer Penguins Data Set
I will be creating a scatter plot from the Palmer Penguins data set. First, I need to import the pandas, matplotlib, and seaborn libraries.
#importing libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#using pandas to read in the Palmer Penguins data set
url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/palmer_penguins.csv"
penguins = pd.read_csv(url)
Next, I want to take a look at the data to see what columns I can use to create my scatter plot.
penguins.head()
studyName | Sample Number | Species | Region | Island | Stage | Individual ID | Clutch Completion | Date Egg | Culmen Length (mm) | Culmen Depth (mm) | Flipper Length (mm) | Body Mass (g) | Sex | Delta 15 N (o/oo) | Delta 13 C (o/oo) | Comments | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | PAL0708 | 1 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N1A1 | Yes | 11/11/07 | 39.1 | 18.7 | 181.0 | 3750.0 | MALE | NaN | NaN | Not enough blood for isotopes. |
1 | PAL0708 | 2 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N1A2 | Yes | 11/11/07 | 39.5 | 17.4 | 186.0 | 3800.0 | FEMALE | 8.94956 | -24.69454 | NaN |
2 | PAL0708 | 3 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N2A1 | Yes | 11/16/07 | 40.3 | 18.0 | 195.0 | 3250.0 | FEMALE | 8.36821 | -25.33302 | NaN |
3 | PAL0708 | 4 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N2A2 | Yes | 11/16/07 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Adult not sampled. |
4 | PAL0708 | 5 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N3A1 | Yes | 11/16/07 | 36.7 | 19.3 | 193.0 | 3450.0 | FEMALE | 8.76651 | -25.32426 | NaN |
Selecting a Subset of Columns
I want to use the Culmen Length (mm), Culmen Depth (mm), Sex, Island, and Species columns from the data set to create my visualization. For this reason, I will select only these columns from the penguins data set and drop the NaN and ‘.’ values from these columns because these values are not helpful in my visualization. I also want to clean up the look of the Species column, so I will only use the first word in each of the Species’ names.
#select the appropriate columns and resave the penguins data set to only incorporate these columns
cols = ["Species", "Culmen Length (mm)", "Culmen Depth (mm)", "Sex", "Island"]
penguins = penguins[cols]
#drop the NaN values in these columns
penguins = penguins.dropna()
#drop the "." values in the Sex column
penguins = penguins[penguins["Sex"] != "."]
#edit each Species name to only include the first word
penguins["Species"] = penguins["Species"].str.split().str.get(0)
Creating the Plot
Next, I want to create the scatter plot using the seaborn package to display the Culmen Length and Culmen Depth for each penguin. I will use faceting to distinguish by Island and Sex. I also want to make each penguin’s data point a different color depending on their species.
#set the theme and color palette of the plot
sns.set(context = "talk", palette = "Set1", style = "whitegrid")
'''
create the plot with Culmen Length on the x-axis and Culmen Depth on the y-axis. each row of scatter plots
separated by Sex and each column separated by Island. the data points are different colors depending on the
species of the penguin.
'''
sns.relplot(data=penguins,
x="Culmen Length (mm)",
y="Culmen Depth (mm)",
row = "Sex",
col = "Island",
hue = "Species",
alpha=.5)
#add the overall title
plt.subplots_adjust(top=0.9)
plt.suptitle("Palmer Penguins")
plt.show()