Palmer Penguins Data Set

Data visualization of the Palmer Penguins data set using a scatter plot.

Reading in the Palmer Penguins Data Set

I will be creating a scatter plot from the Palmer Penguins data set. First, I need to import the pandas, matplotlib, and seaborn libraries.

#importing libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#using pandas to read in the Palmer Penguins data set
url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/palmer_penguins.csv"
penguins = pd.read_csv(url)

Next, I want to take a look at the data to see what columns I can use to create my scatter plot.

penguins.head()
studyName Sample Number Species Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments
0 PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 11/11/07 39.1 18.7 181.0 3750.0 MALE NaN NaN Not enough blood for isotopes.
1 PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 11/11/07 39.5 17.4 186.0 3800.0 FEMALE 8.94956 -24.69454 NaN
2 PAL0708 3 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A1 Yes 11/16/07 40.3 18.0 195.0 3250.0 FEMALE 8.36821 -25.33302 NaN
3 PAL0708 4 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A2 Yes 11/16/07 NaN NaN NaN NaN NaN NaN NaN Adult not sampled.
4 PAL0708 5 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N3A1 Yes 11/16/07 36.7 19.3 193.0 3450.0 FEMALE 8.76651 -25.32426 NaN

Selecting a Subset of Columns

I want to use the Culmen Length (mm), Culmen Depth (mm), Sex, Island, and Species columns from the data set to create my visualization. For this reason, I will select only these columns from the penguins data set and drop the NaN and ‘.’ values from these columns because these values are not helpful in my visualization. I also want to clean up the look of the Species column, so I will only use the first word in each of the Species’ names.

#select the appropriate columns and resave the penguins data set to only incorporate these columns
cols = ["Species", "Culmen Length (mm)", "Culmen Depth (mm)", "Sex", "Island"]
penguins = penguins[cols]

#drop the NaN values in these columns
penguins = penguins.dropna()

#drop the "." values in the Sex column 
penguins = penguins[penguins["Sex"] != "."]

#edit each Species name to only include the first word
penguins["Species"] = penguins["Species"].str.split().str.get(0)

Creating the Plot

Next, I want to create the scatter plot using the seaborn package to display the Culmen Length and Culmen Depth for each penguin. I will use faceting to distinguish by Island and Sex. I also want to make each penguin’s data point a different color depending on their species.

#set the theme and color palette of the plot
sns.set(context = "talk", palette = "Set1", style = "whitegrid")

'''
create the plot with Culmen Length on the x-axis and Culmen Depth on the y-axis. each row of scatter plots
separated by Sex and each column separated by Island. the data points are different colors depending on the
species of the penguin. 
'''
sns.relplot(data=penguins, 
            x="Culmen Length (mm)",
            y="Culmen Depth (mm)",
            row = "Sex",
            col = "Island",
            hue = "Species",
            alpha=.5)

#add the overall title 
plt.subplots_adjust(top=0.9)
plt.suptitle("Palmer Penguins")

plt.show()

output_8_0.png

Written on January 14, 2022