Viewing and Understanding DataFrames Using Pandas

After reading tabular data as a DataFrame, you would need to have a glimpse of the data. You can either view a small sample of the dataset or a summary of the data in the form of summary statistics.

How to View Data Using .head() and .tail()

You can view the first few or last few rows of a DataFrame using the .head() or .tail() methods, respectively. You can specify the number of rows through the n argument (the default value is 5).

  df.head()
  

First five rows of the DataFrame (df) using .head()

  df.tail(n=10)
  

Last 10 rows of the DataFrame using .tail()

Understanding Data Using .describe()

The .describe() method prints the summary statistics of all numeric columns, such as count, mean, standard deviation, range, and quartiles of numeric columns.

  df.describe()
  

Get summary statistics with .describe()

It gives a quick look at the scale, skew, and range of numeric data.

You can also modify the quartiles using the percentiles argument. Here, for example, we’re looking at the 30%, 50%, and 70% percentiles of the numeric columns in DataFrame df.

  df.describe(percentiles=[0.3, 0.5, 0.7])
  

Get summary statistics with specific percentiles

You can also isolate specific data types in your summary output by using the include argument. Here, for example, we’re only summarizing the columns with the integer data type.

  df.describe(include=[int])
  

Get summary statistics of integer columns only

Similarly, you might want to exclude certain data types using the exclude argument.

  df.describe(exclude=[int])
  

Get summary statistics of non-integer columns only

Often, practitioners find it easy to view such statistics by transposing them with the .T attribute.

  df.describe().T
  

Transpose summary statistics with .T

Understanding Data Using .info()

The .info() method is a quick way to look at the data types, missing values, and data size of a DataFrame. Here, we’re setting the show_counts argument to True, which gives an overview of the total non-missing values in each column. We’re also setting memory_usage to True, which shows the total memory usage of the DataFrame elements. When verbose is set to True, it prints the full summary from .info().

  df.info(show_counts=True, memory_usage=True, verbose=True)
  

Understanding Your Data Using .shape

The number of rows and columns of a DataFrame can be identified using the .shape attribute of the DataFrame. It returns a tuple (row, column) and can be indexed to get only rows or only columns count as output.

  df.shape # Get the number of rows and columns
df.shape[0] # Get the number of rows only
df.shape[1] # Get the number of columns only
  

Get All Columns and Column Names

Calling the .columns attribute of a DataFrame object returns the column names in the form of an Index object. As a reminder, a pandas index is the address/label of the row or column.

  df.columns
  

Output of columns:

It can be converted to a list using the list() function.

  list(df.columns)
  

Checking for Missing Values in Pandas with .isnull()

The sample DataFrame does not have any missing values. Let’s introduce a few to make things interesting. The .copy() method makes a copy of the original DataFrame. This is done to ensure that any changes to the copy don’t reflect in the original DataFrame. Using .loc (to be discussed later), you can set rows two to five of the Pregnancies column to NaN values, which denote missing values.

  df2 = df.copy()
df2.loc[2:5, 'Pregnancies'] = None
df2.head(7)
  

Rows 2 to 5 are NaN

You can check whether each element in a DataFrame is missing using the .isnull() method.

  df2.isnull().head(7)
  

Given it’s often more useful to know how much missing data you have, you can combine .isnull() with .sum() to count the number of nulls in each column.

  df2.isnull().sum()
  
  Pregnancies                 4
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64
  

You can also do a double sum to get the total number of nulls in the DataFrame.

  df2.isnull().sum().sum()
  
  4
  

Learn How To Build AI Projects

Now, if you are interested in upskilling in 2024 with AI development, check out this 6 AI advanced projects with Golang where you will learn about building with AI and getting the best knowledge there is currently. Here’s the link.

Last updated 17 Aug 2024, 12:31 +0200 . history