Introduction to Data Structures in Pandas
This is an introduction to data structures in pandas, a python package for data analysis.
Pandas
is a powerful and widely-used data manipulation library in Python, providing versatile data structures and functions designed to make data analysis and manipulation simple and efficient. This blog will introduce you to two primary data structures in Pandas: Series and DataFrame. Understanding these data structures is fundamental to harnessing the full potential of Pandas for data analysis.
Series
A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). It is similar to a column in a spreadsheet or a database table. Let’s dive into some key characteristics and functionalities of Series.
Series is ndarray-like
A Pandas Series is built on top of NumPy’s ndarray. This means that a Series inherits many of the capabilities of ndarray, such as element-wise operations and array manipulations.
import pandas as pd
import numpy as np
data = np.array([1, 2, 3, 4])
s = pd.Series(data)
print(s)
Output:
0 1
1 2
2 3
3 4
dtype: int64
Series is dict-like
A Series can also be thought of as a fixed-size, ordered dictionary. It is an ideal structure for working with time series or other labeled data. You can access elements using labels (indexes) just like you would in a dictionary.
data = {'a': 1, 'b': 2, 'c': 3}
s = pd.Series(data)
print(s)
print(s['a'])
Output:
a 1
b 2
c 3
dtype: int64
1
Vectorized Operations and Label Alignment with Series
Pandas Series support vectorized operations, which allow you to perform operations on entire arrays without writing explicit loops. This feature leverages the speed of NumPy operations. Additionally, Series automatically aligns data based on the labels during arithmetic operations.
s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([4, 5, 6], index=['a', 'b', 'd'])
s = s1 + s2
print(s)
Output:
a 5.0
b 7.0
c NaN
d NaN
dtype: float64
Name Attribute
The name
attribute in Series can be used to assign a name to the Series object or its index, which can be useful for debugging and keeping track of data in larger datasets.
s = pd.Series([1, 2, 3], name="numbers")
print(s)
print(s.name)
Output:
0 1
1 2
2 3
Name: numbers, dtype: int64
numbers
DataFrame
A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It is the most commonly used Pandas object. Let’s explore various ways to create DataFrames and their functionalities.
From Dict of Series or Dicts
You can create a DataFrame from a dictionary of Series or dictionaries. Each Series becomes a column in the DataFrame.
data = {
'col1': pd.Series([1, 2, 3]),
'col2': pd.Series([4, 5, 6])
}
df = pd.DataFrame(data)
print(df)
Output:
col1 col2
0 1 4
1 2 5
2 3 6
From Dict of ndarrays / Lists
A DataFrame can also be created from a dictionary of NumPy arrays or lists. The arrays must be of the same length.
data = {
'col1': [1, 2, 3],
'col2': [4, 5, 6]
}
df = pd.DataFrame(data)
print(df)
Output:
col1 col2
0 1 4
1 2 5
2 3 6
From Structured or Record Array
DataFrames can be constructed from structured or record arrays.
data = np.array([(1, 'A'), (2, 'B'), (3, 'C')], dtype=[('num', 'i4'), ('letter', 'U1')])
df = pd.DataFrame(data)
print(df)
Output:
num letter
0 1 A
1 2 B
2 3 C
From a List of Dicts
Creating a DataFrame from a list of dictionaries is straightforward, with each dictionary representing a row.
data = [
{'a': 1, 'b': 2},
{'a': 3, 'b': 4, 'c': 5}
]
df = pd.DataFrame(data)
print(df)
Output:
a b c
0 1.0 2 NaN
1 3.0 4 5.0
From a Dict of Tuples
DataFrames can also be created from dictionaries of tuples.
data = {
'col1': (1, 2, 3),
'col2': (4, 5, 6)
}
df = pd.DataFrame(data)
print(df)
Output:
col1 col2
0 1 4
1 2 5
2 3 6
From a Series
Creating a DataFrame from a Series is possible and results in a single-column DataFrame.
s = pd.Series([1, 2, 3], name="numbers")
df = pd.DataFrame(s)
print(df)
Output:
numbers
0 1
1 2
2 3
From a List of Namedtuples
You can also construct DataFrames from a list of namedtuples.
from collections import namedtuple
Person = namedtuple('Person', 'name age')
data = [Person('Alice', 25), Person('Bob', 30)]
df = pd.DataFrame(data)
print(df)
Output:
name age
0 Alice 25
1 Bob 30
From a List of Dataclasses
Similarly, DataFrames can be created from a list of dataclasses.
from dataclasses import dataclass
@dataclass
class Person:
name: str
age: int
data = [Person('Alice', 25), Person('Bob', 30)]
df = pd.DataFrame(data)
print(df)
Output:
name age
0 Alice 25
1 Bob 30
Alternate Constructors
Pandas offers a range of alternate constructors for DataFrames, such as from_records, from_items, etc., to accommodate various data formats and structures.
Column Selection, Addition, Deletion
Selecting, adding, and deleting columns in a DataFrame are straightforward tasks.
data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data)
# Select column
print(df['col1'])
# Add column
df['col3'] = [5, 6]
print(df)
# Delete column
del df['col2']
print(df)
Output:
0 1
1 2
Name: col1, dtype: int64
col1 col2 col3
0 1 3 5
1 2 4 6
col1 col3
0 1 5
1 2 6
Assigning New Columns in Method Chains
You can assign new columns while chaining methods using the assign
method.
df = df.assign(col4=lambda x: x['col1'] + x['col3'])
print(df)
Output:
col1 col3 col4
0 1 5 6
1 2 6 8
Indexing / Selection
DataFrames offer robust indexing and selection capabilities. You can use .loc
for label-based indexing and .iloc
for positional indexing.
print(df.loc[0, 'col1'])
print(df.iloc[0, 0])
Output:
1
1
Data Alignment and Arithmetic
DataFrames automatically align data during arithmetic operations, similar to Series.
df1 = pd.DataFrame({'A': [1, 2]}, index=['a', 'b'])
df2 = pd.DataFrame({'B': [3, 4]}, index=['b', 'a'])
result = df1 + df2
print(result)
Output:
A B
a NaN NaN
b NaN NaN
Transposing
Transposing a DataFrame swaps its rows and columns.
print(df.T)
Output:
0 1
col1 1 2
col3
Learn How To Build AI Projects
Now, if you are interested in upskilling in 2024 with AI development, check out this 6 AI advanced projects with Golang where you will learn about building with AI and getting the best knowledge there is currently. Here’s the link.
Last updated 17 Aug 2024, 12:31 +0200 .