Extending Pandas with PyArrow: Enhanced Functionality and Performance
Pandas, the go-to data manipulation library in Python, can further extend its capabilities and improve performance by leveraging PyArrow. PyArrow provides a robust interface for working with Apache Arrow, a columnar in-memory data format optimized for analytics. In this blog, we’ll explore how Pandas integrates with PyArrow to offer more extensive data types, improved support for missing data, performant IO operations, and interoperability with other data frame libraries.
PyArrow Functionality in Pandas
PyArrow enhances Pandas’ functionality in several key areas:
- More Extensive Data Types Compared to NumPy: PyArrow supports a broader range of data types, including more complex types such as decimal and map.
- Missing Data Support (NA) for All Data Types: Unlike NumPy, PyArrow provides consistent missing data (NA) support across all data types.
- Performant IO Reader Integration: PyArrow can significantly accelerate IO operations such as reading from CSV or JSON files.
- Interoperability with Other DataFrame Libraries: Based on the Apache Arrow specification, PyArrow facilitates seamless interoperability with other libraries like Polars and cuDF.
Minimum Supported PyArrow Version
To take advantage of these features, ensure you have the minimum supported version of PyArrow installed. You can install PyArrow using pip:
pip install pyarrow
Data Structure Integration
Creating PyArrow-Backed Pandas Objects
Pandas allows you to create Series, Index, or DataFrame columns backed by PyArrow. This is achieved by specifying the data type using the dtype
parameter.
Example: Creating a Series with PyArrow
import pandas as pd
ser = pd.Series([-1.5, 0.2, None], dtype="float32[pyarrow]")
print(ser)
Output:
0 -1.5
1 0.2
2 <NA>
dtype: float[pyarrow]
Example: Creating an Index with PyArrow
idx = pd.Index([True, None], dtype="bool[pyarrow]")
print(idx)
Output:
Index([True, <NA>], dtype='bool[pyarrow]')
Example: Creating a DataFrame with PyArrow
df = pd.DataFrame([[1, 2], [3, 4]], dtype="uint64[pyarrow]")
print(df)
Output:
0 1
0 1 2
1 3 4
Differences Between string[pyarrow]
and pd.ArrowDtype(pa.string())
There are subtle differences between the string alias "string[pyarrow]"
and specifying dtype=pd.ArrowDtype(pa.string())
. Generally, operations on data will behave similarly, but some differences in return types exist.
import pyarrow as pa
import pandas as pd
data = list("abc")
ser_sd = pd.Series(data, dtype="string[pyarrow]")
ser_ad = pd.Series(data, dtype=pd.ArrowDtype(pa.string()))
print(ser_ad.dtype == ser_sd.dtype) # Output: False
print(ser_sd.str.contains("a"))
print(ser_ad.str.contains("a"))
Output:
False
0 True
1 False
2 False
dtype: boolean
0 True
1 False
2 False
dtype: bool[pyarrow]
Handling More Complex Data Types
For PyArrow types that accept parameters, you can pass a PyArrow type with those parameters into ArrowDtype
.
Example: List of Strings
list_str_type = pa.list_(pa.string())
ser = pd.Series([["hello"], ["there"]], dtype=pd.ArrowDtype(list_str_type))
print(ser)
Output:
0 ['hello']
1 ['there']
dtype: list<item: string>[pyarrow]
Example: Time and Decimal Types
from datetime import time
from decimal import Decimal
# Time Type
time_type = pd.ArrowDtype(pa.time64("us"))
idx = pd.Index([time(12, 30), None], dtype=time_type)
print(idx)
# Decimal Type
decimal_type = pd.ArrowDtype(pa.decimal128(3, scale=2))
data = [[Decimal("3.19"), None], [None, Decimal("-1.23")]]
df = pd.DataFrame(data, dtype=decimal_type)
print(df)
Output:
Index([12:30:00, <NA>], dtype='time64[us][pyarrow]')
0 1
0 3.19 <NA>
1 <NA> -1.23
Working with PyArrow Arrays and ChunkedArrays
If you already have a PyArrow Array
or ChunkedArray
, you can construct the associated Pandas objects directly.
Example: Creating a Series from a PyArrow Array
pa_array = pa.array([{"1": "2"}, {"10": "20"}, None], type=pa.map_(pa.string(), pa.string()))
ser = pd.Series(pd.arrays.ArrowExtensionArray(pa_array))
print(ser)
Output:
0 [('1', '2')]
1 [('10', '20')]
2 <NA>
dtype: map<string, string>[pyarrow]
Retrieving PyArrow Arrays from Pandas Objects
You can retrieve a PyArrow ChunkedArray
from a Pandas Series or Index.
Example: Retrieving PyArrow Array
ser = pd.Series([1, 2, None], dtype="uint8[pyarrow]")
print(pa.array(ser))
idx = pd.Index(ser)
print(pa.array(idx))
Output:
<pyarrow.lib.UInt8Array object at 0x7ff2a2968400>
[
1,
2,
null
]
<pyarrow.lib.UInt8Array object at 0x7ff2a2968460>
[
1,
2,
null
]
Converting a PyArrow Table to a DataFrame
You can convert a PyArrow Table
to a Pandas DataFrame using the to_pandas()
method.
Example: Converting PyArrow Table
table = pa.table([pa.array([1, 2, 3], type=pa.int64())], names=["a"])
df = table.to_pandas(types_mapper=pd.ArrowDtype)
print(df)
print(df.dtypes)
Output:
a
0 1
1 2
2 3
a int64[pyarrow]
dtype: object
PyArrow-Accelerated Operations
Pandas integrates PyArrow to accelerate several operations, including numeric aggregations, arithmetic, logical operations, and more.
Examples of Accelerated Operations
import pyarrow as pa
ser = pd.Series([-1.545, 0.211, None], dtype="float32[pyarrow]")
print(ser.mean()) # Output: -0.6669999808073044
print(ser + ser) # Output: 0 -3.09, 1 0.422, 2 <NA>
print(ser > (ser + 1)) # Output: 0 False, 1 False, 2 <NA>
print(ser.dropna()) # Output: 0 -1.545, 1 0.211
print(ser.isna()) # Output: 0 False, 1 False, 2 True
print(ser.fillna(0)) # Output: 0 -1.545, 1 0.211, 2 0.0
String Operations
ser_str = pd.Series(["a", "b", None], dtype=pd.ArrowDtype(pa.string()))
print(ser_str.str.startswith("a"))
Output:
0 True
1 False
2 <NA>
dtype: bool[pyarrow]
Datetime Operations
from datetime import datetime
pa_type = pd.ArrowDtype(pa.timestamp("ns"))
ser_dt = pd.Series([datetime(2022, 1, 1), None], dtype=pa_type)
print(ser_dt.dt.strftime("%Y-%m"))
Output:
0 2022-01
1 <NA>
dtype: string[pyarrow]
PyArrow-Accelerated IO Reading
PyArrow can be used to speed up various IO operations in Pandas, such as reading from CSV or JSON files. You can specify the engine="pyarrow"
parameter to utilize PyArrow’s capabilities.
Example: Reading CSV with PyArrow
import io
data = io.StringIO("""a,b,c
1,2.5,True
3,4.5,False
""")
df = pd.read_csv(data, engine="pyarrow")
print(df)
Output:
a b c
0 1 2.5 True
1 3 4.5 False
Returning PyArrow-Backed Data
To return PyArrow-backed data, use the dtype_backend="pyarrow"
parameter.
Example: Returning PyArrow-Backed Data
data = io.StringIO("""a,b,c,d,e,f,g,h,i
1
,2.5,True,a,,,,,
3,4.5,False,b,6,7.5,True,a,
""")
df_pyarrow = pd.read_csv(data, dtype_backend="pyarrow")
print(df_pyarrow.dtypes)
Output:
a int64[pyarrow]
b double[pyarrow]
c bool[pyarrow]
d string[pyarrow]
e int64[pyarrow]
f double[pyarrow]
g bool[pyarrow]
h string[pyarrow]
i null[pyarrow]
dtype: object
Conclusion
Integrating PyArrow with Pandas extends the library’s functionality, improves performance, and enables more complex data manipulations. By leveraging PyArrow, you can handle a wider range of data types, achieve better missing data support, and accelerate various operations and IO tasks. Ensure you have PyArrow installed and explore the enhanced capabilities it brings to your Pandas workflows.
Learn How To Build AI Projects
Now, if you are interested in upskilling in 2024 with AI development, check out this 6 AI advanced projects with Golang where you will learn about building with AI and getting the best knowledge there is currently. Here’s the link.
Last updated 17 Aug 2024, 12:31 +0200 .