Updated: Aug 15
Jim Dowling is CEO of Hopsworks and an Associate Professor at KTH Royal Institute of Technology. He co-developed the open-source Hopsworks Feature Store platform and leads the featurestore.org community.
Bridging Python and Lakehouse worlds with Arrow Flight and DuckDB
High throughput Python access to lakehouse data is a challenge for ML - for both model training and model inference. Data volumes are increasing, but Python is also stepping up with new single-host scalable frameworks for data processing, such as Polars, Pandas 2+, and DuckDB.
In this talk we introduce our ArrowFlight and DuckDB service which we built for retrieving data from an offline feature store to Python clients. The service uses DuckDB to read data from files in a lakehouse (Hudi, Parquet, etc), Arrow protocol for ensuring data does not need to be serialized/deserialized, and Arrow Flight Service to transfer Arrow data to clients, avoiding the JDBC/ODBC impedance mismatch (where columnar data is transformed into row-oriented and back again). Pandas/Polars clients can then read the Arrow data directly. Arrow from the lakehouse to Python.