Ponder: The Efficient Data Scientist
By Assaf Araki
Imagine you could speak your native language - whether that be English, Arabic, Thai or Zulu - and everyone in the world understood you perfectly and answered you in their native language; with you understanding their response. Similar to a native speaking language, each user has a native application programming interface (API). For example, one uses Pandas and others use a type of SQL like PLSQL, T-SQL, or Spark SQL. Imagine using your preferred API on top of a compute engine. Wouldn't that be great?
Decoupling API and Compute
At the start, we had data warehouses like Oracle and Teradata, followed by cloud data warehouses like Snowflake and Redshift, with all of them bundling together storage, compute, and API. It was like a “no substitutions or additions” disclaimer on a menu. Later on, we had business intelligence (BI) tools like Tableau and Looker as well as extract, transform and load (ETL) tools like Matillion and DBT that offered a user-level interface or API, but at the same time could operate on multiple database backends. These tools abstracted away the translation from a single user-level API to many databases and compute engines, allowing you to use your preferred BI or ETL tool on top of various database management systems (DBMS) in the cloud and on-premise. These tools require users to learn proprietary API, but once users do so, they offer the freedom to switch between various compute engines with a simple configuration.
End-users often invest considerable time and effort to learn one specific API, often due to the popularity of an API in their organization. Users want to continue using their preferred API, as we use our preferred native speaking language, and execute the code on multiple databases and compute engines.
The Data Scientist Efficiency Gap
Data scientists (DS) still spend most of their time on preprocessing tasks before training a model. Anaconda’s 2020 State of Data Science survey found that 45% of DS’ time is spent on data loading and cleaning, also known as “data wrangling,” with data visualization coming in second at approximately 21% of DS’ time. The preferred API for data wrangling by data scientists is Pandas, however Pandas uses a native compute layer that is not scalable. The question is: can we decouple the pandas API (that so many users love) while supporting scalable compute engines?
Ponder addresses this efficiency gap by providing an enterprise-ready version of pandas that is scalable and easy-to-use. Ponder builds on Modin and Lux, two open source projects created by researchers at the UC Berkeley RISELab with over 10K GitHub stars and over 2.5M downloads. Modin is a scalable “drop-in replacement” for pandas, closing the efficiency gap in data wrangling by scaling up to large datasets without requiring users to change a single line of code. Lux is a visualization tool for pandas that closes the efficiency gap in data visualization by automatically identifying visual insights on large and complex datasets. Ponder eliminates the costly process of retranslation from DS’ preferred API (pandas) to other big data frameworks, by allowing them to easily and efficiently run pandas for data wrangling and visualization at scale.
At Intel, we have long collaborated with academic centers of excellence across the globe, including UC Berkeley. And in the case of Ponder, this collaboration extends from Intel Labs to various business unit engineers, with Intel being an early contributor to Modin and releasing The Intel® Distribution of Modin, a performant, parallel, and distributed dataframe system.
Ponder: From Research to Industry
Today, I’m proud to join the effort of the Ponder team and continue the journey started in the research lab to bring efficiency and scale to data scientists. Data scientists can drop and replace their Pandas code with Modin and use Lux for better insight into their data. Likewise, thanks to the scalability of Modin, developers can now implement the same pandas code in production systems to accelerate the pace of deploying models into production, increase the impact on the business, as well as the ROI for data science and ML/AI. Ponder is well positioned to solve a severe need in data teams everywhere, and has demonstrated traction to prove that they are the solution to the pervasive issue of lost productivity.