We had a workshop on FireDucks with faculties from universities around Bangalore. Thank you for joining and discussion.| FireDucks – Posts
In general, a Data Scientist spends significant efforts in transforming the raw data into a more digestible format before training an AI model or creating visualizations. Traditional tools such as pandas have long been the linchpin in this process, offering powerful capabilities but not without limitations. With the pitfall of its single-core implementation and inefficient data structures, often we face performance issues when dealing with pandas for relatively larger data, but its performanc...| FireDucks – Posts
| FireDucks – Posts
| FireDucks – Posts
| FireDucks – Posts
| FireDucks – Posts
| FireDucks – Posts
| FireDucks – Posts
| FireDucks – Posts
| FireDucks – Posts
| FireDucks – Posts
SWITCHING GROUPBY In this article, we introduce the acceleration techniques of “groupby” used in FireDucks. The groupby operation is one of the most fundamental and important operations in tabular data analysis. We can use the groupby operation to obtain important statistical properties such as the mean and variance of the data. We can also combine it with other operations to obtain new features. FireDucks optimizes based on data characteristics for fast groupby operations. One such optim...| FireDucks – Posts
Application example: Spicy MINT at Toyota Technical Development Corporation| fireducks-dev.github.io
In the previous article, we have talked about how FireDucks lazy-execution can take care of the caching for the intermediate results in order to avoid recomputation of an expensive operation. In today’s article, we will focus on the efficient data flow optimization by its JIT compiler. We will first try to understand some best practices when performing large-scale data analysis in pandas and then discuss how those can be automatically taken care by FireDucks lazy execution model.| fireducks-dev.github.io
We will explore the pitfalls of using the `%%time` magic command in Jupyter and other IPython Notebooks to measure the execution time of FireDucks processes.| fireducks-dev.github.io
Research says that Data scientists spend about 45% of their time on data preparation tasks, including loading (19%) and cleaning (26%) the data. Pandas is one of the most popular python libraries for tabular data processing because of its diverse utilities and large community support. However, due to its performance issue with the large-scale data processing, there is a strong need for high-performance data frame libraries for the community. Although there are many alternatives available at t...| fireducks-dev.github.io
FireDucks has a trace function that records how long each process such as read_csv, groupby, sort, etc. takes. This article introduces how to use the trace function. How to output and display trace files To use the trace function, you do not need to modify the program. Simply set the environment variables as shown below and execute the program to use the trace function. $ FIREDUCKS_FLAGS="--trace=3" python -mfireducks.pandas your_program.py After setting the environment variables and executin...| FireDucks – Posts
We are currently developing a GPU version of FireDucks. FireDucks is built with an architecture that translates programs into an intermediate representation at runtime, optimizes them in this intermediate representation, and then compiles and executes the intermediate representation for the backend. The currently released CPU version of FireDucks has a backend for CPUs. In the development of the GPU version, the backend is changed to a GPU. This allows us to use the translation to and optimiz...| FireDucks – Posts
As described here, FireDucks uses lazy execution model with define-by-run IR generation. Since FireDucks uses MLIR compiler framework to optimize and execute IR, first step of the execution is creating MLIR function which holds operations to be evaluated. This article describes how important this function creation step is for optimization, thus performance. In the simple example below, execution of IR is kicked by the print statement which calls df2.__repr__(). df0 = pd.| fireducks-dev.github.io
In the previous article, we have talked about how FireDucks can take care pushdown-projection related optimization for read_parquet(), read_csv() etc. In today’s article, we will focus on the efficient caching mechanism by its JIT compiler. Let’s consider the below sample query for the same data, used in previous article: import pandas as pd df = pd.read_parquet("sample_data.parquet") f_df = df.loc[df["a"] > 3, ["x", "y", "z"]] r1 = f_df.groupby("x")["z"].sum() print(r1) When executing th...| fireducks-dev.github.io
The availability of runtime memory is often a challenge faced at processing larger-than-memory-dataset while working with pandas. To solve the problem, one can either shift to a system with larger memory capacity or consider switching to alternative libraries supporting distributed data processing like (Dask, PySpark etc.). Well, do you know when working with data stored in columnar formats like csv, parquet etc. and only some part of data is to be processed, manual optimization is possible e...| fireducks-dev.github.io
Recently we have updated the result of polars-tpch benchmark on 4th generation Xeon processor. The latest result can be found here, and also below in this artice, explaining how to reproduce the same. For reproducibility, we have used AWS EC2 for this time evaluation. We have used m7i.8xlarge instance type with ubuntu 24.04 image and 128GB EBS SSD. This instance includes: 4th generation Xeon processor: Intel(R) Xeon(R) Platinum 8488C (32cores) 128GB memory Benchmark Result The graph shown bel...| fireducks-dev.github.io
Thank you for your interest in FireDucks. This article describes possible causes and remedies for slow programs using FireDucks. When a pandas program with FireDucks applied is slow, the reason may be the followings. Using ‘apply’ or ’loop’. Using pandas API not implemented in FireDucks. In the case of 1, if you change the pandas program, the program may become faster. For example, sum_val = 0 for i in range(len(df)): if df["A"][i] > 2: sum_val += df["B"][i] A program using ’loop’...| fireducks-dev.github.io