One billion row challenge 😱

The one billion row challenge 😱

Welcome to this new project where we crack open a one-billion-row CSV file. Yes, ONE BILLION!

This is a very fun challenge created by Gunnar Morling.

Your mission, should you choose to accept it, is deceptively simple: write a Java program for retrieving temperature measurement values from a text file and calculating the minimum, mean, and maximum temperatures per weather station. There’s just one caveat: the file has 1,000,000,000 rows!

Instead of using a Java program, we will use TypeScript, of course, with the Simple Data Analysis library (SDA), which I created (give it a ⭐). This is a great opportunity to test how powerful it can be!

If you get stuck at any point in this project, it might be helpful to review the previous lessons explaining the basics of SDA:

We will use Deno and VS Code. Check the Setup lesson if needed.

Let’s code!

Want to know when new lessons are available? Subscribe to the newsletter ✉️ and give a ⭐ to the GitHub repository to keep me motivated! Click here to get in touch.

Setup

To set up everything we need, let’s use setup-sda as in previous lessons.

Create a new folder, open it with VS Code, and run: deno -A jsr:@nshiab/setup-sda

Then run deno task sda to watch main.ts and its dependencies.

A screenshot of VS Code after running setup-sda.

💡

For SDA to work properly, it’s best to have at least version 2.1.9 of Deno. To check your version, you can run deno --version in your terminal. To upgrade it, simply run deno upgrade.

The data

In the original challenge, you have to fork a GitHub repo and run a script to generate a CSV file with 1 billion rows of data. But I did this for you and uploaded the compressed file to my Google Drive.

Download the data from here. It might take a few minutes since it’s around 4 GB.

Then put it in your data folder and unzip it. The uncompressed version of the CSV file is around 13 GB.

To load the data, copy and paste this code into your main.ts. Depending on your computer, it might take a few seconds to a few minutes to run.

main.ts
import { SimpleDB } from "@nshiab/simple-data-analysis";
 
const sdb = new SimpleDB({ logDuration: true });
 
const table = sdb.newTable();
 
await table.loadData("sda/data/measurements.csv");
await table.logTable();
 
await sdb.done();

A screenshot of VS Code after loading one billion rows of data.

Congratulations! You just loaded 1,000,000,000 rows of data! On my MacBook Pro M1 with 16 GB of RAM, it took around 26 seconds. 🥳

Depending on the RAM available on your computer, you might see a .tmp folder appearing in your project. If the data is bigger than your RAM, this folder will be used to process all of it by storing processed chunks in it.

This .tmp folder can become quite big. On my machine, after the first run, it’s around 8 GB.

💡

If you want to clean your cache, run deno task clean. This will remove .tmp. You can also delete it manually, but don’t forget to empty your trash.

The challenge

As you can see above, the data is just a list of cities with temperatures.

Now, the challenge is to compute the minimum, mean, and maximum temperatures per city. The results also need to be sorted alphabetically by city.

The code is very simple with SDA. You don’t even need to sort the rows manually because summarize does it by default.

main.ts
import { SimpleDB } from "@nshiab/simple-data-analysis";
 
const sdb = new SimpleDB({ logDuration: true });
 
const table = sdb.newTable();
await table.loadData("sda/data/measurements.csv");
await table.summarize({
  values: "temp",
  categories: "city",
  summaries: ["min", "mean", "max"],
});
await table.logTable();
 
await sdb.done();

A screenshot of VS Code after aggregating one billion rows of data.

Loading and aggregating the one billion values took less than 50 seconds on my machine with SDA! 🚀

Speeding things up

When you use methods like loadData and summarize, SDA performs various checks to keep the methods easy to use and prevent potential mistakes.

However, when you work with a file this big, these checks can take quite some time…

If you know SQL, you can write and run your own query to do exactly what you want without any extra overhead!

Here’s how to run an SQL query that takes on the challenge. Note that I use LIMIT 10 to return just the first 10 rows of the result, but all computations are performed.

main.ts
import { SimpleDB } from "@nshiab/simple-data-analysis";
 
const sdb = new SimpleDB({ logDuration: true });
 
const result = await sdb.customQuery(`
SELECT city,
    MIN(temp) AS min_temp,
    ROUND(MEAN(temp), 1) AS mean_temp,
    MAX(temp) as max_temp
FROM read_csv('sda/data/measurements.csv')
GROUP BY city
ORDER BY city
LIMIT 10;
`,
  { returnDataFrom: "query" },
);
 
console.table(result);
 
await sdb.done();

A screenshot of VS Code after aggregating one billion rows of data.

And this runs in just… 16 seconds! 💥💥💥

But… how?

While SDA is a TypeScript library, under the hood, it uses DuckDB, a very fast in-process database system written in C++.

When you use methods like loadData or summarize, SDA translates everything into SQL and asks DuckDB to execute it.

In my experience, using SDA is faster than using Python with Pandas or R with the Tidyverse.

To test and compare the library’s performance, I calculated the average temperature per decade and city using the daily temperatures from the Adjusted and Homogenized Canadian Climate Data, which is around 1.7 GB with 22 million rows of data. For more details, check out the library repository.

I ran the same calculations with simple-data-analysis@4.0.1 (Node.js, Bun, and Deno), Pandas (Python), and the tidyverse (R).

Thanks to DuckDB, Simple Data Analysis was the fastest option.

A chart showing the processing duration of multiple scripts in various languages

However, nothing stops you from using DuckDB with Python and R. But you won’t have all the easy-to-use methods and examples I wrote for SDA. 😬

I also tested the geospatial computations, which are still under heavy development with DuckDB. Python and R are still the fastest for now! Again, for more details, check out the repository.

Conclusion

That’s it! I hope this project convinced you of how performant the Simple Data Analysis library is.

Nothing is stopping you from crunching massive datasets with TypeScript now! 💃🕺

Enjoyed this? Want to know when new lessons are available? Subscribe to the newsletter ✉️ and give a ⭐ to the GitHub repository to keep me motivated! Get in touch if you have any questions.