Blog
chevron-right
My browser WASM’t prepared for this. Using DuckDB, Apache Arrow and Web Workers in real life

My browser WASM’t prepared for this. Using DuckDB, Apache Arrow and Web Workers in real life

Przemyslaw Maciolek
November 29, 2023

Introduction

At Motif Analytics, we are building a highly-interactive analytics tool, which allows finding insights in relatively large datasets, fully in-browser. We have a more traditional cloud-mode too, but the “local” (fully in-browser) mode is an important way for our users to experiment with Motif and see if it matches their needs, without making any commitments on their end. To make that possible, we employ several techniques, which include running DuckDB WASM, Apache Arrow, and Web Workers.

During the past months we learned some upsides and downsides of using these technologies. While this experience was gained on our specific use-cases, we believe that it is still somewhat universal and might be interesting for anyone working on modern, responsive web apps which process non-trivial volumes of data.

What does parallelization have to do with web browsers?

The idea of speeding up processing-intensive analytics tasks by parallelising them is almost as old as computing itself. It made a long way, starting with shared memory architectures of 1960’s and 1970’s, to be replaced by clusters, Grid Computing and eventually Cloud. I still remember the fun of building academic projects using the somewhat forgotten technologies, such as PVM or MPI. The world is very much different today, but ultimately all data-processing problems are very similar at their core and revolve around being able to break large tasks into smaller independent pieces and synchronizing their execution together.

The emergence of energy efficient multi-core cpus allows to bring the heavy computation even closer to the user by running it directly in the web browser. Computing-intensive tasks can be parallelized using Web Workers, which are supported by all modern browsers and are becoming an important tool for writing responsive apps, though with important limitations.

An interactive data analysis can be streamlined by using familiar tools (such as describing the queries using SQL). Here comes DuckDB, an Open-Source database designed for OLAP (Online Analytics Processing). It’s to data warehouses what Sqlite is to RDBMS. It has few external dependencies, has rich language (beautifully documented) and runs very quickly, largely thanks to leveraging state-of-the-art in database analytics. DuckDB WASM is a special version of DuckDB, compiled as WebAssembly, which allows to run it in the browser virtual machine, alongside the page script.

These two (Web Workers and DuckDB WASM) can be used together for a more sophisticated processing architecture. For example, some custom, domain-specific processing code can be executed in parallel using Web Workers and then fed to DuckDB WASM, where a convenient SQL query can handle the results and calculate aggregates.

Gluing the two together can be done using Apache Arrow, a language-agnostic framework using a standardized column-oriented memory format.

To make it even more compelling, the same architecture can be executed in server mode. Node.js worker threads and DuckDB Node library can be organized in a very similar way, allowing for code reuse.

Wait! There are more use-cases! DuckDB can also be used for (fast) interchange using some of the popular data formats, such as Parquet, CSV or JSON.

While this all sounds exciting, it also brings challenges, which we discuss below.

DuckDB WASM Performance

While DuckDB WASM offers practically the same capabilities as any other version of DuckDB, it is executed in a stack-based virtual machine rather than as a native library code. While WASM is considered fast, it adds overhead over running the code natively. Additionally, currently DuckDB WASM is a single-threaded engine (a remnant of the now-outdated SharedArrayBuffer usage limitations), with experimental COI support.

Ultimately, this means that the queries are running much slower than in a native DuckDB instance locally (executed either through shell or as a Node/Python/etc. library).

Example 1: TPC-H lineitem query

So what sort of performance difference can we expect from DuckDB WASM vs. native CLI? Here’s an ad-hoc benchmark on my M1 Pro.

Result 1: DuckDB (native lib) CLI 0.9.2

D PRAGMA threads=1;
D .timer on
D select ...
┌──────────────┬──────────────┬──────────┬────────────────────┬─────┐
│ l_returnflag │ l_linestatus │ sum_qty  │   sum_base_price   │ ... │
│   varchar    │   varchar    │  double  │       double       │ ... │
├──────────────┼──────────────┼──────────┼────────────────────┼─────┤
│ A            │ F            │ 380456.0 │  532348211.6500006 │ ... │
│ N            │ F            │   8971.0 │        12384801.37 │ ... │
│ N            │ O            │ 742802.0 │ 1041502841.4499991 │ ... │
│ R            │ F            │ 381449.0 │  534594445.3500006 │ ... │
└──────────────┴──────────────┴──────────┴────────────────────┴─────┘
Run Time (s): real 0.807 user 0.590743 sys 0.035380
Result 2: DuckDB WASM Shell 0.9.2 on Chrome 119.0

duckdb> .timer on
duckdb> select ...
┌──────────────┬──────────────┬─────────┬────────────────────┬─────┐
│ l_returnflag ┆ l_linestatus ┆ sum_qty ┆ sum_base_price     ┆ ... │
╞══════════════╪══════════════╪═════════╪════════════════════╪═════╡
│ A            ┆ F            ┆  380456 ┆  532348211.6499983 ┆ ... │
│ N            ┆ F            ┆    8971 ┆ 12384801.369999997 ┆ ... │
│ N            ┆ O            ┆  742802 ┆ 1041502841.4499979 ┆ ... │
│ R            ┆ F            ┆  381449 ┆  534594445.3499986 ┆ ... │
└──────────────┴──────────────┴─────────┴────────────────────┴─────┘
Elapsed: 03.257 s

We’re getting 0.8s (native lib, single thread) vs. 3.3s (WASM), which makes WASM4x slower for this particular case. But hold on. What if we copy the data to a temporary table first?


CREATE TABLE lineitem AS SELECT * FROM 'lineitem.parquet';

This made it faster for both cases, taking the time down to 0.1s (native lib, single thread) vs 0.4s (WASM), keeping the 4x factor we observed before.

Example 2: Calling an expensive function

How about another example? We will use the same dataset, but construct a different query. One calling a somewhat expensive function, which must be executed for all rows.

Result 3: DuckDB (native lib) CLI 0.9.2

D PRAGMA threads=1;
D .timer on
D select ...
┌──────────────────────────────────┐
│             md5_val              │
│             varchar              │
├──────────────────────────────────┤
│ 0000031428caeaafcddcf1a1f5d4ea26 │
│ 000005560d5cc20991b8bb57e64f58a0 │
│ 00000aa836f58524e6ae1d75c9246511 │
│ 000016d700b533970506d1164d0ee003 │
│ 00001921a51a1880a184ed25fc3d95b8 │
└──────────────────────────────────┘
Run Time (s): real 1.679 user 1.533280 sys 0.098910

What if we increase the number of threads? Here’s what happens on my M1 Pro laptop:


D PRAGMA threads=8;
D select ...
┌──────────────────────────────────┐
│             md5_val              │
│             varchar              │
├──────────────────────────────────┤
│ 0000031428caeaafcddcf1a1f5d4ea26 │
│ 000005560d5cc20991b8bb57e64f58a0 │
│ 00000aa836f58524e6ae1d75c9246511 │
│ 000016d700b533970506d1164d0ee003 │
│ 00001921a51a1880a184ed25fc3d95b8 │
└──────────────────────────────────┘
Run Time (s): real 0.313 user 1.742058 sys 0.019078
Result 4: DuckDB WASM Shell 0.9.2 on Chrome 119.0

duckdb> .timer on
duckdb> select ...
┌──────────────────────────────────┐
│ md5_val                          │
╞══════════════════════════════════╡
│ 0000031428caeaafcddcf1a1f5d4ea26 │
│ 000005560d5cc20991b8bb57e64f58a0 │
│ 00000aa836f58524e6ae1d75c9246511 │
│ 000016d700b533970506d1164d0ee003 │
│ 00001921a51a1880a184ed25fc3d95b8 │
└──────────────────────────────────┘
Elapsed: 01.881 s

Now, the durations are almost on par. We’re getting 1.7s (native lib, single thread) vs 1.9s (WASM), which makes for a modest 10% penalty when executing the query as WASM. Quite impressive!

Takeaway

As always, YMMV depending on the properties of the dataset and query complexity, but observing this sort of differences is not surprising. Unlike the native version, WASM runs in a virtual machine. Additionally, making it use multiple workers is fairly new (with next-coi experimental version the single-thread limitation should be lifted) and I'm not sure if it can be already recommended.

On top of the above, memory available to WASM is limited by the browser (in case of Chrome, the limit is currently set at 4GB per tab). This can be learnt painfully when trying to load larger datasets or doing non-trivial joins. DuckDB is famous for “never throwing out of memory” (and rightfully so) but WASM cannot use the local persistency yet  for spilling data while processing the query if it doesn’t fit in the available memory. Again, the latter will hopefully change soon, with work underway. Until this happens though, it is just another known WASM limitation.

Despite all of this, DuckDB WASM is currently one of the fastest (if not the fastest) engine for querying fully in-browser. However, when possible, running native code (outside of the browser limitations) will always be faster and support much bigger queries.

Web Workers and Schema Consistency

In our architecture we’re using Web Workers to parallelize a complex task and feed its output to DuckDB. Before we go into some specific details, it’s worth noting that using Web Workers is much different to, say, writing concurrent code in Golang, Java or Rust. This is nicely discussed by Surma in his article, but TLDR, Web Workers are largely independent from each other and can only pass Byte Array buffers (either as a SharedArrayBuffer or transferable ArrayBuffer).

This implies that the volume of messages used for synchronization should be fairly low and special care must be taken when selecting the serialization/deserialization method. It’s just another reason why Apache Arrow fits neatly here, as its IPC buffers make a perfect match (the vector data model does not even need to be serialized and can be just passed as chunks of memory).

DuckDB adds only a very small overhead when reading data in Arrow format while executing the query, which makes it all a very compelling use-case. The one we would hope to use in a low-latency analytical application.

This architecture can largely be implemented as in the following diagram.

Before we run the SQL query, we just need to collect the Apache Arrow tables from the workers and pass these to DuckDB, forming a single logical table. The code for achieving that goal can be illustrated in the following snippet:

Unfortunately, running it will fail with a following error:


Conversion Error: Failed to insert into table 'books': 
    Could not convert string '4th century BC' to DOUBLE
       at O.onMessage (index.js:11583:17)

While each Arrow Table is a valid table on its own and all their column names match, they have (slightly) inconsistent schemas and thus they are rejected by DuckDB while executing the insert. As it turns out, this can be tricky even for simpler cases, where field can be null in some partitions, as seen in ARROW-2860, which is 5 years old.

Using multiple Arrow Tables, where each might contain a separate set of BatchRecords doesn’t seem to be addressed by Apache Arrow spec, so perhaps it can be considered a “grey area” and it seems one should just do whatever possible to not have to deal with this problem.

Ultimately, there seems to be no simple generic solution to schema reconciliation. C++ Arrow API has ConcatenateTables, which allows to handle some cases via setting the unify_schemas property to true (and providing more details as needed), but it cannot be called from Javascript (or any other language where it's not feasible to call FFI). There are some exotic initiatives to bridge the gap in having access to the richness of the C++ library, like Apache Arrow WebAssembly library implementations, but these seem to be not actively developed or are based on an older Arrow version.

Eventually, one must handle this case oneself or (preferably) avoid dealing with schema reconciliation in the first place, though it might not always be possible.

Gaps in the capabilities and… bugs

While DuckDB WASM is getting a lot of attention recently, it’s perhaps the Python and Java libs which are used most actively (a year ago, they were getting an order of magnitude more downloads than the Nodejs library). As such, probably neither WASM nor Node.js DuckDB library should yet be considered a very mature project. The community is very active and DuckDB team pushes good efforts into it, but simply put, various gaps can be easily encountered.

To give a personal perspective, here are some examples, which I hit recently:

  • Querying some JSON strings or Parquet files throws “memory access out of bounds” error. It doesn't seem to happen on DuckDB WASM v1.27.0 or earlier, so downgrading is a viable workaround until the fix is made.
  • Only the first table is registered by register_buffer() - the function implies that it might be possible to pass several Apache Arrow tables, which could form a single logical DuckDB table. It’s not the case though. The workaround I found is registering each Apache Arrow table separately and concatenating into a temporary table. Probably a better one would be reconciling all record batches and registering a single Apache Arrow table.
  • Crashing the library on some UNION ALL and ORDER BY queries - the fix for this one is under active development. This makes the DuckDB Node library unsalvageable and the whole process needs to be restarted (ouch!). Downgrading to DuckDB v0.8 works as a temporary workaround.

All of this might look underwhelming but it can also be seen as a consequence of the sudden explosion of interest in the project, which needs to go through the teething stage. The community is welcoming contributors and as with any open-source project, it’s ultimately in the hands of us, the users, to report and help fix the discovered issues.

Also, some of the limitations from just a few months ago are no longer present thanks to the hard work of the team. E.g., it’s now possible to use many of the extensions in DuckDB WASM. I had do write some workarounds for this a few months ago (when I needed to have better JSON support). These are now no longer needed thanks to enabling the JSON extension in DuckDB WASM (and that’s just one of them). It’s a topic for another post, but DuckDB extensions architecture is truly amazing.

Wrap up

Building fast in-browser processing stacks can be accelerated by using some of the recently popularized technologies. The recipe is basically: take DuckDB WASM (data interchange and SQL), Web Workers (parallelizing processing-intensive, domain-specific tasks) and glue them together via Apache Arrow.

Unfortunately, they all have limitations, which sometimes require extra effort to make them work together. I believe many of these issues will go away in the upcoming months. For example, DuckDB WASM capacity limitations should be much improved with adding support for OPFS and embracing COI version. The “teething problems” will hopefully diminish as more testing will happen by the community and the discovered bugs will get resolved.

From a developer perspective, one can look at this as an opportunity. DuckDB brings many sought-after capabilities and is on the curve of accelerated traction. We are going to see even more use cases for it. As of today, it still has some challenges, which make it a terrific opportunity to jump in, help the community and learn at the same time.

If you're interested in learning more, you can reach out to us at hello@motifanalytics.com.