This is a first installment in a series of posts I plan on writing related to the basics of analytical SQL databases. This content probably most useful to junior or aspriring analytics engineers.

What is a columnar database?

Most well known modern data warehouses (Snowflake, BigQuery, Redshift, etc) are column-oriented, or just columnar databases. This means that unlike more mainstream SQL databases, data is stored in columns instead of rows. So for any particular table in the database, all the data for a certain column will be stored sequentially.

Why does this matter? Well, this means that typical queries that you would need to do analysis could be much faster on huge datasets.

Take a very simple aggregation such as this:

select sum(amount) from transactions

Let's say you're running this on a really big database table with millions of records. In a columnar database, since all the data in the amount column is optimized to be stored together, this query will be much faster than on a traditional database, especially if you haven't optimized the table for doing aggregations on that field.

We often do a lot of these kinds of aggregations (MIN, MAX, SUM, COUNT, AVG, etc) for analytics work, which columnar databases can do really well.

Another benefit of being columnar is that the database can optimize well for performing as much work in memory as possible, and easily distribute workloads in parallel onto a cluster of computers. In our query above, the DB engine only needs to load up data for the column amount to perform the count, and since it's all stored together, this is easy to do without having to scan the full rows.

If a table has millions or even billions of records, some columnar databases can store segments of all the data in our column on different nodes in of cluster, which can individualy compute the sum of each segement seperately, which can then be tallied into a final sum fairly quickly. In this way you can scale an operation that would have been slow or even impossible on a single machine to a cluster of machines easily.

Why should care about if my database is columnar?

In a lot of ways, when your using a colummar database, the database engine doesn't need you to know much about the fact that it's columnar at all. On the surface, you can access it very much like you would a row-oriented database such as PostgreSQL. At a semantic level, most database entities and queries will appear almost identical. You can also stick to the familiar SQL database nomenclature, using terms such as tables, rows, columns and fields.

Columnar databases do, however, significantly change the way you think about modeling and querying data. Even though you don't have to be concerned about the low-level details (unless you're interested), having a high-level understanding of the important concepts will help guide and inform your decisions on how you ingest, store and query your data.

For instance, in a columnar database, it's fine to have a huge table with many columns that might be sparsely populated. As long as your not selecting a lot of columns at once, and only performing aggregations on certain columns, this will still be super performant.

Since tables can have many columns, you might not need to normalize data as much, which is also a great win for analytics.

What are columnar databases not good for?

The flip side of all of the benefits of a columnar database is that a simple select * query that might only return one or a few rows can be surprisingly slow.

Unlike traditional SQL databases, fields related to the same row in a table might not be stored close to each other, and could often be stored in different 'slices' of your database, which could mean different machines in your cluster. So, in a query, selecting all of your table's columns for any particular row might force the database to find and put together a whole lot of scattered datapoints, which can result in a slower than expected execution time.

That doesn't mean that you can't write fast queries that need access to many or all of the columns in at all, it's still perfectly possible but it might mean that need spend more time thinking about how design and optimize your database.

This is why a columnar database won't make for a great database to power a normal web or client-server application, where you're often reading in all the columns for a small number of rows. Another reason why columnar databases aren't that great for these kinds of applications is that it's not really optimized for writing small batches of records to the database.

But it can perform analysis and number crunching on huge datasets incredibly quickly if you know what your doing!

Implementing the concept with code

Let's make the concept of a column vs row oriented database more concrete by writing some code.

We'll build a very, very basic 'database' implementation in Python.

First we'll create a classic-style in-memory database table to store a list of transactions. Each 'row' is a dictionary with values for each column.

transactions =[
    {"id": 1, "customer_id": 1, "product_id": 1, "amount": 9.99},
    {"id": 2, "customer_id": 1, "product_id": 2, "amount": 4.99},
    {"id": 3, "customer_id": 2, "product_id": 3, "amount": 25.99},   
]

Now we can 'query' our database. Let's say we want to calculate our total revenue, the sum of all the amount columns. This function will loop through each transaction, read the amount and then sum it all together.

def get_total_transaction_revenue():
    """
    Return the total revenue from all transactions
    """
    return sum([t["amount"] for t in transactions])

get_total_transaction_revenue()

40.97

This Python code crudely implements the same thing as this SQL query would do:

select sum(amount) from transactions

Let's benchmark this function. We create a benchmark function that will run our query 1 million times and see how long it takes to run.

def benchmark(func):
    
    import timeit
    num_runs = 10**6
    duration = timeit.Timer(func).timeit(number = num_runs)

    print(f"{num_runs} runs took {duration} seconds")

And we run the benchmark:

benchmark(get_total_transaction_revenue)

1000000 runs took 1.9082819640170783 seconds

Ok, so let's make a columnar version of our 'table'. Instead of starting from scratch I'll convert our existing transactions data structure.

The main difference to note here is that our table is now a dictionary with keys for each column, which maps to a list of values for each row.

transactions_columnar = {
    "id": [t["id"] for t in transactions],
    "customer_id": [t["customer_id"] for t in transactions],
    "product_id": [t["product_id"] for t in transactions],
    "amount": [t["amount"] for t in transactions],
}

transactions_columnar

{'id': [1, 2, 3],
 'customer_id': [1, 1, 2],
 'product_id': [1, 2, 3],
 'amount': [9.99, 4.99, 25.99]}

Now we can build a similar 'query' function to calculate our total revenue on our column-oriented table. Instead of having to loop through each row, we can just sum up the amount column directly

def get_total_transaction_revenue_cols():
    """
    Return the total revenue from all transactions
    """
    return sum(transactions_columnar["amount"])

get_total_transaction_revenue_cols()

40.97

This gives us the same result as we got before.

assert get_total_transaction_revenue() == get_total_transaction_revenue_cols()

Let's benchmark this function then:

benchmark(get_total_transaction_revenue_cols)

1000000 runs took 1.1226091338321567 seconds

Not a massive difference, but we can show that this is a faster implementation.

Even on this toy example we can see the benefit of using a column-oriented datastructure for certain aggregation patterns.