Simple problems at scale: log tailing

Simple problems at scale is a series of mini-blogs that takes trivial-looking problems and discusses how hard these problems can become with larger-scale systems. We are going to start with log tailing. Let's define some concepts first.

Wtf are logs?

Even though it looks like something very easy to define, it's not that simple. For the sake of this article we will treat logs as a series of events stored sequentially on a storage medium. Events can be anything: network requests, sensor readings, GPS coordinates of a moving object, series of interactions with a mobile phone etc. In the very core of the definition of logs, data is immutable. It's written only once. In the simplest cases logs are printed out of your console program. If you graduated college and this is a program that will run for more than 1 hour, it's a good idea to have these logs appended to a file.

Now what is log tailing exactly?

Given the previous definition of logs, we can deduce that most of the time we are actually interested in the most recent results of the data. And even in cases where you are interested in some data in the past, you often express that data in a function of the present. You are interested in data in the logs that are either "Now", "Recent" or "3 days ago". This all mean there is much more interest in the tail of the log data rather than any other part. The process of log tailing is to fetch the last N events/lines/transactions in a log.

An angry commenter will now yawn and say "Why not just use $ tail"?

How difficult is log tailing?

The very quick answer is, it depends. It depends on the scale of the logs, how they are arranged and how N is a factor in that.

But let's not get ahead of ourselves. Let's discuss how we would tail logs as they grow starting from your console screen.

1- Logs are emitted to your console screen
It's simple. You wrote a cool game and you are logging changes happening in UI. You can already see the tail of the logs very clearly. Duh!

2- Web server serving few hundred active users per day
Unix provides a great tool for log tailing from files. type tail /var/www/log0.txt and you'll have the last 10 lines of that file printed to your console. Assuming your pet photos website got featured on a local lifestyle blog and it's peaking with 10-100s of requests per second, you'd want to see that action without typing tail 10s of times. Option -f or follow will continuously print out newer logs to your screen. Easy right?

3- You've hit your first million users with around 500 QPS
You posted an awesome picture of a cool cat and your website got really famous. Now looking at logs is a little problematic because you need a way to debug an issue happening in production. Neither tail nor tail -f can help you because of the overwhelming number of requests. Files are getting bigger and they need to be rotated

Log rotation: A process of isolating, archiving and compressing old logs.

Logs apocalypse: you have multi-million user base with >500kQPS.

Here is where everything bends. Your web server is not a single ssh-able machine. Your log is not a single file. Answering the question that UNIX tail used to answer "What are the last N events in my log?" is not easy anymore. It's also time to be skeptical and ask questions like "What do we really need from all those logs?" and "How are we going to use the data?" These lazy-sounding questions are actually very pro-active and lead to insight on how to tackle scale problems that come with that amount of throughput.

Now that you have gigabytes (or petabytes, hypothetical traffic is free anyway) of log entries coming in every second, your logs are being distributed across 10s or 100s of machines. You want an optimised system for writing these logs with guarantees on ordering and most importantly consistency. The system must be able to tail logs from hundreds of machines (while maintaining order) efficiently. Kafka and LogDevice are two open-source examples of distributed data stores for logs that are built with the principle of immutable writes and tailing in mind.

What's next?

Go through docs for Kafka and LogDevice to learn about distributed logging. Spin up a local Kafka cluster. I find the Spotify docker image the fastest way to spin up Kafka locally and play with it.