I've often found myself needing to read and understand structured data from files—whether it's a set of exported table data or large log files. Reading CSV-formatted files has always felt like a tedious task. If I have a database management system (DBMS) installed, I usually import the data into a table and work on it. Otherwise, I’m stuck trying to make sense of it in an office tool.

Fortunately, I eventually discovered DuckDB—a database engine that can use formatted files as a data source. It's also incredibly fast. In fact, it has many other features, but since it's lightweight and can be executed from a binary, I don't see any harm in keeping this powerful tool handy and using it anywhere.

Recently, I needed to analyze access logs of my blog application to see how it was performing. Since I don’t have an ELK stack set up, DuckDB was the best free tool that came to mind, especially since I’m comfortable writing SQL queries.

Log Files

I wrote about how I enabled and formatted access logs for the embedded Tomcat in my blog’s Spring Boot application in this blog post. Below is a sample from those logs (IP addresses masked):

"[13/Oct/2024:19:50:30 +0000]" 0.0.0.0 "1.1.1.1" "GET / HTTP/1.1" 200 8146 15210 14 https-jsse-nio-8443-exec-5 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/126.0.0.0 Safari/537.36"
"[13/Oct/2024:19:50:30 +0000]" 0.0.0.0 "1.1.1.1" "GET /css/base.css HTTP/1.1" 200 861 3117 2 https-jsse-nio-8443-exec-3 "https://coskun.bidici.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/126.0.0.0 Safari/537.36"
"[13/Oct/2024:19:50:30 +0000]" 0.0.0.0 "1.1.1.1" "GET /images/logo.svg HTTP/1.1" 200 1176 3250 3 https-jsse-nio-8443-exec-9 "https://coskun.bidici.com/css/base.css" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/126.0.0.0 Safari/537.36"
"[13/Oct/2024:19:50:31 +0000]" 0.0.0.0 "1.1.1.1" "GET /images/favicon.svg HTTP/1.1" 200 1333 3049 2 https-jsse-nio-8443-exec-8 "https://coskun.bidici.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/126.0.0.0 Safari/537.36"
"[13/Oct/2024:19:50:31 +0000]" 0.0.0.0 "1.1.1.1" "GET /posts/how-to-enable-access-logs-on-spring-boot-embedded-tomcat/17 HTTP/1.1" 200 11105 24501 24 https-jsse-nio-8443-exec-9 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/126.0.0.0 Safari/537.36"
"[13/Oct/2024:19:50:31 +0000]" 0.0.0.0 "1.1.1.1" "GET /css/base.css HTTP/1.1" 200 861 3359 3 https-jsse-nio-8443-exec-10 "https://coskun.bidici.com/posts/how-to-enable-access-logs-on-spring-boot-embedded-tomcat/17" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/126.0.0.0 Safari/537.36"
"[13/Oct/2024:19:50:32 +0000]" 0.0.0.0 "1.1.1.1" "GET /css/post.css HTTP/1.1" 200 53 2943 2 https-jsse-nio-8443-exec-6 "https://coskun.bidici.com/posts/how-to-enable-access-logs-on-spring-boot-embedded-tomcat/17" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/126.0.0.0 Safari/537.36"
"[13/Oct/2024:19:50:32 +0000]" 0.0.0.0 "1.1.1.1" "GET /images/logo.svg HTTP/1.1" 200 1176 2914 2 https-jsse-nio-8443-exec-2 "https://coskun.bidici.com/css/base.css" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/126.0.0.0 Safari/537.36"
"[13/Oct/2024:19:50:32 +0000]" 0.0.0.0 "1.1.1.1" "GET /images/favicon.svg HTTP/1.1" 200 1333 2881 2 https-jsse-nio-8443-exec-5 "https://coskun.bidici.com/posts/how-to-enable-access-logs-on-spring-boot-embedded-tomcat/17" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/126.0.0.0 Safari/537.36"

The log data includes the following columns:

  • Date and time
  • Remote Host (Cloudflare IP addresses since I use Cloudflare as a dynamic DNS solution)
  • Original visitor IP
  • First line of the request
  • HTTP status code of the response
  • Bytes sent
  • Time taken to process the request (in microseconds)
  • Time to first byte (in milliseconds)
  • Thread name handling the request
  • Referrer
  • Agent

Even from the columns, it's obvious how much valuable information is available in these logs. We'll explore some of it together.

Import Data From Logs

I used the read_csv function to import data from formatted log files since I'm going to set the delimiter character, quote character, headers, and null padding for missing data.

SELECT *
  FROM read_csv(
    '*.log',
    delim = ' ',
    quote = '"',
    header = false,
    null_padding = true
  );

With this query, DuckDB imports the data and displays a sample set beautifully. From this point, I can start to execute queries on the data. But, I would like to format it better by getting benefit from DuckDB features.

┌──────────────────────┬────────────────┬────────────────┬───┬──────────────────────┬──────────────────────┬──────────────────────┐
│       column00       │    column01    │    column02    │ … │       column08       │       column09       │       column10       │
│       varchar        │    varchar     │    varchar     │   │       varchar        │       varchar        │       varchar        │
├──────────────────────┼────────────────┼────────────────┼───┼──────────────────────┼──────────────────────┼──────────────────────┤
│ [06/Oct/2024:00:09…  │ 83.97.73.245   │ -              │ … │ https-jsse-nio-844…  │ -                    │ Mozilla/5.0 (Windo…  │
│ [06/Oct/2024:00:09…  │ 83.97.73.245   │ -              │ … │ https-jsse-nio-844…  │ https://51.186.159…  │ Mozilla/5.0 (Windo…  │
│ [06/Oct/2024:00:13…  │ 172.71.98.52   │ 178.215.224.43 │ … │ https-jsse-nio-844…  │ -                    │ python-requests/2.…  │
│ [06/Oct/2024:00:13…  │ 172.71.183.181 │ 178.215.224.43 │ … │ https-jsse-nio-844…  │ -                    │ python-requests/2.…  │
│ [06/Oct/2024:00:13…  │ 172.71.183.181 │ 178.215.224.43 │ … │ https-jsse-nio-844…  │ -                    │ python-requests/2.…  │
│          ·           │       ·        │ ·              │ · │  ·                   │ ·                    │ ·                    │
│          ·           │       ·        │ ·              │ · │  ·                   │ ·                    │ ·                    │
│          ·           │       ·        │ ·              │ · │  ·                   │ ·                    │ ·                    │
│ [06/Oct/2024:00:15…  │ 57.151.71.152  │ -              │ … │ https-jsse-nio-844…  │ -                    │ Mozilla/5.0 zgrab/…  │
│ [06/Oct/2024:00:37…  │ 162.158.38.26  │ 52.178.143.183 │ … │ https-jsse-nio-844…  │ -                    │ -                    │
│ [06/Oct/2024:00:37…  │ 162.158.38.26  │ 52.178.143.183 │ … │ https-jsse-nio-844…  │ -                    │ -                    │
│ [06/Oct/2024:00:37…  │ 162.158.38.26  │ 52.178.143.183 │ … │ https-jsse-nio-844…  │ -                    │ -                    │
│ [06/Oct/2024:00:37…  │ 162.158.38.26  │ 52.178.143.183 │ … │ https-jsse-nio-844…  │ -                    │ -                    │
├──────────────────────┴────────────────┴────────────────┴───┴──────────────────────┴──────────────────────┴──────────────────────┤
│ 123417 rows (10 shown)                                                                                     11 columns (6 shown) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Polishing the Data

DuckDB offers functions to manipulate data just like any other database. Let's polish our log data a bit to make it clearer and more understandable.

SELECT strptime(trim(column00, '[]'), '%d/%b/%Y:%H:%M:%S %z') AS dateTime,
       column01 AS cloudFlareIP,
       column02 AS originalIP,
       split_part(column03, ' ', 1) AS requestMethod,
       split_part(column03, ' ', 2) AS requestPath,
       split_part(column03, ' ', 3) AS requestHTTPVersion,
       CAST(column04 AS SMALLINT) AS responseCode,
       CASE WHEN column05 = '-' THEN null ELSE CAST(column05 AS INTEGER) END AS responseSize,
       CAST(column06 AS INTEGER) AS requestProcessingTime,
       CASE WHEN column07 = '-' THEN null ELSE CAST(column07 AS INTEGER) END AS timeToFirstByte,
       column08 AS threadName,
       column09 AS referrer,
       column10 AS userAgent
  FROM read_csv(
    '*.log',
    delim = ' ',
    quote = '"',
    header = false,
    null_padding = true
  );

Create Table

While not mandatory, I prefer to create a table from the polished data since I'll be running multiple queries against it.

CREATE TABLE access_logs AS
  SELECT strptime(trim(column00, '[]'), '%d/%b/%Y:%H:%M:%S %z') AS dateTime,
           column01 AS cloudFlareIP,
           column02 AS originalIP,
           ...
      FROM read_csv(
        '*.log',
        ...
      );

That's it. Now, I have my well formatted log data to work on.

Analysis

Response code distribution
D SELECT responseCode, count(0) FROM access_logs GROUP BY responseCode;
┌──────────────┬──────────┐
│ responseCode │ count(0) │
│    int16     │  int64   │
├──────────────┼──────────┤
│          200 │     7038 │
│          206 │        2 │
│          302 │     1083 │
│          400 │    85838 │
│          401 │       14 │
│          403 │       75 │
│          404 │    29361 │
│          405 │        3 │
│          500 │        2 │
│          501 │        1 │
├──────────────┴──────────┤
│ 10 rows       2 columns │
└─────────────────────────┘

This information is valuable because it reveals frequency of status codes so I can figure out unauthorised access or unexpected error counts.

Request Processing Times in Miliseconds
  • Home Page
D SELECT avg(requestProcessingTime)/1000 FROM access_logs WHERE requestPath = '/' AND responseCode = 200;
┌─────────────────────────────────────┐
│ (avg(requestProcessingTime) / 1000) │
│               double                │
├─────────────────────────────────────┤
│                  20.822404761904764 │
└─────────────────────────────────────┘
  • Post List Page
D SELECT avg(requestProcessingTime)/1000 FROM access_logs WHERE requestPath = '/posts' AND responseCode = 200;
┌─────────────────────────────────────┐
│ (avg(requestProcessingTime) / 1000) │
│               double                │
├─────────────────────────────────────┤
│                  16.556949438202246 │
└─────────────────────────────────────┘
  • Post Pages
D SELECT avg(requestProcessingTime)/1000 FROM access_logs WHERE requestPath LIKE '/posts/%' AND responseCode = 200;
┌─────────────────────────────────────┐
│ (avg(requestProcessingTime) / 1000) │
│               double                │
├─────────────────────────────────────┤
│                   21.12204563758389 │
└─────────────────────────────────────┘

These request processing times tell me that I have no performance issue on the pages that I checked even though I'm processing markdown on the fly without caching.

Conclusion

In this post, I demonstrated how I use DuckDB to analyze access logs for my blog application. Although I'm only scratching the surface of DuckDB's features, its lightweight design and ease of use make it an excellent tool even for smaller tasks like log analysis. With DuckDB, I get the flexibility of SQL without the overhead of a traditional database setup—making it a go-to choice for quick data exploration and analysis.