Mienxiu

Explicit is Better than Implicit - Part 2: Behaviors

2024-12-04T00:00:00+00:00

“Explicit is better than implicit” is one of my favorite lines from the Zen of Python. In essence, this guiding principle encourages clarity in code by favoring straightforward and easily comprehensible designs over ones that are hidden or implied.

In this series, I’ll explore why explicit designs matter, focusing on two dimensions: intentions and behaviors.

In Part 1: Intentions, we’ll discuss the downsides and potential risks of unclear intentions in code. I will also illustrate how making intentions explicit not only enhances readability but also contributes to better reliability.

In Part 2: Behaviors, we’ll discuss how implicit behaviors can lead to unexpected outcomes through some commonly used programming paradigms and techniques. I will also focus on balancing implicit behavior with explicit clarity.

Aspect-Oriented Programming

Aspect-Oriented Programming (AOP) is a programming paradigm that aims to increase modularity by allowing the separation of cross-cutting concerns. Cross-cutting concerns are aspects of a program that affect multiple components, such as logging or security. AOP achieves this by adding additional behavior (called advice) to existing code (called join points) without modifying the code itself, thereby promoting separation of concerns. (I will clarify these terms with upcoming examples.)

While AOP offers ways to modularize concerns that span multiple parts of an application, it introduces implicit behaviors that can make the code harder to understand and maintain.

Python doesn’t have native AOP support like some other languages (e.g., AspectJ for Java), but we can facilitate AOP-like behavior using decorators and other metaprogramming techniques.

Python Decorators

Python decorators are a language feature. It should not be confused with the decorator pattern from design patterns.

Python decorators enable you to modify the behavior of functions or methods without changing their original code. While they are a powerful and versatile feature, they have this one intrinsic downside: their underlying behavior is not immediately apparent to users. Misuse or overuse of decorators can lead to unexpected outcomes.

Here’s an example to illustrate the misuse of Python decorators.

Consider a scenario where a decorator is used to measure and log the execution time of functions:

import time
from functools import wraps

def time_logger(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start_time = time.time()
        result = func(*args, **kwargs)
        end_time = time.time()
        execution_time = end_time - start_time
        print(f"Function '{func.__name__}' executed in {execution_time:.4f} seconds")
        return result, execution_time  # Implicitly modifying the return value
    return wrapper

@time_logger
def make_transaction():
    ...  # Simulate a network call
    return {"data": "Sample data from {}".format(api_endpoint)}

Let me pause for a moment and clarify the terminology of AOP in this particular context:

cross-cutting concern: performance logging
advice: time_logger
join point: make_transaction

While this seems useful, the decorator implicitly modifies the return value by returning a tuple (result, execution_time) instead of just result. This implicit change can lead to unexpected behavior, especially if the caller of make_transaction does not anticipate the additional execution_time value. It can also break existing code that relies on the original return structure.

A better way to create decorators is to avoid modifying the wrapped function’s return value within decorators unless it’s clear and expected:

import time
from functools import wraps

def time_logger(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start_time = time.time()
        result = func(*args, **kwargs)
        end_time = time.time()
        execution_time = end_time - start_time
        print(f"Function '{func.__name__}' executed in {execution_time:.4f} seconds")
        return result  # Return the original result without changing function's signature.
    return wrapper

@time_logger
def make_transaction(api_endpoint) -> dict:
    ...  # Simulate a network call
    return {"data": "Sample data from {}".format(api_endpoint)}

Another way to introduce confusion is through the use of pointcuts. Consider the following example:

def time_logger(func):
    function_prefix_to_log = "make"  # pointcut specification

    @wraps(func)
    def wrapper(*args, **kwargs):
        func_name: str = func.__name__
        if func_name.startswith(function_prefix_to_log):
            start_time = time.time()
            result = func(*args, **kwargs)
            end_time = time.time()
            execution_time = end_time - start_time
            print(f"Function '{func_name}' executed in {execution_time:.4f} seconds")
        else:
            result = func(*args, **kwargs)
        return result

    return wrapper

In this example, the time_logger decorator applies the advice only to functions whose names start with the make prefix. The developer of it might have wanted to reduce noise in the logs or focus on a particular subset of functions. For whatever reasons, such selective execution can lead to unintended omissions and debugging challenges. For instance, adding this decorator to the function named create_transaction would not work as expected if a developer does not know its control flow.

This example is just to highlight the potential pitfalls of using pointcuts. In my opinion, such selective modifications are generally only justifiable in specific scenarios, such as framework development, where the benefits of such mechanisms outweigh the risks.

Then, what about the overuse of decorators?

One of the main advantages of decorators is that they help solve the DRY (Don’t Repeat Yourself) problem. However, let’s consider a scenario where reusability isn’t a concern, and you create a decorator specifically for a single function. Here’s an example:

from functools import wraps

def authenticate(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        user = kwargs.get('user')
        if not user or not user.is_authenticated:
            raise PermissionError("User is not authenticated")
        return func(*args, **kwargs)
    return wrapper

@authenticate
def fetch_user_data(user: User) -> dict[str, str]:
    return {"name": user.name, "email": user.email}

In the context of AOP:

cross-cutting concern: security (or authentication)
advice: authenticate
join point: fetch_user_data

This approach not only makes the behavior implicit but also introduces redundancy into the code.

In some cases like this scenario, it may be better to write the code explicitly without using decorators:

def fetch_user_data(user: User):
    if not user or not user.is_authenticated:
        raise PermissionError("User is not authenticated")
    return {"name": user.name, "email": user.email}

By making the authentication check explicit within the function, the behavior becomes transparent to anyone reading the code. There’s no hidden logic; everything the function does is laid out plainly.

Here are some simple guidelines to minimize the unintended outcomes when using Python decorators:

KISS(Keep it simple, stupid). Limit what decorators do.
Avoid changing the function signatures or the return value unless absolutely needed.
Clearly document what the behavior of decorators. (Otherwise, the only information available to users for inferring the behavior is the decorator’s name.)

Metaclasses

Python Metaclasses allow you to customize class creation and behavior. While they may be useful in some very special cases (which I don’t really think there are), overusing metaclasses can lead to code that is hard to understand, debug, and maintain as they can introduce hidden behaviors and side effects that are not immediately apparent from the class definition itself.

One example we will look at here is implicitly adding attributes or methods:

class MetaLogger(type):
    def __new__(cls, name, bases, attrs):
        attrs["log_level"] = "INFO"
        attrs["log"] = lambda self, message: print(f"[{self.log_level}] {message}")
        return super().__new__(cls, name, bases, attrs)

class LoggerBase(metaclass=MetaLogger):
    """Base class for all logging-related classes"""
    pass

class ConsoleLogger(LoggerBase):
    pass

class FileLogger(LoggerBase):
    pass

In the context of AOP:

cross-cutting concern: logging behavior
advice: MetaLogger
join point: all classes of LoggerBase

The behavior of a specific logger is as follows:

>>> console_logger = ConsoleLogger()
>>> print(console_logger.log_level)
INFO
>>> console_logger.log("This is a console log.")
[INFO] This is a console log.

One problem is that anyone looking at this Logger class cannot directly find the log_level attribute or the log method, as the class’s behavior is not fully described within its own definition.

Another problem is that subclasses cannot modify the log method by defining it in the class body, leading to unexpected outcomes. Let’s say a developer wants to create a ConsoleLogger subclass with a different log_level, such as DEBUG

class ConsoleLogger(LoggerBase):
    def log(self, message):
        print(f"{datetime.now()} [{self.log_level}] {message}")

However, due to the way the metaclass is implemented, the log is overwritten back to that added in the metaclass: This behavior can confuse developers who assume that defining log method in the subclass will work as intended.

You might argue that this is purely the developer’s fault; however, such implicit behavior introduced by metaclasses violates what’s called the principle of least astonishment, as code should be explicit and predictable, enabling developers to understand its behavior without spending extra time learning the complexities of the metaclass implementation.

Another issue is that the code editor can’t find its references. (This has been confirmed in both VS Code and PyCharm at the time of writing this post.)

A better approach would be just using simple inheritance like below:

class LoggerBase:
    def __init__(self):
        self.log_level = "INFO"

    def log(self, message):
        print(f"[{self.log_level}] {message}")

class ConsoleLogger(LoggerBase):
    pass

class FileLogger(LoggerBase):
    pass

With this approach, the code is simpler, more explicit, and easier to understand, while still providing the same functionality and flexibility as the metaclass approach.

As a side note, whenever tempted to use metaclasses, remember this:

Metaclasses are deeper magic than 99% of users should ever worry about. If you wonder whether you need them, you don’t – Tim Peters

Test Fixtures

A test fixture is a setup or environment created to ensure that tests run consistently and reliably. It includes the necessary conditions, data, or objects required for testing, such as initializing databases, creating mock objects, or cleaning up resources after tests.

While test fixtures aren’t directly related to AOP, both serve to modularize concerns. If you’re already thinking in AOP terms, you might see fixtures as a testing-specific AOP pattern for setup and teardown logic.

In-line Setup

The simplest and most explicit way to define test fixtures is through in-line setup. Here’s an example of using this approach:

def test_get_user():
    # Setting up test fixtures
    db = create_database()
    db.add(User(email="test_user@example.com", username="test_user"))

    user = get_user(email="test_user@example.com")
    assert user.email == "test_user@example.com"
    assert user.username == "test_user"
    
    db.teardown()

def delete_user():
    db = create_database()
    db.add(User(email="test_user@example.com", username="test_user"))

    user = get_user(email="test_user@example.com")
    delete_user(user)
    user = get_user(email="test_user@example.com")
    assert user is None

    db.teardown()

While in-line setup is straightforward, it leads to repetitive boilerplate when multiple tests rely on the same text fixture. Duplicating fixture code not only clutters the test suite but also complicates maintenance. This is not AOP.

Delegate Setup

To address this in an AOP manner, we can extract the shared fixture logic into a reusable method and inject it into test functions. This approach, known as delegate setup, is shown below using pytest:

import pytest

@pytest.fixture
def setup():
    db = create_database()
    db.add(User(email="test_user@example.com", username="test_user"))
    yield  # Execute the test
    db.teardown()

def test_get_user(setup):
    user = get_user(email="test_user@example.com")
    assert user.email == "test_user@example.com"
    assert user.username == "test_user"

def delete_user(setup):
    user = get_user(email="test_user@example.com")
    delete_user(user)
    user = get_user(email="test_user@example.com")
    assert user is None

Here, the setup fixture manages the creation and teardown of resources. While there’s a slight reduction in explicitness—test, it efficiently eliminates duplication, making tests cleaner and easier to maintain.

Implicit Setup

You can go further streamline test fixtures by using implicit setup. By enabling autouse=True, the fixture automatically applies to all tests without requiring explicit inclusion:

import pytest

@pytest.fixture(autouse=True)
def setup():
    db = create_database()
    db.add(User(email="test_user@example.com", username="test_user"))
    yield  # Execute the test
    db.teardown()

def test_get_user():
    user = get_user(email="test_user@example.com")
    assert user.email == "test_user@example.com"
    assert user.username == "test_user"

def delete_user():
    user = get_user(email="test_user@example.com")
    delete_user(user)
    user = get_user(email="test_user@example.com")
    assert user is None

This approach completely hides the cross-cutting concerns in testing, which are database setup and teardown. While this approach minimizes boilerplate, it introduces potential pitfalls. The implicit nature of the setup can obscure dependencies, leading to unexpected behavior. For instance, imagine a new test test_create_user is added like below:

def create_user(email: str, username: str) -> User:
    if check_duplicate_email(email):
        raise DuplicateEmailError
    ...

def test_create_user():
    user = create_user(email="test_user@example.com", username="test_user")
    assert user.email == "test_user@example.com"

Because the setup fixture already populates the database with a user having the same email, this test will fail with DuplicateEmailError. A developer unaware of the implicit setup might waste time diagnosing the issue.

Another subtle issue is that unintended executions of the setup code can degrade performance across large test suites, especially for tests that don’t need these fixtures.

In scenarios like this, a balance between explicit and implicit behavior often delivers the best results. In my view, the delegate setup effectively strikes this balance in most cases.

So far, we’ve explored examples such as Python decorators, metaclasses, and test fixtures to examine the risks of implicit behavior introduced by AOP. Here are a couple of key takeaways:

When the control flow is obscured, it becomes more difficult to understand the program and can lead to unintended behaviors.
Since every aspect is tightly coupled with all of its join points in a program, any change to it can result in widespread program failures.

Convention over Configuration

Convention over configuration, also known as coding by convention, is a design paradigm that minimizes number of explicit configurations developers need to make by providing sensible default behaviors or configurations. This approach is especially prevalent in libraries and frameworks across various programming languages. While it simplifies the development process and adheres to principles like DRY (Don’t Repeat Yourself), it also introduces some pitfalls when implicit behaviors clash with developers’ expectations.

Convention-based ORM

As an example, I will use this particular web framework, Django. Django supports database migrations for its object-relational mapping (ORM) models. When defining models, Django provides two implicit default behaviors:

It adds an auto-incrementing primary key field (id) to every table unless explicitly overridden.
It implicitly names database tables based on the model class name unless explicitly specified.

The first behavior is sensible enough for most cases. Most database tables require primary keys, and automating this process removes repetitive code. For most developers, this default is convenient and intuitive.

The second behavior, however, can lead to challenges. Consider the following model definition:

from django.db import models

class ClassSchedule(models.Model):
    day = models.CharField(max_length=10)
    time = models.TimeField()

By default, this creates a table named appname_classschedule. While functional, here are a few downsides:

If your organization requires table names to be in plural forms (e.g., class_schedules), forgetting to explicitly override the default can result in inconsistency with the policy.
The default name classschedule is not easily readable. A more appropriate name would be class_schedule.

To customize the table name, developers must explicitly define it in the model’s Meta class:

class ClassSchedule(models.Model):
    day = models.CharField(max_length=10)
    time = models.TimeField()

    class Meta:
        db_table = "class_schedule"

Although this approach resolves the issue, it adds a little complexity, especially for developers unaware of the default behavior. Failing to address implicit naming conventions may lead to technical debt.

External Dependencies

Another common use case of convention over configuration is the reliance on environment variables for managing application settings. This method is particularly useful for handling sensitive data (e.g., passwords, tokens) or configuring external systems without hardcoding values.

Consider the example of the boto3 library for interacting with AWS services:

import boto3

s3_client = boto3.client(service_name="s3")
s3_client.upload_file(file_name, bucket, object_name)

boto3, by default, implicitly reads AWS credentials from environment variables (e.g., AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) unless they are specified in the code. While this approach simplifies setup where these variables are pre-configured, it introduces this potential issue. If incorrect values are passed through environment variables in production, the system may operate seemingly correctly but interact with unintended resources. Such issues are often hard to detect and debug.

To reduce these risks, it’s better to explicitly manage environment variables within the application. For example:

import os

import boto3

aws_access_key_id = os.getenv("AWS_ACCESS_KEY_ID")
aws_secret_access_key = os.getenv("AWS_SECRET_ACCESS_KEY")
region_name = os.getenv("AWS_DEFAULT_REGION")

s3_client = boto3.client(
    service_name="s3",
    aws_access_key_id=aws_access_key_id,
    aws_secret_access_key=aws_secret_access_key,
    region_name=region_name,
)
s3_client.upload_file(file_name, bucket, object_name)

With some added verbosity, the application explicitly declares its dependency on external configuration, improving clarity and maintainability.

Multithreading

Multithreading is another example that leads to non-intuitive behaviors. The most notable drawbacks is the occurrence of race conditions. A race condition arises when multiple threads simultaneously access and modify shared data without proper synchronization. Without proper synchronization, they often result in unpredictable and erroneous outcomes.

Here’s an example of a simple banking application that lacks proper synchronization:

from concurrent.futures import ThreadPoolExecutor
from time import sleep

class BankAccount:
    def __init__(self, balance=0):
        self.balance = balance

    def deposit(self, amount):
        temp = self.balance
        sleep(0.1)  # Simulate processing time
        self.balance = temp + amount

At first glance, this code looks straightforward. However, when multiple threads execute the deposit method concurrently, they may interfere with each other. Let’s test this scenario:

def perform_transactions(account):
    account.deposit(100)
account = BankAccount(100)
with ThreadPoolExecutor(max_workers=5) as executor:
    for _ in range(5):
        executor.submit(perform_transactions, account)
print(f"Final balance: {account.balance}")

Output:

Final balance: 200

Although the expected final balance is 600, the actual result is incorrect. Worse yet, the output can vary with each execution. This unpredictability arises because the way multithreading handles shared resources is hidden from the user. In more complex systems, such nondeterministic behavior becomes even more problematic, making bugs harder to identify, reproduce, and fix.

To prevent race conditions, we must manage access to shared resources explicitly like below:

from concurrent.futures import ThreadPoolExecutor
from threading import Lock
from time import sleep

class BankAccount:
    def __init__(self, balance=0):
        self.balance = balance
        self.lock = Lock()  # Explicit lock

    def deposit(self, amount):
        with self.lock:  # Explicitly acquire and release the lock
            temp = self.balance
            sleep(0.1)  # Simulate processing time
            self.balance = temp + amount
            print(f"Deposited {amount}, new balance: {self.balance}")

Here, the Lock object ensures that only one thread can execute the critical section of the deposit method at a time. By explicitly controlling access, the final balance consistently matches expectations.

In the context of threads, explicit behaviors are related to sequential computation. And the value of sequential computation is well described here:

Threads discard the most essential and appealing properties of sequential computation: understandability, predictability, and determinism. Threads, as a model of computation, are wildly non-deterministic, and the job of the programmer becomes one of pruning that nondeterminism. – Edward A. Lee

Conclusion

Implicit behaviors often emerge as a result of specific design choices (e.g. AOP or CoC) we make to address larger problems (e.g. managing cross-cutting concerns or streamlining configurations). When applied appropriately, they help reduce redundancy and simplify complex tasks. However, they also bring costs, including:

Obscured control flow, making it harder to trace program execution.
Increased debugging complexity, as unintended side effects may be difficult to find.
Steeper learning curves, as they often require an in-depth understanding of specific implementations or frameworks.

It’s important to recognize that the costs of implicit behaviors can be higher than anticipated. And, there might be many situations where making implicit behaviors explicit could effectively improve both the productivity and reliability of the code. While implicit behaviors can help streamline development and abstract away some complexity, the value of explicitness should not be underestimated.

References

Explicit is Better than Implicit - Part 1: Intentions

2024-11-21T00:00:00+00:00

In this series, I’ll explore why explicit designs matter, focusing on two dimensions: intentions and behaviors.

Names

Naming in programming plays a fundamental role in making intentions clear. Well-chosen names can make the intent and purpose of the code immediately understandable, reducing the cognitive load on developers. Poor naming, on the other hand, can lead to confusion, errors, and increased time spent debugging or onboarding new team members.

Let’s have a look at the code snippet with implicit function and variable names:

def get_data(id: int) -> Product:
    ...
    return data

def calc(value: int, rate: float) -> float:
    return value * (1 - rate)

def process(id: int) -> float:
    data = get_data(id)
    rate = 0.2
    result = calc(data.price, rate)
    return result

At first glance, neither the variables nor the function names clearly convey their purpose. For example, the function calc() is too vague—what exactly is being calculated? Similarly, process() gives no hint of what kind of “processing” is happening.

This lack of clarity forces readers to spend extra time deciphering the intent of the code to understand what it actually does. We might be able to infer that the get_data function retrieves product information by looking at its return type, which is a Product object. But, what is the rate for? Without additional context or documentation, the code leaves readers to make guesses at best.

However, we should note that guesswork is a dangerous practice in software development. What if you assume the rate represents a discount rate, but it actually refers to a tax rate? Such a misunderstanding could result in unexpected bugs. To avoid these dangers, it’s better to seek clarification or make the code more explicit rather than making guesses.

We can improve the code by using more descriptive function and variable names:

def get_product(id: int) -> Product:
    ...
    return product

def calculate_discounted_price(original_price: int, discount_rate: float) -> float:
    return original_price * (1 - discount_rate)

def get_discounted_price(product_id: int) -> float:
    product = get_product(product_id)
    discount_rate = 0.2
    discounted_price = calculate_discounted_price(product.price, discount_rate)
    return discounted_price

Now, the intentions are made explicit. Even someone unfamiliar with the code can understand what it does at a glance. What’s more valuable of this improvement is that it eliminates the risk of guesswork for readers.

Abbreviations

Let’s take another example that illustrates a different type of potential risk caused by the use of implicit names.

Consider this code snippet:

# Get price
p = product["price"]

...

make_transaction(amount=p)

In this code, the product’s price is stored in a variable named p, and a transaction is made further down the code using this variable.

Now, imagine that the program requirements change so that a reward point should be added to the user based on the product price. A developer might implement this new requirement as follows, not knowing how p is used later in the code:

# Get price
p = product["price"]

...

p = calculate_reward_point(product["price"])  # Accidentally reusing 'p'
user.reward_point += p  # Add reward point

...

make_transaction(amount=p)  # BUG: Make transaction with reward point, not price

In this modified version, p is accidentally overriden with the reward point instead of maintaining the product price. As a result, the make_transaction function mistakenly uses the reward point in place of the actual price, causing a significant bug.

To avoid this issue, more descriptive and intentional naming could help clarify the purpose of each variable:

# Get price
price = product["price"]

...

reward_point = calculate_reward_point(price)
user.reward_point += reward_point  # Add reward point

...

make_transaction(amount=price)

By using explicit names like price and reward_point, the code not only makes each variable’s purpose clear, but also prevents accidental reuse or modification.

Excessive abbreviations also make developers hard to follow the code. For instance, naming a variable for “creation datetime” as crtndt is not immediately understandable. While concise is better than verbose, excessive abbreviations only distract and confuse readers. A better choice would be creation_datetime (using snake_case, for example), which clearly conveys its meaning.

Here are some other examples:

implicit	explicit(snake_case)
lctn	location
fxno	fax_number
rgstdt	registration_date
crno	corporate_registration_number

Meanwhile, not all abbreviations need to be avoided. For example, universally understood abbreviations like url for “Uniform Resource Locator” are perfectly fine. These terms are widely recognized and also improve readability.

The principles we’ve discussed so far also apply to class names.

Generic Terms

Naming classes with overly generic terms like Service or Manager can also be problematic. It is often considered an anti-pattern for two main reasons:

They don’t provide enough information about what the class specifically does, making it harder for developers to understand its purpose at a glance.
They are likely to have too many responsibilities (also called god object), leaving the code difficult to maintain, test, and extend.

To improve clarity and maintainability, it’s better to distribute responsibilities across multiple, more focused classes and give them names that clearly reflect their intentions. For example, a TransactionService class could be split into TransactionProcessor, TransactionValidator, and TransactionLogger. This approach eliminates the ambiguity of the original class name, making the roles of the individual classes much more apparent and easier to work with.

Like so, coming up with specific names does more than just improve the readability of the code—it also enhances the overall design.

Distributing responsibilities across multiple classes is generally encouraged by so-called single responsibility principle (SRP). This principle states that a software module (or class) should have only one responsibility. For a deeper understanding about SRP, you may refer to this post.

Tests

Naming test cases is an often-overlooked but critical aspect of writing maintainable and effective tests.

Consider the following example, where test cases are designed to validate a flight service:

def test_flight0():
    ...

def test_flight2():
    ...

def test_flight1():
    ...

While it’s evident that these tests are related to flights, the actual purpose of each test is hidden. To understand what is being tested, you’d have to look into the implementation details of each test. This lack of clarity can become a pain point when debugging or analyzing test results.

To make matters worse, depending on your testing framework, unclear names may lead to vague test results when something fails—unless verbose output is enabled. For instance, consider the output when test_flight1 fails:

FAILED test.py::test_flight1 - ...

This tells you almost nothing about what went wrong. Was it a booking failure? A cancellation bug? Something else?

Here’s an improved version with explicit naming:

def test_book_flight():
    ...

def test_reschedule_flight():
    ...

def test_cancel_flight():
    ...

With these descriptive names, you know precisely what is being tested at a glance. This clarity extends to test results as well. For example, if the test_cancel_flight test fails, the output becomes self-explanatory:

FAILED test.py::test_cancel_flight - ...

Parameters

Often, some functions demand complex objects as inputs when all they truly need are a few specific attributes or values. This forces clients to provide overly detailed inputs, potentially leading to unnecessary coupling and bloated test setups.

Consider the following example:

product = Product(
    id=1,
    original_price=1000,
    discount_rate=0.1,
    name="Apple",
    status="Available",
    main_image="https://example.com/",
    ...
)

def calculate_discounted_price(product: Product) -> float:
    return product.original_price * (1 - product.discount_rate)

Here, the calculate_discounted_price function takes an entire Product object as input, even though it only uses two attributes: original_price and discount_rate.

I have to say that this approach is not necessarily bad.

However, there are several benefits if we improve the code by explicitly specifying the required inputs like below:

def calculate_discounted_price(original_price: int, discount_rate: float) -> float:
    return original_price * (1 - discount_rate)

First, testing becomes straightforward. When the calculate_discounted_price function requires a Product object, you’re forced to construct a Product object, which may include unnecessary fields or mocked data:

product = Product(
    id=1,
    original_price=1000,
    discount_rate=0.1,
    # ... include just enough to prevent errors
)

assert calculate_discounted_price(product) == 900

But now, you no longer need to create a Product object to test the function, as the calculate_discounted_price now focuses on its own responsibility without depending on the Product object. For example:

assert calculate_discounted_price(original_price=1000, discount_rate=0.1) == 900

Also, this approach allows the same logic to be reused in different contexts. For example, you can now apply the discount calculation to entirely different entities, like movie tickets:

calculate_discounted_price(movie_ticket.original_price, movie_ticket.discount_rate)

Additionally, explicit parameters help make the function’s intent clearer. Anyone reading the code knows exactly what inputs are required, without having to dig into the Product class definition.

Magic Numbers

In programming, magic numbers refer to numerical or text values with unexplained meaning or multiple occurrences.

Let’s explore an example that illustrates the risks of magic numbers. Consider this code snippet:

# Calculate the future value of the investment
future_value = principal * (1 + 0.05) ** periods

# Calculate the Effective Annual Rate (EAR)
ear = (1 + 0.05 / periods) ** periods - 1

This code performs two calculations-the future value of the investment and the Effective Annual Rate (EAR).

What’s not obvious is the value 0.05, which, in fact, represents the interest rate. Developers could reveal its meaning by having a closer look at the context, which is okay-ish. However, there’s still a room for human mistakes. Imagine the interest rate changes to 0.04. A developer, tasked with updating the rate, might modify only the value in the future_value calculation, unaware that the same value is also used in the ear formula. This oversight could lead to serious bugs because they might assume the two instances of 0.05 are unrelated.

The risk of errors introduced by magic numbers can be effectively handled by replacing them with meaningful constants. Here’s a more clarified version:

INTEREST_RATE = 0.05

# Calculate the future value of the investment
future_value = principal * (1 + INTEREST_RATE) ** periods

# Calculate the Effective Annual Rate (EAR)    
ear = (1 + INTEREST_RATE / periods) ** periods - 1

In this version, the magic number 0.05 is replaced with a constant, INTEREST_RATE. Now, anyone reading it immediately understands what the number represents, and making changes is centralized and straightforward.

Not all numbers in code are considered “magic.” In some contexts, numbers are universally understood and don’t require further explanation. Let’s consider this example:

if month == 1:
    ...
elif month == 2:
    ...
elif month == 3:
    ...
...

In this snippet, the numbers 1, 2, and 3 are used to represent months. This is a standard convention, and most developers will instantly recognize their meaning (e.g., 1 for January, 2 for February, etc.). In cases like this, the use of these numbers is justified and does not obscure the intent of the code.

Attempting to replace these values with constants might actually cause unnecessary verbosity:

JANUARY = 1
FEBRUARY = 2
MARCH = 3
...

if month == JANUARY:
    ...
elif month == FEBRUARY:
    ...
elif month == MARCH:
    ...
...

While technically correct, this approach possibly degrade the overall readability of the codebase.

Exceptions

Using overly generic exceptions for error handling can be problematic, as they hide the underlying cause of the issues for both developers and users.

Take this example:

try:
    make_transaction(amount)
except Exception:
    raise Exception("An error occured.")

In this code, no matter what actually goes wrong, the error message will say “An error occurred.”.

As a result, if the system encounters an issue, developers may spend more time debugging. And users may struggle to understand what went wrong, making it difficult to take appropriate action.

To improve this, developers should catch specific exceptions and provide meaningful error messages that give both developers and users actionable information. When doing so, consider the potential scenarios that could cause failure. For instance, the make_transaction function might fail due to an invalid transaction amount, or a failure of an external payment service. Here’s an improved, more specified version:

try:
    make_transaction(amount)
except ValueError as error:
    raise error("The transaction amount is invalid. Please pass a positive value.")
except ConnectionError as error:
    raise error("The payment service went wrong. Please try again later.")

With this approach, you make it easier to debug and provide a much better experience for users.

Comments

Encouraging explicit intentions applies not only to code but also to comments. Comments are particularly useful for explaining what isn’t immediately obvious from the code. However, vague or unclear comments don’t offer much help to readers, often leaving them guessing about the code’s purpose or functionality.

For instance:

try:
    # Wait for 1 second.
    response = requests.get(url="https://example.com", timeout=1)
except requests.exceptions.Timeout as error:
    ...

This comment doesn’t explain why we’re waiting for 1 second, leaving readers with unanswered questions. It might appear to be an arbitrary choice.

But what if, in reality, there was a Service Level Agreement (SLA) that mandates a specific timeout threshold? Without this context, someone might later increase the timeout to 2 seconds, thinking, “1 second is too short” or “why not 2 seconds?”. This is a potential risk posed by lack of clarity about the intention.

Here’s an improved, more explicit version:

try:
    # Wait for 1 second at most to adhere to the SLA.
    response = requests.get(url="https://example.com", timeout=1)
except requests.exceptions.Timeout as error:
    ...

Providing this context helps prevent unintentional changes that could lead to violations of important requirements or expectations.

As a side note for this particular example, if it feels like a magic number, another option is to replace comment with a named constant like below:

SLA_TIMEOUT = 1
try:
    response = requests.get(url="https://example.com", timeout=SLA_TIMEOUT)
except requests.exceptions.Timeout as error:
    ...

Both approaches make the intention more explicit.

Meanwhile, redundant comments can degrade the overall readability. It’s essential to strike a balance—write comments that add value by explaining why a piece of code exists or operates in a certain way, rather than stating the obvious.

Conclusion

When intentions are not clear, it forces other developers and even the original author to spend extra time understanding, maintaining, or modifying the code. Worse yet, it can lead developers to do guesswork, potentially introducing unintended bugs. When intentions are explicit, on the other hand, it becomes easier for developers to understand the purpose behind the code and helps prevent unexpected errors.

Excessive explicitness can sometimes reduce readability rather than enhance it. Therefore, being explicit means focusing on delivering the intention without overloading the reader.

When it comes to how explicit the code should be, it’s helpful to remember that code you find clear might still seem implicit to others. Just like good writers understand their audience, good developers who understand others’ perspectives create better software.

Code Coverage: Misusage and Proper Usage

2024-09-12T00:00:00+00:00

TL;DRs

Blindly aiming for high coverage can degrade test quality as developers may prioritize increasing coverage metrics over writing meaningful, effective tests.
What’s more important than the coverage number is how you use code coverage itself.
Code coverage can be effectively used to identify weaknesses in your test design, highlighting areas that need better testing approaches.

In this post, we will discuss code coverage, exploring both its potential for misusage and proper usage.

Code Coverage Basics

Code coverage is a metric used to measure the percentage of your code that is executed when running your tests. It helps you understand how much of your codebase is covered by tests and can identify untested parts of your application. To achieve this, we typically use code coverage tools.

For example, consider the following code and test in Python:

# app.py
def add(a, b):
    return a + b

def subtract(a, b):
    return a - b

# test_app.py
from app import add


def test_add():
    assert add(2, 3) == 5

In this example, the test covers the add() function but doesn’t cover the subtract() function.

To visualize this, let’s look at how a code coverage tool like Coverage.py might report the coverage for the above code:

Name	Stmts	Miss	Cover
app.py	4	1	75%
test_app.py	3	0	100%
TOTAL	7	1	86%

In this case, the app.py file has 75% coverage, as 3 out of 4 statements are covered. This is due to how Python measures code coverage—the def and class lines in any Python file are executed upon import, even if the functions themselves are not called.

Consequently, any potential bugs in the subtract() function would go unnoticed.

To increase coverage, you would write an additional test, such as:

# test_app.py
from app import add, subtract


def test_add():
    assert add(2, 3) == 5

def test_subtract():
    assert subtract(5, 3) == 2

By adding the test_subtract() function, you increase the code coverage, ensuring that both the add() and subtract() functions are tested. Running the code coverage tool again would likely show 100% coverage, as all functions and code paths are now tested:

Name	Stmts	Cover
app.py	4	100%
test_app.py	5	100%
TOTAL	9	100%

Types of Code Coverage

There are several types of code coverage, each providing a different perspective on how thoroughly your code is tested. The coverage percentage can vary depending on the type of coverage being measured.

Some of the most common types of code coverage include:

Function coverage
Statement coverage
Branch coverage
Condition coverage

To explain how these types of coverage work, let’s consider the following code:

# app.py
def is_eligible_for_discount(age: int, is_student: bool) -> bool:
    """
    Determines if a person qualifies for a discount based on their age and student status.
    """
    if is_student:
        return True
    if age < 18 or 65 <= age:
        return True
    return False

# test_app.py
from app import is_eligible_for_discount


def test():
    assert is_eligible_for_discount(17, True) is True
    assert is_eligible_for_discount(18, False) is False

In this test, we are checking two scenarios: a 17-year-old student and an 18-year-old non-student. While this test might pass, it doesn’t necessarily mean we’ve achieved high code coverage. Different types of code coverage will reveal different aspects of how thoroughly this function is tested.

Function Coverage

Function coverage measures whether each function in the code has been executed at least once. It answers the question, “Have all the functions in the code been called?”

If you measure the test above, function coverage would be 100% as both is_eligible_for_discount and test functions are called during the test execution.

Statement Coverage

Statement coverage measures whether each line (or statement) in the code has been executed. It answers the question, “Have all the lines in the code been executed?”

If you measure the test above, statement coverage for app.py would be about 83% (5 out of 6 statements) as the return statement within the second if block (if age < 18 or 65 <= age) is executed, but the last return statement return False is never executed in the test case.

This is the most commonly used type of code coverage because it strikes a good balance between simplicity and practicality. Its straightforward approach makes it accessible even for teams with limited resources or time. This allows them to quickly gain insights into their code’s reliability without the complexity of more advanced coverage types.

Branch Coverage

Branch coverage measures whether each possible branch (or path) in control structures (like if statements) has been executed. It answers the question, “Have all the paths in the code been executed?”

If you measure the test above, branch coverage for app.py would be about 80% as the if is_student: branch and the if age < 18 or 65 <= age: branch are both covered, but the path where is_student is False and age does not meet the condition (18 <= age < 65) is not covered.

Condition Coverage

Condition coverage, also known as predicate coverage, measures whether each boolean sub-expression (condition) in the code has been tested for both true and false outcomes. It answers the question, “Have all the individual conditions been tested for both true and false?”

If you measure the test above, condition coverage for app.py would be only about 50%. This is because, while the condition is_student is tested for both true and false, the condition within the if age < 18 or 65 <= age: statement is only partially tested. Specifically, only the age < 18 part of the condition is tested, while the 65 <= age condition is not evaluated in the current test cases.

Although the actual coverage percentage might differ slightly depending on the tools and metrics used, this example illustrates how condition coverage can identify untested logical paths in your code, encouraging more comprehensive testing.

Code Coverage vs Test Coverage

While the terms “code coverage” and “test coverage” are often used interchangeably, they refer to different aspects of software testing.

Code coverage specifically measures the percentage of a program’s source code that is executed during testing. It focuses on the technical details of code execution to ensure that all lines, branches, and paths are exercised by the test cases.

On the other hand, test coverage takes a broader perspective. It assesses whether all required features and functionalities of the software are tested according to the defined requirements. Test coverage ensures that the software meets its intended purpose and behaves as expected from a user’s perspective, so to speak.

Target Coverage Values?

(A snippet of a code coverage report from one of the projects I maintain. I usually expect a similar coverage for other projects as well.)

It seems that the most common code coverage targets used in many companies typically fall between 80% and 90%.

Some safety-critical applications require 100% coverage for certain types of tests. For example, in the aerospace industry, software used in flight control systems often mandates 100% branch coverage to ensure that every possible execution path is thoroughly tested.

Benefits

I think having a target coverage value does have its benefits. Based on my experience, developers generally fall into two categories when it comes to testing:

Developers who write tests because they understand and appreciate the value of testing.
Developers who write tests only when required, either by team policy or other mandates.

For the first type, coverage percentage is not really a concern or hassle. I believe measuring their test suite at any point would easily beat 90% coverage. They know what they’re doing.

For the second type of developers, it is unclear how much of the codebase their tests actually cover; it might not even be close to half, which often is a serious issue. Even with 50% coverage, the amount of “properly” tested code could be much lower. For these developers, setting a target coverage can encourage them to write enough tests to provide a baseline level of reliability, though they might still miss covering all critical paths in the code.

Pitfalls

Enforcing minimum coverage requirements also has problems. One issue with enforcing minimum code coverage requirements is that developers often start seeing it as a target, like aiming for 80%. Instead of treating 80% as the minimum, they tend to view it as the maximum they need to hit, so they don’t go beyond that.

Worse yet, developers may write tests just to meet the target, without considering the quality of those tests. This concern is especially pointed out when aiming for 100% coverage. Such habits can degrade the test quality and lead to a false sense of confidence in the reliability of the codebase.

So, how does code coverage lead to bad tests?

Misusage

Let’s explore an example to understand a potential misuse of code coverage.

Consider the following code:

def calculate_discount(price: int, discount_type: str, discount_value: int) -> float:
    """
    Calculate the discounted price based on the given price, discount type, and discount value.
    """
    if discount_type == "fixed":
        discounted_price = price - discount_value
    elif discount_type == "percentage":
        discounted_price = price * discount_value / 100
    else:
        raise ValueError("Invalid discount type")
    return discounted_price

We also have the following test cases to verify this functionality:

def test_calculate_discount_with_fixed_discount():
    assert calculate_discount(100, "fixed", 50) == 50

def test_calculate_discount_with_invalid_discount_type():
    try:
        assert calculate_discount(100, "invalid", 50)
    except ValueError as error:
        assert str(error) == "Invalid discount type"

By running a code coverage tool, we can identify the lines of code that are not executed during testing:

discounted_price = price * discount_value / 100

At this point, if we were to focus solely on achieving 100% code coverage, we might end up with the following test cases:

# Coverage-driven Tests

def test_calculate_discount_with_fixed_discount():
    assert calculate_discount(100, "fixed", 50) == 50

def test_calculate_discount_with_percentage_discount():
    assert calculate_discount(100, "percentage", 50) == 50

def test_calculate_discount_with_invalid_discount_type():
    try:
        assert calculate_discount(100, "invalid", 50)
    except ValueError as error:
        assert str(error) == "Invalid discount type"

With these tests, measuring code coverage will report 100%. But does that mean our code is good to go? Not necessarily.

In fact, there are fundamental flaws in the logic of the calculate_discount function that need addressing:

Logical error: When the discount_type is percentage, the calculation should be:
```
 discounted_price = price - (price * discount_value / 100)
```
A fault of omission: The specifications require that the discounted_price must be equal to or greater than 1. This can be addressed by:
```
 if discounted_price < 1:
     return 1
```

This is an inherent limitation of code coverage: it only indicates which parts of your code have been executed during testing and does not tell you how the code should be written or what additional logic might be needed.

After all, using code coverage tools only to satisfy a higher coverage can lead to shallow, ineffective tests that miss critical logic and edge cases. What’s more important than the coverage value is how you use code coverage to ensure meaningful and robust testing.

Proper Usage

Improve Testing Strategy

Let’s explore how to use code coverage effectively.

Taking another look at our example, where we spot the lines that haven’t been executed:

discounted_price = price * discount_value / 100

This indicates a weakness in our tests. Certain areas are either untested or inadequately tested. And we can do better than what we saw before.

One way to enhance our test suite is by expanding our test cases with various input values:

def test_calculate_discount_with_fixed_discount():
    assert calculate_discount(100, "fixed", 10) == 90
    assert calculate_discount(200, "fixed", 50) == 150

def test_calculate_discount_with_percentage_discount():
    assert calculate_discount(100, "percentage", 10) == 90
    assert calculate_discount(200, "percentage", 50) == 100

def test_calculate_discount_with_invalid_discount_type():
    try:
        assert calculate_discount(100, "invalid", 50)
    except ValueError as error:
        assert str(error) == "Invalid discount type"

In this updated suite, the tests more comprehensively cover both fixed and percentage discount scenarios with various inputs. Note that we have not only added a new test case, test_calculate_discount_with_percentage_discount, but also reinforced the existing test_calculate_discount_with_fixed_discount. As a result, running these updated tests will reveal the logical error in percentage discount calculations.

Based on the test results, we may revise the calculate_discount function as follows:

def calculate_discount(price: int, discount_type: str, discount_value: int) -> float:
    """
    Calculate the discounted price based on the given price, discount type, and discount value.
    """
    if discount_type == "fixed":
        discounted_price = price - discount_value
    elif discount_type == "percentage":
        discounted_price = price - (price * discount_value / 100)
    else:
        raise ValueError("Invalid discount type")
    return discounted_price

As demonstrated, by expanding our test cases, the test suite becomes more effective at identifying errors in the code.

Again, it’s important to note that high code coverage alone doesn’t ensure your tests are meaningful or comprehensive. A more advanced and effective use of code coverage is to focus on whether the tests adequately capture edge cases and fulfill all specified requirements. This approach ensures that your tests go beyond surface-level coverage to provide deeper validation of your code’s behavior in real-world scenarios.

After a thorough examination, we can further improve our tests by adding edge cases and ensuring they reflect the specifications:

def test_calculate_discount_with_fixed_discount():
    assert calculate_discount(100, "fixed", 10) == 90
    assert calculate_discount(200, "fixed", 50) == 150

def test_calculate_discount_with_percentage_discount():
    assert calculate_discount(100, "percentage", 10) == 90
    assert calculate_discount(200, "percentage", 50) == 100

def test_calculate_discount_with_invalid_discount_type():
    try:
        assert calculate_discount(100, "invalid", 50)
    except ValueError as error:
        assert str(error) == "Invalid discount type"

def test_calculate_discount_edge_cases():
    # Edge cases with zero price
    assert calculate_discount(0, "fixed", 20) == 1
    assert calculate_discount(0, "percentage", 20) == 1

    # Edge cases with large discount_value
    assert calculate_discount(100, "fixed", 200) == 1
    assert calculate_discount(100, "percentage", 200) == 1

Now, not only does the code coverage reach 100%, but the tests also reveal a logical error: the calculate_discount function has incorrect logic for percentage discounts. Additionally, we uncover an omission in the code: the discounted_price should not fall below 1.

Finally, we correct the function as follows:

def calculate_discount(price: int, discount_type: str, discount_value: int) -> float:
    """
    Calculate the discounted price based on the given price, discount type, and discount value.
    """
    if discount_type == "fixed":
        discounted_price = price - discount_value
    elif discount_type == "percentage":
        discounted_price = price - (price * discount_value / 100)
    else:
        raise ValueError("Invalid discount type")
    if discounted_price < 1:
        return 1
    return discounted_price

As a side note, reflecting on requirements, as seen here, resembles test-driven development (TDD), where you begin by writing tests that define the expected behavior and then write code to meet those tests. One of the benefits of TDD is naturally achieving high code coverage, aligning with the goals of code coverage tools. However, the two approaches differ in their focus: code coverage emphasizes identifying weaknesses in test design, while TDD focuses on writing tests to guide the development of code that meets specified requirements.

By expanding tests to consider edge cases and various inputs, you naturally improve coverage while ensuring that the function is robust and well-tested. This approach results in a more reliable codebase, reducing the risk of missing important bugs. You can apply this strategy to other parts of the codebase to ensure that the entire application is thoroughly tested.

Find Missed tests and Obsolete Code

It’s a best practice to include your tests in coverage as it helps find missed tests and outdated code, and ensures your coverage goals are accurate.

For instance, if your testing framework’s discovery conventions require test functions to have a test_ prefix, a function named tes_something will not be executed as expected. This could lead to gaps in your test coverage, where certain pieces of code are not adequately tested.

Additionally, coverage reports can help spot obsolete code like unused fixtures. This is important because such code can introduce maintenance costs and potential bugs. By identifying and removing obsolete code, you can simplify your codebase and reduce the risk of unexpected behavior.

Conclusion

While hitting 100% code coverage feels like an accomplishment, it’s crucial to remember that high coverage alone doesn’t mean your tests are truly valuable. Think of code coverage as a tool that tells you what parts of your code are being tested—not how well they’re being tested.

To really get the most out of code coverage, you want to go beyond just covering all your lines of code. The real magic happens when you ask yourself, “Am I testing all the important scenarios? Am I covering edge cases and ensuring my code behaves correctly under all circumstances?”

Writing good tests is no easy task. It requires thoughtful planning, a deep understanding of the code, and the ability to anticipate potential edge cases and scenarios that might break the code. Code coverage tools can help develop this skill by highlighting areas of the code that haven’t been tested, prompting you to think critically about untested paths and edge cases that may otherwise be overlooked.

References

Exposing Applications Running in EKS Cluster for External Access

2024-08-07T00:00:00+00:00

To expose applications running in a Kubernetes cluster in the cloud, you need an additional component to facilitate external access. For clusters in AWS, AWS Elastic Load Balancers are the components that enable this external access.

You have two options for provisioning AWS Elastic Load Balancers:

Manually create a load balancer and register the target pods of the Service with the target groups yourself.
Install a Service or Ingress controller and let it handle the load balancer corresponding to the Kubernetes Service or Ingress objects.

The first option involves a manual process and has limitations due to the dynamic nature of the Kubernetes system, as the targets can change frequently. The second option is automatic, relieving the operator from managing the load balancers, as this becomes the controller’s responsibility. For these reasons, leveraging controllers is the recommended approach for provisioning load balancers in most cases.

In this post, we will discuss the following topics:

A brief review of AWS Elastic Load Balancers and different options for an EKS cluster.
An overview of available controllers, including how they work, along with their pros and cons.
Utilizing the AWS Load Balancer Controller.

AWS Elastic Load Balancers

AWS Elastic Load Balancers (ELB) are a service provided by Amazon Web Services (AWS) that automatically distributes incoming application or network traffic across multiple targets in one or more Availability Zones. They offer high availability and fault tolerance for Kubernetes clusters deployed in AWS.

Load Balancer Types

There are four types of ELBs, each suited for different use cases:

Classic Load Balancers (CLB)
Network Load Balancers (NLB)
Application Load Balancers (ALB)
Gateway Load Balancers (GWLB)

While choosing which load balancer type to use depends on the workload requirements, the most relevant types for EKS clusters are the ALB and NLB:

Application Load Balancer (ALB): Ideal for workloads requiring HTTP/HTTPS load balancing at Layer 7 of the OSI Model. The ALB is managed by the Ingress resource and routes HTTP/HTTPS traffic to corresponding Pods.
Network Load Balancer (NLB): Suitable for TCP/UDP workloads and those needing source IP address preservation at Layer 4 of the OSI Model. NLB is also preferable if a client cannot utilize DNS, as it provides static IPs.

This post focuses on using the Application Load Balancer (ALB) since it is particularly well-suited for managing HTTP/HTTPS traffic within EKS clusters, allowing for advanced routing features and better integration with Kubernetes services.

Target Type

There are two main target types of AWS Load Balancers you can choose when provisioning a load balancer for EKS:

Instance
IP

With the ‘Instance’ target type, the load balancer forwards traffic to the worker node on the NodePort.

This means that traffic from the load balancer is processed by the node’s networking stack, involving iptables rules or similar mechanisms, before being forwarded to the appropriate Service and pod. This additional processing can increase latency and add complexity to monitoring and troubleshooting, as the traffic is first handled by the node before reaching the intended pod.

In contrast, with the ‘IP’ target type, the load balancer forwards traffic directly to the Pod.

This approach bypasses the node’s additional networking layers, simplifying the network path, reducing latency, and making monitoring and troubleshooting more straightforward. It also allows for direct utilization of the load balancer’s health checks, accurately reflecting the pod’s status. Additionally, using ‘IP’ mode can reduce cross-AZ data transfer costs, unlike ‘Instance’ mode, where traffic is routed through Kubernetes NodePort and ClusterIPs.

The recommended target type is ‘IP’ because direct routing avoids additional hops and overhead, providing a more efficient and straightforward traffic flow.

This post focuses on using ‘IP’ target type.

Service and Ingress Controllers

A controller is essentially a control loop that monitors changes to Kubernetes objects and takes corresponding actions, such as creating, updating, or deleting them.

For example, a service controller watches for new Service objects and provisions a load balancer using the cloud provider’s APIs when it detects one with spec.type set to LoadBalancer, as shown below:

apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  type: LoadBalancer
  ...

The controller then sets up the load balancer’s listeners and target groups and registers the target pods of the Service.

Similarly, an ingress controller watches for any changes in Ingress objects and provisions and updates the load balancers, configuring them according to the rules specified in Ingress resources. Below is an example of Ingress resources:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-ingress
  annotations:
    alb.ingress.kubernetes.io/load-balancer-name: my-ingress
    alb.ingress.kubernetes.io/target-type: ip
spec:
  rules:
  - host: example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: my-service
            port:
              number: 80

More detailed usage of Ingress will be covered later in this post.

When deciding which controller to use, we will consider the following three options:

AWS Cloud Controller Manager’s Service Controller (legacy, in-tree service controller)
AWS Load Balancer Controller (recommended by AWS)
Ingress-Nginx Controller (open-source implementation of an ingress controller)

While it’s beyond the scope of this discussion, these methods can be used in conjunction with each other depending on specific requirements.

AWS Cloud Controller Manager’s Service Controller (In-tree Service Controller)

AWS Cloud Controller Manager’s Service Controller is also referred to as the in-tree Service Controller as it is integrated into the Kubernetes core codebase. This means that you can use it right away because it is preinstalled with AWS Cloud Controller Manager.

It can provision Classic Load Balancers (CLBs) or Network Load Balancers (NLBs) depending on the load balancer type specified in the Service manifest.

By default, it creates a CLB when it detects a new Kubernetes Service of type LoadBalancer like below:

apiVersion: v1
kind: Service
spec:
  type: LoadBalancer
  ...

To use an NLB instead of a CLB, you need to set the service.beta.kubernetes.io/aws-load-balancer-type annotation to nlb in the Service manifest:

kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
spec:
  type: LoadBalancer
  ...

In general, using only Service to provision load balancers is not recommended for the following reasons:

It does not support ALB, which provides Layer 7 (application layer) features.
It only supports instance target type, which can lead to operational complexity.
Managing multiple Service objects requires a corresponding number of load balancers, each incurring additional infrastructure cost.

This is where Kubernetes Ingress resource comes in. The Ingress exposes and routes HTTP/HTTPS traffic to services within a Kuberentes cluster. It means that with an ingress controller you can deploy a single load balancer and then route traffic to multiple services.

AWS Load Balancer Controller

The AWS Load Balancer Controller watches both Ingress and Service resources and manages AWS Load Balancers. It can provision both Application Load Balancers (ALBs) and Network Load Balancers (NLBs) and offers more flexibility compared to the in-tree Service Controller.

The following diagram shows an example of using LBC with ip target-type:

A key advantage of the AWS Load Balancer Controller is its seamless integration with Kubernetes, which simplifies the management of NLBs and ALBs through Kubernetes annotations. Unlike the in-tree service controller, this controller supports advanced features such as path-based routing, host-based routing, and other capabilities provided by AWS.

On ther other hand, while it may not be a significant drawback, it’s tightly integrated with AWS services, which can make it less portable. This can be something to consider if you somehow need to move to another cloud provider or use a multi-cloud strategy. It may also introduce some degree of a learning curve, especially for users not familiar with AWS-specific configurations.

Ingress-Nginx Controller

Ingress-Nginx deploys Nginx reverse proxy pods inside the cluster. This reverse proxy routes traffic from outside the cluster to the services. It can also act as an internal layer 7 load balancer.

This option supports a rich feature set of Nginx, a widely used open-source project with a strong community. Since it’s not specific to AWS, it offers flexibility and provides a consistent ingress experience across different cloud providers or even on-premises environments.

On the downside, this approach introduces some operational costs, as cluster operators must monitor, maintain, and scale the underlying resources. Additionally, it may lead to increased infrastructure costs due to the need for dedicated node resources to isolate proxy pods, ensuring high reliability and availability. Lastly, there could be a minor performance overhead from the extra hops involved in routing traffic through the reverse proxy.

Overall Comparison of Load Balancer Controllers

The following table is a summary of different controllers with their pros and cons:

Controller	Pros	Cons
In-tree Service Controller	No installation required	No ALB support Operational cost Infrastructure cost
AWS Load Balancer Controller	Highly scalable Highly available Less operational cost	Less portable
Ingress-Nginx Controller	Rich features of Nginx Highly flexible	Operational cost Latency

All in all, the AWS Load Balancer Controller is a go-to choice if you need a scalable and highly available solution with lower operational costs. If you prefer leveraging Nginx features and require more flexibility in the networking layer, the Ingress-Nginx Controller can be a good option. It’s worth noting that both AWS Load Balancer Controller and Ingress-Nginx Controller are battle-tested and reliable, so you can’t go wrong with either choice.

Now let’s go ahead and focus on the AWS Load Balancer Controller.

AWS Load Balancer Controller

The specific versions I used to demonstrate in this post are as follows:

EKS: 1.30
AWS LBC: 2.8

Prerequisites for non-EKS clusters

In some cases, you might need to use the AWS Load Balancer Controller for a non-EKS cluster. When this is necessary, there are a few requirements you must meet:

Your public and private subnets must be tagged correctly for successful auto-discovery:
- For private subnets, use the tag kubernetes.io/role/internal-elb with a value of 1.
- For public subnets, use the tag kubernetes.io/role/elb with a value of 1.
- If you specify subnet IDs explicitly in annotations on services or ingress objects, tagging is not required.
For IP targets, ensure that pods have IPs from the VPC subnets. You can use the amazon-vpc-cni-k8s plugin to configure this.

These requirements are automatically met if you use eksctl or AWS CDK to create your VPC. For guidance on creating an EKS cluster with AWS CDK, check out my previous post on provisioning an Amazon EKS cluster.

Create Deployment and Service (for testing)

To demonstrate how AWS LBC works, we will create Deployment and Service resources using hashicorp/http-echo, a lightweight web server commonly used for testing or demonstration purposes.

Downloading the demo.yaml file and running the command below creates two namespaces, each containing two deployments and two services, to mimic a real-world environment:

kubectl apply -f demo.yaml

You can verify the resources with the following commands:

kubectl get -n demo0 svc,deploy,po to view resources in the demo0 namespace:

NAME            TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/echo0   ClusterIP   172.20.207.220           8000/TCP   42s
service/echo1   ClusterIP   172.20.252.94            8000/TCP   42s

NAME                    READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/echo0   1/1     1            1           43s
deployment.apps/echo1   1/1     1            1           43s

NAME                         READY   STATUS    RESTARTS   AGE
pod/echo0-5fb8fff5bf-sp4rb   1/1     Running   0          43s
pod/echo1-7588664fd4-46tl7   1/1     Running   0          43s

kubectl get -n demo1 svc,deploy,po to view resources in the demo1 namespace:

$ kubectl get -n demo1 svc,deploy,po
NAME            TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/echo2   ClusterIP   172.20.75.47            8000/TCP   47s
service/echo3   ClusterIP   172.20.47.157           8000/TCP   47s

NAME                    READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/echo2   1/1     1            1           48s
deployment.apps/echo3   1/1     1            1           47s

NAME                        READY   STATUS    RESTARTS   AGE
pod/echo2-5bb6bf895-pmrvh   1/1     Running   0          48s
pod/echo3-658d589bb-vh7wb   1/1     Running   0          47s

Each pod should return a simple plain text response of its name, such as echo0 or echo1, when it receives an HTTP request.

Installation

Follow the installation guide to install the AWS Load Balancer Controller.

While the official installation guide suggests using CLI tools such as eksctl and aws to configure IAM roles for service accounts (IRSA), if you’ve deployed your cluster using AWS CDK like I have, you can configure IAM permissions with AWS CDK like this:

import requests
from aws_cdk import Stack
from aws_cdk import aws_ec2 as ec2
from aws_cdk import aws_eks as eks
from aws_cdk import aws_iam as iam
from aws_cdk.lambda_layer_kubectl_v30 import KubectlV30Layer
from constructs import Construct


class EksStack(Stack):
    """
    This stack deploys an EKS cluster to a given VPC.
    """

    def __init__(self, scope: Construct, construct_id: str, vpc: ec2.Vpc, **kwargs) -> None:
        cluster = eks.Cluster(
            self,
            id="Cluster",
            version=eks.KubernetesVersion.V1_30,
            default_capacity=0,
            kubectl_layer=KubectlV30Layer(self, "kubectl"),
            vpc=vpc,
            cluster_name="my-cluster",
        )

        # ==== Configure IAM for the AWS Load Balancer Controller to have access to the AWS ALB/NLB APIs. ====
        # Download IAM policy
        url = "https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.8.1/docs/install/iam_policy.json"
        res = requests.get(url=url)
        iam_policy = res.json()

        # Create an IAM policy
        aws_lbc_iam_policy = iam.ManagedPolicy(
            self,
            id="ManagedPolicy1",
            managed_policy_name="AWSLoadBalancerControllerIAMPolicy",
            document=iam.PolicyDocument.from_json(iam_policy),
        )

        # Create a Kubernetes ServiceAccount for the LBC
        service_account = eks.ServiceAccount(
            self,
            id="ServiceAccount",
            cluster=cluster,
            name="aws-load-balancer-controller",
            namespace="kube-system",
        )

        # Attach the IAM policy to the ServiceAccount
        service_account.role.add_managed_policy(aws_lbc_iam_policy)
        # ====

If you’ve intalled the AWS LBC with helm correctly, you will see the following output:

NAME: aws-load-balancer-controller
LAST DEPLOYED: Wed Aug  7 22:34:59 2024
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
AWS Load Balancer controller installed!

If everything’s done successfully, you will be able to see the aws-load-balancer-controller deployment with kubectl as shown below:

$ kubectl -n kube-system get deployment.apps/aws-load-balancer-controller
NAME                           READY   UP-TO-DATE   AVAILABLE   AGE
aws-load-balancer-controller   2/2     2            2           32s

Ingress

Applying the following Ingress manifest creates an ALB named ingress-demo0 of ip target-type:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  namespace: demo0
  name: ingress-demo0
  annotations:
    alb.ingress.kubernetes.io/load-balancer-name: ingress-demo0
    alb.ingress.kubernetes.io/scheme: internet-facing # default: internal
    alb.ingress.kubernetes.io/target-type: ip # default: instance
    alb.ingress.kubernetes.io/healthcheck-path: /healthcheck
spec:
  ingressClassName: alb
  rules:
    - http:
        paths:
          - path: /echo0
            pathType: Prefix
            backend:
              service:
                name: echo0
                port:
                  number: 8000
          - path: /echo1
            pathType: Prefix
            backend:
              service:
                name: echo1
                port:
                  number: 8000

Here’s a brief explanation on this manifest:

annotations:
- alb.ingress.kubernetes.io/load-balancer-name: Specifies the name of the load balancer as ingress-demo0.
- alb.ingress.kubernetes.io/scheme: Sets the load balancer scheme to internet-facing, making it accessible from the internet.
- alb.ingress.kubernetes.io/target-type: Specifies that the target type is ip, meaning the load balancer targets IP addresses of the pods.
- alb.ingress.kubernetes.io/healthcheck-path: Sets the health check path to /healthcheck, which the ALB will use to verify the health of the targets.
- There are more annotations you can configure, some of which we will cover later in this post.
spec:
- ingressClassName: Indicates that this Ingress resource uses the ALB Ingress controller by specifying alb. You can find its name via kubectl get ingressclass.
- rules: Defines the routing rules:
- For HTTP traffic, it defines two paths:
  - /echo0: Routes traffic to the echo0 service on port 8000.
  - /echo1: Routes traffic to the echo1 service on port 8000.
- Both paths use the Prefix path type, which means that the paths /echo0 and /echo1 and any sub-paths will be matched.

The load balancer’s name, which is configured by alb.ingress.kubernetes.io/load-balancer-name, must be unique within your set of ALBs and NLBs for the region. Otherwise, you may encounter conflicts that prevent the load balancer from being created.

The load balancers provided by the controllers are not automatically removed, even if you delete the cluster. You need to manually clean up these resources to avoid incurring unnecessary costs.

The creation process takes a short period of time. You can get the DNS name of the load balancer with the kubectl command:

$ kubectl get ingress -n demo0
NAME            CLASS   HOSTS   ADDRESS                                                     PORTS   AGE
ingress-demo0   alb     *       ingress-demo0-1667793482.ap-northeast-3.elb.amazonaws.com   80      67s

After provisioning is complete, you can verify it by sending requests as follows:

$ curl http://ingress-demo0-1667793482.ap-northeast-3.elb.amazonaws.com/echo0
echo0
$ curl http://ingress-demo0-1667793482.ap-northeast-3.elb.amazonaws.com/echo1
echo1

So far, the ingress-demo0 ALB has three listener rules:

The first two listener rules are configured based on the paths from the Ingress resource. The last listener rule is a default, which will be covered in a moment. The load balancer can route traffic to specific target groups based on these rules. Each target group then has the IP addresses of target pods as its targets. This is because we’ve created a load balancer of the ip target-type.

For example, the following picture shows a target group for requests whose path is /echo0 or /echo0/*:

The target here is the IP address of a pod of echo0. You can verify this address with the following command:

$ kubectl get endpoints echo0 -n demo0
NAME    ENDPOINTS        AGE
echo0   10.0.3.53:8000   18m

If you scale out the pods, the new addresses of those pods are registered as new targets and vice versa.

The following command increases the number of pods of echo0 to 2:

kubectl scale deployment echo0 --replicas=2 -n demo0

Once a new pod is created, the endpoint list is updated accordingly:

$ kubectl get endpoints echo0 -n demo0
NAME    ENDPOINTS                        AGE
echo0   10.0.2.185:8000,10.0.3.53:8000   18m

The new IP address is then registered in the target group:

You can also use host-based routing to direct traffic to different services based on the hostname specified in the request.

The following example demonstrates how to set up host-based routing in an Ingress resource:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  namespace: demo0
  name: ingress-demo0
  annotations:
    alb.ingress.kubernetes.io/load-balancer-name: ingress-demo
    alb.ingress.kubernetes.io/scheme: internet-facing # default: internal
    alb.ingress.kubernetes.io/target-type: ip # default: instance
    alb.ingress.kubernetes.io/healthcheck-path: /healthcheck
    alb.ingress.kubernetes.io/group.name: ingress-demo
spec:
  ingressClassName: alb
  rules:
    - host: echo0.host.com
      http:
        paths:
          - path: /echo0
            pathType: Prefix
            backend:
              service:
                name: echo0
                port:
                  number: 8000
    - host: echo1.host.com
      http:
        paths:
          - path: /echo1
            pathType: Prefix
            backend:
              service:
                name: echo1
                port:
                  number: 8000

This way, any request to echo0.host.com/echo0 will be directed to the echo0 service on port 8000. And likewise, any request to echo1.host.com/echo1 will be directed to the echo1 service on port 8000.

Again, all of this is done automatically by the ingress controller.

DefaultBackend

If you send requests to the paths not specified in the rules in the Ingress resource, you will get a “404 Not Found error” by default:

$ curl -i ingress-demo0-1667793482.ap-northeast-3.elb.amazonaws.com
HTTP/1.1 404 Not Found
Server: awselb/2.0
Date: Wed, 07 Aug 2024 13:51:53 GMT
Content-Type: text/plain; charset=utf-8
Content-Length: 0
Connection: keep-alive

This can be configured with the DefaultBackend.

For demonstration purposes, let’s create a new resource that returns plain text saying “This is the default server.”:

apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: demo0
  name: default-backend
  labels:
    app.kubernetes.io/name: default-backend
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: default-backend
  template:
    metadata:
      labels:
        app.kubernetes.io/name: default-backend
    spec:
      containers:
        - name: default-backend
          image: hashicorp/http-echo
          args:
            - -listen=:8000
            - -text=This is the default server.
          resources:
            limits:
              cpu: 10m
              memory: 50Mi
---
apiVersion: v1
kind: Service
metadata:
  namespace: demo0
  name: default-backend
  labels:
    app.kubernetes.io/name: default-backend
spec:
  type: ClusterIP
  selector:
    app.kubernetes.io/name: default-backend
  ports:
    - protocol: TCP
      appProtocol: http
      port: 8000
      targetPort: 8000

Here’s an example of configuring DefaultBackend to route any traffic that doesn’t match any rule to the default-backend service:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  namespace: demo0
  name: ingress-demo0
  annotations:
    alb.ingress.kubernetes.io/load-balancer-name: ingress-demo0
    alb.ingress.kubernetes.io/scheme: internet-facing # default: internal
    alb.ingress.kubernetes.io/target-type: ip # default: instance
    alb.ingress.kubernetes.io/healthcheck-path: /healthcheck
spec:
  ingressClassName: alb
  defaultBackend:
    service:
      name: default-backend
      port:
        number: 8000
  rules:
    ...

Verify the result with the following command:

$ curl ingress-demo0-1667793482.ap-northeast-3.elb.amazonaws.com
This is the default server.

If you happen to misconfigure the default backend to route to the wrong service, you will get a “503 Service Temporarily Unavailable” error:

HTTP/1.1 503 Service Temporarily Unavailable
Server: awselb/2.0
Date: Wed, 07 Aug 2024 13:54:22 GMT
Content-Type: text/plain; charset=utf-8
Content-Length: 30
Connection: keep-alive

Backend service does not exist

Healthcheck

As mentioned earlier, the AWS LBC leverages the Load Balancer’s health checks, which can directly represent the pod’s status.

Health checks of a target group in AWS are important because they ensure traffic is only routed to healthy targets, preventing failed or degraded targets from affecting application performance. This improves overall reliability and user experience by maintaining the high availability and stability of the service.

Aside from the alb.ingress.kubernetes.io/healthcheck-path annotation in the Ingress resource above, there are a few more options we can use to configure how to health check targets:

alb.ingress.kubernetes.io/healthcheck-port (default: traffic-port)
alb.ingress.kubernetes.io/healthcheck-protocol (default: HTTP)
alb.ingress.kubernetes.io/healthcheck-interval-second (default: 15)
alb.ingress.kubernetes.io/healthcheck-timeout-second (default: 5)
alb.ingress.kubernetes.io/healthy-threshold-coun (default: 2)
alb.ingress.kubernetes.io/unhealthy-threshold-coun (default: 2)
alb.ingress.kubernetes.io/success-codes (default: 200)

For example, if you increase alb.ingress.kubernetes.io/healthcheck-interval-seconds value, the frequency of health checks will decrease, which can reduce the load on your application but may also delay the detection of unhealthy targets.

The default values might suffice in most cases; however, tuning these parameters can be beneficial depending on the specific needs of your application. Adjusting health check settings allows for finer control over how quickly unhealthy targets are detected and how aggressively traffic is rerouted, which can be crucial for optimizing performance and ensuring high availability.

Ingress group

Previously, we deployed four services—echo0, echo1, echo2, and echo3—in two different namespaces: demo0 and demo1. We created an Ingress in demo0 namespace to route traffic to echo0 and echo1, since Kubernetes Ingress resources can only route traffic to backend services within the same namespace. This means that to route traffic to echo2 and echo3 in demo1 namespace, we would need to create a new Ingress, resulting in the provisioning of a new load balancer.

While this design might be suitable in certain scenarios, it’s often inefficient from both an infrastructural and operational perspective when dealing with numerous Ingress resources in the cluster, which is a common situation. If you don’t need multiple load balancers for each Ingress resource, it’s better to optimize the setup.

With alb.ingress.kubernetes.io/group.name annotation, we can use a single load balancer for multiple Ingress resources.

The example below demonstrates this setup:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  namespace: demo0
  name: ingress-demo0
  annotations:
    alb.ingress.kubernetes.io/load-balancer-name: ingress-demo
    alb.ingress.kubernetes.io/scheme: internet-facing # default: internal
    alb.ingress.kubernetes.io/target-type: ip # default: instance
    alb.ingress.kubernetes.io/healthcheck-path: /healthcheck
    alb.ingress.kubernetes.io/group.name: ingress-demo
spec:
  ingressClassName: alb
  rules:
    - http:
        paths:
          - path: /echo0
            pathType: Prefix
            backend:
              service:
                name: echo0
                port:
                  number: 8000
          - path: /echo1
            pathType: Prefix
            backend:
              service:
                name: echo1
                port:
                  number: 8000

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  namespace: demo1
  name: ingress-demo1
  annotations:
    alb.ingress.kubernetes.io/load-balancer-name: ingress-demo
    alb.ingress.kubernetes.io/scheme: internet-facing # default: internal
    alb.ingress.kubernetes.io/target-type: ip # default: instance
    alb.ingress.kubernetes.io/healthcheck-path: /healthcheck
    alb.ingress.kubernetes.io/group.name: ingress-demo
spec:
  ingressClassName: alb
  rules:
    - http:
        paths:
          - path: /echo2
            pathType: Prefix
            backend:
              service:
                name: echo2
                port:
                  number: 8000
          - path: /echo3
            pathType: Prefix
            backend:
              service:
                name: echo3
                port:
                  number: 8000

Keep in mind that adding the alb.ingress.kubernetes.io/group.name annotation will result in replacing the existing load balancer with a new one. This can potentially cause downtime or configuration issues, which is something you might want to avoid in a production environment.

After creating a new Ingress like the example above, we have two ingresses:

$ kubectl get ingress --all-namespaces
NAMESPACE   NAME            CLASS   HOSTS   ADDRESS                                                    PORTS   AGE
demo0       ingress-demo0   alb     *       ingress-demo-1596788157.ap-northeast-3.elb.amazonaws.com   80      92s
demo1       ingress-demo1   alb     *       ingress-demo-1596788157.ap-northeast-3.elb.amazonaws.com   80      87s

Now that since the two Ingress resources share the same load balancer, the LB has 5 listener rules including Default:

You can verify the results with the following commands (with a new DNS name this time):

$ curl ingress-demo-1596788157.ap-northeast-3.elb.amazonaws.com/echo0
echo0
$ curl ingress-demo-1596788157.ap-northeast-3.elb.amazonaws.com/echo1
echo1
$ curl ingress-demo-1596788157.ap-northeast-3.elb.amazonaws.com/echo2
echo2
$ curl ingress-demo-1596788157.ap-northeast-3.elb.amazonaws.com/echo3
echo3

The priority of a listener can be modified using the alb.ingress.kubernetes.io/group.order annotation, which defaults to 0. If you don’t explicitly specify the order, the rule order among Ingresses within the same IngressGroup is determined by the lexical order of the Ingress’s namespace/name.

When multiple Ingresses share the same load balancer via the alb.ingress.kubernetes.io/group.name annotation, deleting one of the Ingresses does not remove the load balancer. The load balancer is deleted only when all associated Ingress resources are deleted.

IngressClass

IngressClass is a cluster-wide resource that any Ingress resource across all namespaces can refer to.

To have the ingress controller handle a specific Ingress, we need to specify the ingressClassName as alb. Alternatively, you can skip specifying the ingressClassName by setting the ingressclass.kubernetes.io/is-default-class annotation to true on the alb IngressClass.

Here’s the command to add the annotation to the alb IngressClass:

kubectl annotate ingressclass alb ingressclass.kubernetes.io/is-default-class=true -n kube-system

After this update, any new Ingress resource without an ingressClassName will implicitly use alb as its default IngressClass.

Alternatively, you can create your own IngressClass and configure all Ingress resources to reference that IngressClass by default.

Access control

Access control in AWS is typically managed by security groups. Security groups in AWS act as virtual firewalls for your resources, such as load balancers, controlling both incoming and outgoing traffic.

You can update the security group with the following annotations:

alb.ingress.kubernetes.io/inbound-cidrs
alb.ingress.kubernetes.io/security-group-prefix-lists
alb.ingress.kubernetes.io/listen-ports

By default, the controller will automatically create one security group that allows access from from alb.ingress.kubernetes.io/inbound-cidrs and alb.ingress.kubernetes.io/security-group-prefix-lists to the alb.ingress.kubernetes.io/listen-ports. If any of these annotations is not present, the load balancer allows incoming traffic from any IP address to the HTTP:80 or HTTPS:443 port depending on whether alb.ingress.kubernetes.io/certificate-arn is specified.

For example, the following annotations update the target group to allow access from 10.0.0.0/8 or 1.2.3.4/32 to the ports of HTTP:80, HTTP:443, and HTTP:8000:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    alb.ingress.kubernetes.io/inbound-cidrs: "10.0.0.0/8,1.2.3.4/32"
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}, {"HTTP": 8000}]'
    ...

After updating the Ingress, the inbound rules will be updated accordingly:

Using existing security groups is also possible through the alb.ingress.kubernetes.io/security-groups annotation. This annotation takes precedence over the alb.ingress.kubernetes.io/inbound-cidrs annotation.

Additionally, setting alb.ingress.kubernetes.io/scheme to internal will make your load balancer only accessible from within your VPC. This is useful when you want to restrict access to your application to only resources within your VPC, enhancing security by preventing external access.

For more granular access control, you can consider using a service mesh like Istio.

Access logs

AWS load balancers provide the option to store access logs of all requests made to them, which can be instrumental in diagnosing issues, analyzing traffic patterns, and maintaining security.

To enable the access logs feature, you must have an S3 bucket for storing the logs. You can either create a new bucket or use an existing one. For instructions on creating an S3 bucket, visit this AWS documentation.

If you use AWS CDK, below is an example to provision an S3 bucket for storing access logs of a load balancer for regions available before August 2022:

from aws_cdk import Duration, RemovalPolicy, Stack
from aws_cdk import aws_iam as iam
from aws_cdk import aws_s3 as s3
from constructs import Construct


class S3Stack(Stack):
    """
    This stack deploys an EKS cluster to a given VPC.
    """

    def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)

        # Create an S3 bucket for storing ELB access logs
        lb_access_logs_bucket = s3.Bucket(
            self,
            "LBAccessLogBucket",
            auto_delete_objects=True,
            block_public_access=s3.BlockPublicAccess.BLOCK_ALL,
            bucket_name="my-elb-access-logs",  # This name must be unique across all existing bucket names in Amazon S3
            encryption=s3.BucketEncryption.S3_MANAGED,  # This is the only server-side encryption option for access logs
            enforce_ssl=True,
            lifecycle_rules=[s3.LifecycleRule(expiration=Duration.days(90))],
            removal_policy=RemovalPolicy.DESTROY,
            versioned=False,  # default
        )

        # Define the bucket policy for ELB permission to write access logs
        elb_account_id = "383597477331"  # Replace it with the ID for your region
        prefix = "my-prefix"
        bucket_policy = iam.PolicyStatement(
            effect=iam.Effect.ALLOW,
            principals=[iam.ArnPrincipal(f"arn:aws:iam::{elb_account_id}:root")],
            actions=["s3:PutObject"],
            resources=[f"{lb_access_logs_bucket.bucket_arn}/{prefix}/AWSLogs/{self.account}/*"],
        )

        # Attach the policy to the bucket
        lb_access_logs_bucket.add_to_resource_policy(bucket_policy)

After provisioning an S3 bucket, add the alb.ingress.kubernetes.io/load-balancer-attributes annotation as shown below:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    # Replace `my-elb-access-logs` and `my-prefix` with the values you used to create your S3 bucket.
    alb.ingress.kubernetes.io/load-balancer-attributes: access_logs.s3.enabled=true,access_logs.s3.bucket=my-elb-access-logs,access_logs.s3.prefix=my-prefix
    ...

Once you apply the Ingress, the access logs will be sent to the specified S3 bucket.

For example log entries, visit this AWS documentation.

Note that removing the annotation does not disable access logs. To disable access logs, you need to explicitly set this value to access_logs.s3.enabled=false as shown below:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    alb.ingress.kubernetes.io/load-balancer-attributes: access_logs.s3.enabled=false
    ...

Similarly, you can also enable connection logs:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    # Replace `my-connection-log-bucket` and `my-prefix` with the values you used to create your S3 bucket.
    alb.ingress.kubernetes.io/load-balancer-attributes: connection_logs.s3.enabled=true,connection_logs.s3.bucket=my-connection-log-bucket,connection_logs.s3.prefix=my-prefix
    ...

Since the access log files are stored in S3 buckets in compressed format, it’s not the most convenient for reading log files directly. You typically have to use analytical tools to analyze and process access logs.

One strategy to analyze and process access logs is to use Amazon Athena. Amazon Athena allows you to query data directly from S3 using SQL. You can create a table in Athena to map the log data format and then run SQL queries to analyze the logs. For more details, visit this AWS documentation.

References

Building Faceted Search using Elasticsearch

2024-07-08T00:00:00+00:00

Faceted search

Faceted search is a type of search that enables users to narrow down search results by applying multiple filters based on predefined facets, which are categorical attributes or properties of the items being searched. This method is particularly useful on e-commerce platforms, as it enhances the user experience by allowing users to efficiently refine their search queries based on specific criteria and preferences.

Below is an example of a faceted search interface for a query like ‘phone’ on Amazon:

In the picture above, the left pane shows available facets based on the items being searched. Clicking on any of the values in the facets will narrow down results accordingly. Therefore, a faceted search system’s functions can be broadly divided into two categories as below:

Getting facet distribution: Gathering and organizing data to provide users with a comprehensive overview of the facet distribution of items being searched.
Searching by facets: Narrowing down search results based on specific facets.

In this post, I will explain how to build a faceted search using Elasticsearch, which is a document-oriented database. I will also use the Kibana Console to interact with Elasticsearch. The specific versions of Elasticsearch and Kibana in this tutorial are both 8.13.4.

Relational database vs document-oriented database

Faceted search can indeed be built using a relational database. However, it is more common to use a document-oriented database like Elasticsearch for several reasons:

Performance: Relational databases are not inherently optimized for the complex, multi-dimensional queries typical of faceted search. This can lead to slow response times and high computational costs. Elasticsearch, on the other hand, is designed for such scenarios, providing faster query performance.
Scalability: Elasticsearch scales more effectively horizontally, allowing it to handle large volumes of data and high query loads more efficiently than relational databases.
Full-text search: Faceted search is often used with full-text search, an area where relational databases fall short.

For these reasons, leveraging a document-oriented database like Elasticsearch is often the preferred choice for implementing faceted search.

Modeling data

The faceted search system we are about to build is for an e-commerce platform where a product can have multiple facets and a facet can have multiple values, like the example above. For instance, a phone can have facets like color or capacity, where color can have values such as “black” or “red” and capacity can have values such as “128GB” or “256GB”.

The relationships among product, facet, and value would be described as follows:

Defining fields

First, we are going to have a facets field to store all facets data. Note that while we could explicitly define fields for every possible facet, such as color, capacity, or any other specific attribute, there is no need for such specific fields in this schema. Defining fields explicitly for every attribute would make the schema rigid and difficult to manage as the number of product attributes increases. Instead, we use a flexible approach by implementing generic facets. This allows us to dynamically handle a wide variety of product attributes without altering the schema structure. By leveraging nested fields within the facets property, we can accommodate any number of attributes, providing a scalable and adaptable solution. This approach not only simplifies the indexing process but also ensures that our search system can evolve with the product catalog, accommodating new attributes seamlessly as they arise.

The facets field will contain a list of facets, each of which has code, name, and values. The values will contain a list of values, each of which has code and name. A value here is like an option or attribute associated with the facet.

The reasons for assigning code to name are as follows:

Uniqueness: Codes ensure that each facet can be uniquely identified, avoiding any ambiguity that might arise from identical names. For example, capacity might refer to the storage capacity of a digital device (e.g., 128GB) or the volume capacity of a bottle (e.g., 500ml).
Consistency: Codes allow for consistent reference to facets across different systems and interfaces, making integrations and communications more reliable. This also helps in internationalization where multiple languages should be supported without changing the underlying schema.
Efficiency: Using codes can be more efficient for storage and processing, especially in large datasets where names might be longer and more variable.

Using id intead of code is also possible if the primary use case is for internal referencing only. I only choose code to open for more human-readable or descriptive string identifier.

Additionally, the index will also have id and name fields to reflect a real-world application.

Defining field data types

The id is a unique identifier of a document (or product), so it should be of keyword type. The name is the name of a product and should be of text type for full-text search. The facets should be of nested type for handling multiple attributes associated with each product. The values should be of nested type too. Inside facets and values, the type of the name and code properties are defined as keyword, which is suitable for filter operations, aggregations, and exact matches.

As a result, the index has three properties:

id (keyword)
name (text)
facets (nested)
- code (keyword)
- name (keyword)
- values (nested)
  - code (keyword)
  - name (keyword)

Defining mapping

Below is the complete JSON mapping for our Elasticsearch index::

{
    "mappings": {
        "dynamic": "strict",
        "properties": {
            "id": {
                "type": "keyword"
            },
            "name": {
                "type": "text"
            },
            "facets": {
                "type": "nested",
                "properties": {
                    "code": {
                        "type": "keyword"
                    },
                    "name": {
                        "type": "keyword"
                    },
                    "values": {
                        "type": "nested",
                        "properties": {
                            "code": {
                                "type": "keyword"
                            },
                            "name": {
                                "type": "keyword"
                            }
                        }
                    }
                }
            }
        }
    }
}

Running Elasticsearch locally

Skip this section unless you want to run Elasticsearch to follow this tutorial in your local environment.

You can run Elasticsearch in your local environment by using Docker Compose. Follow the steps to run Elasticsearch and Kibana:

Save the compose file in your directory.
Run docker-compose up in the same directory to run Elasticsearch and Kibana.
Open your web browser and navigate to http://localhost:5601, which is the Kibana URL.
Log in with the the following credentials:
- Username: elastic
- Password: elasticpassword
Click on Dev Tools in the Management section in the side navigation menu.

If everything is done successfully, you should see the Console application:

This will provide an interactive interface where you can send requests to Elasticsearch.

Creating index

To create an index named products, go the the Kibana Console and send the PUT request to the /products path with the mappings we defined:

PUT /products
{
    "mappings": {
        "dynamic": "strict",
        "properties": {
            "id": {
                "type": "keyword"
            },
            "name": {
                "type": "text"
            },
            "facets": {
                "type": "nested",
                "properties": {
                    "name": {
                        "type": "keyword"
                    },
                    "code": {
                        "type": "keyword"
                    },
                    "values": {
                        "type": "nested",
                        "properties": {
                            "name": {
                                "type": "keyword"
                            },
                            "code": {
                                "type": "keyword"
                            }
                        }
                    }
                }
            }
        }
    }
}

If you see the following response, the index has been successfully created:

{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "products"
}

Inserting documents

The following tables provide example products and facets for demonstration::

Products:

id	name	facets
1	Phone 1	(Black), (128GB)
2	Phone 2	(Red), (256GB)
3	Phone 3	(Black, Red), (128GB, 256GB)
4	Phone 4	(Black, Blue), (256GB, 512GB)

Facets:

code	name
1	color
2	capacity

Color:

code	name
1	Black
2	Red
3	Blue

Capacity:

code	name
1	128GB
2	256GB
3	512GB

Let’s insert the documents with the following JSON requests:

POST /products/_doc
{
    "id": "1",
    "name": "Phone 1",
    "facets": [
        {
            "name": "color",
            "code": "1",
            "values": [
                {
                    "name": "Black",
                    "code": "1"
                }
            ]
        },
        {
            "name": "capacity",
            "code": "2",
            "values": [
                {
                    "name": "128GB",
                    "code": "1"
                }
            ]
        }
    ]
}

POST /products/_doc
{
    "id": "2",
    "name": "Phone 2",
    "facets": [
        {
            "name": "color",
            "code": "1",
            "values": [
                {
                    "name": "Red",
                    "code": "2"
                }
            ]
        },
        {
            "name": "capacity",
            "code": "2",
            "values": [
                {
                    "name": "256GB",
                    "code": "2"
                }
            ]
        }
    ]
}

POST /products/_doc
{
    "id": "3",
    "name": "Phone 3",
    "facets": [
        {
            "name": "color",
            "code": "1",
            "values": [
                {
                    "name": "Black",
                    "code": "1"
                },
                {
                    "name": "Red",
                    "code": "2"
                }
            ]
        },
        {
            "name": "capacity",
            "code": "2",
            "values": [
                {
                    "name": "128GB",
                    "code": "1"
                },
                {
                    "name": "256GB",
                    "code": "2"
                }
            ]
        }
    ]
}

POST /products/_doc
{
    "id": "4",
    "name": "Phone 4",
    "facets": [
        {
            "name": "color",
            "code": "1",
            "values": [
                {
                    "name": "Black",
                    "code": "1"
                },
                {
                    "name": "Blue",
                    "code": "3"
                }
            ]
        },
        {
            "name": "capacity",
            "code": "2",
            "values": [
                {
                    "name": "256GB",
                    "code": "2"
                },
                {
                    "name": "512GB",
                    "code": "3"
                }
            ]
        }
    ]
}

After indexing all four documents, you can retrieve all documents from the products index with the request below:

GET /products/_search
{
    "query": {
        "match_all": {}
    }
}

The response JSON should look something like this. (Since the full JSON response is quite extensive, I thought it’d be better to provide an external link rather than including it on this page or using a collapsible section to avoid disrupting the post.)

In order to get the facet distribution, you need to use Elasticsearch’s aggregation function. Below is the query DSL for it:

GET /products/_search
{
    "size": 0,
    "aggs": {
        "facets": {
            "nested": {
                "path": "facets"
            },
            "aggs": {
                "codes": {
                    "terms": {
                        "field": "facets.code"
                    },
                    "aggs": {
                        "names": {
                            "terms": {
                                "field": "facets.name"
                            }
                        },
                        "values": {
                            "nested": {
                                "path": "facets.values"
                            },
                            "aggs": {
                                "codes": {
                                    "terms": {
                                        "field": "facets.values.code"
                                    },
                                    "aggs": {
                                        "names": {
                                            "terms": {
                                                "field": "facets.values.name"
                                            }
                                        }
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

The aggregations object from the response JSON can be found here.

I want to mention that you don’t necessarily set the size to 0 if you are performing the search and aggregation at once. It’s just to get the facet distribution, not the search results.

Breakdown

A query DSL below aggregates on facets.code only:

GET /products/_search
{
    "size": 0,
    "aggs": {
        "facets": {
            "nested": {
                "path": "facets"
            },
            "aggs": {
                "codes": {
                    "terms": {
                        "field": "facets.code"
                    }
                }
            }
        }
    }
}

The aggs object contains the main aggregation with a nested path to facets, which is a terms aggregation on facets.code. This will create buckets based on the unique values of facets.code.

The aggregations object from the response JSON would be as follows:

{
    "aggregations": {
        "facets": {
            "doc_count": 8,
            "codes": {
                "doc_count_error_upper_bound": 0,
                "sum_other_doc_count": 0,
                "buckets": [
                    {
                        "key": "1",
                        "doc_count": 4
                    },
                    {
                        "key": "2",
                        "doc_count": 4
                    }
                ]
            }
        }
    }
}

But we do not know the corresponding facet names yet.

To retrieve the facet names, we will add a sub-aggregation on facets.name within a facets.code aggregation. A sub-aggregation allows you to further break down the results for each bucket of documents. Here’s the query DSL for it:

GET /products/_search
{
    "size": 0,
    "aggs": {
        "facets": {
            "nested": {
                "path": "facets"
            },
            "aggs": {
                "codes": {
                    "terms": {
                        "field": "facets.code"
                    },
                    "aggs": {
                        "names": {
                            "terms": {
                                "field": "facets.name"
                            }
                        }
                    }
                }
            }
        }
    }
}

The aggregations object from the response JSON is as follows:

{
    "aggregations": {
        "facets": {
            "doc_count": 8,
            "codes": {
                "doc_count_error_upper_bound": 0,
                "sum_other_doc_count": 0,
                "buckets": [
                    {
                        "key": "1",
                        "doc_count": 4,
                        "names": {
                            "doc_count_error_upper_bound": 0,
                            "sum_other_doc_count": 0,
                            "buckets": [
                                {
                                    "key": "color",
                                    "doc_count": 4
                                }
                            ]
                        }
                    },
                    {
                        "key": "2",
                        "doc_count": 4,
                        "names": {
                            "doc_count_error_upper_bound": 0,
                            "sum_other_doc_count": 0,
                            "buckets": [
                                {
                                    "key": "capacity",
                                    "doc_count": 4
                                }
                            ]
                        }
                    }
                ]
            }
        }
    }
}

Now we have the facet distribution by code and name.

Since all facets.code values are unique and mapped to names in a 1:1 relationship in our scenario, the size of the facets.name buckets must be exactly 1. In other words, if a facet code has multiple names (e.g., two names), the size of the buckets for those names should match the number of names (e.g., 2).

Like so, another nested aggregation is performed on facets.values which further breaks down into terms aggregations on facets.values.code:

GET /products/_search
{
    "size": 0,
    "aggs": {
        "facets": {
            "nested": {
                "path": "facets"
            },
            "aggs": {
                "codes": {
                    "terms": {
                        "field": "facets.code"
                    },
                    "aggs": {
                        "names": {
                            "terms": {
                                "field": "facets.name"
                            }
                        },
                        "values": {
                            "nested": {
                                "path": "facets.values"
                            },
                            "aggs": {
                                "codes": {
                                    "terms": {
                                        "field": "facets.values.code"
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

Lastly, just as including a sub-aggregation on facets.name, adding a sub-aggregation on facets.values.name within the facets.values.code finalizes the query for getting facet distribution with all names and codes mapped together.

Sorting

You can also sort the values by a specific metric such as count or alphabetical order.

Here’s an example of how you can sort the values by count in ascending order:

"terms": {
    "field": "facets.values.code",
    "order": {
        "_count": "asc"
    }
}

And here’s an example of how you can sort the values by alphabetical order:

"terms": {
    "field": "facets.values.code",
    "order": {
        "_key": "asc"
    }
}

Size

The default size of a bucket is 10. Here’s an example of how you can increase the size to 20:

"terms": {
    "field": "facets.values.code",
    "size": 20
}

Now we know how to obtain facet distribution using Elasticsearch. However, the responses from Elasticsearch might not be suitable for rendering in a faceted search interface on the UI, as they often contain raw, nested data structures and metadata. Therefore, it is advisable to transform these responses into a more user-friendly format.

For example, the aggregations part of the response can be transformed into a simpler, more understandable structure like the following JSON:

[
    {
        "key": "1",
        "name": "color",
        "count": 4,
        "values": [
            {
                "key": "1",
                "name": "Black",
                "count": 3
            },
            {
                "key": "2",
                "name": "Red",
                "count": 2
            },
            {
                "key": "3",
                "name": "Blue",
                "count": 1
            }
        ]
    },
    {
        "key": "2",
        "name": "capacity",
        "count": 4,
        "values": [
            {
                "key": "2",
                "name": "256GB",
                "count": 3
            },
            {
                "key": "1",
                "name": "128GB",
                "count": 2
            },
            {
                "key": "3",
                "name": "512GB",
                "count": 1
            }
        ]
    }
]

The following Python code will help transform the response into a format like the example above:

response = elasticsearch.search(index="products", body=query)

facet_distribution = []
for facet_bucket in response["aggregations"]["facets"]["codes"]["buckets"]:
    facet = {
        "key": facet_bucket["key"],
        "name": facet_bucket["names"]["buckets"][0]["key"],
        "count": facet_bucket["doc_count"],
        "values": [],
    }
    for value_bucket in facet_bucket["values"]["codes"]["buckets"]:
        value = {
            "key": value_bucket["key"],
            "name": value_bucket["names"]["buckets"][0]["key"],
            "count": value_bucket["doc_count"],
        }
        facet["values"].append(value)
    facet_distribution.append(facet)

This way, the data becomes easier to understand and use in your faceted search interface.

First, let’s search for products with black color. The following query DSL searches for documents with the facet code 1 which is color and the value code 1 which is black:

GET /products/_search
{
    "query": {
        "bool": {
            "must": [
                {
                    "nested": {
                        "path": "facets",
                        "query": {
                            "bool": {
                                "must": [
                                    {
                                        "term": {
                                            "facets.code": "1"
                                        }
                                    },
                                    {
                                        "nested": {
                                            "path": "facets.values",
                                            "query": {
                                                "terms": {
                                                    "facets.values.code": [
                                                        "1"
                                                    ]
                                                }
                                            }
                                        }
                                    }
                                ]
                            }
                        }
                    }
                }
            ]
        }
    }
}

The response JSON would look something like this.

It has retrieved three documents with the black value in their color facet. Below is the table version of the response JSON for the sake of readability:

id	name	facets
1	Phone 1	(Black), (128GB)
3	Phone 3	(Black, Red), (128GB, 256GB)
4	Phone 4	(Black, Blue), (256GB, 512GB)

Conjunctive facets, also known as regular facets, allow users to filter search results by multiple criteria simultaneously using “AND” logic.

For example, the query DSL below searches for products which have the black for color facet and 512GB for digital capacity facet:

GET /products/_search
{
    "query": {
        "bool": {
            "must": [
                {
                    "nested": {
                        "path": "facets",
                        "query": {
                            "bool": {
                                "must": [
                                    {
                                        "term": {
                                            "facets.code": "1"
                                        }
                                    },
                                    {
                                        "nested": {
                                            "path": "facets.values",
                                            "query": {
                                                "terms": {
                                                    "facets.values.code": [
                                                        "1"
                                                    ]
                                                }
                                            }
                                        }
                                    }
                                ]
                            }
                        }
                    }
                },
                {
                    "nested": {
                        "path": "facets",
                        "query": {
                            "bool": {
                                "must": [
                                    {
                                        "term": {
                                            "facets.code": "2"
                                        }
                                    },
                                    {
                                        "nested": {
                                            "path": "facets.values",
                                            "query": {
                                                "terms": {
                                                    "facets.values.code": [
                                                        "3"
                                                    ]
                                                }
                                            }
                                        }
                                    }
                                ]
                            }
                        }
                    }
                }
            ]
        }
    }
}

This time, the response will contain one document with red and 256GB in its facets:

id	name	facets
4	Phone 4	(Black, Blue), (256GB, 512GB)

Disjunctive facets use “OR” logic.

For example, the query DSL below searches for documents whose color is either red or blue:

GET /products/_search
{
    "query": {
        "bool": {
            "must": [
                {
                    "nested": {
                        "path": "facets",
                        "query": {
                            "bool": {
                                "must": [
                                    {
                                        "term": {
                                            "facets.code": "1"
                                        }
                                    },
                                    {
                                        "nested": {
                                            "path": "facets.values",
                                            "query": {
                                                "terms": {
                                                    "facets.values.code": [
                                                        "2",
                                                        "3"
                                                    ]
                                                }
                                            }
                                        }
                                    }
                                ]
                            }
                        }
                    }
                }
            ]
        }
    }
}

The response will contain three documents with red or blue as their color facet:

id	name	facets
2	Phone 2	(Red), (256GB)
3	Phone 3	(Black, Red), (128GB, 256GB)
4	Phone 4	(Black, Blue), (256GB, 512GB)

In some scenario where you might want to search products with different facets using OR logic, you can use should instead of must in the query DSL.

For example, the query DSL below searches documents whose color is black or capacity is 512GB:

GET /products/_search
{
    "query": {
        "bool": {
            "should": [
                {
                    "nested": {
                        "path": "facets",
                        "query": {
                            "bool": {
                                "must": [
                                    {
                                        "term": {
                                            "facets.code": "1"
                                        }
                                    },
                                    {
                                        "nested": {
                                            "path": "facets.values",
                                            "query": {
                                                "terms": {
                                                    "facets.values.code": [
                                                        "1"
                                                    ]
                                                }
                                            }
                                        }
                                    }
                                ]
                            }
                        }
                    }
                },
                {
                    "nested": {
                        "path": "facets",
                        "query": {
                            "bool": {
                                "must": [
                                    {
                                        "term": {
                                            "facets.code": "2"
                                        }
                                    },
                                    {
                                        "nested": {
                                            "path": "facets.values",
                                            "query": {
                                                "terms": {
                                                    "facets.values.code": [
                                                        "3"
                                                    ]
                                                }
                                            }
                                        }
                                    }
                                ]
                            }
                        }
                    }
                }
            ]
        }
    }
}

The response JSON will contain three documents with black or 512GB in their facets:

id	name	facets
4	Phone 4	(Black, Blue), (256GB, 512GB)
1	Phone 1	(Black), (128GB)
3	Phone 3	(Black, Red), (128GB, 256GB)

You can also combine conjunctive facets and disjunctive facets. Here’s the query DSL that searches documents whose color is either black or red and capcity is 128GB. In other words, (color: black or red) and (capacity: 128GB).

GET /products/_search
{
    "query": {
        "bool": {
            "must": [
                {
                    "nested": {
                        "path": "facets",
                        "query": {
                            "bool": {
                                "must": [
                                    {
                                        "term": {
                                            "facets.code": "1"
                                        }
                                    },
                                    {
                                        "nested": {
                                            "path": "facets.values",
                                            "query": {
                                                "terms": {
                                                    "facets.values.code": [
                                                        "1",
                                                        "2"
                                                    ]
                                                }
                                            }
                                        }
                                    }
                                ]
                            }
                        }
                    }
                },
                {
                    "nested": {
                        "path": "facets",
                        "query": {
                            "bool": {
                                "must": [
                                    {
                                        "term": {
                                            "facets.code": "2"
                                        }
                                    },
                                    {
                                        "nested": {
                                            "path": "facets.values",
                                            "query": {
                                                "terms": {
                                                    "facets.values.code": [
                                                        "1"
                                                    ]
                                                }
                                            }
                                        }
                                    }
                                ]
                            }
                        }
                    }
                }
            ]
        }
    }
}

The response JSON will contain two documents that match the condition:

id	name	facets
1	Phone 1	(Black), (128GB)
3	Phone 3	(Black, Red), (128GB, 256GB)

Conclusion

The method outlined in this post is just one of many approaches to building faceted search. There may be better designs and tools depending on your specific use cases and the unique needs of your users. Ultimately, with the right approach and tools, such as Elasticsearch, you can create an intuitive and efficient search system that can significantly enhances the user experience of any search-driven application.

Book Review: Code Simplicity by Max Kanat-Alexander

2024-06-14T00:00:00+00:00

Code Simplicity by Max Kanat-Alexander is an ambitious effort to define a set of principles, rules, and laws for software development. In my opinion, this is a bold attempt, considering that software development is too complex and dynamic to establish fixed rules and laws. Yet, what Max tries to achieve is commendable and thought-provoking, as having general rules and laws could indeed help guide developers towards creating more maintainable and efficient code.

I actually read this book a year ago, and I have since tried to apply some of the rules and laws it presents. Now, I think it’s a good time to write a review. This review will cover some topics I found particularly interesting and include some of my favorite quotes from the book.

Good Programmers and Bad Programers

In “Code Simplicity”, Max starts by distinguishing between good programmers and bad programmers:

The difference between a bad programmer and a good programmer is understanding. That is, bad programmers don’t understand what they are doing, and good programmers do.

These two lines are the very first lines of the book. While I am not entirely sure if I agree with the statement, as I believe there are many factors that define whether a programmer is good or bad, they remain the most memorable ones after reading the book.

Over the past year, I have deliberately tried to examine if I truly understand what I am doing in my work. More specifically, I have tried to understand the problems I am solving and to consider alternative ways of addressing these problems with simpler designs. This process of understanding undeniably helps in making better design decisions.

However, achieving simpler design cannot be accomplished solely by understanding one’s own actions; it also requires a degree of comprehension of the trade-offs associated with various technologies and design methods. This understanding allows for more informed and balanced decision-making.

I think the importance of understanding what I am doing extends beyond individual capability. It enhances communication with colleagues, leading to improved productivity for the entire team, considering that software development is pretty much a team endeavor.

Here are some additional lines about good programmers and bad programmers:

So, a “good programmer” should do everything in his power to make what he writes as simple as possible to other programmers. A good programmer creates things that are easy to understand, so that it’s really easy to shake out all the bugs.

Programmers who don’t fully understand their work tend to develop complex systems.

If so, why should a good programmer prioritize simplicity? To answer that, we need to understand the purpose of software and its nature stated in the book:

The purpose of software is to help people.

The longer your program exists, the more probable it is that any piece of it will have to change.

It is obvious that the more complex the system, the more difficult it becomes to make changes, and the higher the likelihood of bugs occurring. A complex system with bugs and a rigid design can fail to meet user needs, become a burden for developers, and ultimately lead to its own obsolescence. This is bad software that won’t help people.

From all these rules and laws, we can deduce the following relationship:

bad programmer ~= complexity ~= bugs, cost ~= bad software ~= won't help people

Conversely, we can assert:

good programmer ~= simplicity ~= fewer bugs, less cost ~= good software ~= help people

Therefore, we conclude:

good programmer ~= help people

While I initially questioned my agreement with the opening statements of the book, I would highly agree with this equivalence.

Accordingly, I would redefine a good programmer as follows: “A good programmer efficiently solves problems without needlessly complicating solutions.”

I also want to note that a good programmer should balance input from others with their own understanding and judgment. The following quotes emphasize the balance between collaboration and individual responsibility in decision-making within a team:

A designer should always be willing to listen to suggestions and feedback, because programmers are usually smart people who have good ideas. But after considering all the data, any given decision must be made by an individual, not by a group of people.

I’ve had an experience where I prioritized major feedback over trusting my initial decision regarding my own task. (My design was simpler, so to speak.) Later on, I confirmed my initial choice was indeed the better path forward, as those who initially offered contrary feedback eventually supported my decision after encountering a perplexing problem arising from unncessary complexity in the design. It’s not that those who offered feedback weren’t good programmers, rather, it’s just that I was the one responsible for the task and understood what I was doing the most. After all, this aligns with a key qualitity of a good programmer: understanding.

The Equation of Software Design

The following equation describes the desirability of a change:

D = (Vn + Vf) / (Ei + Em)

where each term stands for:

D: the desirability of a change
Vn: the value now
Vf: the future value
Ei: the effort of implementation
Em: the effort of maintenance

Max especially points out the effect of effort of maintenance:

It is more important to reduce the effort of maintenance than it is to reduce the effort of implementation.

This is because maintenance effort accumulates over the lifespan of a system, leading to significant long-term costs and potential complications, whereas implementation effort is a one-time cost.

Then how do we reduce the maintenance cost? It all comes down to simplicity again. Among all the terms in the equation above, what simplicity truly solves the most is the effort of maintenance (Em). And this is why we should strive for simplicity in our code and system designs.

And we know the priority equation in terms of value and cost.

priority = value / cost

The following graph represents the priority equation with each value ranging from 0 to 10, for example:

This is a fundamental concept for making strategic decisions in business and project management, which is very obvious.

The problem is that we sometimes fall into the mistake of spending higher effort for lower value. This can happen when we prioritize tasks based on perceived urgency or complexity without a careful evaluation. We all have such experiences where a task initially seemed complex, only to find that as we progressed, its actual complexity did not align with our initial perception. The risk of misallocating priority due to misperceiving complexity would have been reduced if the design were simpler.

Then how do we measure the value of a change? I think measuring the value of a change is relatively more difficult than measuring the cost of a change. However, there are several ways to measure the value of a change:

Key Performance Indicators (KPIs)
User Feedback
A/B Testing

These methods can provide insights into a change’s value, helping to solve the priority equation effectively.

Incremental Development and Design

The most common and disastrous error that programmers make is predicting something about the future when in fact they cannot know.

Here are the three flaws that go against the principles of simplicity:

YAGNI (You Aren’t Gonna Need It)
- Don’t write code until you actually need it, and remove any code that isn’t being used.
Rigid design
- Code should be designed based on what you know now, not on what you think will happen in the future.
Overengineering
- Be only as generic as you know you need to be right now.

These patterns all increase the probability of project failure in that they add “unnecessary” complexity, making the codebase harder to maintain, understand, and extend.

The “necessary” complexity is the inherent complexity that naturally arises from solving a given problem within its specific context and requirements. It includes the essential features and logic needed to achieve the desired functionality without adding any superfluous elements.

Max presents “incremental development and design” as a way to avoid all three of these pitfalls. It’s about developing only essential features in a small and simple manner, and incrementally adding to them. This approach is similar to the concept of MVP (Minimum Viable Product).

So, what exactly is developing only essential features bit by bit? It can be understood by contrasting it with the opposite approach: speculative development. It goes like this:

“Let’s include this feature, and that feature might be necessary too, and in the future, it might evolve like this, so it would be good to develop this in advance, and having that would also be beneficial…”

From my experience, this is one of the most practical methods to avoid unnecessary complexity. If you start by developing only the absolutely necessary features in a small and simple manner, even if it turns out that certain features are actually needed in the future, it won’t be difficult to add them later. In fact, modifying what you’ve already built at that time can incur much higher costs.

One point of contention I found in the book is Max’s somewhat negative stance on rewriting programs to cope with complexity. While he argues that rewriting programs is generally not advisable, in my experience, there have been times when rewriting has been successful. With proper planning and strategy, this approach can be particularly useful in microservices architecture, where restructuring a service can be achieved without significant cost.

How simple do we have to be?

Stupid, Dumb Simple.

Conclusion

I think any designer like me who appreciates the value of simplicity would empathize with many parts of this book. While many things may just seem obvious, it was a good read to bolster the importance of simplicity through the experiences of one brilliant software architect.

Dependency Injection and Dependency Inversion

2024-05-08T00:00:00+00:00

Dependency injection and dependency inversion are two terms that often come together but serve distinct meanings and purposes.

Definitions

The following definitions are sourced from Wikipedia.

Dependency Injection:

In software engineering, dependency injection is a programming technique in which an object or function receives other objects or functions that it requires, as opposed to creating them internally. Dependency injection aims to separate the concerns of constructing objects and using them, leading to loosely coupled programs.

Dependency inversion (often referred to as DIP):

In object-oriented design, the dependency inversion principle is a specific methodology for loosely coupled software modules. When following this principle, the conventional dependency relationships established from high-level, policy-setting modules to low-level, dependency modules are reversed, thus rendering high-level modules independent of the low-level module implementation details.

While numerous articles delve into explaining the concepts of dependency injection and dependency inversion, many of them tend to feel abstract, employing placeholder names like Foo or Bar for example classes, or lacking relevant context. Recognizing this gap, I aim to fill it by providing straightforward examples closely related to real-world scenarios. In this post, we’ll explore what they are, why they matter, and how they are related. I hope that after reading this post, you’ll grasp the definitions provided.

Let’s start with the context.

Context

Imagine an online flight booking application. This app handles flight reservations and processes payments through PayPal, a digital payment platform. The example code for this app contains two classes - PaypalPaymentProcessor and FlightBookingProcessor, each responsible for different parts of the payment process and reservation.

class PaypalPaymentProcessor:
    """
    A class that handles payment processing via PayPal.
    """

    def process_payment(self, amount: float) -> dict:
        """
        Process a payment via PayPal.
        """
        url = "https://api-m.sandbox.paypal.com/v2/payments"
        res = requests.post(url, json={"transactions": {"amount": amount}})
        return res.json()

class FlightBookingProcessor:
    """
    A class that books a flight.
    """
    def __init__(self):
        self.payment_processor = PaypalPaymentProcessor()

    def book_flight(self, amount: float):
        """
        Book a flight.
        """
        res = self.payment_processor.process_payment(amount)

For clarity, the implementation details are simplified here. But in reality, the process_payment method would handle the intricacies of interfacing with PayPal’s API, sending the necessary payment information, and processing the response accordingly. The book_flight method would orchestrate the booking process, including handling payment transactions with the returned dictionary value of the process_payment method. In practice, both classes are often maintained by dedicated developer teams to separate concerns, allowing for modular development and easier maintenance of the codebase. We will assume such scenario.

Before diving deeper, let’s take a moment and look at what “dependency” is. In this context, the FlightBookingProcessor acts as the dependent or client, while the PaypalPaymentProcessor serves as a dependency of the FlightBookingProcessor class. More specifically, the PaypalPaymentProcessor is an implicit dependency.

The following diagram describes the relationship between these two classes:

Now, suppose the organization behind this app decides to expand its business to another region. And due to specific requirements in that region, it is required to integrate Stripe as the payment platform. As a user of the FlightBookingProcessor class, there is no way to use Stripe instead of Paypal since they have no control over the dependency. More specifically, the instantiation of the PaypalPaymentProcessor is abstracted away from the user of the FlightBookingProcessor class as follows:

flight_booking_processor = FlightBookingProcessor()
flight_booking_processor.book_flight(amount=800)

To satisfy such requirements without dependency injection, both dev teams of the PaypalPaymentProcessor and the FlightBookingProcessor should modify their components. What does that mean?

Let’s say the dev team of payment service creates a new class for its clients to use Stripe as follows:

class StripePaymentProcessor:
    """
    A class that handles payment processing via Stripe.
    """

    def process_payment(self, amount: float) -> dict:
        """
        Process a payment via Stripe.
        """
        url = "https://api.stripe.com/v2/payments"
        body = {"amount": amount}
        res = requests.post(url, json=body)
        return res.json()

Creating a new StripePaymentProcessor class, which is independent fo the PaypalPaymentProcessor class, exemplifies the Single Responsibility Principle (SRP) by adhering to a clear and focused purpose for each class, which is great so far.

However, the existing design of the flight booking application lacks flexibility when integrating a new payment platform. This rigidity arises from the direct instantiation of the PaypalPaymentProcessor within the FlightBookingProcessor class, which tightly couples these two classes together.

Without dependency injection, the developers of FlightBookingProcessor should modify their code as well, for example: For example:

class FlightBookingProcessor:
    """
    A class that books a flight.
    """
    def __init__(self, payment_platform: str):
        if payment_platform == "paypal":
            self.payment_processor = PaypalPaymentProcessor()
        elif payment_platform == "paypal":
            self.payment_processor = StripePaymentProcessor()

    def book_flight(self, amount: float, payment_platform):
        """
        Book a flight.
        """
        res = self.payment_processor.process_payment(amount)

In this version, the FlightBookingProcessor class uses a conditional to instantiate the appropriate payment processor based on the payment_platform parameter. This way, users can select the desired payment platform:

# Use Paypal as a payment platform.
flight_booking_processor = FlightBookingProcessor(payment_platform="paypal")
flight_booking_processor.book_flight(amount=800)

# Use Stripe as a payment platform.
flight_booking_processor = FlightBookingProcessor(payment_platform="stripe")
flight_booking_processor.book_flight(amount=800)

While this approach technically satisfies the requirements, it still has design limitations:

It violates the Single Responsibility Principle (SRP), which advocates for classes to have only one reason to change. In this case, the FlightBookingProcessor class is now responsible for both booking flights and selecting the appropriate payment platform based on a string parameter.
This design increases the degree of coupling, as the FlightBookingProcessor class now has a new dependency.
Introducing a new payment platform or modifying an existing one would require developers to update the conditional logic in the constructor of the FlightBookingProcessor class, resulting in increased development cost and reduced readability. Such design can lead to poor extensibility.

This poses another problem in terms of testability. Since the FlightBookingProcessor class directly instantiates the payment processor objects internally, any unit tests for the book_flight method would inherently rely on the functionality of the actual payment processors. This introduces dependencies on external systems, making unit tests difficult to isolate and control. Additionally, testing different scenarios, such as error handling or edge cases, becomes cumbersome with this design.

So, how can we improve the design to minimize coupling and enhance both maintainability and testability?

Dependency Injection

With dependency injection, we can effectively decouple dependent (the FlightBookingProcessor class) from its dependencies (the PaypalPaymentProcessor class), enhancing flexibility and testability within our codebase. There are mainly two methods in Python for applying dependency injection - constructor injection and method injection.

Constructor injection

Constructor injection is considered as the most common form of dependency injection.

The code example demonstrating constructor injection is as follows:

class FlightBookingProcessor:
    """
    A class that books a flight.
    """
    
    def __init__(self, payment_processor):
        self.payment_processor = payment_processor

    def book_flight(self, amount: float):
        """
        Book a flight.
        """
        res = self.payment_processor.process_payment(amount)

With this version, the FlightBookingProcessor class is provided the dependency through the constructor instead of internally instantiating it:

# Instantiate dependency
paypal_payment_processor = PaypalPaymentProcessor()
# Inject dependency to constructor
flight_booking_processor =  FlightBookingProcessor(payment_processor)
flight_booking_processor.book_flight(800)

The advantage of this approach is that it forces the injection of necessary dependencies in order to create the client. Once dependencies are injected via the constructor, they are typically immutable for the lifetime of the object. This immutability can help enforce the principle of encapsulation and prevent unintended modifications to dependencies.

Method injection

Method injection is another form of dependency injection.

The code example demonstrating method injection is as follows:

class FlightBookingProcessor:
    """
    A class that books a flight.
    """

    def book_flight(self, amount: float, payment_processor):
        """
        Book a flight.
        """
        res = payment_processor.process_payment(amount)

With this version, the FlightBookingProcessor class is provided the dependency through the method instead of internally instantiating it:

# Instantiate dependency
paypal_payment_processor = PaypalPaymentProcessor()
flight_booking_processor =  FlightBookingProcessor()
# Inject dependency to method
flight_booking_processor.book_flight(amount=800, payment_processor=payment_processor)

The advantage of this approach is that it supports dynamic dependency injection, allowing clients to use different dependencies at runtime. This flexibility can be beneficial in scenarios where dependencies need to be varied dynamically.

In comparison between two methods, constructor injection offers advantages such as explicit dependency declaration and immutability of dependencies. On the other hand, method injection provides dynamic dependency injection. When it comes to the question of which choice is better, it indeed depends on the specific requirements and design of the project. While I use both methods interchangeably, my personal preference is method injection as it allows for better flexibility in managing dependencies over the lifecycle of an object.

Either way, the point is that they all aim to make implicit dependencies explicit.

Advantages

With dependency injection, users of the FlightBookingProcessor gain the flexibly to choose whichever payment platforms they need to use, as demonstrated below:

paypal_payment_processor = PaypalPaymentProcessor()
flight_booking_processor =  FlightBookingProcessor(paypal_payment_processor)
flight_booking_processor.book_flight(amount=800)

sripe_payment_processor = StripePaymentProcessor()
flight_booking_processor =  FlightBookingProcessor(sripe_payment_processor)
flight_booking_processor.book_flight(amount=800)

This approach enhances the flexibility and maintainability of your code. Adding a new component no longer necessitate modifications in its dependent components, as seen in the example without dependency injection provided earlier.

Dependency injection also greatly improves testability. By injecting dependencies, you can easily substitute real dependencies with mock objects or stubs during testing. For example, you can create a mock payment processor for testing purposes as follows:

mock_payment_processor = MockPaymentProcessor()
flight_booking_processor =  FlightBookingProcessor(mock_payment_processor)
flight_booking_processor.book_flight(amount=800)

This decouples the FlightBookingProcessor from the implementation details of its dependencies, which would have otherwise been tightly bound without dependency injection applied. As a result, we can write more focused and resilient testing.

Limitations

Dependency injection is a means, not an end. – Daniel Somerfield

In essence, dependency injection is simply a technique for providing a dependent object with the dependencies it requires to function. However, dependency injection alone may still lack robustness in design if the dependent relies on concrete implementations rather than abstractions.

In the example above, even with dependency injection applied, the FlightBookingProcessor class still depends on concrete implementations like the PaypalPaymentProcessor class (or the StripePaymentProcessor class). The relationship between FlightBookingProcessor and PaypalPaymentProcessor remains unchanged. Specifically, the FlightBookingProcessor class relies on the process_payment method of its injected dependency and the value returned by that method. As a result, any changes to these implementations could potentially break the code.

Let’s see some potential issues, especially in dynamically typed languages like Python.

For instance, imagine a scenario where the process_payment method is changed, perhaps to return a string value instead of the dictionary as it previously did:

class PaypalPaymentProcessor:
    """
    A class that handles payment processing via PayPal.
    """

    def process_payment(self, amount: float) -> str:
        """
        Process a payment via PayPal.
        """
        url = "https://api-m.sandbox.paypal.com/v2/payments"
        res = requests.post(url, json={"transactions": {"amount": amount}})
        return res.json()["payment_id"]

In this scenario, the code would break since the FlightBookingProcessor was designed to work with the dictionary value returned from the process_payment method.

Another flaw in this design emerges when injecting dependencies that the dependent is not compatible with. Consider a scenario where you need to add another payment method, such as SquarePaymentProcessor. If the developer of this new processor is ignorant of the contract between it and its dependent, they might inadvertently design it without the process_payment method, as demonstrated below:

class SquarePaymentProcessor:
    """
    A class that handles payment processing.
    """

    def process(self, amount: float) -> dict:
        url = "https://connect.squareup.com/v2/payments"
        body = {"amount_money": amount}
        res = requests.post(url, json=body)
        return res.json()

In this case, the code would break when attempting to call process_payment on an instance of SquarePaymentProcessor because FlightBookingProcessor expects it to have a specific method, process_payment.”

These issues highlight the drawback of strict dependence on concrete implementations when injecting dependencies. To address these challenges, we need to invert the dependencies.

Dependency Inversion

The dependency inversion principle (DIP), which is one of the SOLID principles, encourages abstraction and decoupling by ensuring that high-level modules do not directly depend on low-level modules. Instead, both should depend on abstractions. To clarify, a high-level module is responsible for managing the primary logic or main use cases of an application, while a low-level module contains implementation details like API interactions or payment processing.

In our context, the high-level module refers to FlightBookingProcessor while the low-level module refers to PaypalPaymentProcessor. With that in mind, let’s refactor our example code to depend on abstractions rather than concrete implementation by creating an abstract base class for all payment processors:

from abc import ABC, abstractmethod

class PaymentProcessor(ABC):
    """
    An abstract base class that defines the interface for payment processing.
    """

    @abstractmethod
    def process_payment(self, amount: float) -> dict:
        """
        Process a payment and return a response.
        :param amount: The amount to be processed
        :return: A response dict representing the outcome of the payment
        """
        ...

In the example, the PaymentProcessor class provides an abstraction that ensures any payment processing implementation must define the process_payment method. This abstract class acts as the base class upon which various payment processor implementations can be built.

It’s also a good practice to indicate abstract components. One such way is using an I prefix in their names, such as IPaymentProcessor.

Next, we implement specific payment processors that inherit from PaymentProcessor and provide their own concrete implementations of process_payment. Here’s how PaypalPaymentProcessor is implemented:

class PaypalPaymentProcessor(PaymentProcesor):
    """
    A class that handles payment processing via PayPal.
    """

    def process_payment(self, amount: float) -> dict:
        """
        Process a payment via PayPal.
        """
        url = "https://api-m.sandbox.paypal.com/v2/payments"
        res = requests.post(url, json={"transactions": {"amount": amount}})
        return res.json()

You can apply he same method to StripePaymentProcessor and SquarePaymentProcessor as well:

class StripePaymentProcessor(PaymentProcesor):
    ...

class SquarePaymentProcessor(PaymentProcesor):
    ...

Lastly, refactor the FlightBookingProcessor class depend on the abstraction PaymentProcessor rather than the concrete PaypalPaymentProcessor:

class FlightBookingProcessor:
    """
    A class that books a flight.
    """
    
    def __init__(self, payment_processor: PaymentProcessor):
        self.payment_processor = payment_processor

    def book_flight(self, amount: float):
        """
        Book a flight.
        """
        res = self.payment_processor.process_payment(amount)

By depending on the abstract PaymentProcessor, FlightBookingProcessor can now utilize any payment processor implementation as long as it adheres to the PaymentProcessor interface. As a result, both FlightBookingProcessor and PaypalPaymentProcessor now depend on the abstraction PaymentProcessor:

One might argue that abstractions can also change, potentially breaking their dependents. That’s correct. If the abstractions need to change, the high-level modules should indeed change accordingly. However, the rationale behind this rule is that the abstractions rarely change compared to the concrete implementations. Moreover, this approach can force developers to carefully design abstractions, ensuring they are well-structured and extensible. This is why adhering to abstractions provides more stability and flexibility in the long run.

All in all, it is not mandatory to define the abstract base class, but it is desirable in order to achieve a cleaner design. – Mariano Ayana, Clean Code in Python

Conclusion (Final thoughts about DIP)

Dependency injection and dependency inversion are both useful guidelines in software design. Nevertheless, I want to point out that we shouldn’t be overly fixated on these principles. In our example, creating abstractions for different payment processors seems like an obvious way to ensure stability and flexibility in our design. But there are situations where a simpler solution will suffice, especially if it’s intended for one-time use and may never require future modification. That is, the overhead of implementing complex abstractions may outweigh the benefits.

One of my favorite design philosophies is ‘Keep It Simple, Stupid’ (KISS), which encourages developers to focus on clear, straightforward solutions that fulfill requirements without unnecessary complexity. While this idea sounds easy to follow, I think it’s more challenging to apply in practice than simply adhering to the design principles I’ve covered in this post. I also believe that experienced developers are more likely to understand the value of KISS and know how to write simple yet maintainable code, as they’ve likely encountered numerous scenarios where simplicity triumphed over complexity. Remember, regardless of the design principle, keeping your code simple and straightforward ensures maintainability and provides enough room for easy refactoring whenever necessary.

Preventing Flaky Tests and Brittle Tests

2024-03-23T00:00:00+00:00

Flaky tests and brittle tests are two common pitfalls that make our tests unreliable and hard to maintain. In this post, I will illustrate these concepts with examples and suggest ways to prevent such problems to ensure the reliability and maintainability of tests.

The examples provided in this post are written in Python, and pytest is used to validate them.

Flaky Tests

Flaky tests are those that exhibit nondeterministic outcomes (pass or fail) at each execution under the same conditions. This inconsistency can be due to various factors such as the use of random variables or external dependencies, or timing issues.

Using random variables

One of the most common causes of flakiness is probably the use of random variables.

Within the application

The following example uses a random variable within the application:

import random


class SortingHat:
    def get_house(self, name: str) -> str:
        houses = ["Gryffindor", "Hufflepuff", "Ravenclaw", "Slytherin"]
        house = random.choice(houses)
        print(f"The Sorting Hat has chosen {house} for {name}.")
        return house

As you can see, the return value of get_house() is nondeterministic as it randomly selects and returns a house by using choice() method of random module.

A possible test case for this application would be written as follows:

def test_get_house():
    sorting_hat = SortingHat()
    house = sorting_hat.get_house("Harry Porter")
    assert house == "Gryffindor"

This test is flaky because it relies on a random variable returned by the get_house method of the SortingHat class. Consequently, it may pass sometimes if Gryffindor is chosen, but it could fail at other times if any other house is chosen. In this scenario, the probability of the test failing is 75%, as there are four houses. And the more the number of possible values, the higher the probability of failure accordingly, making the test even more unreliable.

To tackle this problem, you can use mocking or seeding.

The example below uses mocking:

import pytest


def test_get_house(monkeypatch: pytest.MonkeyPatch):
    # Patch random.choice to always return "Gryffindor"
    monkeypatch.setattr(random, "choice", lambda x: "Gryffindor")

    sorting_hat = SortingHat()
    house = sorting_hat.get_house("Harry Porter")
    assert house == "Gryffindor"

By mocking, the get_house method only returns Gryffindor, ensuring predictable behavior during testing.

Or you can use seeding as follows:

import random


class SortingHat:
    def get_house(self, name: str) -> str:
        houses = ["Gryffindor", "Hufflepuff", "Ravenclaw", "Slytherin"]
        # # Seed the random number generator to make the test deterministic
        random.seed(name + "some salt to make it more random")
        house = random.choice(houses)
        print(f"The Sorting Hat has chosen {house} for {name}.")
        return house

By seeding the random number generator with a consistent value based on input parameters, you can ensure that the test consistently produces the same results across different test runs.

The choice between mocking and seeding depends on factors such as the complexity of your code and the requirements of your applications. While mocking can provide determinism without altering the application code, it may lead to a disconnect between the test and the actual behavior of the system. On the other hand, seeding the random number generator aligns the test with the real behavior of the system but requires modifying the application code.

Meanwhile, what’s more challenging to track down is the cause of tests that are relatively more reliable yet still flaky. Let’s see what this means with another example with a different context.

Within the test

This time, random variables are used in the test, not in the application under test.

Here we have the create_user function, which creates and returns an instance of the User class:

class User:
    def __init__(self, name: str, id: int):
        self.name = name
        self.id = id


def create_user(name: str, id: int) -> User:
    if not 0 < id:
        raise ValueError("id must be positive")
    user = User(name, id)
    return user

Let’s say a developer writes a test case to verify if create_user indeed creates a User with the id passed into it:

def test_create_user():
    name = "Mickey"
    id = random.randint(0, 99)
    user = create_user(name, id)

    assert user.name == name
    assert user.id == id

In this scenario, the test’s failure probability is 1%, given that there are 100 potential values generated by random.randint(0, 99). Only one of these values, namely 0, would result in a failed test. Consequently, there’s a high likelihood that the test will pass, creating a false belief in its reliability. By the time the failure occurs, it might not be immediately reproducible, making it more challenging to isolate and fix the underlying issue.

To prevent this problem, you can consider using deterministic values for testing instead of relying on random variables. In this case, you could replace the random ID generation with a fixed value or a controlled set of values that cover relevant test cases. This approach ensures that the test consistently evaluates the behavior of the create_user function without introducing unnecessary flakiness from random inputs.

External dependencies

Relying on external dependencies that beyond our control can lead to test flakiness.

Here is a simple example that depends on external dependencies:

import requests


def get_response(url: str) -> requests.Response:
    return requests.get(url, timeout=1)

And one can write a test case for this as follows:

def test_get_response():
    url = "https://example.com/"
    response = get_response(url)
    assert response.status_code == 200

This test could become flaky if the response time exceeds 1 second or the external service (https://example.com/) is down temporarily. Or it could be the unreliable network that can cause the flakiness.

The most general approach to prevent such issue in this case is using mocks as follows:

def test_get_response(monkeypatch: pytest.MonkeyPatch):
    url = "https://example.com/"

    # Define a mock response object
    class MockResponse:
        def __init__(self, status_code):
            self.status_code = status_code

    # Define a mock function to replace requests.get()
    def mock_get(*args, **kwargs):
        # Simulate the behavior of the external service
        return MockResponse(200)

    # Use monkeypatch to replace requests.get() with the mock function
    monkeypatch.setattr(requests, "get", mock_get)

    # Call the function under test
    response = get_response(url)

    assert response.status_code == 200

In this test, pytest.MonkeyPatch is used to replace the requests.get() function with a mock function (mock_get). This mock function simulates the behavior of the external service by returning a mock response with 200 status code. As a result, you can isolate the code under test from its dependencies and ensure that the test remains reliable regardless of external factors.

Timing issues: Concurrency

Concurrency is also a possible cause of flakiness. Flakiness due to concurrency is subtler and more difficult to track down than other causes of flakiness.

Let’s see how it leads to nondeterministic test results and some ways to prevent them.

Within the application

The application code with concurrency may lead to flakiness in test outcomes.

Let’s say we have a function named get_factorials, which get the factorial values of the given numbers concurrently using multiple threads as follows:

import concurrent.futures
from math import factorial


def get_factorial(url: str) -> int:
    return factorial(url)


def get_factorials(nums: list[int]) -> list[int]:
    """
    Get factorials of the given numbers concurrently using multiple threads.
    """
    results = []

    # Create a ThreadPoolExecutor with the desired number of threads
    with concurrent.futures.ThreadPoolExecutor(max_workers=len(nums)) as executor:
        # Submit tasks to the executor
        futures = [executor.submit(get_factorial, num) for num in nums]

        # Wait for all tasks to complete and collect the results
        for future in concurrent.futures.as_completed(futures):
            result = future.result()
            results.append(result)

    return results

And we have the following test case for get_factorials:

def test_get_factorials():
    nums = [1, 2, 3]
    status_codes = get_factorials(nums)
    assert status_codes == [1, 2, 6]

This test is flaky in that get_factorials does not guarantee the order of results. This is because the execution of threads can be nondeterministic, meaning that the order in which the threads complete their tasks may vary over time.

For this particular example, you may simply sort the outcomes of the result to prevent flakiness in the test:

def test_get_factorials():
    nums = [1, 2, 3]
    status_codes = get_factorials(nums)
    assert sorted(status_codes) == [1, 2, 6]

Some other ways to deal with this problem is to consider implementing strategies such as deterministic thread scheduling, or having an option of limiting concurrency to reduce the likelihood of flaky test outcomes as follows.

Below is an example that allows an option to limit concurrency:

def get_factorials(nums: list[int], max_workers: int = 4) -> list[int]:
    """
    Get factorials of the given numbers concurrently using multiple threads.
    """
    results = []

    # Create a ThreadPoolExecutor with the desired number of threads
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Submit tasks to the executor
        futures = [executor.submit(get_factorial, num) for num in nums]

        # Wait for all tasks to complete and collect the results
        for future in concurrent.futures.as_completed(futures):
            result = future.result()
            results.append(result)

    return results

def test_get_factorials():
    nums = [1, 2, 3]
    status_codes = get_factorials(nums, max_workers=1)  # Use only 1 thread
    assert status_codes == [1, 2, 6]

This way, the test becomes deterministic as it deploys only one thread to limit concurrency.

Within the test

Tests can become flaky if you use shared resources between multiple tests and run them concurrently without proper synchronization.

Look at the following code:

# product.py
from redis import Redis

redis = Redis()


class Product:
    def __init__(self, id):
        self.id = id

    @property
    def quantity(self):
        return int(redis.get(f"product:{self.id}:quantity") or 0)

    @quantity.setter
    def quantity(self, value):
        redis.set(f"product:{self.id}:quantity", value)

    def add_quantity(self, value):
        self.quantity += value

    def reduce_quantity(self, value):
        self.quantity -= value

    def reset_quantity(self):
        self.quantity = 0

In the example above, we have Product class that uses Redis to store and manage quantity information for each product. It also has three methods that interact directly with the Redis instance to manipulate the quantity data:

add_quantity
reduce_quantity
reset_quantity

Imagine we write three test cases to validate each method of the Product class as follows:

# test_product.py
import pytest
from product import Product
from redis import Redis

redis = Redis()


class Test:
    @pytest.fixture(autouse=True, scope="function")
    def setup(self):
        # Clean up test data from redis before running tests
        redis.delete("product:1:quantity")

    def test_add_quantity(self):
        product = Product(1)
        product.add_quantity(10)
        assert product.quantity == 10

    def test_reduce_quantity(self):
        product = Product(1)
        product.add_quantity(10)
        product.reduce_quantity(5)
        assert product.quantity == 5

    def test_reset_quantity(self):
        product = Product(1)
        product.add_quantity(10)
        product.reset_quantity()
        assert product.quantity == 0

With the setup method to clean up any data in Redis before running tests, the same state can be provided to each test. This way, the tests will consistently pass without incurring any flakiness if we run them sequentially.

However, if the tests are executed concurrently, there is a possibility of encountering flakiness due to potential race conditions or interference between the tests accessing the shared Redis instance. You can actually observe non-deterministic outcomes at each execution if you run the tests in the example with pytest-xdist, a library that allows for concurrent execution of tests in Python.

One solution to prevent such flakiness due to concurrency within the test is to implement locks to ensure proper isolation of access to shared resources. For example:

import pytest
from product import Product
from redis import Redis

redis = Redis()


class Test:
    @pytest.fixture(autouse=True, scope="function")
    def setup(self):
        # Clean up test data from redis before running tests
        lock = redis.lock("redis_lock", timeout=5)
        with lock:
            redis.delete("product:1:quantity")
            yield  # Release the lock after the test (teardown)

    def test_add_quantity(self):
        product = Product(1)
        product.add_quantity(10)
        assert product.quantity == 10

    def test_reduce_quantity(self):
        product = Product(1)
        product.add_quantity(10)
        product.reduce_quantity(5)
        assert product.quantity == 5

    def test_reset_quantity(self):
        product = Product(1)
        product.add_quantity(10)
        product.reset_quantity()
        assert product.quantity == 0

With this modification, we can ensure that only one test or thread accesses the shared resource at a time, preventing race conditions and guaranteeing consistent results. While using locks adds overhead and may slow down test execution, it can at least reduce the possibility of flakiness caused by concurrent access to resources.

Heterogeneous test runners

The cause of flaky outcomes is not necessarily due to the way we write the code but also the way we organize test runners to run our tests in, for example, a continuous integration (CI) pipeline. So it may not be problematic if you run your tests only in your local environment or any other consistent environments. But it can happen in environments where multiple different test runners are used to execute tests.

Specifically, the problem with heterogeneous test runners arises from differences in their configurations, capacity, and dependencies. For instance, one test runner might lack some dependencies to run the tests. In this case, a test may pass on one runner but fail on another, leading to inconsistent results.

(A snippet of log from Github Actions. We use self-hosted runners to run tests and encountered this issue from time to time. Some of the test runners was missing a required dependency to run the test.)

What’s more time-consuimg is that managing multiple test runners adds complexity to the testing infrastructure. Teams may struggle to synchronize configurations and updates across all environments. Additionally, troubleshooting flaky tests becomes more difficult when multiple runners are involved, as diagnosing the root cause requires understanding how each runner is configured and behaves.

To prevent such problem of having heterogeneous test runners, you can standardize your testing environment by adopting a unified approach. This involves ensuring consistency in all dependencies and system configurations across environments, which can lead to deterministic test results.

Note that it’s important to avoid writing flaky tests in the first place because, again, identifying the cause of these issues often leads to a significant reduction in productivity, as there is usually no straightforward way to track down the problem.

Brittle tests

Brittle tests are those that are sensitive to changes in the application, leading to frequent failures even when unrelated changes are made to the codebase. I am going explore two causes that can make tests brittle:

over-specifying expected outcomes
relying on extensive and complicated boilerplate

Let’s see how they might look and how to prevent them.

Over-specifying expected outcomes

The problem of over-specifying expected outcomes is that it tightly couples the test with the implementation details of the code, not the behavior.

Without further explanation, let’s look at the following code:

# product.py
from dataclasses import dataclass
from enum import Enum


@dataclass
class Product:
    id: int
    name: str
    price: int


# Simulate product database
product_db = [
    Product(id=1, name="Apple", price=10),
    Product(id=2, name="Banana", price=20),
    Product(id=3, name="Cherry", price=30),
]


class Sort(str, Enum):
    price_asc = "price_asc"
    price_desc = "price_desc"


def get_products(
    id: int | None = None,
    sort: Sort | None = None,
) -> list[Product]:
    """
    Return a list of products based on the provided parameters.
    """
    products = product_db
    if id is not None:
        products = [product for product in products if product.id == id]
    if sort == Sort.price_asc:
        products = sorted(products, key=lambda x: x.price)
    elif sort == Sort.price_desc:
        products = sorted(products, key=lambda x: x.price, reverse=True)

    return products

The Product class three attributes - id, name and price. And the get_products returns a list of Product objects based on the provided filter or sort parameters - id, sort.

Now look at test suite below:

from product import Product, Sort, get_products


class Test:
    def test_get_products_with_no_parameters(self):
        result = get_products()
        assert result == [
            Product(id=1, name="Apple", price=10),
            Product(id=2, name="Banana", price=20),
            Product(id=3, name="Cherry", price=30),
        ]

    def test_get_products_with_id(self):
        result = get_products(id=1)
        assert result == [Product(id=1, name="Apple", price=10)]

    def test_get_products_with_sort_price_asc(self):
        result = get_products(sort=Sort.price_asc)
        assert result == [
            Product(id=1, name="Apple", price=10),
            Product(id=2, name="Banana", price=20),
            Product(id=3, name="Cherry", price=30),
        ]

    def test_get_products_with_sort_price_desc(self):
        result = get_products(sort=Sort.price_desc)
        assert result == [
            Product(id=3, name="Cherry", price=30),
            Product(id=2, name="Banana", price=20),
            Product(id=1, name="Apple", price=10),
        ]

The tests will always pass for now. However, the tests will fail if an additional attribute is added to the product class for example:

# product.py
@dataclass
class Product:
    id: int
    name: str
    price: int
    quantity: int


# Simulate product database
product_db = [
    Product(id=1, name="Apple", price=10, quantity=30),
    Product(id=2, name="Banana", price=20, quantity=20),
    Product(id=3, name="Cherry", price=30, quantity=10),
]

...

Here I added the quantity attribute to the Product class and updated the product database following the change.

If we run the tests above with pytest, the tests will break as follows:

...
============================================================ short test summary info ============================================================
FAILED test_product.py::test_get_products_with_id - TypeError: Product.__init__() missing 1 required positional argument: 'quantity'
FAILED test_product.py::test_get_products_with_sort_price_asc - TypeError: Product.__init__() missing 1 required positional argument: 'quantity'
FAILED test_product.py::test_get_products_with_sort_price_desc - TypeError: Product.__init__() missing 1 required positional argument: 'quantity'
========================================================== 4 failed, 1 passed in 0.02s ==========================================================

This is because the Product objects created in the test don’t include the quantity value.

In order to prevent such brittleness, we should aim to focus on testing the behavior rather than the implementation details.

Here’s a modified version to test the behavior, not the implementations:

from product import Sort, get_products, product_db


class Test:
    def test_get_products_with_no_parameters(self):
        result = get_products()
        assert result == [
            Product(id=1, name="Apple", price=10, quantity=30),
            Product(id=2, name="Banana", price=20, quantity=20),
            Product(id=3, name="Cherry", price=30, quantity=10),
        ]

    def test_get_products_with_id(self):
        result = get_products(id=1)
        assert result[0].id == 1

    def test_get_products_with_sort_price_asc(self):
        result = get_products(sort=Sort.price_asc)
        assert result == sorted(result, key=lambda x: x.price)

    def test_get_products_with_sort_price_desc(self):
        result = get_products(sort=Sort.price_desc)
        assert result == sorted(result, key=lambda x: x.price, reverse=True)

Now the test becomes more resilient to change by focusing on the functionality of each parameter rather than the implementations.

It’s important to mention that the test_get_products_with_no_parameters still required modification after changes were made to the SUT. This is because it’s intended to verify the outcome itself, as developers often need to test not only the behaviors but also the outcomes. In other words, you can segregate the responsibility of verifying the outcome into one test case from the others that focus more on behaviors rather than the outcome.

Relying on extensive and complicated boilerplate

Extensive and complicated boilerplate can also make tests brittle.

First, let me illustrate how such tests can go break by unrelated changes. Look at the following example:

# customer.py
class Customer:
    def __init__(self, name: str, phone: str):
        self.name = name
        self.phone = phone

# order.py
from customer import Customer


class OrderItem:
    def __init__(self, id: int, name: str, price: float, quantity: int):
        self.id = id
        self.name = name
        self.price = price
        self.quantity = quantity


class Order:
    def __init__(
        self,
        order_id: int,
        custmer: Customer,
        order_items: list[OrderItem],
    ):
        self.order_id = order_id
        self.customer = custmer
        self.order_items = order_items


class OrderService:
    def __init__(self):
        self.orders = []

    def add_order(self, order: Order):
        self.orders.append(order)

    def get_order_by_id(self, order_id: int) -> Order | None:
        for order in self.orders:
            if order.order_id == order_id:
                return order
        return None

    def get_total_price(self) -> float:
        total_price = 0
        for order in self.orders:
            for order_item in order.order_items:
                total_price += order_item.price * order_item.quantity
        return total_price

Here we have four classes OrderItem, Order, OrderService and Customer. They clearly represent entities and services related to managing orders in the system.

An example of a test suite with extensive and complicated boilerplate would be as follows:

# test_order_service.py
import pytest
from customer import Customer
from order import Order, OrderItem, OrderService


class Test:
    @pytest.fixture(autouse=True, scope="function")
    def setup(self):
        order_items = [
            OrderItem(1, "Laptop", 1200, 1),
            OrderItem(2, "Mouse", 10, 1),
            OrderItem(3, "Monitor", 300, 1),
        ]
        customer = Customer("Mickey", "01000000000")
        order = Order(1, customer, order_items)
        order_service = OrderService()
        order_service.add_order(order)
        self.order_service = order_service

    def test_add_order(self):
        order_service = self.order_service

        customer = Customer("Bob", "555-5678")
        order = Order(2, customer, [])
        order_service.add_order(order)
        assert order in order_service.orders

    def test_get_order_by_id(self):
        order_service = self.order_service

        order = order_service.get_order_by_id(1)
        assert order.order_id == 1

    def test_get_total_price(self):
        order_service = self.order_service

        assert order_service.get_total_price() == 1510

At first glance, this test suite seems well-structured and organized. The use of pytest.fixture to set up the order_service ensures that each test function operates on a consistent starting state, promoting test independence and reliability. The tests themselves cover a range of scenarios, including adding orders, retrieving orders by ID, and calculating total prices.

But it has a vulnerability to brittleness in that it has extensive boilerplate in the setup method. Suppose we add a new parameter email to the constructor method of the Customer class:

# customer.py
class Customer:
    def __init__(self, name: str, phone: str, email: str):
        self.name = name
        self.phone = phone
        self.email = email

This addition of the email parameter is unrelated to the existing functionality of the OrderService class and the test cases. It doesn’t affect the logic of the methods being tested (add_order, get_order_by_id, get_total_price), but it does alter the structure of the Customer class. As a result, the test suite will break because the Customer objects created in the test suite (setup method) do not include the email attribute:

================================ short test summary info ================================
ERROR test_order_service.py::Test::test_add_order - TypeError: Customer.__init__() missing 1 required positional argument: 'email'
ERROR test_order_service.py::Test::test_get_order_by_id - TypeError: Customer.__init__() missing 1 required positional argument: 'email'
ERROR test_order_service.py::Test::test_get_total_price - TypeError: Customer.__init__() missing 1 required positional argument: 'email'
=================================== 3 errors in 0.18s ===================================

One approach to resolve this problem is to use mocking:

# test_order_service.py
...

class Test:
    @pytest.fixture(autouse=True, scope="function")
    def setup(self):
        order_items = [
            OrderItem(1, "Laptop", 1200, 1),
            OrderItem(2, "Mouse", 10, 1),
            OrderItem(3, "Monitor", 300, 1),
        ]
        customer = object()  # Mock Customer
        order = Order(1, customer, order_items)
        order_service = OrderService()
        order_service.add_order(order)
        self.order_service = order_service

    def test_add_order(self):
        order_service = self.order_service

        customer = object()  # Mock Customer
        order = Order(2, customer, [])
        order_service.add_order(order)
        assert order in order_service.orders

    ...

Now the tests become more focused on the functionality of the OrderServcie class and more resilient to change as they don’t depend on the implementation detail of the Customer class.

Do note that the misuse of mocks can also degrade the test quality. For instance, the overuse of mocks can make tests overly reliant on the mocked behavior, rather than accurately reflecting the behavior of the real system. To avoid these pitfalls, it’s important to use mocks appropriately with a clear understanding of their purpose and limitations.

Let’s change the context and look at another scenario where we may have to update the test. This time, the Order class is changed to get a new parameter status:

...

class Order:
    def __init__(
        self,
        order_id: int,
        custmer: Customer,
        order_items: list[OrderItem],
        status: str,
    ):
        self.order_id = order_id
        self.customer = custmer
        self.order_items = order_items
        self.status = status

...

Adding status is seemingly unrelated to the existing functionality of the OrderService class and the test cases. But the tests will break due to this change. As for the question whether the change in this case should be counted as “unrelated” or “breaking”, I don’t have a clear answer. If it’s breaking, we should go back to change the test. If it’s unrelated, how would you address that? One can try resolving the issue by providing a default value for the status argument in the Order class. But that modification does not actually mitigate the brittleness inherent in the test. (I am not necessarily saying it’s a bad idea. It’s rather reasonable considering that most orders would start with pending status.) Maybe it’s both unrelated but also breaking. With that being said, I want to point out that it’s quite difficult to avoid them entirely in the real world.

Here’s an excerpt from Software Engineering at Google:

The ideal test is unchanging: after it’s written, it never needs to change unless the requirements of the system under test change.

After all, striving for unchanging tests is the key idea to prevent brittle tests as much as possible.

Conclusion

The act of writing tests is desirable, but just writing them isn’t sufficient. Poorly written tests can become hard to maintain, resulting in increased costs and, in the worst-case scenario, leading to the neglect of test maintenance itself. And the larger the organization, the severer these issues can become. Therefore, it’s important to establish best practices and guidelines for writing resilient tests to prevent flaky tests and brittle tests in the future. I hope this post helps you understand the causes of flaky and brittle tests, enabling you to improve the quality of your tests.

Provision an Amazon EKS cluster with AWS CDK

2023-11-12T00:00:00+00:00

Amazon EKS (Elastic Kubernetes Service) is a fully managed Kubernetes service that simplifies building, securing, operating, and maintaining Kubernetes clusters on AWS.

This post serves as a step-by-step tutorial on provisioning an Amazon EKS cluster within a custom Amazon VPC, utilizing AWS CDK, specifically using L2 constructs.

Previously, I posted a tutorial on creating and configuring an Amazon VPC by using AWS CDK. All the examples in that post are based on L1 constructs to illustrate how they represent AWS CloudFormation.

In practice, however, using L2 constructs is more preferred. They incorporate reasonable defaults and boilerplate based on best practices, alleviating the need for an in-depth understanding of all the details about AWS resources. They also provide convenience methods to help you work with the resource. For example, you can use bucket.grant_read(user) to grant only the minimal permissions required for user to read from the bucket:

from aws_cdk import aws_s3 as s3

bucket = s3.Bucket()
user = iam.User()
bucket.grant_read(user)

You would otherwise have to manually figure out and write the policy using L1 constructs or CloudFormation, which can be inefficient and daunting if there’s no need to fine-tune the configurations for specific exceptional cases (which I consider a scenario to avoid if possible).

Although the programming language for the tutorial is Python, you can use whatever familiar one among the supported programming languages.

You may incur AWS charges for the resources created by this tutorial. I advise you to consult the AWS Pricing Calculator to estimate the cost.

Prerequisites

You need to have an AWS account and have configured the AWS CLI to interact with AWS.

Here’s a list of requirements and links to install to use the AWS CDK:

Node.js 14.15.0 or later (I recommend using a version manager such as nvm to installation)
Python 3.7 or later including pip (I recommend using a version manager such as pyenv to installation)

After satisfying all the requirements, you need to install some additional packages:

AWS CDK Toolkit (npm install -g aws-cdk)
AWS CDK Library (python -m pip install aws-cdk-lib)
AWS Lambda Layer with KubeCtl v1.27 (python -m pip install aws-cdk.lambda-layer-kubectl-v27): as we are about to create a kubernetes cluster version 1.27, we need to explicitly specifcy the kubectl version to use. Otherwise, a default layer with Kubectl 1.20 will be used.
kubectl

FYI, here are the versions of the requirements that I used to write and test the content in this post:

Python 3.11.4
AWS CDK Toolkit (cdk command) 2.95.0
kubectl 1.27

Although I use the AWS CLI to verify the outcomes in this tutorial, you can also use the AWS Management Console instead.

Bootstrap

You need to provision necessary resources for the AWS CDK to deploy AWS CDK apps into an AWS environment (a combination of an AWS account and Region). This process deploys the CDK toolkit stack into an AS CloudFormation named CDKToolkit.

The simplest method to bootstrap is to use cdk command:

cdk bootstrap aws://{ACCOUNT-NUMBER}/{REGION}

For example, the following command bootstraps into ap-northeast-3 region with an AWS account number that is 123456789012:

cdk bootstrap aws://123456789012/ap-northeast-3

You can identify your account number by aws sts get-caller-identity and default region by aws configure get region.

To confirm that a new stack named CDKToolkit is created after bootstrapping, run the following command:

aws cloudformation list-stack-resources --stack-name CDKToolkit

Create the AWS CDK app

Below are the steps to create a directory named my-project for our AWS CDK app and generate the code written in Python programming language:

mkdir my-project
cd my-project
cdk init app --language python
(If you have not installed aws-cdk-lib in your environment,) python -m pip install requirements.txt

The following tree shows the files and directories created by cdk init command:

.
├── README.md
├── app.py
├── cdk.json
├── my_project
│   ├── __init__.py
│   └── my_project_stack.py
├── requirements-dev.txt
├── requirements.txt
├── source.bat
└── tests
    ├── __init__.py
    └── unit
        ├── __init__.py
        └── test_my_project_stack.py

You can take a moment to examine each file for an overview of their organization. Here’s a concise explanation of some key files and directories in the structure:

app.py contains the App construct which represents an entire CDK app. Normally, this file acts as an entrypoint for cdk command.
my_project directory is where we are going to write our stacks.
- my_project_stack.py is an automatically generated file as a stack example.
tests directory is where where we are going to write our tests for our app.

A stack is a unit of deployment in the AWS CDK and we are about to create two stacks:

VpcStack: a stack for a VPC on which to provision an Amazon EKS cluster
EksStack: a stack for an Amazon EKS

Since we are going to write our own stacks, you may just remove my_project_stack.py.

Create a VPC

Define the stack

To create a VPC, place the following code in my_project/vpc_stack.py:

# my_project/vpc_stack.py
from aws_cdk import Stack
from aws_cdk import aws_ec2 as ec2
from constructs import Construct


class VpcStack(Stack):
    """
    This stack deploys a VPC with six subnets spread across two availability zones.
    """

    def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)

        # https://docs.aws.amazon.com/cdk/api/v2/python/aws_cdk.aws_ec2/Vpc.html
        self.vpc = ec2.Vpc(
            self,
            id="Vpc",
            ip_addresses=ec2.IpAddresses.cidr("10.0.0.0/16"),
            max_azs=2,
            # We follow the three-tier architecture
            subnet_configuration=[
                ec2.SubnetConfiguration(
                    name="Web",
                    subnet_type=ec2.SubnetType.PUBLIC,
                    cidr_mask=24,
                ),
                ec2.SubnetConfiguration(
                    name="Application",
                    subnet_type=ec2.SubnetType.PRIVATE_WITH_EGRESS,
                    cidr_mask=24,
                ),
                ec2.SubnetConfiguration(
                    name="Database",
                    subnet_type=ec2.SubnetType.PRIVATE_ISOLATED,
                    cidr_mask=24,
                ),
            ],
        )

Here are a few notes on the configuration of VpcStack in the code above:

The VPC meets the subnet requirements of Amazon EKS by employing at least two subnets in different Availability Zones.
The subnet configuration follows the three-tier architecture which is the most popular implementation of a multi-tier archiecture.
Each ec2.SubnetConfiguration creates a subset for each AZ, so this stack will create 6 (3 subnet groups * 2 AZs) subnets (the subnets spread across two availability zones automatically).
By default, a NAT gateway is created in every public subnet for maximum availability. So in our configuration, there will be two NAT gateways in Web1 and Web2.
I didn’t specify the VPC name to use a generated resource name, but you can specify it with vpc_name parameter. It will be set to VpcStack/Vpc in our configuration.
I intentionally stored a reference to the Vpc construct as an attribute of VpcStack to pass it to the constructor of the EKS stack, which I will create later on this post.

Below is a table of subnets from the above configuration:

Subnet Name	Type	IP Block	AZ
Web1	`PUBLIC`	10.0.0.0/24	#1
Web2	`PUBLIC`	10.0.1.0/24	#2
Application1	`PRIVATE`	10.0.2.0/24	#1
Application2	`PRIVATE`	10.0.3.0/24	#2
Database1	`ISOLATED`	10.0.4.0/24	#1
Database2	`ISOLATED`	10.0.5.0/24	#2

Test the stack

Since infrastructure changes can have a significant impact on our system, testing infrastructure code is essential as any other application code we write.

To ensure our infrastructure behaves as expected, the AWS CDK library also provides assertions module.

Place the following code in tests/unit/test_vpc_stack.py:

# tests/unit/test_vpc_stack.py
import aws_cdk as core
import aws_cdk.assertions as assertions
from my_project.vpc_stack import VpcStack


def test_vpc_created():
    app = core.App()
    vpc_stack = VpcStack(app, "VpcStack")
    template = assertions.Template.from_stack(vpc_stack)

    # For arguments, refer to cdk.out/VpcStack.template.json
    template.resource_count_is("AWS::EC2::Subnet", 6)
    template.resource_count_is("AWS::EC2::NatGateway", 2)
    template.resource_count_is("AWS::EC2::InternetGateway", 1)

    template.has_resource_properties("AWS::EC2::VPC", {"CidrBlock": "10.0.0.0/16"})
    template.has_resource_properties(
        "AWS::EC2::Subnet", {"CidrBlock": assertions.Match.exact("10.0.0.0/24")}
    )
    template.has_resource_properties(
        "AWS::EC2::Subnet", {"CidrBlock": assertions.Match.exact("10.0.1.0/24")}
    )
    template.has_resource_properties(
        "AWS::EC2::Subnet", {"CidrBlock": assertions.Match.exact("10.0.2.0/24")}
    )
    template.has_resource_properties(
        "AWS::EC2::Subnet", {"CidrBlock": assertions.Match.exact("10.0.3.0/24")}
    )
    template.has_resource_properties(
        "AWS::EC2::Subnet", {"CidrBlock": assertions.Match.exact("10.0.4.0/24")}
    )
    template.has_resource_properties(
        "AWS::EC2::Subnet", {"CidrBlock": assertions.Match.exact("10.0.5.0/24")}
    )

The command to run the test:

pytest tests/unit/test_vpc_stack.py

Synthesize the stack

Although you can deploy the stack right away, it is a good practice to synthesize before deploying.

To synthesize the stack, update app.py file that as follows:

import aws_cdk as cdk
from my_project.vpc_stack import VpcStack

ACCOUNT_ID = "123456789012"  # Your AWS account ID

app = cdk.App()
vpc_stack = VpcStack(
    app,
    "VpcStack",
    env=cdk.Environment(account=ACCOUNT_ID, region="ap-northeast-3"),
)

app.synth()

As you see in the example above, we initialize app from cdk.App construct and vpc_stack from VpcStack. And we also explicitly specify the environment for vpc_stack with cdk.Environment. This is a recommended practice for production stacks.

The following command synthesizes an AWS CloudFormation template for the app:

cdk synth

The CloudFormation template file is created in .out directory and this is what cdk will use to deploy the VpcStack.

Deploy the stack

Now that everything’s up, issue cdk deploy to create a VPC into our AWS environment:

cdk deploy

The expected modifications will appear in the console, and you will be prompted to deploy the changes:

IAM Statement Changes
...
IAM Policy Changes
...

Do you wish to deploy these changes (y/n)?

Note that deployment may take a few seconds to a few minutes to complete.

After deploying the stack, you can verify a new VPC is created as expected with the following command:

aws ec2 describe-vpcs --filters "Name=tag:Name,Values=VpcStack/Vpc"

Again, you can also use the AWS Management Console to verify it.

Create an EKS cluster

Define the EKS stack

To create an EKS stack, place the following code in my_project/eks_stack.py:

# my_project/eks_stack.py
from aws_cdk import Stack
from aws_cdk import aws_ec2 as ec2
from aws_cdk import aws_eks as eks
from aws_cdk.lambda_layer_kubectl_v27 import KubectlV27Layer
from constructs import Construct


class EksStack(Stack):
    """
    This stack deploys an EKS cluster to a given VPC.
    """

    def __init__(self, scope: Construct, construct_id: str, vpc: ec2.Vpc, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)

        cluster = eks.Cluster(
            self,
            id="Cluster",
            version=eks.KubernetesVersion.V1_27,
            default_capacity=0,
            kubectl_layer=KubectlV27Layer(self, "kubectl"),
            vpc=vpc,
            cluster_name="my-cluster",
        )

Here are a few notes on the configuration of EksStack in the code above:

Setting default_capacity=0 will prevent the initialization of any worker instances, which we plan to add later.
The Kubernetes version is 1.27.
We employ KubectlV27Layer(self, "kubectl") to use kubectl version 1.27, aligning with the Kubernetes version. If not defined, a default layer containing Kubectl 1.20 and Helm 3.8 will be used.

Test the stack

Place the following code in tests/unit/test_eks_stack.py:

# tests/unit/test_eks_stack.py
import aws_cdk as core
import aws_cdk.assertions as assertions
from my_project.eks_stack import EksStack
from my_project.vpc_stack import VpcStack


def test_eks_created():
    app = core.App()

    vpc_stack = VpcStack(app, "VpcStack")
    eks_stack = EksStack(app, "EksStack", vpc=vpc_stack.vpc)
    template = assertions.Template.from_stack(eks_stack)

    template.has_resource_properties(
        "Custom::AWSCDK-EKS-Cluster", {"Config": {"name": "my-cluster", "version": "1.27"}}
    )

This test verifies that the corresponding CloudFormation template to the stack has a resource property that specifies the Kubernetes version that is 1.27.

To run the test:

pytest tests/unit/test_eks_stack.py

Synthesize the stack

Before synthesizing, update app.py to add a new stack to our app as follows:

# my_project/eks_stack.py
import aws_cdk as cdk
from my_project.eks_stack import EksStack
from my_project.vpc_stack import VpcStack

ACCOUNT_ID = "123456789012"  # Your AWS account ID

app = cdk.App()
vpc_stack = VpcStack(
    app,
    "VpcStack",
    env=cdk.Environment(account=ACCOUNT_ID, region="ap-northeast-3"),
)
eks_stack = EksStack(
    app,
    "EksStack",
    vpc=vpc_stack.vpc,
    env=cdk.Environment(account=ACCOUNT_ID, region="ap-northeast-3"),
)
eks_stack.add_dependency(target=vpc_stack, reason="We use a custom VPC for the cluster.")

app.synth()

Here’s the command to synthesize the EksStack:

cdk synth EksStack

Deploy the stack

Now that we have two stacks in our app, we can deploy only EksStack with the following command:

cdk deploy EksStack

Deploying the eks.Cluster construct may take take some time, given the numerous resources it creates in its stack. You can verify these resources by inspecting cdk.out/EksStack.template.json, which is the CloudFormation template synthesized by running cdk synth.

After deploying the stack, you can confirm that a new EKS cluster has been created as expected by using the following command:

aws --no-cli-pager eks describe-cluster --name my-cluster

You can also list all stacks in the app:

cdk ls

Output:

VpcStack
EksStack

Update a kubeconfig file

In order to use kubectl to interact with our cluster, you need to create a masters role and associate it with the system:master RBAC group, which has super-user access to our cluster. And by assuming this role, you can perform actions on your cluster through kubectl.

Update eks_stack.py to create a masters role and add it to system:master RBAC group:

# my_project/eks_stack.py
from aws_cdk import Stack
from aws_cdk import aws_ec2 as ec2
from aws_cdk import aws_eks as eks
from aws_cdk import aws_iam as iam
from aws_cdk.lambda_layer_kubectl_v27 import KubectlV27Layer
from constructs import Construct


class EksStack(Stack):
    """
    This stack deploys an EKS cluster to a given VPC.
    """

    def __init__(self, scope: Construct, construct_id: str, vpc: ec2.Vpc, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)

        cluster = eks.Cluster(
            self,
            id="Cluster",
            version=eks.KubernetesVersion.V1_27,
            default_capacity=0,
            kubectl_layer=KubectlV27Layer(self, "kubectl"),
            vpc=vpc,
            cluster_name="my-cluster",
        )

        # Create a role to interact with the cluster through `kubectl`
        masters_role = iam.Role(
            self,
            "MastersRole",
            role_name="EksMastersRole",
            assumed_by=iam.AnyPrincipal(),  # anyone can assume this role
        )
        cluster.aws_auth.add_masters_role(role=masters_role)
        # To create or update a kubeconfig file, run the following command:
        # aws eks update-kubeconfig --name my-cluster --region ap-northeast-3 --role-arn arn:aws:iam::123456789012:role/EksMastersRole

Update the EksStack by synthesizing and deploying the stack with the following commands:

cdk synth EksStack
cdk deploy EksStack

This creates an IAM role named EksMastersRole which you can assume for cluster authentication.

The next step is to update the kubeconfig file. The command below will automatically create or update the default kubeconfig file ($HOME/.kube/config) by assuming EksMastersRole:

aws eks update-kubeconfig --name my-cluster --region ap-northeast-3 --role-arn arn:aws:iam::123456789012:role/EksMastersRole

If you encounter a permission error like AccessDeniedException in the output, make sure the profile you use with aws command has the eks:DescribeCluster permission.

The configuration in the example allows anyone to assume the masters role. You can restrict this condition by modifing assumed_by=iam.AnyPrincipal() to assumed_by=iam.ArnPrincipal(some_arn)

Now you can use kubectl to communicate with my-cluster:

kubectl get svc

Output:

NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   172.20.0.1           443/TCP   28m

Add a managed node group

While we can add and use self-managed nodes in the cluster, we are going to leverage Amazon EKS managed node groups. This option offers powerful management features including auto-scaling through EC2 Auto Scaling Groups, node version upgrade, and graceful node termination.

To add a managed node group, update eks_stack.py as follows:

# my_project/eks_stack.py
from aws_cdk import Stack
from aws_cdk import aws_ec2 as ec2
from aws_cdk import aws_eks as eks
from aws_cdk import aws_iam as iam
from aws_cdk.lambda_layer_kubectl_v27 import KubectlV27Layer
from constructs import Construct


class EksStack(Stack):
    """
    This stack deploys an EKS cluster to a given VPC.
    """

    def __init__(self, scope: Construct, construct_id: str, vpc: ec2.Vpc, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)

        cluster = eks.Cluster(
            self,
            id="Cluster",
            version=eks.KubernetesVersion.V1_27,
            default_capacity=0,
            kubectl_layer=KubectlV27Layer(self, "kubectl"),
            vpc=vpc,
            cluster_name="my-cluster",
        )

        # Create a role to interact with the cluster through `kubectl`
        masters_role = iam.Role(
            self,
            "MastersRole",
            role_name="EksMastersRole",
            assumed_by=iam.AnyPrincipal(),
        )
        cluster.aws_auth.add_masters_role(role=masters_role)
        # To create or update a kubeconfig file, run the following command:
        # aws eks update-kubeconfig --name my-cluster --region ap-northeast-3 --role-arn arn:aws:iam::123456789012:role/EksMastersRole

        cluster.add_nodegroup_capacity(
            id="NodeGroup1",
            min_size=2,  # Since we employ two availability zones
            desired_size=2,
            max_size=4,
            instance_types=[
                ec2.InstanceType.of(
                    instance_class=ec2.InstanceClass.T3,
                    instance_size=ec2.InstanceSize.MEDIUM,
                )
            ],
            disk_size=20,  # default
        )

While the code above deploys two worker instances of the t3.medium type, you have the flexibility to configure your node group according to your preferences.

By default, private subnets are employed for the node group, primarily for security reasons. This setting is one of the sane defaults that Cluster construct offers.

Now update EksStack by synthesizing and deploying the stack with the commands below:

cdk synth EksStack
cdk deploy EksStack

After updating, you can simply confirm the new nodes in the cluster with kubectl:

kubectl get node

Output:

NAME                                            STATUS   ROLES    AGE    VERSION
ip-10-0-2-249.ap-northeast-3.compute.internal   Ready       118s   v1.27.7-eks-4f4795d
ip-10-0-3-5.ap-northeast-3.compute.internal     Ready       2m5s   v1.27.7-eks-4f4795d

You can also verify that the two nodes are evenly distributed across two private subnets.

Labels and taints

Labels and node affinity are used to attract desired pods to specific nodes, while taints and tolerations are used to repel unwanted pods. This approach ensures that tenant-specific workloads are exclusively executed on nodes allocated for the respective tenants.

Configuring labels and taints are also possible through the code.

Below is an example of adding a node group with labels (role=backend) and taints (role=backend:NoSchedule):

        ...

        cluster.add_nodegroup_capacity(
            id="NodeGroup2",
            min_size=2,  # Since we employ two availability zones
            desired_size=2,
            max_size=4,
            instance_types=[
                ec2.InstanceType.of(
                    instance_class=ec2.InstanceClass.T3,
                    instance_size=ec2.InstanceSize.MEDIUM,
                )
            ],
            disk_size=20,  # default
            labels={"role": "backend"},
            taints=[
                eks.TaintSpec(
                    effect=eks.TaintEffect.NO_SCHEDULE,
                    key="role",
                    value="backend",
                )
            ],
        )

View Kubernetes resources in the console

Basically, you can’t view the Resources tab and Nodes section on the Compute tab in the AWS Management Console with the following console error message:

Your current IAM principal doesn't have access to Kubernetes objects on this cluster.
This may be due to the current user or role not having Kubernetes RBAC permissions to describe cluster resources or not having an entry in the cluster’s auth config map.

There are two options to make Kubernetes resources visible in the AWS Management Console for you or other users:

Grant permissions to IAM users: This involves creating an entry for each individual IAM user in the aws-auth ConfigMap. However, it can become cumbersome as the number of users requiring access grows.
Grant permissions to IAM roles: This option allows users to use an IAM role, providing easier maintenance and when multiple users need access.

I will describe both options in this post. You can choose whichever that meets your organization’s requirements.

Whatever you choose, it is recommended that you periodically audit the aws-auth ConfigMap to see who has been granted access.

Option a: Grant permissions to IAM users

Firstly, make sure the IAM user you want to allow to view the console has the following permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "eks:ListFargateProfiles",
                "eks:DescribeNodegroup",
                "eks:ListNodegroups",
                "eks:ListUpdates",
                "eks:AccessKubernetesApi",
                "eks:ListAddons",
                "eks:DescribeCluster",
                "eks:DescribeAddonVersions",
                "eks:ListClusters",
                "eks:ListIdentityProviderConfigs",
                "iam:ListRoles"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "ssm:GetParameter",
            "Resource": "arn:aws:ssm:*:123456789012:parameter/*"
        }
    ]
}  

The next step is to create a Kubernetes ClusterRole and ClusterRoleBinding that has the necessary permissions to view the Kubernetes resources:

kubectl apply -f https://s3.us-west-2.amazonaws.com/amazon-eks/docs/eks-console-full-access.yaml

The last step is to add the following mappings to the aws-auth ConfigMap:

The EksMastersRole role and the eks-console-dashboard-full-access-group.
The IAM user and the eks-console-dashboard-restricted-access-group.

To do this, open the editor to edit configmap/aws-auth:

kubectl edit -n kube-system configmap/aws-auth

And then, add the following mappings to the existing ones:

apiVersion: v1
data:
mapRoles: |
  - groups:
    - eks-console-dashboard-full-access-group
    rolearn: arn:aws:iam::123456789012:role/EksMastersRole
    username: EksMastersRole        
mapUsers: |
  - groups:
    - eks-console-dashboard-restricted-access-group
    userarn: arn:aws:iam::123456789012:user/mickey
    username: mickey

Now you can view the Kubernetes resources in the console.

You can also achieve the editing part by updating the EksStack as follows:

# my_project/eks_stack.py
from aws_cdk import Stack
from aws_cdk import aws_ec2 as ec2
from aws_cdk import aws_eks as eks
from aws_cdk import aws_iam as iam
from aws_cdk.lambda_layer_kubectl_v27 import KubectlV27Layer
from constructs import Construct


class EksStack(Stack):
    """
    This stack deploys an EKS cluster to a given VPC.
    """

    def __init__(self, scope: Construct, construct_id: str, vpc: ec2.Vpc, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)

        cluster = eks.Cluster(
            self,
            id="Cluster",
            version=eks.KubernetesVersion.V1_27,
            default_capacity=0,
            kubectl_layer=KubectlV27Layer(self, "kubectl"),
            vpc=vpc,
            cluster_name="my-cluster",
        )

        # Create a role to interact with the cluster through `kubectl`
        masters_role = iam.Role(
            self,
            "MastersRole",
            role_name="EksMastersRole",
            assumed_by=iam.AnyPrincipal(),
        )
        cluster.aws_auth.add_masters_role(role=masters_role)
        # To create or update a kubeconfig file, run the following command:
        # aws eks update-kubeconfig --name my-cluster --region ap-northeast-3 --role-arn arn:aws:iam::123456789012:role/EksMastersRole

        cluster.add_nodegroup_capacity(
            id="NodeGroup1",
            min_size=2,  # Since we employ two availability zones
            desired_size=2,
            max_size=4,
            instance_types=[
                ec2.InstanceType.of(
                    instance_class=ec2.InstanceClass.T3,
                    instance_size=ec2.InstanceSize.MEDIUM,
                )
            ],
            disk_size=20,  # default
        )

        # Before updating, you should you should create a Kubernetes
        # `ClusterRole` and `ClusterRoleBinding` that has the necessary
        # permissions to view the Kubernetes resources with the command:
        # kubectl apply -f https://s3.us-west-2.amazonaws.com/amazon-eks/docs/eks-console-full-access.yaml
        cluster.aws_auth.add_role_mapping(
            role=masters_role,
            groups=["eks-console-dashboard-full-access-group"],
        )
        mickey = iam.User.from_user_arn(
            self, "mickey", user_arn=f"arn:aws:iam::{self.account}:user/mickey"
        )
        cluster.aws_auth.add_user_mapping(
            user=mickey,
            groups=["eks-console-dashboard-full-access-group"],
        )

Option b. Grant permissions to IAM roles

To use the masters role to view the console, you can grant permissions to it by updating eks_stack.py as follows:

# my_project/eks_stack.py
from aws_cdk import Stack
from aws_cdk import aws_ec2 as ec2
from aws_cdk import aws_eks as eks
from aws_cdk import aws_iam as iam
from aws_cdk.lambda_layer_kubectl_v27 import KubectlV27Layer
from constructs import Construct


class EksStack(Stack):
    """
    This stack deploys an EKS cluster to a given VPC.
    """

    def __init__(self, scope: Construct, construct_id: str, vpc: ec2.Vpc, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)

        cluster = eks.Cluster(
            self,
            id="Cluster",
            version=eks.KubernetesVersion.V1_27,
            default_capacity=0,
            kubectl_layer=KubectlV27Layer(self, "kubectl"),
            vpc=vpc,
            cluster_name="my-cluster",
        )

        # Create a role to interact with the cluster through `kubectl`
        masters_role = iam.Role(
            self,
            "MastersRole",
            role_name="EksMastersRole",
            assumed_by=iam.AnyPrincipal(),
        )
        cluster.aws_auth.add_masters_role(role=masters_role)
        # To create or update a kubeconfig file, run the following command:
        # aws eks update-kubeconfig --name my-cluster --region ap-northeast-3 --role-arn arn:aws:iam::123456789012:role/EksMastersRole

        cluster.add_nodegroup_capacity(
            id="NodeGroup1",
            min_size=2,  # Since we employ two availability zones
            desired_size=2,
            max_size=4,
            instance_types=[
                ec2.InstanceType.of(
                    instance_class=ec2.InstanceClass.T3,
                    instance_size=ec2.InstanceSize.MEDIUM,
                )
            ],
            disk_size=20,  # default
        )

        masters_role.add_to_policy(
            iam.PolicyStatement(
                actions=["eks:AccessKubernetesApi", "eks:Describe*", "eks:List*"],
                resources=["*"],
            )
        )

Synthesize and deploy the EksStack with the commands below:

cdk synth EksStack
cdk deploy EksStack

To switch to MastersRole, navigate to your user name on the navigation bar in the upper right of the console, and choose Switch Role:

Enter Account, Role, and Display Name (optional) and choose Switch Role:

Now you are able to see Nodes section on the Compute tab in the console:

If MastersRole is configured to be assumed by any principal, any IAM user can switch to the role. You can limit this permission by changing assumed_by=iam.AnyPrincipal() to assumed_by=iam.ArnPrincipal(some_arn)

Destroy

If you want to destroy EksStack for example, issue the following command:

cdk destroy EksStack

Or destroy all stacks in the app:

cdk destroy --all

Conclusion

As a former CloudFormation user, provisioning a new VPC and EKS cluster on AWS used to demand careful consideration and a deep understanding of resource details. However, in practice, there are established best practices and defaults. Writing code to follow these best practices would be a waste of time if any of them could be automated. This is where AWS CDK can do much more with much less code compared to CloudFormation.

For a console user, opting for IaC tools like AWS CDK over the web console surely involes certain trade-offs. Nevertheless, the ease of use and flexibility offered by AWS CDK can undoubtedly streamline the process, making it a compelling choice for provisioning an EKS cluster. Additionally, it’s remarkably easy to clean up the resources left after deleting the cluster.

References

Memory Profiling CPython Applications with Memray

2023-10-22T00:00:00+00:00

Memory management in modern programming languages is well-abstracted away from programmers. We do not have direct control over how our programs allocate and free memory for all data structures and objects. And it’s true that we have bigger problems to solve than memory management.

However, we occasionally face situations where our applications slow down or even crash due to inefficient or excessive memory usage. We mostly identify these issues through monitoring system alerts or by watching the application’s metrics in practice. This is precisely when we should rely on the assistance of the memory profiler.

What is memory profiling?

Memory profiling is a process of monitoring and analyzing a program’s memory usage during its execution.

The goals of this process are the followings:

Identifying memory leaks: by identifying and addressing memory leaks, we can prevent out-of-memory (OOM) errors, which can be hard to detect.
Cost savings: with efficient memory usage, we can save up cost on resource.
Performance improvement: optimizing data structurs and algorithms can help improve performance.

A memory profiler is a tool for precisely these purposes.

Memory management in CPython and caveats when memory profiling

It is worth briefly noting how memory management works in CPython (hereafter Python) before getting into memory profiling.

Python uses the pymalloc allocator as the default memory allocator:

For small objects (<= 512 bytes), it uses memory mappings called “arenas”.
For large objects (> 512 bytes), it delegates to the system allocator (malloc(), calloc(), realloc() and free()).

The pymalloc allocator is optimized for small objects. It creates arenas with a fixed size of 1 MiB on 64-bit platforms or 256 KiB on 32-bit platforms for objects whose sizes are smaller or equal to 512 bytes. This way, it can avoid calling too many memory allocation requests for all small objects that are likely to be created more frequently than larger ones.

The diagram below illustrates an example of arenas:

To describe each data structure briefly:

Areans are subdivided into pools by the size class.
Pools are fragemented into fixed-size blocks.
Blocks are the smallest units where small objects are stored.

The important thing here is that while pools are just marked as empty when all of their blocks are not being used and available for allocation, arenas can be actually freed back to the operating system when their pools are all empty. And as an additional note, from Python 3.9, one empty arena can still remain so as to avoid thrashing in some rare cases where a simple loop woulde create and destroy an arena on every iteration.

You can set PYTHONMALLOCSTATS environment variable to print statistics of the pymalloc memory allocator every time a new pymalloc object arena is created, and on shutdown.

This gives us a few considerations when memory profiling:

Allocation requests that do not require the system allocator won’t appear in the result of memory profiling.
Allocation requests that require a new arena will appear in the result, but with the size of the arena.
Deleting objects does not always free memory if the arena is still in use, resulting in memory leaks in the result. (a false positive)

In short, it is quite difficult to precisely track all memory allocations and deallocations, especially for small objects. (Unless you disable the pymalloc allocator at runtime with the PYTHONMALLOC=malloc environment variable, which is generally not recommended as it does not represent the actual case.) When memory profiling in Python does help would be when you have to understand the cause of the observably abnormal memory leaks or found out excessive memory usage.

Memray

Memray is a memory profiler for Python, released on April 9, 2022, by Bloomberg.

The examples in this post uses Python 3.11.4 and Memray 1.10.0.

Memray only works on Linux and MacOS at the time of writing this post.

Installation

You can install it in your virtual environment with pip:

pip install memray

You should then be able to run memray command in your CLI.

$ memray -V
1.10.0

Basic Usage

I am going to create an empty Python file named app.py to start with:

# app.py

You can track memory usage app.py and generate profile results of it with memray run:

$ memray run --output app.bin app.py
Writing profile results into app.bin
[memray] Successfully generated profile results.

You can now generate reports from the stored allocation records.
Some example commands to generate reports:

/.../python3.11 -m memray flamegraph app.bin

I intentionally set the name of the output to app.bin to simplify the example. Without --output or -o argument, the default name of the output file is ..bin

You can then convert the results into different types of human-readable reports with the following subcommands:

flamegraph
table
tree
summary
stats

I am going to use flamegraph in particular to visualize the results as it offers an intuitive hierarchical view of function calls with the width representing how much memory that function call and its childern allocates.

The following command generates an HTML flame graph out of app.bin:

$ memray flamegraph app.bin
Wrote memray-flamegraph-app.html

Here is how our flame graph looks like:

There is not much to see in our first example but this is just to illustrate how it looks from the scratch.

Do also note that your results might look different from the screenshot here because the horizontal ordering in the graph has nothing to do with the passage of time in the application.

You can see the peak memory usage by looking Stats on the header, which will show the following information:

...
Total number of allocations: 139
Total number of frames seen: 65
Peak memory usage: 66.9 KiB
Python allocator: pymalloc

We can confirm that the peak memory usage of our empty app.py is 66.9 KiB. Bearing in mind that memray itself allocates some amount of memory, it’s safe to assume that the Python interpreter uses a minimum of 66.9 KiB or so, based on our first results.

Now let’s make some observable memory allocations by importing an external package: httpx. You can install httpx with:

$ pip install httpx

And our example is:

# app.py
import httpx

The flame graph looks like the following:

By putting the mouse pointer on the third topmost stack frame which is import httpx, we can see that it requested memory allocations of 9.5 MiB. Under that frame, the child frames include other external libraries that httpx imports (depends on) such as httpcore and certifi. You can expand those frames to look inside by clicking them.

If you are not interested in the memory allocations related to the Python import system, checking Hide Import System Frames gives you the cleaner graph:

Now that we have seen how Python’s import system allocates memory for importing libraries, we will see how the user code appears in the graph. Here is the example:

# app.py
def foo() -> list[int]:
    array = [0 for _ in range(5000)]
    return array

def bar() -> list[int]:
    array = [1 for _ in range(10000)]
    return array

def baz() -> list[int]:
    array = foo() + bar()
    return array

baz()

The flame graph with the at the top looks as follows:

baz() called foo() and bar(), and the width of foo()(array = [0 for _ in range(10000)]) is wider than that of bar()(array = [0 for _ in range(20000)]) as it allocated more memory than bar().

In many cases, the widest box from the end of the stack is probably a good place to start considering optimization.

You can read more information and details about the flame graph reporter in the docs.

Live tracking mode

Memray also allows you to live track the memory usage of an application or an already running process.

You can directly start executing and live-tracking a Python program:

$ memray run --live app.py

Or live track an already running process:

$ memray attach

Both methods run a terminal user interface (TUI) displaying the memory usage in real time. This is particularl useful for tracking memory allocations of long running appliations such as web applications.

memray attach works by injecting executable code into a running process. As it might lead to process crashes or deadlocks in the attached process, the authors of memray advises using memray attach for debugging purposes only.

Patterns of memory inefficiency

Here are some common patterns of memory inefficiency, which could result in excessive memory usage or memory leaks.

List comprehensions with large iterables

Here’s the simplest example of memory inefficient code:

array = [i for i in range(9999999)]

Storing a lot of values in an array can be memory inefficient when dealing with large datasets or ranges like this example.

We can use memray to see the peak memory usage of this code. The following summary is from stats of its flame graph:

Total number of allocations: 980
Total number of frames seen: 69
Peak memory usage: 396.0 MiB
Python allocator: pymalloc

To improve this code, we can use a generator expression instead of a list comprehension:

array = (i for i in range(9999999))

Generator expressions “yield” values one at a time instead of storing all the values in the memory at once. We can see the peak memory usage of iterating over a generator object array is dramatically decreased:

Total number of allocations: 161
Total number of frames seen: 64
Peak memory usage: 66.9 KiB
Python allocator: pymalloc

Circular references

Circular referencese can lead to memory leaks, considering that the Python interpreter employs reference counting for garbage collection.

Here’s an example:

class Node:
    def __init__(self):
        self.array = [i for i in range(100000)]  # large enough to cause memory allocation
        self.next = None

def create_circular_refrences():
    node1 = Node()
    node2 = Node()
    node1.next = node2
    node2.next = node1

Because node1 and node2 reference each other in a circular manner, their reference counts cannot be set to 0, preventing them from being garbage collected.

The following code leads to memory leaks by iterating create_circular_refrences over 10 times:

for i in range(10):
    create_circular_refrences()

At the top of the flame graph page generated from this code includes a chart that visualizes the process’s memory usage over time:

You can observe that the peak memory usage is approximately 77.3 MiB for 10 iterations. As the number of executions increases, the peak memory usage of our process also rises.

Storing values in class attributes

Here’s an exmple to illustrate the potential danger of storing values in class attributes:

class Foo:
    data: list = []  # Shared attribute

    def add_val(self, val):
        self.data.append(val)

# Keep storing values in the class attribute
for i in range(1000000):
    foo = Foo()
    foo.add_val(i)
    del foo

In this example, the Foo class has a mutable class attribute data, which is shared among all instances. This attribute grows as add_val is called. Even if we delete the instances, Foo.data remains in memory since it is bound to the Foo class itself, potentially leading to memory leak over time. Therefore, you should be cautious about using class attributes unless you know what you’re doing.

You can try and see how the heap size of the example above grows over time by generating the flame graph with memray.

Pytest plugin

pytest-memray allows you to activate memray when using pytest.

Installation

You can install pytest-memray with:

pip install pytest-memray

In this post, I am using pytest-memray 1.5.0.

Basic usage

Here’s a test case that executes code from the previous example:

# test.py
def test_create_array():
    array = [i for i in range(9999999)]

You can run the test with --memray option:

$ pytest --memray test.py

The output shows allocations at the highwater mark - a point that represents the maximum memory usage:

============================= test session starts =============================
platform darwin -- Python 3.11.4, pytest-7.4.2, pluggy-1.3.0
rootdir: /
plugins: memray-1.5.0, anyio-3.7.1
collected 1 item

test.py .                                                                [100%]


================================ MEMRAY REPORT ================================
Allocation results for test.py::test_create_array at the high watermark

         📦 Total memory allocated: 391.0MiB
         📏 Total allocations: 815
         📊 Histogram of allocation sizes: |█|
         🥇 Biggest allocating functions:
                - <listcomp>:/test.py:2 -> 391.0MiB


============================== 1 passed in 0.74s ============E==================

What’s more powerful of this plugin is that it can enforce a memory limit of our code. Here’s an example that uses a pytest marker to limit the memory usage to 10 MB:

# test.py
@pytest.mark.limit_memory("10 MB")
def test_create_array():
    array = [i for i in range(9999999)]

The output:

============================= test session starts =============================
platform darwin -- Python 3.11.4, pytest-7.4.2, pluggy-1.3.0
rootdir: /
plugins: memray-1.5.0, anyio-3.7.1
collected 1 item

test.py M                                                                [100%]

================================== FAILURES ===================================
______________________________ test_create_array ______________________________
Test was limited to 10.0MiB but allocated 396.0MiB
------------------------------ memray-max-memory ------------------------------
List of allocations:
    - 396.0MiB allocated here:
        <listcomp>:/test.py:6
        ...

================================ MEMRAY REPORT ================================
Allocation results for test.py::test_create_array at the high watermark

         📦 Total memory allocated: 396.0MiB
         📏 Total allocations: 825
         📊 Histogram of allocation sizes: |█|
         🥇 Biggest allocating functions:
                - <listcomp>:/test.py:6 -> 396.0MiB


=========================== short test summary info ===========================
MEMORY PROBLEMS test.py::test_create_array
============================== 1 failed in 0.77s ==============================

As you can see, the test failed as the execution of the code allocated more memory (396.0MiB) than it is allowed to use (10.0MiB).

Conclusion

Memory profiling is an important process for identifying memory leaks, inefficient memory usage, and other issues that can degrade the performance of our code.

And we, as software engineers, have access to tools like memray to help us solve when facing these memory-related problems.

Furthermore, by using pytest-memray plugin, we can proactively manage memory usage when developing memory-intensive applications.

Mienxiu

Explicit is Better than Implicit - Part 2: Behaviors

Aspect-Oriented Programming

Python Decorators

Metaclasses

Test Fixtures

In-line Setup

Delegate Setup

Implicit Setup

Convention over Configuration

Convention-based ORM

External Dependencies

Multithreading

Conclusion

References

Explicit is Better than Implicit - Part 1: Intentions

Names

Abbreviations

Generic Terms

Tests

Parameters

Magic Numbers

Exceptions

Comments

Conclusion

Code Coverage: Misusage and Proper Usage

TL;DRs

Code Coverage Basics

Types of Code Coverage

Function Coverage

Statement Coverage

Branch Coverage

Condition Coverage

Code Coverage vs Test Coverage

Target Coverage Values?

Benefits

Pitfalls

Misusage

Proper Usage

Improve Testing Strategy

Find Missed tests and Obsolete Code

Conclusion

References

Exposing Applications Running in EKS Cluster for External Access

AWS Elastic Load Balancers

Load Balancer Types

Target Type

Service and Ingress Controllers

AWS Cloud Controller Manager’s Service Controller (In-tree Service Controller)

AWS Load Balancer Controller

Ingress-Nginx Controller

Overall Comparison of Load Balancer Controllers

AWS Load Balancer Controller

Prerequisites for non-EKS clusters

Create Deployment and Service (for testing)

Installation

Ingress

DefaultBackend

Healthcheck

Ingress group

IngressClass

Access control

Access logs

References

Building Faceted Search using Elasticsearch

Faceted search

Relational database vs document-oriented database

Modeling data

Defining fields

Defining field data types

Defining mapping

Running Elasticsearch locally

Creating index

Inserting documents

Getting facet distribution

Breakdown

Sorting

Size

Searching by facets

Conjunctive facets