Part 4 Scaling the System at AR - Part 4 - Message Queue at AR

The problem of scaling team

As the product grows bigger and bigger, one team is not capable for developing and maintaining the whole system anymore. The organization won’t scale if we operate using a top-down style, where one Production Owner or Tech Leader dictates the decision of all teams. Sooner or later it will come a big monolith application, where the changes from one place can cause cascading effect to all the other teams.

The philosophy that we use for scaling the product is to split it into smaller ones, building multiple autonomous teams, each team is responsible for one (or more) business domain. Each individual team is free to experiment new technology and choose the tech stack that is most suitable for them without impacting the other teams.

There are, of course, plenty of difficulties when splitting team

  • How to make those teams operate independently?
  • How can we make sure the error in one team doesn’t cause cascading effect to the other team?
  • How does one team access the data that another team manages?
  • How does one team update the data that belongs to another team?
  • How can one team encapsulate the underlying business logic from the other teams?
  • How do we maintain the consistency across teams?
  • What will happen if we add one (or more team) in the future?

I won’t talk much about the process, operating or management skill here, just because I’m not an expert on that. I will just show some technical techniques related to Message Queue that we used to support the ultimate goal of splitting and building small autonomous teams.

How about an API Server for each team?

One simple way that people usually think of is to expose an API Server/Gateway to the other teams in the company. An API Server can help

  • Limit the permission of other teams/services on the current team resources, allow the other teams to read/write certain entities only.
  • Prevent the other teams to cause unwanted effect on the internal data structure of the current team.
  • Abstract away the complex underlying business logic of the team.
  • Each team can choose its own technology stack, no matter if they follow a Monolith or Microservice design, no matter if they use Nodejs or C#, as long as the API is a standard one (usually HTTP API).
Read more

Sau ngót nghét 10 năm đi xe Wave, lần đầu tiên lên đời mô tô côn tay. Hứng lên làm bài blog post khoe xe thôi chứ chả có gì. 🤣

Thực ra cũng thích xe côn tay từ lâu. Ngày xưa học lái ô tô trước, thấy có côn số full bộ. Sau đó mới được học cách đi xe máy xe số, tự dưng thấy thiếu thiếu 😂. Hồi đó đã định mua Exciter 135 mà sau đó để tiền mua xe Wave + laptop để nam tiến học đại học. Cũng may ngày đó không mua, chứ không giờ tụi nó bảo mình đi trộm chó 😂

10 năm đã trôi qua, tới lúc lên đời cái xe nào lạ hơn con Wave đang đi rồi. Trước tết Trường đã ngắm con MT-15 rồi, nhận thưởng tết xong là xúc thôi

Lần đầu tiên đi xem em nó ngoài tiệm, thiết kế khá là đẹp

Img1

Read more

Part 2 Mistakes of a Software Engineer - Favor NoSQL over SQL - Part 2

Is that all problems that I have with NoSQL?

No. But I’m lazy now 😂. So I won’t talk about the problems anymore.

Choosing a Database system with a Product perspective

So, how do I choose a database system from my Product engineer view?

The answer is: It depends. (of course 😂)

The philosophies of the 2 database systems are different.

NoSQL is designed for simple operations with very high read/write throughput. It best fits if you usually read/write the whole unstructured object using very simple queries. There are some specific use cases that you can think about

  • Receive data from another system (an integration API for example): You can use a NoSQL database as a temporary storage to increase API response time and safety by deferring data processing to a later time. Read more Scaling the System at AR - Part 2 - Message Queue for Integration
  • A backing service to store Message Values for a Message Queue application
  • A caching database where you know exactly what you need and you can query easily by key value (but even in this case, think of other solution in SQL first, like materialized views, triggers,…).
  • Others?
Read more

Part 1 Mistakes of a Software Engineer - Favor NoSQL over SQL - Part 1

There is nothing called Schemaless

One argument that people usually make about NoSQL is the flexibility in schema designing. They are Schemaless databases and very similar to objects in programming language. The dynamic schema is very powerful and can support several different use cases, not limited like what SQL provides. You can simply throw your objects into it and do anything you want.

Is it really true? For me, it’s NOT.

There is nothing called Schemaless. The schema just moves from Database level to Application level. For Document databases, you are actually using implicit data schema model (compare to explicit data schema in SQL). Instead of letting the database handle the schema and datatype itself, you have to do it yourself in your application code.

  • Some databases support simple schema validation. However, none of them are as optimal as what SQL provides. You are limited to just some basic data types (string, number, boolean,…) while SQL offers a wide range of different data types for different use cases (for example: varchar, nvarchar text,… just for string data). Working with strict data type is always easier and more performant.
  • In case you handle the schema yourself, it may be ok if you use a static-typed programming language (like C#). However, if you are developing your application using Nodejs, Ruby, PHP,… (which is quite common for Startup companies), you have to maintain another layer of schema validation in your application because the data is just runtime objects.

Schema migration is also another problem. NoSQL databases encourage you to just throw the objects into and not care much about the schema. Some databases don’t even require you to create any table, simply write the data and the tables will be created automatically. When you read/write the data, it is hard to know which version the data is on. You will have to add many if statements just to check for what should have been provided. No matter how careful you design your application, how strict your coding conventions are, there will always be the case your code works with outdated data schema. For critical business workflows, I want everything to be expressed clearly, not implicitly.

Read more

What is this? This is just something to summarize the mistakes that I found from my experience as a Software Engineer.

The first one that I want to talk about is the war between SQL and NoSQL. This is one of my favorite topic. I can debate about this topic for a whole day. 😁

It doesn’t mean I hate NoSQL. I just want to use the right tool for the right job…

This may be true at this time but who knows what will change in the future? I have faced a lot of problems with the design of NoSQL, especially when it is the main backing service for the whole system. Each database has its own strength and weakness and we should use it for the most suitable job or at least, pick the one that can solve 80% of your problems and live with the other 20% pain instead of choosing the one that fits for 20% use cases and fix the problem that it brings to the other 80%.

By NoSQL, I mean Document database systems (Mongo DB for example)

A real life loop

Back in the 1970s, the most popular database used a simple data model called the hierarchical model, which is similar to JSON model used in document databases today. That model worked well for one-to-many relationships. However, when it comes to many-to-many relationships, this design exposes a whole lot of problems, from joining issue to duplicate data management. Two alternative solutions were proposed, Relational model and Network model. The fact is obvious, Relational model wins the long run.

It seems that NoSQL databases today are repeating the problems in the history, which have been solved for decades. Despite the advantages of a more flexible data schema or a better representation of data (which is more similar to objects in programming language), the 1970s’ problems still exist and developers still have to handle them manually in an awkward way.

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

Read more

Part 2 here Basic Logging & Debugging in Microservices - Part 2

ElasticSearch for storing the log data

At this time, you may have a good logging library that you built on your own. Your application probably produced a bunch of log entries and you can debug your application using cat and tail command. You can also open the log file in your favorite text editor and search for the log entries that you want. However, there should be a better solution for that. Imagine when your log file grows to thousands or millions or even billions of lines, that’s not feasible to debug manually like that.

Fortunately, ElasticSearch is a good choice for organising log entries. Setting up ElasticSearch is a quite straight forward task. There is even pre-built docker image for it. Our team at AR doesn’t even care about optimising it because we rarely face the problem with it. Of course, our ElasticSearch instance goes down sometimes, but since it’s not a critical module of the system, doesn’t affect main application’s status, we don’t need to invest much time in optimising it. What we do is to configure restart policy so that Docker can recover the process after OOM. Another thing we need to do is to set up a daily script for deleting the old log entries to save the disk space. You can find installation instructions for ElasticSearch on its official Docker image page

fluentd for pushing the log data to ElasticSearch

Now you have your ElasticSearch server up and running. The next thing you need to do is to push your log data to it. fluentd, an open source data collector for unified logging layer, allows you to unify data collection and consumption for a better use and understanding of data. If your application writes all the log entries into a file or to standard output inside Docker, you can set up fluentd to tail the log files and push to ElasticSearch every time one line is finished (just like the tail command). In case you don’t know, Docker redirect its standard output to a file on disk. You can find a sample fluentd setup here https://github.com/tmtxt/clojure-pedigree/tree/master/images/fluentd. The most important thing that you need to notice is its configuration file, the td-agent.conf file. You will need to modify it to fit your requirements

Read more

Configuration, it sounds very simple but many people do it the wrong way.

Configuration is an essential part of any application (The term Configuration here does not include internal application config, such as the route path definition). Configuration varies across deploys and environments, which provides confidential information to the application (such as database connection string) and defines the way the application should behave (for example, enable or disable specific features). It sounds quite simple but I found that many people did it the wrong way. That’s why we have this blog post.

Take a look at this

Recently, I worked on another project, which was written in ASP.Net Core 3.1. After a while digging into the code, I saw these patterns and felt really annoyed about it. They are related to the way the Configuration is implemented and passed around the application as well as the way it is consumed.

Here is the Setting interface and its implementations.

public interface ISetting
{
    Env Environment { get; }
    string MailgunApiKey { get; }
    string AppName { get; }
}

public class LocalSetting : ISetting
{
    Env Environment => Env.Local;
    public string MailgunApiKey => "key-xxxx";
    public string AppName => "Prod Name";
}

public class StagingSetting : ISetting
{
    Env Environment => Env.Staging;
    public string MailgunApiKey => "key-yyyy";
    public string AppName => "Prod Name";
}

public class ProdSetting : ISetting
{
    Env Environment => Env.Prod;
    public string MailgunApiKey => "key-zzzz";
    public string AppName => "Prod Name";
}

Here is an app component which uses the Setting object

public async Task Execute()
{
    var setting = Resolve<ISetting>();
    var emailToUse = "";

    switch (setting.Environment)
    {
        case Env.Local:
            emailToUse = "[email protected]";
            break;
        case Env.Staging:
            emailToUse = "[email protected]";
            break;
        case Env.Prod:
            emailToUse = request.ActualEmailAddress;
            break;
        default:
            break;
    }

    await SendEmail(emailToUse, setting.MailgunApiKey);
}
Read more

Part 3: Scaling the System at AR - Part 3 - Message Queue in general

Continue from my previous post, I’m going to demonstrate the internal tool that we use at AR to work with Message Queue. I will also summarize some of our experience when designing a system with Message Queue.

The Message Queue in AR system

Currently, we have over 100 different types of workers/queues. We have built a tool to manage them efficiently.

The tool allows us to quickly filter for any Subscriptions (Queues/Workers) MessageBus

Read more

Part 2: Scaling the System at AR - Part 2 - Message Queue for Integration

In previous post, I mentioned Messages Queue in some specific use cases for Integration components. In this post, I’m going to talk about Message Queue in general and how the workflow looks like at AR.

Messages Queue in design

One of the main difference of the AR system is that most of the tasks are background tasks and backed by several Message Queues. There are several reasons for us to choose this design

  • We want to keep the user-facing API and databases simple. This way, the API will respond very fast and the app performs more smoothly. That brings a good impression to our users and makes them happier.
  • We can isolate different aspects of the system
    • We can easily limit the resources consumption of the less important tasks (the tasks that are not user-facing or do not need the results immediately), for example, the task to log User Activities or the task to export User Data.
    • We can also allocate more resources for and scale only the tasks that are critical to the users, for instance, the task to send a Blast email in case of disaster.
    • This is controlled by via various parameters when creating the queue and running the worker
      • The number of worker instances running at the same time
      • The number of concurrent messages that a worker instance can pull and process at the same time
      • The delay of the messages that are published to the queue
  • The Message queue ensures the eventual consistency for our system in case of failure. The system is fault-tolerant by design. Even if the database or the network is down, all the tasks are guaranteed to be processed at some points in the future. Moreover, we only need to retry the failed parts, not the full flow.
  • Each worker is a re-usable workflow with the message value as the input data. Want to implement a new feature which re-use the same flow? Instead of activating the same functions, you publish a message to the corresponding queue.
  • This allows us to choose different technology for each worker, depending on the requirements. Most of our workers are written in Nodejs. However, there are some of the written in Golang. We also have a team with many C# experts working on Integration projects. There are no problems for us to integrate everything into one same workflow.
Read more

It has been one and a half year since my first post about this topic :(

Continue from my previous post Scaling the System at AR - Part 1 - Data Pre-Computation, this time I’m going to talk about one of the most important component of the AR system: The Message Queue.

Message Queue is an asynchronous inter-service communication pattern. It is a temporary place to store the data, waiting for the message receiver to process. It encourages decoupling of logic and components in the system, provides a lightweight and unified protocol for communication between different services (written in different languages) and is perfectly suitable for Microservice design. A good message queue should satisfy these criteria

  • It must be fast and capable of handling a large amount of messages coming in at the same time.
  • It have to ensure the success of message processing. A message must be processed and retried until success. Otherwise, an Error queue (Dead Letter queue) should be provided to store the failed messages for later processing.
  • It is required that each message is processed by one and only one consumer at the same time.
  • The message queue should be independent from any languages and allow various applications written in different languages to send and receive messages without any problem.

Because the Message Queue is so important to us and there is a limit in number of developers, we decided to switch to third-party services after several months doing both dev and ops work with Kafka. Both Google Pub/Sub and AWS SQS offer the service in a relatively cheap price and you can choose either of them, depending on the Cloud platform that you are using. AWS SQS seems to be better since it offers a lot of functionalities around its SQS service, for example, mapping the Message events to Lambda, which allows us to save a lot of time working on the ops side and focus more on our business core value.

Currently, we are running 2 different systems on 2 different Cloud providers and we are using both solutions.

Read more