How MediaMachine uses Traefik for easy-to-maintain request routing

6 min readJun 12, 2021

MediaMachine.io is an IaaS platform for user-generated video content and we use as the reverse proxy in our network layer. This blog details how we leverage Traefik to run a highly-available infrastructure that is also easy for us to maintain.

So, what is Traefik?#

Traefik is a popular, open-source reverse proxy implementation in Go. It is actively maintained by Traefik Labs (they used to be called Containous). Traefik allows you to expose network resources to the internet (or even other networks), manage routing based on hostnames, url paths etc.

You can use Traefik to redirect HTTP traffic to another, private IP address, to provide HTTPS access for your customers or to provide a reverse proxy to web servers sitting in your internal VPC. This is especially useful if you run you infrastructure on a public cloud, such as AWS.

What problems does it solves for us?#

Let’s see how Traefik fits into a modern network plane.

As the name suggests, the most obvious thing we want from a reverse-proxy is to proxy our servers sitting inside our private VPC to the outside world. However, we want to do it in a way that is easy to configure and highly dynamic. We elastically scale our servers based on load and we can’t have a static mapping of external <-> internal server IP pools.

A reverse-proxy in a modern public cloud infrastructure stack wish list:

Gated access to internal resources

Isolate internal and external networks
Ability to block unwanted traffic to internal servers from external networks
Simple SSL support (including certificate verification+renewal)
Support for TCP/UDP routing

Dynamic configuration that updates itself as server membership changes

Auto update routing as servers deploy or autoscaling is triggered
Support for applying middleware and routing logic on the fly

High-Availability and Graceful Failover

Ability to run multiple instances for scalability and load-balancing
Health checks and graceful failover mechanisms

Easy to configure and maintain

Ease of debugging when getting paged at 3 AM
Flexible routing based on rules like url match, host name etc
Support features like A/B testing and canary deployments

Needless to say, performance is a whole category of its own. If the proxy itself is slow, you pay a penalty for every single request passing through it.

The other tools in the reverse-proxy landscape#

There is a rich ecosystem of tools in the reverse-proxy landscape. One of our beloved alternates is HAProxy and we’ve had great success with it in other projects.

The other popular players in this space are:

HAProxy
NGINX
Apache
Caddy

PS: Check out this awesome-proxy page.

Let’s look at an AWS ALB setup Without Traefik#

You can get started with AWS ALB for simple routing based on host names as well as for easy SSL termination. We prefer to use ALB for SSL because it makes managing certificates via AWS pretty straightforward. If you don’t use ALB, you can definitely use Traefik with their built-in SSL support via Let’s Encrypt or bring your own certs.

You’ll notice that the configuration rules supported by ALB don’t support very complex setups. For example, we can easily do this:

But the rules aren’t very expressive. It doesn’t let us do things like fix urls on the fly, ratelimit requests with custom tuning, compress responses (more on this later) etc:

There are other limitations too. ALB configuration updates work on a push model, so spinning up a new service means pushing the configuration to ALB to start routing. Wouldn’t it be nice if we were able to deploy a new service and have the routing declared automatically? That’s where Traefik comes in…

Add Traefik for easy, simple routing#

We can configure Traefik to help correct our url from the example above:

# http routing section
[http]
  [http.routers]
     # Define a connection between requests and services
     [http.routers.api-mediamachine]
        middlewares = ["fix-oops"]
        # forward to the mediamachine api service (declared below)
        service = "whoami"[http.middlewares]
    [http.middlewares.fix-oops.replacePathRegex]
      regex = "^/oops/(.*)"
      replacement = "/fixed/$1"[http.services]
    # Define how to reach an existing service on our infrastructure
    [http.services.api-mediamachine.loadBalancer]
      [[http.services.api-mediamachine.loadBalancer.servers]]
        url = "http://private-ip-addr-in-vpc/"

Traefik gives us a lot of different tools via their middlewares implementation. There is some serious firepower available via Traefik middlewares and they’re recently added support for plugins too.

For example, using the InFlightReq middleware, we set an upper limit on how many requests can stay in-flight to our backend servers. When the threshold is breached, Traefik automatically starts returning 429s to the callers and protects our servers from falling down. If PagerDuty has woken you up at 2 AM due to a cascading-failure, you'll appreciate the CircuitBreaker middleware.

Our Traefik setup#

We run Traefik in a autoscaling group that is set to scale up/down based on load. Our fleet runs on a handful of t3.micro instances. Traefik itself is totally stateless and we pass in a templated configuration file at startup to each instance.

Each instance talks to Consul periodically to fetch latest service metadata and routing membership data. Traefik has built-in support for a bunch of configuration providers and we use a combination of File and Consul Catalog for dynamic config updates.

Since we run our container workloads on Nomad, this pairing works out really well for us. We configure Nomad to roll out our deploys with a certain delay between each instance so that Traefik can update the membership metadata. This ensures our deploys don’t cause a blip in our service availability.

AWS ALB sits in front of Traefik and performs SSL termination as well as some basic routing — we run a split DNS setup that lets us isolate internal host names and endpoints. We’ll share a runbook soon with a deep dive into dynamic configuration with Consul+Nomad and some other customizations.

At present we don’t send any TCP traffic over to Traefik but we are experimenting with it for managing some database endpoints. For monitoring, we use DataDog but Traefik can send metrics to any StatsD enabled metrics ingestion site.

Limitations#

One limitation Traefik is that there is no easy dry-run while pushing config changes. We’d love to see a setting to reject bad configuration without impacting existing routing. Maybe this ticket will see some action soon https://github.com/traefik/traefik/issues/6451. In the meantime, a hacky alternative is to run docker run traefik <config file> and watch for errors.

We’ve also hit a few spots where we could’ve written a custom plugin to solve a problem specific to our setup but the lack of support for private plugins blocked that approach. Currently, only open-source plugins are supported.

Conclusion#

MediaMachine is a platform designed for high throughput video transcoding, intelligent thumbnail generation and video summarization, and Traefik is able to handle all the incoming traffic as well as the inter-service traffic flying through our system with ease.

We’ve been pretty happy with Traefik in our setup so far. So far we are loving how Traefik integrates with the rest of our stack: Nomad + Consul + Docker.

As promised, a Runbook with more configuration tips and examples is out!

Originally published at https://mediamachine.io on June 12, 2021.