Elixir/Erlang Hot Swapping Code (2016)

Elixir/Erlang Hot Swapping Code (2016)

225

14d

by justinludwig

Volundr

14d

It's worth noting that distillery is deprecated in favor of mix releases, which don't support relups out of the box, and specifically warn against them due to the complexity involved in writing code to support them correctly.

It's a cool feature that's no doubt amazing for applications that need it, but it brings a fair amount of complexity vs other deployment strategies.

thibaut_barrere

13d

Good point. Someone shared this in case someone wonders:

https://elixirforum.com/t/how-to-tweak-mix-release-to-work-w...

> I’ve spent some time understanding how to do hot code reloading with releases built using mix release, and here I’d like to detail the steps needed, in hopes that it will help someone.

superdisk

14d

Yeah, note that this article is from 2016. I distinctly remember during that time that these hot-swap deployments were all the rage in the Elixir community, and then fell out of fashion with time.

hauxir

14d

At kosmi.io we use elixir hot swapping for every small patch/bugfix on the backend. This allows us to deploy updates multiple times a day with 0 disruption.

Allows the clients to remain connected and be none the wiser that there was an update at all.

For larger updates we just do hard restarts when in-memory data structures or supervision tree are changed.

Comment was deleted :(

deathtrader666

13d

Would love to know more how you go about it.

hauxir

13d

It's a little hacky but I'll try to explain:

* The server runs in a docker container which has an ssh server installed and running in the background. The reason for SSH is simply because that's what edeliver/distillery uses.

* The CI(local github runner) runs in a docker container as well which handles building and deploying the updated releases when merged on master.

* We use edeliver to deploy the hot upgrades/releases from the CI container to the server container. This happens automatically unless stopped which we do for larger merges where a restart is needed.

* The whole deployment process is done in a bash script which uses the git hash for versioning, edeliver for deploying and in the end it runs the database migrations.

I'm not going to say it's perfect but it's allowed us to move pretty damn fast.

modernerd

13d

Live updating a drone running Erlang in 10ms while it was flying with no application restart and no loss of state impressed me when I saw it in 2021:

https://www.youtube.com/watch?v=XQS9SECCp1I

But I almost never hear Erlang/Elixir/Gleam folks talk about this benefit of the Erlang VM now, even though it seems fairly unique and interesting. Has the community moved away from it? Is it just not that useful?

thibaut_barrere

13d

A lot of web apps are just well-enough served with a blue-green deployment model. It is less risky.

But if you really need it, it's really great to have that option (e.g. very long running systems which are split in front/back etc), and it can be used in creative ways too (like the Drone example).

Here is a lightning talk I gave about how to use hot-reload for music / MIDI interactions: https://www.youtube.com/watch?v=Z8sGQM6kLvo

modernerd

13d

Great talk, thanks, nice to see other creative uses. Great idea to add LiveView and SVGs for the keyboard UI.

"…thanks to hot reloading, which — for once — is useful…"

That seems to sum up the sentiment that hot swapping in Erlang has uses but they're generally not aligned with what Erlang is typically employed for. It seems like it would be great for tight game dev loop feedback and iteration too, for example, but that's not a traditional use of Erlang either.

thibaut_barrere

13d

> That seems to sum up the sentiment that hot swapping in Erlang has uses but they're generally not aligned with what Erlang is typically employed for

Actually, I think it is much more common in original Erlang scenarios (including "non-web") where high availability is a useful pre-requisite.

It is in my experience less common in Elixir, which is often more web-oriented (although not exclusively).

epiccoleman

13d

Extremely cool, thanks for sharing!

cess11

13d

A lot of the GenServer-information floating around explains code_change/3, no? That's commonly what you want, a way to handle state propagation when process code is updating in a running system.

Most people are probably running some web services or something and might as well shift machines in and out of a cluster or can wait for old processes to disband on their own, because the new code is backwards compatible with the one in already running processes, and so on.

It can also be relatively hard to do without causing damage to the system. Those who need and can manage it probably don't need it marketed.

cess11

13d

Someone put a reply and then deleted it while I wrote a response, and it irks me that it might have been a waste so here's the gist of it:

"Is it just that people are more comfortable with blue-green deploys, or are blue-green deploys actually better?"

It depends. If you can do a blue-green shift where you gradually add 'fresh' servers/VM:s/processes and drain the old, that's likely to be most convenient and robust in many organisations. On the other hand, if you rely on long running processes in a way where changing their PID:s break the system, then you pretty much need to update them with this kind of hot patching.

"Does Erlang offer any features to minimize damage here?"

The BEAM allows a lot of things in this area, on pretty much every level of abstraction. If you know what you're doing and you've designed your system to fit well into the provided mechanisms the platform provides a lot of support for hot patching without sacrificing robustness and uptime. But it's like another layer of possible bugs and risks, it's not just your usual network and application logic that might cause a failure, your handling of updates might itself be a source of catastrophe.

In practice you need to think long and hard about how to deploy, and test thoroughly under very production like conditions. It helps that you can know for sure what production looks like at any given time, the BEAM VM can tell you exactly what processes it runs, what the application and supervisor trees look like, hardware resource consumption and so on. You can use this information to stage fairly realistic tests with regards to load and whatnot, so if your update for example has an effect on performance and unexpected bottlenecks show up you might catch it before it reaches your users.

And as anyone can tell you who has updated a profitable, non-trivial production system directly, like a lot of PHP devs of ye olden times, it takes a rather strong stomach even when it works out fine. When it doesn't, you get scars that might never fade.

Muromec

13d

This is also a reply to that deleted comment, because I had to type it all and also got to go outside and have my European 2 hour long lunch break while doing it.

If you have any kind of state in gen_server and the state or assumptions of it have changed, you need to write that code_change thingy that migrates the state both ways between two specific versions. If by some chance this function is bugged, then the process is killed (which is okay), so you need to nail down the supervision tree to make things restartable and also not get into restart loops. Remember writing database migrations for django or whatever ORM of the day? Now do that, but for memory structures you have.

Now, while the function is running it can't be updated of course, so you need gen_server to call you back from the outside of the module. If you like to save function references instead of saving process references in your state, you need to figure out which version you will be actually calling.

If you change the arity of your record, then the old record no longer matches your patterns.

Since updates are not atomic, you will have two versions of the code running at the same time, potentially sending messages that old/new stuff does not expect, and both old and new code should not bug out. And if they do bug out, you have been smart enough to figure out how to recover and actually test that.

Than there is this thing, if somehow something from the version V-2 still running after update to V-1 and you start updating to the latest V, then things happen.

You can deal with all that of course and erlang gives you tools and recipies to make it work. Sometimes you have to make it work, because restarting and losing state is not an option. Also it's probably fun to deal with complex things.

Or you could just do do the stupid thing that is good enough and let it crash and restart instead of figuring out ten different things that could go wrong. Or take a 15 minutes maintenance window while your users are all sleeping (yes, not everybody is doing critical infra that runs 24/7 like discord group with game memes). Or just do blue-green and sidestep it all completely.

Comment was deleted :(

chefandy

13d

Huh, really? I feel like I see Elixir folks sing the praises of beam pretty regularly. Specifically the OTP supervisor stuff for fault-resistant server deployments. I haven’t looked specifically for that though recently so maybe people are taking it for granted?

GCUMstlyHarmls

14d

This is a talk about a large scale, resilient elixir/erlang deployment in healthcare.

Specifically they talk about running with no down time using hot code reloading here: https://youtu.be/pQ0CvjAJXz4?t=2667 but the whole talk is quite interesting regarding availability.

Warning: the video is quite quiet.

behnamoh

14d

Lisp has had this features since day 1. But Lisp-like langs like Clojure, Racket, etc. don't have it. This is one of the fundamental features of Common Lisp and I don't know why most other Lisp-wanna-be's don't implement it.

fiddlerwoaroof

14d

Clojure has it for a large percentage of functionality: things like https://github.com/clojure-emacs/cider depend on it. However, this mostly stays in dev-time and isn't used much for releases. Which I find a bit funny because Clojure's functional, data-driven philosophy is great for enabling painless hot-code updates

chamomeal

13d

Can’t you do something like this with clojure?

I don’t understand the particulars, but one selling point of biff is it’s got built-in support for updating things directly in prod via the REPL.

There’s a fun interview with the biff guy on the podcast “the REPL”. He talks about how much fun it is to develop directly on the prod server, and how horrified people are by it lol.

https://biffweb.com/

https://www.therepl.net/episodes/48/

lamuswawir

14d

Came here to say this. In Lisp, you can just compile a function, or load a file and it just works. It's not even sold as a hot feature, not the way Erlang sells it. It's just a feature.

I manage a few websites written in Lisp, and updating them is as simple as push code, recompile and it works.

davidw

14d

But what if the system is running and the new function takes different arguments or something? What if there is data loaded in the system, what happens to it?

Simply loading new code is easy, ensuring the whole system works seems to require a bit more effort.

fiddlerwoaroof

14d

Common Lisp has a bunch of features designed to enable migrating the system. e.g. update-instance-for-redefined-class ( https://www.lispworks.com/documentation/HyperSpec/Body/f_upd... ) lets you write code to update instance data between class versions when a class definition is reloaded.

It turns out, though, that making hot-code reloading work well is mainly a question of how you design your system: designing for hot code reloading isn't all that hard for 90% of cases once you figure out the relevant techniques.

leprechaun1066

13d

We do this in q/kdb+ systems often for patches. An important thing about these languages is that this kind of workflow is part of the core for solving problems. So when you are building a system one of the aspects of its design will always allow for this update method. Then when you push a patch you both know the impact of the change (because you've tested the exact same steps in a dev/QA/UAT/Beta environment) and the work required to do it safely.

Major releases do go through a full shutdown and release cycle though.

Comment was deleted :(

osmano807

13d

Those sites have something like Phoenix LiveView or it's something ad hoc like a simple SSR template engine? Would be nice to have something to handle migrations in the client side code to match the server side API.

dszoboszlay

13d

Hot code upgrades on the BEAM are awesome, but they're not a piece of cake. If you're also interested in the challenges of making them production safe, I gave a talk about this topic on CodeBEAM Sto earlier this year:

https://youtu.be/epORYuUKvZ0?si=gkVBgrX2VpBFQAk5

OP talks in the summary about the importance of understanding the process. It's very much true, but you need to understand not only the process your tooling provides, but also what's going on in the background and what hasn't been taken care for you by your tools. I'm afraid these things are rarely understood about hot upgrades, even by experienced Erlang engineers.

benzible

13d

"hot deploys on fly.io to a planet-wide cluster, in 3 seconds.": https://x.com/chris_mccord/status/1785678249424461897

apex_sloth

13d

I used to work for a company that wanted zero downtime through Erlang's hot code reload feature. While it absolutely works, it requires immense effort and extra code to handle state upgrades and downgrades.

gregors

13d

The Big Elixir 2018 - Desmond Bowe - Hot Upgrade Are Not Scary

https://www.youtube.com/watch?v=IeUF48vSxwI

robocat

13d

Great discussion 23 days ago on hot code loading:

https://news.ycombinator.com/item?id=42187761

epiccoleman

13d

I wonder if this kind of thing could be used to make the Elixir REPL a bit more LISPy. I like iex a good deal, but I often find myself wishing I could just easily eval some code or expression in the editor and have it make its way into the REPL context. (yes, I know you can `r` on a module, but that's pretty clunky compared to something like CIDER).

anonymousDan

14d

I'm a distributed setup I imagine there could be cases where you want to atomically hot upgrade multiple VMs at the same time. Is this common in practice and if so are there recommended patterns/techniques for doing it?

AlphaWeaver

14d

Erlang does have a mechanism that allows a module to control when it moves from the "old version" to the "new version" of its own code. Calls to the module with the fully qualified name (e.g. `module:function()`) will invoke the "new code" once it's loaded, but calls within that module using only function names (just `function()`) will continue to invoke the "old code".

If the portion of the app you were hot upgrading was an OTP process like a GenServer, you could theoretically wait for some sort of atomic coordination mechanism to make that fully qualified function call after the new code has loaded, at least in theory.

We use hot code reloading at my work, but haven't had a reason to atomically sync the reload. Most of the time it's a tmux session with `synchronize-panes` and that suffices. If your application can handle upgrades within a module smoothly, it's rare to have a need for some sort of cluster-level coordination of a code change, at least one that's atomic.

Muromec

14d

There can't be anything atomic in a distributed system. You can't even atomically hot upgrade it on a single VM anyway -- you instead load the new version of the module and let dispatcher know to route new calls into it, the same as you would do with a load balancer and a bunch of load bearing docker hosts, just inside your app.

knome

13d

erlang has a code_change function in the otp that allows the gen_server to update its current state and start using new code. No connections need be broken with clients, no long running processes need be stopped. Just updated in place.

It's not just a routing change.

https://www.erlang.org/docs/24/man/gen_server

Muromec

13d

It's a routing change in a sense that gen_server is routing function calls to the new module definition. I know about gen_server and code_change, the point was that conceptually the same mechanism, just on a different level of abstraction.

knome

13d

Routing in progress connections to a new module seems a rather different thing to me than merely routing new ones.

toast0

13d

I mean, yes, there's cases where you want that. But there's no mechanism for it, because you would have to stop the world, do the load, and then resume.

Even within a single VM, hot loading doesn't stop the world, during the load some schedulers will switch before others. Although there are guarantees that mean when a process runs new code and sends a message to another local process, that process will have the new code available when it reads the message. (It may still be running the old code, depending on how it's called though)

Dealing with multiple versions active is part of life in most distributed systems though. You can architect it away in some systems, but that usually involves having downtime in maintenance windows.

A typical pattern is making progressive updates, where if you want to change a request, first you deploy a server that can handle old and new requests, then you deploy the client that sends the new request, then you can deploy a server that no longer accepts old requests.

For new replies, if the new reply comes with a new request, that works like above... a client that sent a new request must handle the new reply. Otherwise, update the client to handle either type of reply, then update the server to send the new reply, finally remove handling of the old reply in the clients.

It gets a bit harder if your team dynamics mean one person/group doesn't control both sides... Then you need stats to tell you when all the clients have switched.

Sometimes you do need more of a point in time switch. If it needs to be pretty good, you can just set a config through a dist 'broadcast'. If it needs to be better than that, you can have the servers and clients change behavior after a specific time... but make sure you understand the realities of clock synchronization and think about what to do for requests in flight. If that's not good enough, you can drop or buffer requests for a little bit before your targer time, make sure there are no in progress requests, then resume processing requests with the new version.

melvinroest

13d

Is this like a similar feature in Smalltalk/Pharo and Lisp?

igouy

13d

Yes, the basics are there in Smalltalk and there's more support built into Erlang.

Also:

"Live program changes in the Dart VM"

https://github.com/dart-lang/sdk/blob/main/docs/Hot-reload.m...

"Live reloading for your ESP32"

https://github.com/toitlang/jaguar

amelius

14d

Does this hot swapping also work for closures?

Muromec

14d

Erlang doesn't have closures, because erlang doesn't have variables. The compiler simply desugars it to partially applied function referenced by it's name (yes, those inline functions in fact have names).

If you have something_function, then first inline function used in it will be -something_function/1-fun-0- with zero being the index and captured variable being another argument. Now if you will change the host function to have more inlines before it, the indexing will drift.

So I would expect the body of inline function will still be resolved from the old version of the module, but I didn't actually try.

Source: I did run erlc -S at least once.

Add: now thinking of it, will the call to a local function from the old version of the module ever escape into the new one without first returning back to gen_server and letting it call the new version? Another comment says that calls withing the module never do, so the assumption was correct.

bitwalker

14d

Erlang absolutely has closures, you are mistaken. What you are referring to are "function captures", which bind a function reference as a value, and there is no environment to close over with those. However, you can define closures which as you'd expect, can close over bindings in the environment in which the closure is defined.

The interaction between hot reloads and function captures in general is a bit subtle, particularly when it comes to how a function is captured. A fully qualified function capture is reloaded normally, but a capture using just a local name refers to the version of the module at the time it was captured, but is force upgraded after two consecutive hot upgrades, as only two versions of a module are allowed to exist at the same time. For this reason, you have to be careful about how you capture functions, depending on the semantics you want.

toast0

14d

> but is force upgraded after two consecutive hot upgrades, as only two versions of a module are allowed to exist at the same time.

Force upgraded is maybe misleading. When a module is loaded for the 3rd time, any processes that still have the first version in their stack are killed. That may result in a supervisor restarting them with new code, if they're supervised.

bitwalker

13d

Ah right, good point - I was trying to remember the exact behavior, but couldn't recall if an error is raised (and when), or if the underlying module is just replaced and "jesus take the wheel" after that.

Muromec

13d

What does is it look like? I was talking about this thing:

   Val = 1, SumFun = fun(X) -> X + Val end, SumFun(2).

It looks like you define arity 1 function that captures Val, while in fact you define arity 2 function and bind 1 as a first argument. Since you can't redefine Val anyway, it's as good as a closure, but technically it doesn't capture the environment.

Maybe I'm mistaken and there is another way to express it?

bitwalker

13d

The example you've given here does not work the way you think it does. I would agree however that the mechanics of closure environments is simpler in Erlang due to the fact that values are immutable, as opposed to closures in other languages where mutability must be accounted for.

I would also note that, for the example you've given, the compiler _could_ constant-fold the whole thing away, but for the sake of argument, let's assume that `Val` is an argument to the current function in which `SumFun` is defined, and so the compiler cannot reason about the actual value that was bound.

The closure will be constructed at the point it is captured, using the `make_fun` BIF, with a given number of free var slots (in this case, 1 for the capture of `Val`). `Val` is written to the slot in the closure environment at this time as well. See the implementation of the BIF [here](https://github.com/erlang/otp/blob/6cefa05a2a977864150908feb...) if you are curious.

At runtime, when the closure is executed, the underlying function receives the closure environment, from which it loads any free vars. In my own Erlang compiler, the closure environment was given via pointer, as the first argument to the function, and then instructions were emitted to load free variables relative to that pointer. I believe BEAM does the same thing, but it may differ in the specific details, but conceptually that is how it works.

The compiler obviously must generate a new free function definition for closures with free variables (hence the name of the function you see in the interactive shell, or in debug output). The captured MFA of the closure is this generated function. The runtime distinguishes between the two types of closures (function captures vs actual closures) based on the metadata of the func value itself.

Like I mentioned near the top, it's worth bearing in mind that the compiler can also do quite a bit of simplification and optimization during compilation to BEAM - so there may be cases where you end up with a function capture instead of a closure, because the compiler was able to remove the need for the free variable in cases like your example, but I can't recall what erlc specifically does and does not do in that regard.

Muromec

13d

> let's assume that `Val` is an argument to the current function in which `SumFun` is defined, and so the compiler cannot reason about the actual value that was bound.

That was exactly the case I was talking about, because otherwise there is no need to even make arity 2 function. If the value is known at compile time, the constant is embedded into the body of inlined function.

>At runtime, when the closure is executed, the underlying function receives the closure environment, from which it loads any free vars.

To my understanding, no it doesn't, as the value is resolved when the function pointed is created, not when the underlying function executes, which the code you linked shows too. I know it uses the "env" as a structure field, but it's partial application, not the actual closure which has access to parent scope. Consider two counter examples in python:

    for x in range(1,10): ret.append(partial(lambda y: y*2, x)) # that's what erlang does

    for x in range(1,10): ret.append(partial(x, lambda y: y*2)) # that's an actual closure, as all lambdas will return 18 because x is captured from the parent context

But then again, it doesn't matter since variables are assigned only once.

>Like I mentioned near the top, it's worth bearing in mind that the compiler can also do quite a bit of simplification and optimization during compilation to BEAM - so there may be cases where you end up with a function capture instead of a closure, because the compiler was able to remove the need for the free variable in cases like your example, but I can't recall what erlc specifically does and does not do in that regard.

I was looking into it a week ago, and erlc does what I described when it can't figure out the constant at compile time.

add: If we are at it, BEAM doesn't even know about variables, only values and registers anyway, so it has nothing to capture anyway.

bitwalker

12d

> To my understanding, no it doesn't, as the value is resolved when the function pointed is created, not when the underlying function executes, which the code you linked shows too. I know it uses the "env" as a structure field, but it's partial application, not the actual closure which has access to parent scope

The code I linked literally shows that the closed-over terms are written into the closure environment when the fun is created, and if any term is a heap allocated object, it isn't copied into the closure, only the pointer is written into the env. The only reason you can't observe the effects of mutability here is because, unlike Python, there is no way to mutate bindings in Erlang.

Again, this isn't partial application - not in implementation nor in semantics.

Muromec

12d

>Again, this isn't partial application - not in implementation nor in semantics.

Maybe you will change your opinion if you take a look at the code 'erlc -S' produces for the inline function.

slt2021

14d

hot reload of code is nothing new nowadays, but people use it only locally during development for REPL like development style.

in actual production, people prefer to operate at the container level + traffic management, and dont touch anything deeper than the container

diath

14d

> in actual production, people prefer to operate at the container level + traffic management, and dont touch anything deeper than the container

How do you think video games like World of Warcraft or Path of Exile deploy restartless hotfixes to millions of concurrent players without killing instances? I don't think it's a matter of "prefer to", it's a matter of "can we completely disrupt the service for users and potentially lose some of the state"? Even if that disruption lasts a mere millisecond, in some context it's not acceptable.

Thaxll

14d

Most of those hot fixes are data driven as in database updates. Gameserver just reload the data, the binary itself is not touch.

I've never seen a game where they hot reload code inside the gameserver itself, it's usually a downtime or rolling updates.

diath

14d

> Most of those hot fixes are data driven as in database updates. Gameserver just reload the data, the binary itself is not touch.

And since the data from the disk/database (whether it's a Lua table, XML structure, JSON object, or a query) is then representend as a low-level data structure, that's essentially what hot reloading is - you deserialize the new data and hot-swap the pointers in the simplest terms.

>I've never seen a game where they hot reload code inside the gameserver itself, it's usually a downtime or rolling updates.

In World of Warcraft, you will literally have bosses despawn mid-fight and spawn again with new stats or you will see their health values update mid-fight, all without the players getting interrupted, their spell state getting desynced, or spawned items in the instance disappearing. This can be observed with the release of every single new raid on live streams as Blizzard employees are watching the world first attempts and tweaking/tuning the fights as they happen.

EDIT: Here's such an example, for the majority of the fight the extra tank could keep a spawned monster away from the boss, then mid-fight, the monster suddenly started one-shotting the tank, without the disruption of the instance, this was Blizzard's way of addressing a cheese strat to force the players to do the right as designed: https://www.youtube.com/watch?v=7gMm60BXAjU

Thaxll

14d

Yes but again it's not hot swapping code as in Erlang, the C++ code is unchanged, they just change some xml somewhere.

By your definition every CRUD app have hot reloading capabilities.

diath

14d

> Yes but again it's not hot swapping code as in Erlang, the C++ code is unchanged, they just change some xml somewhere.

Right, not on the C++ side, but on the Lua side that WoW uses - you load the new gameplay code that pulls the new data, and override the globals with new functions.

dgfitz

14d

Why does it matter the language? C++ built in the tooling to allow hot swapping, no?

Thaxll

14d

C++ because 99% of the major games are built in that language.

tomjakubowski

14d

LPMUDs ran almost entirely on hot reloadable code written in a quirky language called LPC, which later inspired the Pike language.

I believe that only the "driver" code, which handles system calls and hosts the LPC interpreter and is written in C, couldn't be hot reloaded; everything else running in the game could be reloaded without restarting the server.

I'd guess in the modern day, there would be some games where Lua scripts can be hot-reloaded like any other data, from a database or object store.

cess11

13d

It's a rather fun language and programming environment, I'd recommend playing around with it over doing AoC.

swat535

14d

In addition to what most people said, many other game servers just simply announce upcoming maintenance work and take the services offline until the patches are deployed.

This way they can properly test everything and rollback any potential fixes if required.. even banking systems regularly goes down for maintenance.

qudat

14d

WoW restarts every week. Not sure that’s better than zero downtime deployments

diath

14d

That's just how it works when your backend is a hybrid software that utilizes a low-level compiled programming language and a high-level language that runs in its own VM. You can use the latter for gameplay features, and can hotfix on the go, and then for core changes you have to restart, which is also why WoW will hotfix the latter on the go, usually every day on an expansion launch, whereas they defer the bulk of backend changes for the next weekly restart without continuously disrupting the game for players.

AnotherGoodName

14d

That’s a very big assumption that they do code hotpatching.

It would seem far more likely they seperate the stateful (database) and stateless layers (game logic) and they just spin up a new instance of the stateless server layer behind a reverse proxy and spin down the old instance. It’s basically how all websites update without down time.

diath

14d

A website that just proxies to another server does not need to do much to restore the previous state to make it look seamless to a user, the client will just perform another GET request that triggers a few SELECT queries, it's far more complex in the context of a video game.

Muromec

14d

Games do in fact have downtimes on major releases and you have to restart the client too before connecting.

diath

14d

For major patches/backend changes that require recompiling - yes, for gameplay tweaks/hotfixes - no, hot reloading is preferable where possible.

aeturnum

14d

I work at a company that deploys Elixir/Erlang and while we do /prefer/ to push a fully tested build in a new container, sometimes things get nasty and we need to console in and re-define a module in production. It's not a "best practice" but it stems the bleeding while the best practice is going though its test suite.

simoncion

14d

> in actual production, people prefer to operate at the container level + traffic management, and dont touch anything deeper than the container

Fred Hebert (and many of the folks he has worked with) do not operate that way: <https://ferd.ca/a-pipeline-made-of-airbags.html>

One nice quote (out of many) from the article:

> The thing that stateless containers and kubernetes do is handle that base case of "when a thing is wrong, replace it and get back to a good state." The thing it does not easily let you do is "and then start iterating to get better and better at not losing all your state and recuperating fast".

(And if one wants to argue with that quote, please read the entire essay first. There's important context that's relevant to fully understanding Hebert's opinion here.)

AlphaWeaver

14d

People may "prefer" simply replacing containers, but as some siblings mention, some applications might require more reliability guarantees.

Erlang was originally designed for implementing telephony protocols, where interrupted phone calls were not an acceptable side effect of application updates.

AnotherGoodName

14d

FWIW as soon as you start using containers you should be able to handle those containers spinning up/down. Pretty much the whole point of containers. At which point you don’t need to bother with code hot swapping since you already have a mechanism for newer containers to spin up while older ones spin down.

The sibling post “that’s how they update without downtime” is super naive. It is absolutely not how they do it.

Muromec

14d

That's kinda what erlang does, just on a different level. Your docker and your load balancer are both inside your app.

simoncion

14d

If we to wedge how Erlang does hot code swapping into a container metaphor, then to get what Erlang does, you'd need to have a container per function call.

Given that it would be absurdly wasteful to use OS processes in containers to clone Erlang's code reload system, AnotherGoodName might take ten minute to watch Erlang: The Movie to get a better sense of the capabilities of that system. The movie is available from many places, including archive.org.

Muromec

13d

>If we to wedge how Erlang does hot code swapping into a container metaphor, then to get what Erlang does, you'd need to have a container per function call.

You have a container that responds to HTPP requests sitting behind a load balancer, then you spawn a new container and tell load balancer to redirect calls to the new one. From the point of view of whoever is calling the load balancer you have hot swapping. You may even separate containers into logical groups and call it microservices architecture. Or you can define a process as something having qualified name and a mailbox and is sending messages to other processes.

Now reasonable people may disagree about what's wasteful, but the market seems to tolerate places where adding a checkbox to a form is a half a year process involving five different departments and the market can't be wrong.

simoncion

13d

Sure, you can shut down and restart your entire application. You could do that back in 1990 without containers, too.

The thing is that Erlang does hot reload at a per-function (or -according to Hebert- sometimes more-fine-grained) level, so nuking the entire program and paying the cost to start it up again is not at all the same thing as -say- using a not-absurdly-priced AWS Lambda [0] or similar to get per-function hot reloading.

By the way, have you read "A Pipeline Made of Airbags"? If not, you should give it a read: <https://ferd.ca/a-pipeline-made-of-airbags.html>. It might be old news to you, maybe, but maybe not.

[0] Good luck finding one, though.

Muromec

12d

I didn't read that one before, but I share the sentiment. We can't have cool things and it was all dumbed down, so the worst case become a default mode of operation. This didn't happen specifically with hot reload in erlang, it happens all the time at all levels.

foota

14d

Amusingly, this reminds me sort of about the story of a person who joins a new company only to discover that their programming framework is intricately linked to their version control system.

toast0

13d

> in actual production, people prefer to operate at the container level + traffic management, and dont touch anything deeper than the container

I mean, this seems to be "best practices" these days, but I certainly don't prefer it. At least the orchestration I use is amazingly slow. And cold loading changes is terrible for long running processes... this makes deployment a major chore.

It's less terrible if you're just doing mostly stateless web stuff, but that's not my world.

In the time it takes to run terraform plan, I could have pushed erlang code to all my machines, loaded it, and (usually) confirm I fixed what I wanted to fix.

Low cost of deploy means you can do more updates which means they can be smaller which makes them easier to review.

myfavoritedog

14d

[dead]

alberth

14d

(2016)

dang

14d

Where do you see that? I couldn't find it.

gnabgib

14d

It's in the URL :D But yeah, the page doesn't make it clear (and some of the embedded JS has a 2020 date suggesting it's received updates).

In the RSS feed too: Wed, 07 Dec 2016 https://kennyballou.com/index.xml

dang

14d

Hidden in plain view! Ok, let's put 2016 above, on the assumption that the edits since then haven't been too major.

omertoast

14d

i'm so sick of this DevOps bullshit i wonder if there is an alternative language that you can hot swap code and do all the black magic stuff while keeping the reliability and performance like Rust.

saurik

14d

I am shocked at the idea of anyone implying that Erlang is "unreliable"... it's entire reason for existence was to step up the game on reliability.

dcsommer

14d

GP seems to be implying hot swapping, not Erlang, is unreliable. To which, from my experience using it in Erlang, I heartily agree is fraught. Inconsistent state across nodes is much harder to reason about. When you _must_ ensure consistency, hot swapping is reckless, especially as org size and product complexity increases.

Leave hot loading to local/development environments, not production deploys.

Loading configs on the fly can also have some of this risk, but it is much easier to reason about typically.

Muromec

14d

I heard the quote that some 50% of the mobile traffic is handled by erlang. Somehow the other 50% seems to be doing just fine (except the usual shitshow on the inside that sofware is everywhere all the time).

simoncion

14d

> I heard the quote that some 50% of the mobile traffic is handled by erlang.

Given that you can implement OTP in any language (albeit with varying degrees of difficulty), that's not surprising.

The thing to remember is that Erlang was first used in production in like 1986. nearly forty years is more than enough time for the biggest good ideas in Erlang to percolate out into non-BEAM systems.

zanderwohl

14d

I agree. I don't know much about Erlang but what I've heard seems to indicate it's used for high-uptime systems that handle errors well.

Muromec

14d

I suspect the causality is reversed. When you have a good designed telecom system, then spmething shaped as erlang happens to be a good tool to create to implement it. The tool than keeps you committed to the design choices you made by being restrictive enough.

LAC-Tech

14d

As someone who is now in the rust world and very very sympathetic to the Erlang world... you both probably mean completely different things when you say "reliable". The contexts are just world apart.

Me001

14d

[flagged]

sgarland

14d

Can’t wait to hear your take on NodeJS.

gf000

13d

It's not well known, but the JVM has very good hot reload support, and is a very reliable and performant platform.

Muromec

14d

The black magic comes at the cost of not having one streamlined procedure to release stuff.

And to make the black magic work you have to engage with it. Most of the time people don't even bother write to a proper import.meta.hot.accept thingy in javascript. Developers simply hate chores, which is evident by not willing to write proper unit tests (despite knowing that tests work) or writing just enough to let the coverage cop pass ship the build.

A dedicated small team running something like whatsup? Sure, look into the arcane and let it look back at you (although high insight makes one more susceptible to madness you know). But most of the time you will do better job with PHP in a stupid restartable box behind seven load balancing proxies.

yetihehe

13d

> The black magic comes at the cost of not having one streamlined procedure to release stuff.

You can also have a streamlined procedure to release stuff. Most changes in my erlang based system consist of "push to staging branch, click to deploy and test, pull to master, click deploy button". Can't be simpler than that. Most changes in such systems are also pretty simple. When you need to add something big, typically not many things are dependent on that, so deploy is also pretty simple.

> But most of the time you will do better job with PHP in a stupid restartable box behind seven load balancing proxies.

Yeah, we talk here about more complicated things here. If you have something simple, you don't need to use erlang, `python -m http.server` will be even simpler than your php in stupid restartable box, because you don't need a special box, just one small command.

Muromec

13d

Do you do 100% of deployments using hot reload? If yes, maybe you should share the recipy with everybody else, since consensus seems to recommend the opposite.

At the very least you will have a different procedure to upgrade the erlang itself, right?

>If you have something simple, you don't need to use erlang

I think on a spectrum of difficult things there is an area between hosting static file on rpi at home and running massivele distributed system full of long running stateful processes.

yetihehe

13d

> Do you do 100% of deployments using hot reload?

About 99%. We need to restart servers maybe once a year. Maybe next year we will finally migrate from erlang 21 to latest. Most "stopping everything" deployments take max 1s of downtime, like this month when we needed to upgrade postgresql database by switching it over to new machine, having zero downtime here would take a little longer to get the consistency, but we could spare a second to make it much simpler task on database side (it was a restart of database module, rest of erlang server was unaffected and clients were not disconnected). Otherwise, most deployments are not visible as disconnections, we have a lot long-running connections.

> At the very least you will have a different procedure to upgrade the erlang itself, right?

Yes.

> I think on a spectrum of difficult things there is an area between hosting static file on rpi at home and running massivele distributed system full of long running stateful processes.

Is PHP good for both? I think PHP is NOT good for long running stateful processes, but I didn't use it in 10years. And it probably is not needed for static files.

Muromec

13d

Sounds like you are doing cool stuff the cool way, which can't be said about everyone.

>Is PHP good for both? I think PHP is NOT good for long running stateful processes, but I didn't use it in 10years. And it probably is not needed for static files.

No, of course it isn't. I didn't touch it with a long pole for even more years and don't even want to. And would not argue to do what you are doing in anything that isn't running on BEAM/OTP.

The point I'm trying to make is -- for most of the web stuff, making transactional response from a short-lived http handler is good enough and you can do it even in PHP (which is not a praise for PHP as a great tool I enjoy using, but the opposite). It would not the most optimal solution by any metric, not the most elegant or sophisticated either, but it's survivable, it's the lowest bidder.

lamuswawir

14d

Erlang is built for reliability. They're chasing nine nines. Everything about the BEAM is built to emphasize that, the design choices, the documentation, the recommended practices.

Erlang is not very fast, but that's not what it was built for.

Muromec

14d

Is it really beam or just otp? Sure, beam gives you processes, network-transparent send, immutable structures and linking-monitoring thingy on top, but is what makes it good to shoot for nines?

I suspect the aura of mistycism around yet another jit vm is not that warranted

jerf

14d

It is reasonable to conceive of Erlang as encompassing OTP. Perhaps somewhere in the world there is some developer out there hot on Erlang but just hates OTP and doesn't use it, but they must be fairly frustrated at how hard it is to keep OTP out of their code base if they ever need any libraries.

Restarting is arguably the definitive thing that makes Erlang stack the 9s out past what most languages and runtimes can achieve... the thing is, it's more complicated to use in practice than a web page like this makes it look, and it's beyond what most products need. Few applications need the fifth or sixth or seventh nine, and it gets to the point that you can't have it anyhow because your Erlang cluster, no matter how well distributed, itself probably doesn't have 99.99999 availability, and your users probably don't have 99.99999 availability on their own network connection.

It's not impossibly complicated, but it is the sort of thing where you if you want to use the feature you need to have it sort of constantly in mind as you write the rest of your system, and it's a lot easier even in Erlang to just design the system to take entire nodes down and bring them back up, if not the entire cluster down, rather than fuss with hot reloads. I wish Erlang advocates would be more upfront about pitching this as an interesting niche feature, but not really a reason to consider Erlang. Unless you absolutely need it, in which case it can indeed be the thing that puts it on the short list of choices... but as evidenced by the vast, vast majority of software and systems not being on Erlang and managing to get along, there aren't really that many things that need it.

sergiotapia

14d

we had this wonderful thing in PHP where you would just save a .php file and bada bing it was LIVE.

what happened? :D

cardanome

14d

We still have that and it is awesome. PHP is better than ever.

In serious emergencies I even sometimes end up quickly SSH-en to a prod server and changing the file directly. Which is kind of horrifying but hey customer is happy it got fixed immediately and I get to relax and take my time to write a proper fix. Beats sweating and watching the pipeline build and asking around for people to approve my merge request.

thanksgiving

14d

They took away our access after one too many outages :yay:

Muromec

14d

Sounds like what happened with hot upgrade privileges in some erlang shops too.

thanksgiving

13d

I've worked at a smaller project where it worked just fine. The key I think is to have the project be small enough to be able to fit in my head.

That and have an identical test server. I used to make changes locally, test it locally, then make the same change on test, have someone else look at it, get a lgtm, and do the same thing on the production machine. It sounds like a lot of steps but it is pretty straightforward.

Sadly, it probably doesn't work with bigger teams or more complicated projects.

Muromec

13d

A lot of things work on a small project where everybody knows what they are doing and can keep all the dependencies and data structure in their heads. Logging to the production server and editing PHP file right where apache is looking at it and fixing stuff with sql commands in the production database to address customer complaints.

The big question is what everyone else should be doing that survives the touch with unevenly distributed amounts of technical expertise and amount fucks given about result.

jongjong

13d

Forcing all clients to reload their code at the same time sounds like a bad idea. Allowing different clients to run different incompatible versions of the code at the same time also sounds like a bad idea.

APIs are like database engines; they should rarely change. Making it easy to change them is an anti-pattern.

Engineers don't build bridges with replaceable pillars or skyscrapers with replaceable foundations. When aerospace engineers tried building a plane with replaceable engines, we got Boeing 737 Max...

tzmudzin

13d

Engine replacement happens on airplanes fairly frequently. You don't want to scrap an airplane because of a single damaged turbine blade, or even keep it on the ground for longer.

https://jalopnik.com/how-airlines-decide-to-replace-jet-engi....

jongjong

13d

Yes but the new parts meet the specs of the original design. The design itself isn't flexible. You can't make the engines significantly bigger without significantly revising the blueprint as a whole. That was the Boeing Max lesson. Just changing the software was not enough.

p_l

13d

737 MAX had nothing to do with replaceable engines, but with trying to run an ancient airframe with new engines but without necessary upgrades to support the new engines because of costs.

jongjong

13d

Replaceable at the design level. OMG. Why do I have to explain everything? Clearly I'm talking about blueprints here. Code is a blueprint since you can launch multiple processes/instances running the same code.

p_l

13d

And then you went and got even less on track, because offering multiple engines and re-engining aircraft is the norm, sometimes to very different engines (like Russian engines offered as upgrades to old Mirages)

jongjong

12d

Clearly there is a limit as to how different the engines can be in terms of size, weight, thrust, etc... Still if they want to add different engines, with different characteristics, they need to rerun all the calculations and tests to make sure it works with the frame, wings and everything else. No serious engineer outside of software realm aims to design silver bullet solutions. They always aim for a very specific solution.

If they need to adapt the solution later, they know it will involve a lot of re-working and require re-running all the calculations. This is fine. Nobody needs silver bullet solutions.

With software, if you design your API poorly and can't fix it in a backwards-compatible way, you can just release a new API version and migrate over to the new endpoints over time.

Crafted by Rajat

Source Code

hckrnws

Elixir/Erlang Hot Swapping Code (2016)