## Actuarial Thinking * By Michael Snoyman * Written August 2025 * Semi-sequel to [Economic Argument for Functional Programming](https://www.snoyman.com/reveal/economic-argument-functional-programming/) * Avoiding complex math, focusing on intuition <img src="/img/actuarial-thinking/endisnigh.png" height="300"> --- ## What is an actuary? * Risk assessment * Heavy on probability/statistics, also some econ * Insurance * Calculate price of insurance (how likely are you to crash your car/die?) * Calculate reserves for insurance (how much money do we need to hold onto for potential losses?) --- ## What is actuarial thinking? * There are no facts! * We have estimates of likelihoods * Estimates are known to be wrong * We make the best decisions we can with known incomplete data * Also somewhat morbid, you'll all understand my sense of humor a bit better soon --- ## People misunderstand risks * Risks are _possibilities_ * "Risk of dying some day" is nonsensical, we'll all die! * Risks may not happen * I have a fire extinguisher _just in case_ there's a fire * I didn't waste my money on the extinguisher if there's no fire * In aggregate, we can treat some risks essentially as guarantees * If I run 100,000 servers, one of them will have a hardware failure this year * Unlikely things _will_ happen * Coin flip: How likely are 10 heads in a row? * How likely is a streak of 10 heads in 1 million tosses? * I wasted my money on home owner's insurance because nothing bad happened to my house * Do you feel the same way about life insurance? --- ## Coin flip * I flip a coin 100 times * How many times should I expect to get heads? * Thought experiment: what do you think of these scenarios? * I got heads 50 times * I got heads 51 times * I got heads 60 times * I got heads 95 times * I got heads 100 times --- ## Probability distribution * Coin flips form a _binomial distribution_ * With large numbers, approaches the _normal distribution_ * We can ask questions like "how likely is it to get X heads?" * Getting 100 heads == `(0.5)^100` == almost impossible * So what do we do in that scenario? --- ## Updating priors * Did I tell you that the coin was a fair coin? Nope! * Baseline assumption: the coin is fair. * If you get 100 heads, you need to update your priors * Priors == assumptions about the data. * Our expectations for the future can generally be wrong because either: * There was a random unlikely event * Our priors were wrong * Often times, we can't distinguish these two. For example... --- ## Engineering estimates * I estimate that a task will take 5 days to complete * It ends up taking 8 days * Non-actuary thinking: your estimate was wrong! * Actuarial thinking: * How much confidence did I have in that 5 day estimate * Better way to express: there's a 70% chance that I complete the task in 5 days or less * How likely was an 8 day completion assuming the original estimate was correct? * Do I need to revise my estimation process for more accuracy in the future? --- ## Why analyze risk? * Goal: maximize utility across all possible outcomes * Yes, this is economics again * Simple, no-risk case * Buy hamburger worth 5 HAPPY points for $4 * Buy steak worth 6 HAPPY points for $3 * Obvious: buy the steak * More complex if the steak is more expensive, then we need to compare value of money in HAPPY * All outcomes are guaranteed * However... --- ## Gambling You have $100,000 (that's your entire net worth). Let's consider these games. | Game | Win on heads | Lose on tails | | --- | --- | --- | | 1 | $10 | $10 | | 2 | $100 | $50 | | 3 | $100,000 | $50,000 | | 4 | $1,000,000 | $100,000 | * Whether you play each game depends on _risk aversion_ * Are games 2 and 3 the same? * Would you rather play game 1 or 4? ``` Utility = Sum(Probability(Event) * Utility(Event)) ``` Losing all my money is more than twice as bad as losing half my money! (If I'm risk averse.) --- ## DevOps impact example (Making up numbers.) * A single AWS AZ has a 1% chance of failing during the course of a year. * Failure of AZs is assumed to be _independent events_. * What's the likelihood of 2 AZs failing? * What's the likelihood of 3 AZs failing? * Bad way of expressing DevOps best practices: we deploy to three AZs to ensure we never have downtime. * Do we really ensure that? * Why stop at 3 AZs? * Good way of expressing this: the probability of three AZs simultaneously failing is so small that we consider this an _acceptable risk_ * SLAs are part of what helps us define acceptable risks --- ## Checkpoint summary * Proper risk analysis helps us _understand possible outcomes_ * We need to get priors (assumptions), and make them as accurate as possible * Priors can be refined by new, incoming data * We aren't "wrong" if reality does not meet expectations, we either had bad priors or an unlikely event happened * Risks don't mean "we avoid them entirely," it means we calculate the impact of different outcomes and make informed decisions --- ## Actuarial thinking in tech Let's apply these concepts more directly to our own work. --- ## Risk assessment can be expensive * How likely is it that I lose the coin-flip game after 10 rounds? * Define a betting strategy * Really easy to calculate the odds * Worth doing! * How likely is it that a regulatory change from the US will result in catastrophic impacts to my business? * Much harder to estimate! * Requires deep research * There are different levels of analysis we can perform --- ## Risks on projects Projects are chock-full of lots of different kinds of risk! * **Technical Risk**: Bugs or system failures (e.g., a major bug slipping through testing). * **Personnel Risk**: Team member illness or departure disrupting progress. * **Hardware Risk**: Server or infrastructure failure causing downtime. * **Competitive Risk**: A new competitor disrupting your blockchain project's market. * **Customer Risk**: Changing requirements delaying delivery. * **Regulatory Risk**: New blockchain regulations impacting project viability. * **Process Risk**: Workflow bottlenecks or unclear requirements. * **Financial Risk**: Budget overruns or unexpected cost increases. * **Communication Risk**: Misaligned teams or stakeholders causing delays. How do we deal with these? --- ## Identify most likely risks * Use your own prior experience * Brainstorm with experienced team members * Do quick web searches * This is a place where AI is _great_ * 25% chance that someone on the team will be sick over the next month: consider that risk * 0.00000003% chance that AWS will stop supporting Linux: not so important * Can ignore unimportant things (e.g. 45% chance I'll wake up drowsy one day this week and be 4% less productive) --- ## Identify highest impact risks * Even unlikely risks, if impactful, should be considered * 0.5% chance that our servers will have a hardware failure in the next year * Not very likely... * But can the project tolerate a 0.5% chance of an outage? * Team member quitting/dying/major brain injury: could be catastrophic * Mitigations: documentation, knowledge transfer, code reviews, etc * We'll still sometimes ignore impactful risks if they are too unlikely (e.g. we have no project plans for dealing with an alien invasion) --- ## Accept that we have "unknown unknowns" * Priors: we already accepted that our risk assessments may have the wrong numbers * We also sometimes will have unknown unknowns: risks we did not consider * We need to do "due diligence" to find out what these are * We'll never get all of them * How much time we spend depends on importance of the project (upcoming slide) --- ## Accept that mistakes are likely * We missed something * Priors need updating * Could have spent more time analyzing risks early * Need to compare cost of analyzing risks vs likelihood of major negative impact from unknown unknowns --- ## Risk tolerance is project-dependent * Proof of concept demo of a game? Run wild! * Send a man to the moon? Be careful! * Common mistake: being too risk averse when it's not needed * Remember, being risk averse almost always increases costs! * Hardware: running in multiple AZs * Engineers: writing tests * Project: analyzing risks * Helpful question: what's the worst that could happen? --- ## Wasted work * Need to solve a problem, we come up with two ideas * We rough estimate approach A at 2 days, approach B at 3 days * We're not feeling great about the estimates * It would take 4 engineer-hours to make better estimates * Discuss: what should we do? --- ## Defaults * Company standard: we like to use PostgreSQL, Rust, Docker, etc * Non-risk reason: we build up expertise inside the company, people can swap between projects more easily, etc * Risk related reason: we understand the risks of these tools better, less "unknown unknowns" * Easier to make confident decisions since we know the priors are more accurate --- ## Eliminating risks * Some risks can be completely eliminated! * For example: if I use (safe) Rust, I have 0% risk of passing a `String` to a function that needs an `i32` * Choosing tools that eliminate entire risk categories is great! * Keep in mind that choosing those tools/processes/etc. may introduce its own costs * Then we need to compare expected costs vs expected benefits of risk elimination --- ## Flexibility and feedback loops * Bad way to operate: constantly worrying that we made the wrong decision * Good way to operate: be responsive to new data * "Hey look, we've gone over our estimates for the past 5 sprints, maybe we need to reassess our estimation process?" * "We anticipated a 5% chance per day of overwhelming our database server, but it's only happened 0.5% of the time, are we over-provisioned?" * Remember: it's not about "I made a bad decision." * It's about: is this random bad luck, or are my priors incorrect? --- ## Conclusions * We live in a quantum state of all possible outcomes * Just because our expectations don't happen, doesn't mean we're "wrong" * Life (and engineering) is about making the best decisions given our current data * Assess the risks, determine how much effort to put into mitigation * Stay flexible, be willing to change course when new data comes in * Accept the fact that _mistakes will happen_ * Don't spend more on mitigating risks than the expected cost of those risks * I don't want to get in a car crash, but I accept that it's a possibility every time I drive * Insure yourself (through insurance, planning, etc.) against impactful, likely risks