We moved from AWS to Hetzner, saved 90%, kept ISO 27001 with Ansible
medium.comEarlier this year I led our migration off AWS to European cloud (Hetzner + OVHcloud), driven by cost (we cut 90%) and data sovereignty (GDPR + CLOUD Act concerns).
We rebuilt key AWS features ourselves using Terraform for VPS provisioning, and Ansible for everything from hardening (auditd, ufw, SSH policies) to rolling deployments (with Cloudflare integration). Our Prometheus + Alertmanager + Blackbox setup monitors infra, apps, and SSL expiry, with ISO 27001-aligned alerts. Loki + Grafana Agent handle logs to S3-compatible object storage.
The stack includes: • Ansible roles for PostgreSQL (with automated s3cmd backups + Prometheus metrics) • Hardening tasks (auditd rules, ufw, SSH lockdown, chrony for clock sync) • Rolling web app deploys with rollback + Cloudflare draining • Full monitoring with Prometheus, Alertmanager, Grafana Agent, Loki, and exporters • TLS automation via Certbot in Docker + Ansible
I wrote up the architecture, challenges, and lessons learned: https://medium.com/@accounts_73078/goodbye-aws-how-we-kept-i...
I’m happy to share insights, diagrams, or snippets if people are interested — or answer questions on pitfalls, compliance, or cost modeling.
> We rebuilt key AWS features ourselves
At what cost? People usually exclude the cost of DIY style hosting. Which usually is the most expensive part. Providing 24x7 support for the stuff that you've home grown alone is probably going to make large dent into any savings you got by not outsourcing that to amazon.
> $24,000 annual bill felt disproportionate
That's around 1-2 months of time for a decent devops freelancer. If you underpay your devs, about 1/3rd of an FTE per year. And you are not going to get 24x7 support with such a budget.
This still could make sense. But you aren't telling the full story here. And I bet it's a lot less glamorous when you factor in development time for this.
Don't get me wrong; I'm actually considering making a similar move but more for business reasons (some of our German customers really don't like US hosting companies) than for cost savings. But this will raise cost and hassle for us and I probably will need some re-enforcements on my team. As the CTO, my time is a very scarce commodity. So, the absolute worst use of my time would be doing this myself. My focus should be making our company and product better. Your techstack is fine. Been there done that. IMHO Terraform is overkill for small setups like this; fits solidly in the YAGNI category. But I like Ansible.
> Providing 24x7 support for the stuff that you've home grown alone is probably going to make large dent into any savings you got by not outsourcing that to amazon.
I don’t understand why people keep propagating this myth which is mostly pushed by the marketing department of Azure, AWS and GCP.
The truth is cloud provider doesn’t actually provide 24/7 support to your app. They only ensure that their infrastructure is mostly running for a very loose definition of 24/7.
You still need an expert on board to ensure you are using them correctly and are not going to be billed a ton of money. You still need people to ensure that your integration with them doesn’t break on you and that’s the part which contains your logic and is more likely to break anyway.
The idea that your cloud bill is your TCO is a complete fabrication and that’s despite said bill often being extremely costly for what it is.
I think both things are true - people overestimate the level of support provided by AWS, but also re-building the laundry list of stuff OP did in-house to save $24k/year seems onerous.
But the idea that AWS provides some sort of white glove 24/7 support is laughable for anyone that's ever run into issues with one of their products...
Only incident we needed AWS support we had an engineer on call with us for several hours (including a shift change for them). Might’ve been a one-off but has seemed like their support is pretty phenomenal (also talked with someone who worked there and they thought it was good).
If you pay for enterprise support they will absolutely stay on a call with you during a production outage. Best support I've seen from any of the vendors I've used.
It’s mostly stuff they would have done on aws anyway, and probably with crappy tools/reproducability
I love the fact that AWS was willing to make kernel updates to support our use cases within a week of flagging it
Why would cloud providers support anything more than their infrastructure?
The core question is what is the value they bring compared to what they cost.
You will definitely get support reasonably fast if something breaks because of them but that’s not where breakage happens most of the time. The issue will nearly always be with how you use the services. To fix that, you need someone who understands both the tech you use and how it’s offered by your cloud provider. At which point, you have an expert on board anyway so what’s the point of the huge bill?
A hoster will cost you less for most of the benefits. They already offer most of the bricks required for easy scalability.
Hefty bill is for things like RDS, IAM, Systems Manager and all other tools they have. Rebuilding and supporting these is a non-trivial exercise.
It is more trivial than it seems. How did people manage a Postgres instance prior to RDS? Of the entire feature list, what parts of RDS do you use?
1. Dumping a backup every so often?
2. Exporting its performance via Prometheus, and displaying in a dashboard?
3. Machine disk usage via Prometheus?
4. An Ansible playbook for recovery? Maybe kicking that into effect with an alert triggered from bullet 2 and 3.
5. Restoring the database that you backed up into your staging env, so you get a recurring, frequent check of its integrity.
This would be around 100 to 500 lines of code of which an LLM can do for you.
What am I missing?
There is a lot more - Aurora to handle our spiky workload (can grow 100x from normal levels at times) - Zero-ETL into RedShift. - Slow query monitoring, not just metrics but actual query source. - Snapshots to move production data into staging to test queries.
Besides this we also use - ECS to autoscale app layer - S3 + Athena to store and query logs - Systems Manager to avoid managing SSH keys. - IAM and SSO to control access to the cloud - IoT to control our fleet of devices
I’ve never seen how people operate complex infrastructures outside of a cloud. I imagine that using VPS I would have a dedicated dev. ops acting as a gatekeeper to the infrastructure or I’ll get a poorly integrated and insecure mess. With cloud I have teams rapidly iterating on the infrastructure without waiting on any approvals and reviews. Real life scenario 1. Let use DMS + PG with sectioned tables + Athena 2. Few months later: let just use Aurora read replicas 3. Few months later: Let use DMS + RedShift 4. Few months later: Zero-ETL + RedShift.
I imagine a dev. ops would be quite annoyed by such back and forth. Plus he is busy keeping all the software up to date.
> I’ve never seen how people operate complex infrastructures outside of a cloud
That’s your issue. If all you have is a hammer, everything looks like a nail.
I have the same issue with the junior we hire nowadays. They have been so brain washed that the idea that the cloud is the solution and they can’t manage without them that they have no idea of what to do instead of reaching for them.
> I imagine that using VPS I would have a dedicated dev. ops acting as a gatekeeper to the infrastructure or I’ll get a poorly integrated and insecure mess.
You just describe having a real mess after this.
> I imagine a dev. ops would be quite annoyed by such back and forth.
I would be quite annoyed by such back and forth even on the cloud. I don’t even want to think about the costs of changing so often.
>That’s your issue. If all you have is a hammer, everything looks like a nail.
While I admit lack of experience at scale I had my share of Linux admin experience to understand how it could be done. My point is that building a comparable environment without cloud would be much more than just 500 LoC. If you have relevant experience please share.
>I would be quite annoyed by such back and forth even on the cloud. I don’t even want to think about the costs of changing so often.
In cloud it took 1-2 weeks per iteration with several months in between when we have been using the solution. One person did it all, nobody in the team even noticed. Being able to iterate like this is valuable.
I wanted to comment on this but mistakenly put the answer here. Sorry.
https://news.ycombinator.com/item?id=44335920#44346481
>What you see as “rapid iteration” looks a lot like redoing the same work every few months because of shifting cloud-native limitations.
This is not the case. The reason for iteration is the search for solution in the space we don’t know well enough. In this particular case cloud made iteration cheap enough to be practical.
I asked you to think about what it would take to build well integrated suite of tools (PG + backups + snapshots + prom + logs + autoscaling for DB and API + ssh key management + SSO into everything). It is a good exercise, if you ever built and maintained such a suite with uptime and ease of use comparable to AWS I genuinely would like to hear about it.
Ln the case of AWS: customer obsession.
It's their first "leadership principle" (their sort of corporate religion, straight out of the lips of Jeff himself)
You’re describing exactly the kind of vendor lock-in treadmill I was trying to avoid. What you see as “rapid iteration” looks a lot like redoing the same work every few months because of shifting cloud-native limitations.
Also, the idea that using VPS or non-hyperscaler clouds means “poorly integrated and insecure mess” feels like AWS marketing talking. Good ops doesn’t mean gatekeepers — it means understanding your system so you don’t need to swap out components every quarter because the last choice didn’t scale as promised.
I’d rather spend time building something stable that aligns with my compliance and revenue goals, than chasing the latest AWS feature set. And by the way, someone still has to keep all that AWS software up to date — you’ve just outsourced it and locked yourself into their way of doing it.
AWS features may be expensive to replicate 100% but what if one only needs 80%. One also needs to consider the effort involved in configuring AWS and maintaining the skills for that. Then there are opportunity costs of using e.g. AWS dashboards vs. better ones with grafana etc..
I guess a lot depends on size, diversity and dynamics of the demand. Not every nail benefits from contact with the biggest hammer in the toolbox.
> AWS features may be expensive to replicate 100% but what if one only needs 80%.
You are correct, but I think you're missing the point: my 80% and your 80% don't overlap completely.
It makes sense if you consider there is a risk you might get kicked out by AWS because the US government force Amazon to close your account. The US is also hinting about going to war against Europe (Greenland), which makes a bad idea to have any connection to the US.
... and the US just made the EU very unhappy by killing the ICC's Microsoft subscription. Which by the way was hosted on Azure in Europe (meaning local or "sovereign cloud" or whatever they call it provides exactly zero protection against US sanctions).
So no more Microsoft software then?
The EU isn't willing to pay for that. They'll just throw the ICC under the bus, just like they'll throw any EU company that the US sanctions under the bus. That costs less. The EU has a nice name for throwing people under the bus like this: it's called "the peace dividend".
[dead]
> $24,000 annual bill felt disproportionate
>> That's around 1-2 months of time for a decent devops freelancer. If you underpay your devs, about 1/3rd of an FTE per year. And you are not going to get 24x7 support with such a budget.
In terms of absolute savings, we’re talking about 90% of 24k, that’s about 21.6k saved per year. A good amount, but you cannot hire an SRE/DevOps Engineer for that price; even in Europe, such engineers are paid north of 70k per year.
I personally think the TCO (total cost of ownership) will be higher in the long run, because now every little bit of the software stack has to be managed by their infra team/person, and things are getting more and more complex over time, with updates and breaking changes to come. But I wish them well.
In mid sized companies, creating/using/maintaining AWS resources requires nevertheless one or more teams of devops/sre.
Out of experience, in the long run, this "managed aws saved us because we didn't need people" feels always like the typical argument made by saas sales people. In reality, many services/saas are really expensive, and you probably will only need a few features which sometimes you can rollout yourself.
The initial investment might be higher, but in the long run I think it's worth it. It's a lot like Heroku vs AWS. Superexpensive, but it allows you with little knowledge to push a POC in production. In this case, it's AWS vs self hosted or whatever.
Finally, can we quantify the cost of data/information? This company seems to be really "using" this strategy (= everything home made, you're safe with us) for sales purposes. And it might work, although for the final consumer this might have a higher price, which finally pays the additional devops to maintain the system. So who cares?
How important is for companies to not be subject to CLOUD act or funny stuff like that?
I think many European countries have SRE lower for than 70k. How good is hard to judge. Our DevOps likely earns less, but she is just decent, not google-level.
70k? Just hire in Poland/Czechia/Slovakia for 50% off!
Unless by Europe you mean the Apple feature availability special of UK/Germany/France/Spain/Italy
Spain and Italy are closer to the Poland bracket than the UK/Germany one, possibly even lower for some roles.
My colleagues were talking about salaries in the range of $40-60k... About 8-10 years ago. And I don't think it got any cheaper
Still, it’s highly location-dependent, and mileage varies drastically between countries.
I’m an SWE with a background in maths and CS in Croatia, and my annual comp is less than what you claim here. Not drastically, but comparing my comp to the rest of the EU it’s disappointing, although I am very well paid compared to my fellow citizens. My SRE/devops friends are in a similar situation.
I am always surprised to see such a lack of understanding of economic differences between countries. Looking through Indeed, a McDonald’s manager in the US makes noticeably more than anyone in software in southeast Europe.
Only if you want to hire students. Experienced senior engineers have pretty much the same 70k+ price tag in Poland.
You won’t find anyone competent for that kind of money there.
Isn’t $24k also a naive accounting of the annual cost of AWS in this case? What FTE-equivalent was required to set up the services they use at AWS? What FTE-equivalent is required to keep the annual bill down to $24k from say $48k or $100k?
Before migration (AWS): We had about 0.1 FTE on infra — most of the time went into deployment pipelines and occasional fine-tuning (the usual AWS dance). After migration (Hetzner + OVHCloud + DIY stack): After stabilizing it is still 0.1 FTE (but I was 0.5 FTE for 3-4 months), but now it rests with one person. We didn’t hire a dedicated ops person.
I am curious why you think AWS services are more hands-off than a series of VPSs configured with Ansible and Terraform? Especially if you are under ISO 27001 and need to document upgrades anyway.
I was emphasizing that if the new Hetzer expenses are naive, then it was also naive to consider that AWS only costs $24k per year.
My point was that AWS is not hands-off. You still have to set it up, you have to keep a close eye on expenses, and Amazon holds your hand less than many people seem to expect.
your implicit assumption that AWS requires less (exoensive) labour is just not true
I have helped hundreds of people migrate to AWS and never had a single person spend more effort unless they went for an apples to apples disaster. I have only seen this when people take a high overhead tool they don’t understand (eg k8s) and move to cloud services they don’t understand.
Exactly our insight having maintained the same app both places.
> That's around 1-2 months of time for a decent
Presumably they are in Europe? so labour is a few times cheaper here.
> Providing 24x7 support
They are not maintaining the hardware itself and it’s not like Amazon is doing providing devops for free. Unless you are using mainly serverless stuff the difference might not be that significant
Amazon’s effort in making sure things _actually are up_ is fundamentally different than budget clouds.
The systems you design when you have reliable queues, durable storage, etc. are fundamentally different. When you go this path you’re choosing to solve problems that are “solved” for 99.99% of business problems and own those solutions.
Still, things fail. A-tier clouds also fail, and you may still have to design for it. Rule of thumb, if you are capable of rolling out your own version, you'll be far more competent planning for & handling downtime, and will often have full ownership of the solution.
Also, any company with strict uptime requirements will have proper risk analysis in place, outlining the costs of the chosen strategy in case of downtime; these decisions require proper TCO evaluation and risk analysis, they aren't made in a vacuum.
This is a strangely limited view. Cloud providers have done the work of building fault-tolerant distributed systems for many of the _primitives_ with large blast radius on failure.
For example, you'd be hard pressed to find a team building AWS services who is not using SQS and S3 extensively.
Everyone is capable of rolling their own version of SQS. Spin up an API, write a message to an in memory queue, read the message. The hard part is making this system immediately interpretable and getting "put a message in, get a message out" while making the complexities opaque to the consumer.
There's nothing about rolling your own version that will make you better able to plan this out -- many of these lessons are things you only pick up at scale. If you want your time to be spent learning these, that's great. I want my time to be spent building features my customers want and robust systems.
I see where you’re coming from — no doubt, services like SQS and S3 make it easier to build reliable, distributed systems without reinventing the wheel. But for me, the decision to shift to European cloud providers wasn’t about wanting to build my own primitives or take on unnecessary complexity. It was about mitigating regulatory risk and protecting revenue.
When you rely heavily on U.S. hyperscalers in Europe, you’re exposed to potential disruptions — what if data transfer agreements break down or new rulings force major changes? The value of cloud spend, in my view, isn’t just in engineering convenience, but in how it helps sustain the business and unlock growth. That’s why I prioritized compliance and risk reduction — even if that means stepping a little outside the comfort of the big providers’ managed services.
Fred Brooks, the author of The Mythical Man-Month said:
> “Software is ten times easier to write than it was ten years ago, and ten times as hard to write as it will be ten years from now.”
Ansible, Hetzner, Prometheus and object storage will give you RDS if you prompt an LLM, or at least give you the parts of RDS that you need for your use case for a fraction of the cost.
Hetzner is also working on their own Managed RDS offering. Their own S3 Offering is also relatively new. Back then, they've also had job offerings for DB Experts
https://www.hetzner.com/de/storage/object-storage/
> Don't get me wrong; I'm actually considering making a similar move but more for business reasons (some of our German customers really don't like US hosting companies) than for cost savings
There will be a new AWS European Sovereign Cloud[1] with the goal of being completely US independent and 100% compliant with EU law and regulations.
[1]: https://www.aboutamazon.eu/news/aws/aws-plans-to-invest-7-8-...
> There will be a new AWS European Sovereign Cloud[1] with the goal of being completely US independent
The idea that anything branded AWS can possibly be US independent when push comes to shove is of course pure fantasy.
It's not really the brand that's the problem.
If Amazon partnered with an actually independent European company, provided the software and training, and the independent company set it up and ran it; in case of dispute, Amazon could pull the branding and future software updates, but they wouldn't be able to access customer data without consent and assistance of the other company and the other company would be unlikely to provide that for requests that were contrary to European law. It would still be branded AWS for Europe, and nobody would doubt its independence.
This way, where it's all subsidiaries of Amazon can't be trusted though.
I can guarantee that if you read your comment with Amazon substituted for Huawei, you'd object against the okayness of that arrangement. Same thing.
But it will check boxes on compliance checklist.
Not on the “political concerns” checklist which is getting more and more important
I don't know, with that argument you can argue that everything is dependent on everything, for instance, the EU automobile industry is hugely dependent on materials and chips from all over the world including US and thus real independence is a pipe dream.
This is one of the reasons we were wondering if the US can switch off our fighter jets. The ones we own, brought from the US.
The US clearly state that extraterritoriality is fine with them. Depending on the company, one gag order is enough to sabotage a whole company.
> I don't know, with that argument you can argue that everything is dependent on everything
It is. And China have been the only ones intelligent enough to have understood this very long ago. They also show that while entire independence on their scale may be a pipe dream, getting close to it is feasible.
Of course you know, the US government can use many methods to enforce their demand, it makes no sense to use an Amazon alternative to Amazon, it's such a nonsense to join a conversation about migrating away from Amazon suggesting that.
There are still US AWS people the US gov can apply pressure to. Sovereignty requires nothing on US soil, people, infrastructure, entities, etc. What Microsoft and AWS are doing is performance art around “EU sovereignty.”
Our customers across EU (hospitals) are not impressed or interested (n=175). Such a delusional project.
The ICC move by MS made hospitals go in an even higher gear to prepare off-ramp plans. From private Azure cloud to “let’s get out”
Even if you are cloud native, it makes sense to have scaffolding to allow for vendor mitigation, unless you want to tie your entire companies future to the whims of a single company.
Monitoring and persistence layers are cross cutting and already an abstraction with impedance mismatch already.
You don't need a full blown SOA2 systems, just minimal scaffolding to build on later.
Even if you stick to AWS for the remainder of time, that scaffolding will help when you grow, AWS services change, or you need a multi cloud strategy.
As a CTO, you need to also de-risk in the medium and longer term, and keeping options open is a part of that.
Building tightly coupled systems with lots of leakage is stepping over dollars to pick up pennies unless selling and exiting is your plan for the organization.
The author doesn't mention what they had to write, but typically it is cloud provider implementation details leaking into your code.
Just organizing ansible files in a different way can often help with this.
If I was a CTO who thought this option was completely impossible for my org, I would start on a strategic initiative to address it ASAP.
Once again you don't need to be able to jump tomorrow, but to me the belief that a vendor has you locked in would be a serious issue to me.
90% sounds good but the real dollar amount feels low.
Two reasons for this stick out:
- Are the multi-million dollar SV seed rounds distorting what real business costs are? Counting dev salaries etc. (if there is at least one employee) it doesn't seem worth the effort to save $20k - i.e., 1/5 of a dev salary? But for a bootstrapped business $20k could definitely be existential.
- The important number would be the savings as percent of net revenue. Is the business suddenly 50% more profitable? Then it's definitely worth it. But in terms of thinking about positively growing ARR doing cost/benefit on dropping AWS vs. building a new (profitable) feature I could see why it might not make sense.
Edit to add: it's easy to offhand say "oh yeah easy, just get to $2M ARR instead of saving $20k- not a big deal" but of course in the real world it's not so simple and $20k is $20k. The prevalent SV mindset of just spending without thinking too much about profitability is totally delusional except for like 1 out of 10000 startups.
From the blog post: "We are a Danish workforce management company doing employee scheduling." Definitely not a VC-funded SV startup. Probably bootstrapped.
Yes, bootstraped for our own money. It makes a difference.
If I generalize, I see two kinds of groups for whom this reduction of cost does not matter. The first group are VC-funded, and the second group are in charge of +million AWS bill. We do not have anything in common with these companies, but we have something in common with 80% of readers on this forum and 80% of AWS clients.
It was cool reading your article.
We're also bootstrapped and use Hetzner, not AWS (except for the occasional test), for very much the same reasons as you.
And we are also fully infrastructure as code using Ansible.
We used to be a pure software vendor, but are bringing out a devtool where the free tier runs on Hetzner. But with traction, as we build out higher tier services, it's an open question on what infrastructure to host it on.
There are a kazillion things to consider, not the least of which is where the user wants us to be.
My last contact with AWS support (100€/month tier) was someone feeding me LLM generated slop that contained hallucinations about nonexistent features and configuration options.
This is what I'm wondering too. 90% is a lovely number to throw around but what is the opportunity cost?
> Cost of DIY and support: You’re absolutely right that 24x7 ops could eat up any savings if you built everything from scratch without automation or if you needed dedicated staff watching dashboards all night. In our case:
• We heavily invested upfront in infrastructure-as-code (Terraform + Ansible) so that infra is deterministic, repeatable, and self-healing where possible (e.g. auto-provisioning, automated backup/restore, rolling updates).
• Monitoring + alerting (Prometheus + Alertmanager) means we don’t need to watch screens — we get woken up only if there’s truly a critical issue.
• We don’t try to match AWS’s service level (e.g. RTO of minutes for every scenario) — we sized our setup to our risk profile and customers’ SLAs.
> True cost comparison:
• The migration was done as part of my CTO role, so no external consulting costs. The time investment paid back within months because the ongoing cost to operate the infra is low (we’re not constantly firefighting).
• I agree that if you had to hire more people just to manage this, it could negate the savings. That’s why for some teams, AWS is still a better fit.
> Business vs. cost drivers: Honestly, our primary driver was sovereignty and compliance — cost savings just made the business case easier to sell internally. Like you, our European customers were increasingly skeptical of US cloud providers, so this aligned with both compliance and go-to-market.
> Terraform / YAGNI: Fair point! Terraform probably is more than we need for the current scale. I went with it partly because it fits our team’s skillset and lets us keep options open as we grow (multi-provider, DR regions, etc).
And, finally, because this, I am posting about it. I am sharing as much as I can, and just spread the work about it. I just sharing my experience and knowledge. If you have any questions or want to discuss further, feel free to reach out at jk@datapult.dk!
I think it's indeed the opportunity cost and the commoditization of the infrastructure and operational expertise that drives startups to AWS. But over time, as you scale, they can easily become your biggest component to your marginal cost, without an easy exit, because they locked you in.
https://news.ycombinator.com/item?id=44335920#44336757
Good enough is good enough for most folks. In most cases, downtime is cheaper than higher reliability.
I feel there is a lot of FUD spread whenever someone moves off the cloud, with the inane comparison to the annual wage of a dedicated sysadmin, trying to discourage you from doing a “reckless” migration which will bite you in the ass, your servers will catch fire every day and that it is better to stay within the golden handcuffs of AWS and GCP.
I wonder if it’s both stockholm syndrome and learned helplessness of developers that cannot imagine having to spend a little more effort and save, like OP, 90% off their monthly bill.
Yeah sure for some use cases AWS is the market leader, but let’s not kid ourselves, 9/10 companies on AWS don’t require more than a few servers and a database.
Well said. It reminds me of a story I heard in a podcast once.
A database administrator for a drug cartel became an informant for the police.
His cartel boss called him in on a weekend due to a server errors. He said in the podcast "I knew I've been found out because a database running Linux never crashes"
Makes you wonder what everyone is telling themselves about the need for RDS..
[dead]
I kind of cringed reading this article because there is also the cost in downtime which doesn't seem to be considered along with the RTO timelines.
Hetzner has had issues where they just suddenly bring servers down with no notice, sometimes every server attached to an account because they get a bogus complaint, and in some cases it appears they are still up but all your health checks fail, where you are scurrying around trying to find the cause with no visibility or lifeline. All this costs money, a lot of money, and its unmanageable risk.
For all the risks and talk of compliance, what about the counterparty-risk where a competitor (or whoever) sends a a complaint from a nonexistent email which gets your services taken down. Sure after support gets involved and does their due dilligence they see its falsified and bring things back up but this may be quite awhile.
It takes their support at least 24 hours just to get back to you.
DIY hosting is riddled with so many unmanageable costs I don't see how OP can actually consider this a net plus. You basically are playing with fire in a gasoline refinery, once it starts burning who knows when the fire will go out so people can get back to work.
Totally valid concerns — I don’t disagree that DIY hosting comes with real risks that managed platforms abstract away (but AWS could close your account too).
We didn’t go into this blind though — we spent a lot of time testing scenarios (including Hetzner/OVH support delays) and designing mitigation strategies.
Some of what we do:
• Our infra is spread across multiple providers (Hetzner, OVH)) + Cloudflare for traffic management. If Hetzner blackholes us, we can redirect within minutes. • DB backups are encrypted and replicated nightly to various regions/providers (incl. one outside the primary vendors), with tested restore playbooks.
The key point: no platform is free of counterparty risk — whether that’s AWS pulling a region for legal reasons, or Hetzner taking a server offline. Our approach tries to make the blast radius smaller and the recovery faster, while also achieving compliance and cutting costs substantially (~90% as noted).
DIY is definitely not for everyone — it is more work, but for our particular constraints (cost, sovereignty, compliance) we found it a net win. Happy to share more details if helpful!
Oh, an imagine being kicked out of AWS and you used Aurora.. My certified multi-cloud setup with standard components should not make you cringe.
With respect, there's a big difference between "could close your account" and have "closed people's accounts" temporarily based on unlawful complaints.
I probably won't be responding after this or in the future on HN because I took a significant blast off my karma for keeping it real and providing valuable feedback. You have a lot of people brigading accounts that punish those that provide constructive criticism.
Generally speaking AWS is incentivized to keep your account up so long as there is no legitimate reason for them taking it down. They generally vet claims with a level of appropriate due diligence before imposing action because that means they can keep billing for that time. Spurious unlawful requests cost them money and they want that money and are at a scale where they can do this.
I'm sure you've spent a lot of time and effort on your rollout. You sound competent, but what makes me cringe is the approach you are taking that this is just a technical problem when it isn't.
If you've done your research you would have ran across more than a few incidents where people running production systems had Hetzner either shut them down outright, or worse often in response to invalid legal claims which Hetzner failed to properly vet. There have also been some strange non-deterministic issues that may be related to hardware failing, but maybe not.
Their support is often a one response every 24 hours, what happens when the first couple responses are boilerplate because the tech didn't read or understand what was written. 24 hours + % chance of skipping the next 24 hours at each step; and no phone support, which is entirely unmanageable. While I realize they do have a customer support line, it is for most an international call and the hours are bankers hours. If your in Europe you'll have a lot easier time lining up those calls, but anywhere else and you are dealing with international calls with the first chance of the day being midnight.
Having a separate platform for both servers is sound practice, but what happens when your DAG running your logging/notification system is on the platform that fails, but not a failover. The issues are particularly difficult when half your stack fails on one provider, stale data is replicated over to your good side, and you have nonsensical, or invisible failures; and its not enough to force an automatic failover with traffic management which is often not granular enough.
Its been awhile since I've had to work with Cloudflare tm, so this may have become better but I'm reasonably skeptical. I've personally seen incidents where the RTO for support for direct outages was exceptional, but then the RTO for anything above a simple HTTP(200) was nonexistent with finger pointing, which was pointless because the raw network captures were showing the failure at L2/L3 traffic on the provider side which was being ignored by the provider. They still argued, and downtime/outage was extended as a result. Vendor management issues are the worst when contracts don't properly scope and enforce timely action.
Quite a lot of the issues I've seen with various hosting providers OVH and Hetzner included, are related to failing hardware, or transparent stopgaps they've put in place which break the upper service layers.
For example, at one point we were getting what appeared to be stale cache issues coming in traffic between one of a two backend node set on different providers. There was no cache between them, and it was breaking sequential flows in the API while still fulfilling other flows which were atomic. HTTP 200 was fine, AAA was not, and a few others. It appeared there was a squid transparent proxy placed in-line which promptly disappeared upon us reaching out to the platform, without them confirming it happened; concerning to say the least when your intended use of the app you are deploying is knowledge management software with proprietary and confidential information related to that business. Needless to say this project didn't move forward on any cloud platform after that (and it was populated with test data so nothing lost). It is why many of our cloud migrations were suspended, and changed to cloud repatriation projects. Counter-party risk is untenable.
Younger professionals I've found view these and related issues solely as technical problems, and they weigh those technical problems higher than the problems they can't weigh because of lack of experience and something called the streetlamp effect which is an intelligence trap often because they aren't taught a Bayes approach. There's a SANS CTI presentation on this (https://www.youtube.com/watch?v=kNv2PlqmsAc).
The TL;DR is a technical professional can see and interrogate just about every device, and that can lead to poor assumptions and an illusion of control which tend to ignore problems and dismiss them when there is no real clear visibility about how those edge problems can occur (when the low level facilities don't behave as they should). The class of problems in the non-deterministic failure domain where only guess and check works.
The more seasoned tend to focus more on the flexibility needed to mitigate problems that occur from business process failures, such as when a cooperative environment becomes adversarial, which necessarily occurs when trust breaks down with loss, deception, or a breaking of expectations on one parties part. This phase change of environment, and the criteria is rarely reflected or touched on in the BC/DR plans; at least the ones that I've seen. The ones I've been responsible for drafting often include a gap analysis taking into account the dependencies, stakeholder thoughts, and criteria between the two proposed environments, along with contingencies.
This should includes legal obviously to hold people to account when they fail in their obligations but even that is often not enough today. Legal often costs more than simply taking the loss and walking away absent a few specific circumstances.
This youthful tendency is what makes me cringe. The worst disasters I've seen were readily predictable to someone with knowledge of the underlying business mechanics, and how those business failures would lead to inevitable technical problems with few if any technical resolutions.
If you were co-locating on your own equipment with physical data center access I'd have cut you a lot more slack, but it didn't seem like you are from your other responses.
There are ways to mitigate counter-party risk while receiving the hosting you need. Compromises in apples to oranges services given the opaque landscape rarely paint an objective view, which is why a healthy amount of skepticism and disagreement is needed to ensure you didn't miss something important.
There's an important difference between constructive criticism intended to reduce adverse cost and consequence, and criticisms that simply aren't based in reality.
The majority of people on HN these days don't seem capable of making that important distinction in aggregate. My relatively tame reply was downvoted by more than 10 people.
These people by their actions want you to fail by depriving you of feedback you can act on.
> A combination of Prometheus, Grafana, and Loki allowed us to replicate — and in some ways exceed — the visibility we had on AWS
Given these existence of these tools, which are fantastic, I'm often stunned at how sluggish, expensive and how lacklustre the UX is of the AWS monitoring stack.
Monitoring quickly became the most expensive, and most unpleasant part of our AWS experience.
When I discovered that Live Tail (an equivalent of looking at the logs with `tail -f ...`) is paid, I laughed out loud. The most obvious functionality for everyday looking at logs is not free. CW is pain.
I don't mean that in an offensive way, but if your amount of logs is so small that you can live tail them, you don't operate at the scale that AWS cares about.
It's paid because operating that feature at AWS' scale is expensive as hell. Maybe not for your project, but for 90% of their customers it is.
You can specify filters in Live Tail, if I read the docs right. So seeing live logs from one transaction id or one user should be possible. This is definitely useful. Searching through gigs/s of logs in order to perform that could be expensive - sure. This just feels wrong to me anyway. CW is already pricy.
If you were to divide the AWS customer base into a 10% bucket and a 90% bucket, a 90% bucket would not be the ones needing the infinite scale of AWS.
I think the most often mentioned problems mentioned are pollution of Hetzner addresses by shady people (might be addressed with "exits" from AWS / Cloudflare) and you are running on hardware which does tend to fail / needs upgrades. Were there some concerns on those from you?
Also, Loki! How do you handle memory hunger on loki reader for those pesky long range queries, and are there alternatives?
Pollution: We front everything user-facing through Cloudflare, so external users (and bots) don’t interact directly with our Hetzner/OVH IPs. We lock down our IPs at the firewall (ufw + Cloudflare IP allowlisting) so only trusted sources can even connect at L4.
Failures/upgrades: We provision with Terraform, so spinning up replacements or adding capacity is fast and deterministic.
We monitor hardware metrics via Prometheus and node exporter to get early warnings. So far (9 months in) no hardware failure, but it’s a risk we offset through this automation + design.
Apps are mostly data-less and we have (frequently tested) disaster recovery for the database.
Loki: We’re handling the memory hunger by
• Distinguishing retention limits and index retention
• Tuning query concurrency and max memory usage via Loki'’'s config + systemd resource limits.
• Use Promtail-style labels + structured logging so queries can filter early rather than regex the whole log content.
• Where we need true deep history search, we offload to object store access tools or simple grep of backups — we treat Loki as operational logs + nearline, not as an archive search engine.
Thanks for thorough answer! Seems like you've platformized(!) yourself to an extent, have you considered going full on with k8s on top of metal (their machines) to offset some of the concerns about hardware?
Thanks for the compliment.
We used AWS EKS in the old days and we never liked the extreme complexity of it.
With two Spring Boot apps, a database and Redis running across Ubuntu servers, we found simpler tools to distribute and scale workloads.
Since compute is dirt cheap, we over-provision and sleep well.
We have live alerts and quarterly reviews (just looking at a dashboard!) to assess if we balance things well.
K8s on EKS was not pleasant, I wanna make sure I never learn how much worse it can get across European VPS providers.
Hmm, what was so unpleasant about EKS if you don't mind my asking?
I'm guessing the answer is going to center around the word "complexity" cited in their original comment. That is: I would guess it's YAGNI more than EKS itself
There's an ongoing thread (one of many) exploring the different perspectives on that debate: https://news.ycombinator.com/item?id=44317825
He said "in the old days" so probably before addons, managed nodegroups or auto mode. This must have been hell.
Ah. Yeah I'm setting things up with managed node groups and it doesn't seem so bad so far... Waiting for the other shoe to drop after so much doom saying though haha. Luckily we removed the need for anything stateful, so I can ignore the EBS-CSI shortcomings. Also trying to keep it simple/minimal when it comes to ingress and networking.
Cool. I've never had any issue with EBS CSI driver itself, the biggest issue were idiosyncracies of EBS itself, like the mount count limit or availability zone requirements. These need ugly workarounds, like limiting your volumes to a single AZ, so, no HA.
On the other side, their VPC CNI plugin and their ingress controller are pretty much set and forget.
Yeah basically it is that EBS limitation (AZ-specific) combined with autoscaling causing quirky failure modes unless you are careful about how you set things up.
https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues...
(Not OP): On the loki question: yeah our project had a similar issue. I did a lot of playing around with the loki configuration, and what you'll discover by reading their blogs on Loki performance is that the indexing settings they recommend are not the ones that are used by default in helm (and probably other deployment configurations). Once I did some reconfiguration, added read specific instances, and implemented their other recommendations - we did see much better performance.
Just remember: their interest is that you buy their cloud service, not in giving an out-of-the-box great experience on their open source stuff.
A good alternatives for Loki is Victoria. Popular, way more performant and reputable but we went with Loki because of the relative size and diversity of maintainers between the two projects. Your points are super valid and we worked around it as mentioned above.
Quickwit is also worth a look, along with its log collector companion Vector. I think at least Vector was a YC company before they got shlorped up by Datadog, but they're still both actively maintained open source.
https://en.wikipedia.org/wiki/Sybil_attack
One of the advantages of more expensive providers seems to be that they have good reputation due to a de facto PoW mechanism.
Depends on the use case, right? I don’t accept traffic from random Hetzner IPs — only Cloudflare’s IPs are allowed.
The only potential indirect risks is if your Hetzner VPS IP range gets blacklisted (because some Hetzner clients abuse it for Sybil attacks or spam).
Or if Hetzner infrastructure was heavily abused, their upstream or internal networking could (in theory) experience congestion or IP reputation problems — but this is very unlikely to affect your individual VPS performance.
This depends on what you are doing on Hetzner and how you restrict access but for an ISO-27001 certified enterprise app, I believe this is extremely unlikely.
For those wondering about ISO 27001 - it's a standard for international security management, and popular in Europe.
However in the US it's not very relevant or even interesting to companies, and some European companies fail to understand that.
SOC 2 is the default and the preferred standard in the US - it's more domestic and less rigid than ISO 27001.
ISO27001 I wouldn’t call rigid, most of the stuff you should be doing anyway if you use any software.
checking for evidence that you are doing those things I would call ridgit. SOC2 as attestation doesn’t require so much documentation.
Sure, it depends on the implementation of your ISMS. Ideally you want to follow the control guidance in 27002. They've done a lot of thinking on this.
Having been through both, I much prefer the "rigid" ISO 27001 as the SOC2 audits seem to be based on how well you vibed with the auditor and the auditors competency more than anything. The things they are auditing seem overly broad and open to interpretation, and the auditors descriptions of your controls can easily be twisted.
Name one big US cloud provider or similar that is not ISO 27001 compliant.
The cloud provider can be compliant with your app being so.. Most apps will not pass an ISO audit unless designed to do so.
Same here, but Azure. About 90% saved, with a very similar stack.
It is a great big cloud play to make enterprises reliant on the competency in their weird service abstractions, which is slowly draining the quite simple ops story an enterprise usually needs.
Can you please elaborate how Azure is cheaper?
”Same here” meaning moving to Hetzner, but from Azure - could’ve made it less ambiguous!
Might throw together a post on it eventually:
https://news.ycombinator.com/context?id=43216847
I think the parent meant that they moved from Azure to Hetzner.
I don't get the numbers. It used to be 24000$/year. You saved 90%. So you're spending 200$ a month at Hetzner? That's literally one EPYC server. You really don't need distributed systems for that. Can you talk a bit more about requests per second or number of users?
You can't do single server setup for your workloads if you are ISO 27001 compliant and, further, you must have a separate server for logging and monitoring.
No matter load, there is a need for complexity for this certification.
Not all employees log in daily. For a scheduling app, most people check a few times a week, but not every day.
Daily active users (DAU) = around 10,000 to 20,000
Peak concurrency (users on at the exact same time) = generally between 1,500 to 2,000 at busy times (like when new schedules drop or at shift start/end times)
Average concurrent users at any random time = maybe 50 to 150
Why cloud costs can add up even for us:
Extensive use of real-time features and complex labour rules mean the app needs to handle a lot of data processing and ultimately sync into salary systems.
An example:
Being assigned to a shift has different implications for every user. It may trigger a nuisance bonus, and such a bonus could further only be triggered in certain cases, such as when you had the shifts assigned compared to when it start time.
Lastly, there is the optimizing of a schedule why is computationally expensive.
Thanks for the answer makes sense. So you can have a a few smaller app and db dedi servers plus a few hetzner cloud vps instances to handle backups and monitoring and object storage to store it.
It would be interesting to read more about your policy on logging and monitoring and how you've implemented it.
Our app is a lot more demanding (I put 0.5 cores/user, 300 iops/user and 20Mb/s/user as requirements) and I forgot that there are also lighter use cases. We blew thru the thousands in free credits on aws in like 2 months and went immediately to Hetzner
Read through this list here, it should give you a good sense of what logging and monitoring is sufficient for ISO and valuable to us:
https://news.ycombinator.com/item?id=44335920#44337659
If you have any more questions, just reach out at jk@datapult.dk
Sounds like an interesting use case.
Thanks! My main hurdle towards iso is lack of warm bodies. It's difficult to do separation of concerns when more or less solo
$2400, not $200.
$2400/year is $200/month. He is correct.
This stack appears to be a solid choice for building generic CRUD applications, regardless of immediate ISO certification needs. Would it be feasible to package this as a ready-to-use solution for greenfield projects that may pursue ISO certification in the future? Which components would still require manual setup, and why?
Yes, web apps all need logging, performance dashboard, redundancy, DB backups and such.
This could be a stack that could be parametrised with sound defaults just requiring some terraform provider credentials as well as a path to an executable web app and a choice of database engine.
ISO readiness built-in and abstracted at the OS level rather than programming language level.
If anyone wants to "assetize" what I built, reach out at jk@datapult.dk. I bring a battle-tested setup that has been ISO certified by independent auditors.
You bring clients directly or indirectly with marketing/growth hacking mindset.
I love Hetzner, I run my Internet search engine from there: bare metal FTW.
I know OVH and Hetzner gets mentioned a lot as European Cloud, but I thought I should bring UpCloud [1] for HN's attention. I believe their CPU core are actual CPU core and not vCPU as in a single thread ( Although I cant find reference to it which is annoying )
I also sometimes think OVH and Hetzner are not a fair comparison as much as I want competition to HyperScaler. Hetzner uses consumer grade component with a few server grade selections.
[1] https://upcloud.com
Do I have Stockholm Syndrome for being just utterly confused how ever single one of these budget providers [that I've seen so far] has no meaningful IaaS IAM offering? I don't mean "yeah, I can login to the console with username and password," I mean Permissions, Roles, Machine Identity, ... you know, who can do what to what and be able to see those actions in an audit log
What I find especially bizarre is that OpenStack seems to tick many of those boxes, including https://docs.openstack.org/keystone/2025.1/admin/manage-serv... and https://docs.openstack.org/keystone/2025.1/admin/federation/... so it's not like those providers would have to roll such a control plane from scratch as they seem to have done now
As a concrete example for your link, they cite Crossplane (and good for them) but then the Crossplane provider gets what I can only presume is some random person's console creds https://upcloud.com/docs/guides/getting-started-crossplane/#... and their terraform provider auths the same way
I do see the https://developers.upcloud.com/1.3/3-accounts/#add-subaccoun... and some of its built-in filtering but it doesn't have a good explanation for how it interacts with https://developers.upcloud.com/1.3/18-permissions/#normal-re... . But don't worry, you can't manage any of that with their terraform provider, so I guess upcli it is
You're not wrong.
I like to say that there are no European cloud providers. There are only European hosting providers.
As you say IAM is table stakes for being a real cloud.
I did a successful AWS to Hetzner migration myself once, and I'd like to make a business of "back-to-earth migrations" but clients are hard to find.
Everyone talks about it but none wants to be the first mover.
Offer a contingent payment of 10% (or similar) of the savings pr. month for one year after the migration. If a migration doesn't make sense then it won't happen and the client doesn't pay (would be very rare!) but when a migration happens 10% of the savings is probably more than you would have charged by the hour or quoted as a fixed price.
When the money really starts drying up it will be on the table again, for now it's "someone elses money" at worst and "less profits for the shareholders" at best (which is not an incentive for an engineer on the ground).
There's also a lot of FUD regarding hiring more staff, my observed experience is that hyperscalers need an equivalent number of people on rotation- it's just different skills (learning the intricacies/quirks of different product offerings on the hyperscaler vs CS/Operational fundamentals) - so everyone is scared to overload their teams with work and potentially need to hire people -- you can couple this with the fact that all migrations are up-front expensive and change is bad by default.
There will come a day where there simply isn't enough money to spend 10x the cost on these systems. It will be a sad day for everyone because salaries will be depressed too, and we will opine the days of shiny tools where we could make lots of work disappear by saying that our time is too expensive to work with such peasant issues.
Part of what I expect to get when I pay AWS is that it reduces my operational burden, and this has been true in my experience. I've almost forgotten about all the prep, the stress, etc. that comes from upgrading deprecated mysql clusters now that I've gotten used to using the AWS managed equivalents.
That is not to say that this aspect alone justifies huge fees, but it does have significant value.
This comment is great. Upgrade processes should be part of your internal processes if you want to get ISO 27001 certification that is not just checking the boxes but actually something you use for more control of your development and release cycle.
AWS RDS does not upgrade major or minor versions of Postgres or, as you mentioned, MySQL. In that case, they might patch update it. But these patch updates are easy to do yourself and does not take long to be reminded of in your ISMS and then subsequently carry out.
The purpose of this post is not to justify cloud hyperscalers versus European servers. It is actually a post on how to manage a highly regulated, compliant, and certified server setup yourself outside AWS because so many people just have their ISO certification on AWS infrastructure and once they got that they are never able to leave AWS again.
If you have no client demand and no real need to work on updating your infrastructure yourself, then you can go ahead and not go for an ISO 27001 certification and let AWS RDS update as it pleases. But if you operate a complex beast in a regulated industry such as employment law, finance, and such, then you get some more fun challenges and higher need for control.
I think a European CloudFlare would be nice to exist.
No problem! https://bunny.net/about/ Enjoy!
bunny still don't support IPv6 to origin, or else I would have switched.
I never seen IPv6 used for origins in CF. Why do you need that?
Cloudflare even has a blog post on it.
https://blog.cloudflare.com/amazon-2bn-ipv4-tax-how-avoid-pa...
If you're using a cheap vps provider who charges for ipv4 but gives ipv6 for free it helps.
We're in the process of migrating away from azure. Currently lots of cloudflare, but also some stuff runs on Hetzner.
If I manage to get https://uncloud.run/ or something similar up & running, the platform will no longer matter, whether it's OVH, Hetzner, Azure, AWS, GCP, ... It should all be possible & easy to switch... #FamousLastWords
Yes, it would be nice. Given Cloudflare's dev-friendly branding for some reason, I did not mind keeping it.
How do you do disk encryption?
Coming from AWS this is simple but I haven't seen how to do it well on hosting providers.
Obviously one can't write the disk encryption key to a boot partition or that undermines the point of it...
ISO27001 doesn't specifically require disk encryption. Rather it requires data on disks be protected according to how it is classified. Disk encryption is one way to achieve this, especially in a shared-hardware environment.
In this case, the disks being in a ISO27001 data centre with processes in place to ensure erasure during de-provisioning (which Hetzner is, and has), may well also meet this criteria.
I'm involved with a cloud migration myself so I like the topic, but the Medium article contains less information than this "Shown HN" post.
The Medium post is mostly fluff and a lead generator.
The Medium post is more of a high-level case study for a mixed audience (including non-technical decision makers). I intentionally kept the details lighter there, partly to avoid overwhelming readers and partly because the real “meat” (like our Ansible/Terraform patterns, Prometheus config, etc.) is harder to convey in that format without turning it into a giant technical appendix.
I’m happy to share specific configs, diagrams, or lessons learned here on HN if people want — and actually I’m finding this thread a much better forum for that kind of deep dive.
I'll dive into other aspects elsewhere: You can't doubt that given what I am sharing here.
Any particular area you’d like me to expand on? (e.g. how we structured Terraform modules, Ansible hardening, Prometheus alerting, Loki tuning?)
More detail how you tie ISO and Terraform/Ansible would be welcome.
Here you go. Write me at jk@datapult.dk, if you need more.
A.5.25 Security in development and support processes:
Safe rolling deploy, rollback mechanisms, NGINX health checks, code versioning, Prometheus alerting for deployment issues
A.6.1.2 Segregation of duties:
Separate roles for database, monitoring, web apps; distinct system users
A.8.1.1 Inventory of assets:
Inventory management through Ansible inventory.ini and groups
A.8.2.3 Handling of assets:
Backup management with OVH S3 storage; retention policy for backups
A.8.16 Monitoring activities (audit logging, monitoring):
auditd installed with specific rule sets; Prometheus + Grafana Agent + Loki for system/application/audit log monitoring
A.9.2.1 User registration and de-registration:
ansible_user, restricted SSH access (no root login, pubkey auth), AllowUsers, DenyUsers enforced
A.9.2.3 Management of privileged access rights:
Controlled sudo, audit rules track use of sudo/su; no direct root access
A.9.4.2 Secure log-on procedures:
SSH hardening (no password login, no root, key-based access)
A.9.4.3 Password management system:
Uses Ansible Vault and variables;
A.10.1.1 Cryptographic controls policy:
SSL/TLS certificate generation with Cloudflare DNS-01 challenge, enforced TLS on Loki, Prometheus
A.12.1.1 Security requirements analysis and specification:
Tasks assert required variables and configurations before proceeding
A.12.4.1 Event logging:
auditd, Prometheus metrics, Grafana Agent shipping logs to Loki
A.12.4.2 Protection of log information:
Logs shipped securely via TLS to Loki, audit logs with controlled permissions
A.12.4.3 Administrator and operator logs:
auditd rules monitor privileged command usage, config changes, login records
A.12.4.4 Clock synchronization:
chrony installed and enforced on all hosts
A.12.6.1 Technical vulnerability management:
Lynis, Wazuh, vulnerability scans for Prometheus metrics
A.13.1.1 Network controls:
UFW with strict defaults, Cloudflare whitelisting, inter-server TCP port controls
A.13.1.2 Security of network services:
SSH hardening, NGINX SSL, Prometheus/Alertmanager access control
A.13.2.1 Information transfer policies and procedures:
Secure database backups to OVH S3 (HTTPS/S3 API)
A.14.2.1 Secure development policy:
Playbooks enforce strict hardening as part of deploy processes
A.15.1.1 Information security policy for supplier relationships:
OVH S3, Cloudflare services usage with access key/secret controls; external endpoint defined
A.16.1.4 Assessment of and decision on information security events:
Prometheus alert rules (e.g., high CPU, low disk, instance down, SSL expiry, failed backups)
A.16.1.5 Response to information security incidents:
Alertmanager routes critical/security alerts to email/webhook; plans for security incident log webhook
A.17.1.2 Implementing information security continuity:
Automated DB backups, Prometheus backup job monitoring, retention enforcement
A.18.1.3 Protection of records:
Loki retention policy, S3 bucket storage with rotation; audit logs secured on disk
Re: ssh, I'm going to also assume that the public-facing servers don't have ssh exposed publicly, or locked down so only accessible via bastion server -- or through a specific internal-only network or VPN/tailscale etc ?
Yeah, the SSH port isn't publicly exposed
Did you consider on-premises hardware instead of EU Cloud?
We did not consider that because we are a lean, small startup. We barely have a premise to put anything on.
We talked to a few premium hosting vendors in Denmark and to build our own redundancy beyond what they guarantee, it actually became more expensive than AWS.
Does anybody care, besides you, that you’re ISO 27001 compliant? I thought SSAE 16 and other SSAE standards were the main things people were concerned with having?
Seems to depend on industry and/or region.
Most of our customers have a hard requirement on ISO 9001. Many on ISO 27001, too. The rest strongly prefers a partner having a plan to get ISO 27001
Pff… You wish. Depending on the sector you are in, ISO 27001 can either be a hard requirement (either directly or through national standards built upon it, like the Dutch healthcare NEN 7510) or completely irrelevant. If this company needs it, you can bet their customers need it — usually because they in turn are required to do so because of regulations.
You must be looking from US perspective. In EU I don't think I've seen any SSAE provided or wanted, and I've seen a bit of medium to big industry.
Might be interesting, but doesn’t seem to be a valid “Show HN”
* - https://news.ycombinator.com/showhn.html
Did you look into prepackaged solutions such as kamal/dokku/caprover for parts of this? What were you missing from those?
I looked at Kamal, Dokku, and CapRover — all great tools if you want to abstract away server management. But for a HIPAA/ISO 27001 certifiable app, I need a higher level of control and auditability across the entire stack.
With Ansible, I can version everything — from server hardening to DB backups — and ensure idempotent, transparent provisioning. I don’t have to reverse-engineer how a PaaS layer configures things under the hood, or worry about opaque defaults that might not meet compliance requirements.
There's nothing wrong with these tools, but once you're in the mood for the ISO certification, and once you start doing these things yourself, they actually seem like a step backwards or add very little value.
I also prefer running my own DB backups rather than relying on magic snapshots — it's easier to integrate with encrypted offsite storage and disaster recovery policies that align with ISO requirements. This lets me lock down the environment exactly as needed, with no surprise moving parts.
Tools like Kamal/Dokku/CapRover shine for fast, developer-friendly deploys, but for regulated workloads, I’ll take boring, explicit, and auditable any day.
Happy for you, don't get me wrong, but your post is not particularly news, I'm guessing everyone on HN knows bare metal/VPS providers are cheaper than AWS/Azure/GCP.
And also lacking a bit in details:
- both technical (e.g. how are you dealing with upgrades or multi-data center fallback for your postgresql), and
- especially business, e.g. what's the total cost analysis including the supplemental labor cost to set this up but mostly to maintain it.
Maybe if you shared your scripts and your full cost analysis, that would be quite interesting.
> Techincal
I'm trying to share as much technical across this thread as for your two examples:
System upgrades:
Keep in mind that as per the ISO specification, system upgrades should be applied but in a controlled manner. This lends itself perfectly to the following case that is manually triggered.
Since we take steps to make applications stateless, and Ansible scripts are immutable:
We spin up a new machine with the latest packages and once ready it join the Cloudflare load balancer. The old machines are drained and deprovisioned.
we spin up a new machine We have a playbook that iterates through our machines and does it per machine before proceeding. Since we have redundancy on components, this creates no downtime. The redundancy in the web application is easy to achieve using the load balancer in Cloudflare. For the Postgres database, it does require that we switch the read-only replica to become the main database.
DB failover:
The database is only written and read from by our web applications. We have a second VM on a different cloud that has a streaming replication of the Postgres database. It is a hot standby that can be promoted. You can use something like PG Bouncer or HAProxy to route traffic from your apps. But our web framework allows for changing the database at runtime.
> Business
Before migration (AWS): We had about 0.1 FTE on infra — most of the time went into deployment pipelines and occasional fine-tuning (the usual AWS dance). After migration (Hetzner + OVHCloud + DIY stack): After stabilizing it is still 0.1 FTE (but I was 0.5 FTE for 3-4 months), but now it rests with one person. We didn’t hire a dedicated ops person. On scaling — if we grew 5-10×: * For stateless services, we’re confident we’d stay DIY — Hetzner + OVHCloud + automation scales beautifully. * For stateful services, especially the Postgres database, I think we'd investigate servicing clients out of their own DBs in a multi-tenant setup, and if too cumbersome (we would need tenant-specific disaster recovery playbooks), we'd go back to a managed solution quickly.
I can't speak for cloud FTE toll vs a series of VPS servers in the big boys league ($ million in monthly consumption) and in the tiny league but at our league it turns out that it is the same FTE requirement.
Anyone want to see my scripts, hit me up at jk@datapult.dk. I'm not sure it'd be great security posture to hand it out on a public forum.
> We have a second VM on a different cloud that has a streaming replication of the Postgres database.
How did you setup/secure the connection between the clouds?
The network allows relevant ports from the respective IPs and so does the UFW so the servers can communicate between in each other in a restricted way. Needless to say communication is encrypted with certificates.
Our logging server will switch primary DB in case the original primary DB server is down. Since we are counting on downtime, the monitoring server is by default not hosted in the same places as the primary DB but in the same place as the secondary DB.
We assume that each clod will go down but not at the same time.
Cloudflare could be considered a point of failure and is another level of complexity compare to doing your own LB (the extra is the external org — actually extra both in terms of tech and of compliance).
Have you considered doing your own HA Load balance? If yes what tech options did you consider
Nice observation.
I took for granted that Hetzner and OVHcloud would be prone to failures due to their bad rep, not my own experience, so I wanted to be able to direct traffic to one if the other was down.
Doing load balancing ourselves in either of the two clouds gave rise to some chicken and egg situations now that we were assuming that one of them could be down (again not my lived experience).
Doing this externally was deliberate and picking something with a better rep than Hetzner and OVHcloud was obvious in that case.
With the rise of Agentic AI, this increasingly feels like the right move, unless AWS drastically lowers their prices.
Agreed, LLMs helped us with this.
How did you decide on Hetzner and OVH and why do you need both?
Have you looked into others as well, like IONOS and Scaleway?
Great question. Technically speaking I might not need both, but I have a gut feeling that one of these cloud providers might not be as hardened as the hyperscalers, and that Russia is just waiting to put one of these two services down. So for maximal resiliency I chose to design from a multi-cloud setup from the beginning.
Scaleway came up but is more expensive. IONOS did not come up in our research.
Part of what we tried to do was to make ourselves independent from traditional cloud services and be really good at doing stuff on a VPS. Once you start doing that, you can actually allow yourself to look more at uptimes and at costs. Also, since we wanted everything to be fully automated, Terraform support was important for us, and OVHcloud and Hetzner had that.
I'm sure there's many great cloud providers out in Europe, but it's hard to vet them to understand if they can meet demand and if they are financially stable. We would want not to keep switching cloud providers. So picking two of the major ones seemed like a safe choice.
What would Russia's interests be in putting these ISPs down, specifically?
Without making it too political and speculating on things I don't know, I, like many other Europeans, have seen plenty of cases of Russia ruining infrastructure projects in Europe, everything from internet cables on the ocean bed, telcos, water supplies, railways and more. Authorities are asking civilians in Scandinavia to be prepare their hiused with. Good and water and are actively hardening security around critical infrastructure, including their software. I won't comment more on this because it's gonna derail this discussion.
Is there a single proof? Like some Russian citizen was caught ruinining infrastructure project and it was proved that a) he is a citizen of Russia or was paid by Russian authorities, b) that the person in question had indeed done some damage to the infrastructure project.
I don't remember a single such case. I remember reading a lot of speculations like "it's highly likely that it was done by Russians" every single time without a trace of evidence.
Does it matter for the average business, if an infrastructure was brought down buy the Russian state or someone blaming it on the Russians?
It's undeniable that core European infrastructure is targeted currently
Okay, I mean, if you want to give poor ol’ Putin the benefit of the doubt, something that looks like a state actor but might theoretically not be Russia is doing a lot of minor to moderate economic sabotage in Europe.
Personally I think the amount of special pleading required to imagine that it is _not_ Russia is a bit much (particularly around the deep sea cable cuts; at that point you’re really claiming that Russia is deniably pretending that it is them, but really it’s someone else), but you do you. It doesn’t change the overarching point; both Hetzner and OVH would be obvious targets for, ah, whoever it is.
Russia likes causing trouble in Europe generally (and elsewhere; the Internet Research Agency was largely targeted at the US, say).
I'm not surprised about 90% of savings. I remember that initially AWS was promoted everywhere as being "cheaper" than your own hardware, colocation or VPS/VDS hosting.
Once I was working in a quite small company (around 100 employees) that hosted everything on AWS. Due to high bills (it's a small company that resided in Asia) and other problems, I migrated everything to DigitalOcean (we still used AWS for things like SES), and the monthly bill for hosting became like 10 times lower. With no other consequences (in other words, it haven't become less reliable).
I still wonder who calculated that AWS is cheaper than everything else. It's definitely one of the most expensive providers.
Interesting, comparing commodity services (VMs, storage etc) like-for-like, DO has always seemed more expensive than AWS. Do you remember what was the main source of savings?
I don't remember all the details. But the triggering point was Amazon RDS: it was complaining every week that our DB consumed all available space and we had to increase the size and pay more with every week. Our DB and flow of data wasn't that big. I've spent some time investigating but haven't found anything: sum of sizes of all tables was quite moderate and much less than the size of the storage we paid Amazon for.
I lacked both expertise and time to find out where the wasted space go. After I've set up Maria DB on a smallest Digital Ocean droplet, mysterious storage growth haven't repeated and the cheapest droplet had enough capacity to serve our needs for years.
Also, there were 7-10 forgotten "test" server instances and other artifacts (buckets, domains, etc) on Amazon (I believe it's also quite common, especially in bigger companies).
AWS is cheaper under certain, favorable (to them) assumptions, primarily around needing fewer employees to maintain the hardware.
In my mind it's very similar to how people sometimes frame the cost-effectiveness of Apple products.
Like when the 5K iMac originally came out, there was a lot of people claiming it was a good value. Because if you bought a 5K display and then built a PC, that would end up being more expensive. So, like for like, Apple was cheaper.
But... that assumed you even needed a 5K display, which were horribly overpriced and rare at the time. As soon as you say "4K is good enough", the cost advantage disappears, and it's not even close.
My memory might be off here but wasn’t the initial AWS "cheaper" promise only made vs buying & maintaining your own hardware?
A single developer in Denmark (it's a Danish company) would easily cost around $100K a year to the company.
They might save 90% of their $24K on hardware, but just spend probably double the amount on salaries.
This is why AWS is in the end cheaper if it is costs more for the same (let's be real it's not at all the same actually) software.
It sounds like if you rent a VPS/VDS in any other place than Amazon, you'll have to hire a separate person to babysit it 24/7. It's not true.
Obviously it's not true, but if you want to put the following on your VPS:
> • Ansible roles for PostgreSQL (with automated s3cmd backups + Prometheus metrics) • Hardening tasks (auditd rules, ufw, SSH lockdown, chrony for clock sync) • Rolling web app deploys with rollback + Cloudflare draining • Full monitoring with Prometheus, Alertmanager, Grafana Agent, Loki, and exporters • TLS automation via Certbot in Docker + Ansible
You'll spend a heck of a lot of time on setting it up originally, and you will spend a lot of time keeping it up-to-date, maintaining it, and fixing the inevitable issues that will occur.
If their bill was 200K a year, why not. But at 24K a year, 25% of an employee's salary, it is negligible and most likely a bad choice.
People keep comparing cloud costs to employee costs, but I think that’s the wrong metric. The real ratio to look at is cloud spend vs. the revenue you can unlock.
For me, switching from AWS to European providers wasn’t just about saving on cloud bills (though that was a nice bonus). It was about reducing risk and enabling revenue. Relying on U.S. hyperscalers in Europe is becoming too risky — what happens if Safe Harbor doesn’t get renewed? Or if Schrems III (or whatever comes next) finally forces regulators to act?
Being able to stay compliant and protect revenue is worth far more than quibbling over which cloud costs a little less.
Some of these tasks are required when you run your service in Amazon Cloud as well. It's not all free and not all by default. You'll need someone experienced with Amazon Services to set up many of these things in the Amazon cloud as well.
Also, it's not like you need everything you mention and need it immediately.
NTP clock syncing is a part of any Linux distro for the last 20 years if not more.
I don't remember that Amazon automatically locks down SSH (didn't touch AWS for 7-8 years, don't remember such a feature out of the box 8 years ago).
Rolling web app deploys with rollback can be implemented in multiple ways, depends on your app, can be quite easy in some instances. Also, it's not something that Amazon can do for you for free, you need to spend some effort on the development side anyways, doesn't matter if you deploy on Amazon or somewhere else. There's no magic bullet that makes automatic rollback free and flawless without development effort.
Exactly. Well said.
A thing we learned in this process is that there's many levels of abstraction which you can think of rollback and locking down SSH and so on and so forth.
If your abstraction level is AWS and the big hyperscalers, it would be to use Kubernetes, but peeling layers of complexity off that, you could also do it with Docker Compose or even Linux programs that are really battle tested for decades.
Most ISO certified companies are not at hyperscale so here is a fun one: Instead of Grafana Agent from 2020, you could most likely get away better with rsyslog from 2004.
And if you want your EKS cluster to give you insights you have configure CloudWatch yourself so does what hands-off is there comparing that setup to Ubuntu+Grafana Agent?
The OP already said that they spend the same amount of time maintaining this vs AWS
For now :)
If you want me to assess what I would be needing the next 5-10 years, I'd make a very different thread here on HN.
The defining conditions is my current setup and business requirement. It works well and we've resisted pretending that we know where we will be in 5 years.
I am reminded of the 2023 story of the surprisingly simple infra of Stack Overflow[1] and the 2025 story of that Stack overflow is almost dead[2]
Given that the setup works now, one can't add that it is only working "for now". I see no client demand in the foreseeable future leading me to think that this has been fundamentally architected incorrectly.
[1] https://x.com/sahnlam/status/1629713954225405952
[2] https://blog.pragmaticengineer.com/stack-overflow-is-almost-...
That is absolutely not what I was talking about.
I'm talking about the issues that will happen to your current setup and requirement. Disaster recovery, monitoring, etc.
> Disaster recovery, monitoring, etc
The ISO 27001 has me audited for just that (disaster recovery and monitoring) so that settles it, no?
Also worth noting that these are the two things you don't really get from the hyperscalers. If you want to count on more than their uptime guarantees, you have to roll some DR yourself and while you might think that this is easy, it is not easier than doing it with Terraform and Ansible on other clouds.
I have had my DR and monitoring audited in its AWS and EU version. One was no easier or harder than the other.
But the EU setup gave me a clear answer to clients on CLOUD act, Shrems II, GDPR, Safe Harbor, which is a competitive advantage.
Any reasons to go for certbot instead of Traefik or Caddy?
We use cloudflare as the WAF and loadbalancer, which makes traefik less relevant and Certbort easier to couple.
I moved from managed AWS to unmanaged AWS (lightsail), decreasing the cost significantly and still staying in AWS ecosystem. I use S3, Route53, SES and other cheap services, you could consider this path
Hetzner's biggest problem is that they can and do terminate a user's account without warning if the user starts using CPU resources very heavily or for any reason. This is for very legal usage, of course. This can and does happen to people in months. When this happens, consider your data lost and your account blocked. They will offer no explanation whatsoever, and will even send you a bill for the full month. Hetzner simply cannot be trusted, not even a little bit.
As for OVH, they don't do the above, but they have week-long unplanned downtimes, so using them is okay only as an optional resource.
Even so, there are lots of providers that are cheaper than Amazon and won't screw you over.
I have not experienced this in spite of rumours online. As I mention in these two comments, given these we decided to design our way around it by assuming that they would both go down at some point of time (but not at the same time).
1. https://news.ycombinator.com/item?id=44335920#44339234
2. https://news.ycombinator.com/item?id=44335920#44337619
Don't depend on a single provider. Always do live replication across providers. OVH downtime is only an issue if your entire company is running on a single server. Split it across a couple zones if this is actually paying bills.
At the very least I backup data to a different provider, so I can restore the services elsewhere, although there may still exist some residual data loss since the last backup. I guess it depends on just how well it's paying the bills.
So op is now spending 10% of their "$24,000 annual bill" which would be 2400 $/year. Which in turn would be $200/month on infrastructure.
If the whole company can run on 200$/month in VPSes, they probably went on AWS too early.
As I wrote elsewhere in this thread:
Being able to stay compliant and protect revenue is worth far more than quibbling over which cloud costs a little less.
The real ratio to look at is cloud spend vs. the revenue.
For me, switching from AWS to European providers wasn’t just about saving on cloud bills (though that was a nice bonus). It was about reducing risk and enabling revenue. Relying on U.S. hyperscalers in Europe is becoming too risky — what happens if Safe Harbor doesn’t get renewed? Or if Schrems III (or whatever comes next) finally forces regulators to act?
[dead]