The saddest part about this article being from 2014 is that the situation has arguably gotten worse.
We now have even more layers of abstraction (Airflow, dbt, Snowflake) applied to datasets that often fit entirely in RAM.
I've seen startups burning $5k/mo on distributed compute clusters to process <10GB of daily logs, purely because setting up a 'Modern Data Stack' is what gets you promoted, while writing a robust bash script is seen as 'unscalable' or 'hacky'. The incentives are misaligned with efficiency.
I've done a handful of interviews recently where the 'scaling' problem involves something that comfortably fits on one machine. The funniest one was ingesting something like 1gb of json per day. I explained, from first principals, how it fits, and received feedback along the lines of "our engineers agreed with your technical assessment, but that's not the answer we wanted, so we're going to pass". I've had this experience a good handful of times.
I think a lot of people don't realize machines come with TBs of RAM and hundreds of physical cores. One machine is fucking huge these days.
The wildest part is they’ll take those massive machines, shard them into tiny Kubernetes pods, and then engineer something that “scales horizontally” with the number of pods.
Maybe you are right about kubernetes, I don't have enough experience to have an opinion. I disagree about containers though, especially the wider docker toolchain.
It is not that difficult to understand a Dockerfile and use containers. Containers, from a developer pov, solve the problem of reliably reproducing development, test and production environments and workloads, and distributing those changes to a wider environment. It is not perfect, its not 100% foolproof, and its not without its quirks or learning curve.
However, there is a reason docker has become as popular as it is today (not only containers, but also dockerfiles and docker compose), and that is because it has a good tradeoff between various concerns that make it a highly productive solution.
I can maybe make a case for running in containers if you need some specific security properties but .. mostly I think the proliferation of 'fucked up piles of shit' is the problem.
Containers are just processes plus some namespacing, nothing really stops you from running very huge tasks on Kubernetes nodes. I think the argument for containers and Kubernetes is pretty good owing to their operational advantages (OCI images for distributing software, distributed cron jobs in Kubernetes, observability tools like Falco, and so forth).
So I totally understand why people preemptively choose Kubernetes before they are scaling to the point where having a distributed scheduler is strictly necessary. Hadoop, on the other hand, you're definitely paying a large upfront cost for scalability you very much might not need.
Time to market and operational costs are much higher on kubernetes and containers from many years of actual experience. This is both in production and in development. It’s usually a bad engineering decision. If you’re doing a lift and shift, it’s definitely bad. If you’re starting greenfield it makes sense to pick technology stacks that don’t incur this crap.
It only makes sense if you’re managing large amounts of large siloed bits of kit. I’ve not seen this other than at unnamed big tech companies.
99.9% of people are just burning money for a fashion show where everyone is wearing clown suits because someone said clown suits are good.
Writing software that works containerized isn't that bad. A lot of the time, ensuring cross platform support for Linux is enough. And docker is pretty easy to use. Images can be spun up easily, and the orchestration of compose is simple but quite powerful. I'd argue that in some cases, it can speed up development by offering a standardized environment that can be brought up with a few commands.
Kubernetes, on the other hand, seems to bog everything down. It's quite capable and works well once it's going, but getting there is an endeavor, and any problem is buried under mountains of templatized YAML.
Imagine working an a project for the first time, having a Dockerfile that works or compose file, that just downloads and spins up all dependencies and builds the project succesfully. Usually that just works and you get up and running within 30 minutes or so.
On the other hand, how it used to be: having to install the right versions of, for example redis, postgres, nginx, and whatever unholy mess of build tools is required for this particular hairball, hoping it works on you particular (version) of linux. Have fun with that.
Working on multiple projects, over a longer period of time, with different people, is so much easier when setup is just 'docker compose up -d' versus spending hours or days debugging the idiosyncrasies of a particular cocktail that you need to get going.
Thanks. You’ve reassured me that I’m not going mad when I look at our project repo and seriously consider binning the Dockerfile and deploying direct to Ubuntu.
The project is a Ruby on Rails app that talks to PostreSQL and a handful of third party services. It just seems unnecessary to include the complexity of containers.
I have a lot of years of actual experience. Maybe not as much as you, but a good 12 years in the industry (including 3 at Google, and Google doesn't use Docker, it probably wouldn't be effective enough) and a lot more as a hobbyist.
I just don't agree. I don't find Docker too complicated to get started with at all. A lot of my projects have very simple Dockerfiles. For example, here is a Dockerfile I have for a project that has a Node.JS frontend and a Go backend:
FROM node:alpine AS npmbuild
WORKDIR /app
COPY package.json package-lock.json .
RUN npm ci
COPY . .
RUN npm run build
FROM golang:1.25-alpine AS gobuilder
WORKDIR /app
COPY go.mod go.sum .
RUN go mod download
COPY . .
COPY --from=npmbuild /app/dist /app/dist
RUN go build -o /server ./cmd/server
FROM scratch
COPY --from=gobuilder /server /server
ENTRYPOINT ["/server"]
It is a glorified shell script that produces an OCI image with just a single binary. There's a bit of boilerplate but it's nothing out of the ordinary in my opinion. It gives you something you can push to an OCI registry and deploy basically anywhere that can run Docker or Podman, whether it's a Kubernetes cluster in GCP, a bare metal machine with systemd and podman, a NAS running Synology DSM or TrueNAS or similar, or even a Raspberry Pi if you build for aarch64. All of the configuration can be passed via environment variables or if you want, additional command line arguments, since starting a container very much is just like starting a process (because it is.)
But of course, for development you want to be able to iterate rapidly, and don't want to be dealing with a bunch of Docker build BS for that. I agree with this. However, the utility of Docker doesn't really stop at building for production either. Thanks to the utility of OCI images, it's also pretty good for setting up dev environment boilerplate. Here's a docker-compose file for the same project:
And if your application is built from the ground up to handle these environments well, which doesn't take a whole lot (basically, just needs to be able to handle configuration from the environment, and to make things a little neater it can have defaults that work well for development), this provides a one-command, auto-reloading development environment whose only dependency is having Docker or Podman installed. `docker compose up` gives you a full local development environment.
I'm omitting a bit of more advanced topics but these are lightly modified real Docker manifests mainly just reformatted to fewer lines for HN.
I adopted Kubernetes pretty early on. I felt like it was a much better abstraction to use for scheduling compute resources than cloud VMs, and it was how I introduced infrastructure-as-code to one of the first places I ever worked.
I'm less than thrilled about how complex Kubernetes can be, once you start digging into stuff like Helm and ArgoCD and even more, but in general it's an incredible asset that can take a lot of grunt work out of deployment while providing quite a bit of utility on top.
Is there a book like Docker: The Good Parts that would build a thorough understanding of the basics before throwing dozens of ecosystem brand words at you? How does virtualisation not incur an overhead? How do CPU- and GPU-bound tasks work?
I think the key thing here is the difference between OS virtualization and hardware virtualization. When you run a virtual machine, you are doing hardware virtualization, as in the hypervisor is creating a fake devices like a fake SSD which your virtual machine's kernel then speaks to the fake SSD with the NVMe protocol like it was a real physical SSD. Then those NVMe instructions are translated by the hypervisor into changes to a file on your real filesystem, so your real/host kernel then speaks NVMe again to your real SSD. That is where the virtualization overhead comes in (along with having to run that 2nd kernel). This is somewhat helped by using virtio devices or PCIe pass-through but it is still significant overhead compared to OS virtualization.
When you run docker/kubernetes/FreeBSD jails/solaris zones/systemd nspawn/lxc you are doing OS virtualization. In that situation, your containerized programs talk to your real kernel and access your real hardware the same way any other program would. The only difference is your process has a flag that identifies which "container" it is in, and that flag instructs the kernel to only show/allow certain things. For example "when listing network devices, only show this tap device" and "when reading the filesystem, only read from this chroot". You're not running a 2nd kernel. You don't have to allocate spare ram to that kernel. You aren't creating fake hardware, and therefore you don't have to speak to the fake hardware with the protocols it expects. It's just a completely normal process like any other program running on your computer, but with a flag.
Docker is just Linux processes running directly on the host as all other processes do. There is no virtualization at all.
The major difference is that a typical process running under Docker or Podman:
- Is unshared from the mount, net, PID, etc. namespaces, so they have their own mount points, network interfaces, and PID numbers (i.e. they have their own PID 1.)
- Has a different root mount point.
- May have resource limits set with cgroups.
(And of course, those are all things you can also just do manually, like with `bwrap`.)
There is a bit more, but well, not much. A Docker process is just a Linux process.
So how does accessing the GPU work? Well sometimes there are some more advanced abstractions for the benefit of I presume stronger isolation, but generally you can just mount in the necessary device nodes and use the GPU directly, because it's a normal Linux process. This is generally what I do.
About 25 years here and 10 years embedded / EE before that.
The problem is that containers are made of images and those and kubernetes are incredibly stateful. They need to be stored. They need to be reachable. They need maintenance. And the control responsibility is inverted. You end up with a few problems which I think are not tenable.
Firstly, the state. Neither docker itself or etcd behind Kubernetes are particularly good at maintaining state consistently. Anyone who runs a large kubernetes cluster will know that once it's full of state, rebuilding it consistently in a DR scenario is HORRIBLE. It is not just a case of rolling in all your services. There's a lot of state like storage classes, roles, secrets etc which nothing works if you don't have in there. Unless you have a second cluster you can tear down and rebuild regularly, you have no idea if you can survive a control plane failure (we have had one of those as well).
Secondly, reachability. The container engine and kubernetes require the ability to reach out and get images. This is such a fucking awful idea from a security and reliability perspective it's unreal. I don't know how people even accept this. Typically your kubernetes cluster or container engine has the ability to just pull any old shit off docker hub. That also couples to you that service being up, available and not subject to the whims of whatever vendor figures they don't want to do their job any more (broadcom for example). To get around this you end up having to cache images which means more infrastructure to maintain. There is of course a whole secondary market for that...
Thirdly, maintainance. We have about 220 separate services. When there's a CVE, you have to rebuild, test and deploy ALL those containers. We can't just update an OS package and bounce services or push a new service binary out and roll it. It's a nightmare. It can take a month to get through this and believe me we have all the funky CD stuff.
And as mentioned above, control is inverted. I think it's utterly stupid on this basis that your container engine or cluster pulls containers in. When you deploy, the relationship should be a push because you can control that and mandate all of the above at once.
In the attempt to solve problems, we created worse ones. And no one is really happy.
I get your points but I'm not sure I agree. Kubernetes is a different kind of difficulty but I don't think its so different from handling VM fleets.
You can have 220 vms instead and need to update all of them too.
They also are full of state and you will need some kind of automatic deployment (like ansible) to make it bearable, just like your k8s cluster.
If you don't configure the network egress firewall, they can also both pull whatever images/binaries from docker hub/internet.
> To get around this you end up having to cache images which means more infrastructure to maintain
If you're not doing this for your VMs packages and your code packages, you have the same problem anyway.
> When there's a CVE
If there is a CVE in your code, you have to build all you binaries anyway. If it's in the system packages, you have to update all your VMs. Arguably, updating a single container and making a rolling deployment is faster than updating x VMs. In my experience updating VMs was harder and more error prone than updating a service description to bump a container version (you don't just update a few packages, sometimes you need to go from Centos 5 to Centos 7/8 or something and it also takes weeks to test and validate).
I mostly agree with you, with the exception that VMs are fully isolated from one another (modulo sharing a hypervisor), which is both good and bad.
If your K8s cluster (or etcd) shits the bed, everything dies. The equivalent to that for VMs is the hypervisor dying, but IME it’s far more likely that K8s or etcd has an issue than a hypervisor. If nothing else, the latter as a general rule is much older, much more mature, and has had more time to work out bugs.
As to updating VMs, again IME, typically you’d generate machine images with something like Packer + Ansible, and then roll them out with some other automation. Once that infrastructure is built, it’s quite easy, but there are far more examples today of doing this with K8s, so it’s likely easier to do that if you’re just starting out.
> If your K8s cluster (or etcd) shits the bed, everything dies.
When etcd and/or kubelet shits the bed, it shouldn't do anything other than halt scheduling tasks. The actual runtime might vary between setups, but typically containerd is the one actually handling the individual pod processes.
Of course, you can also run Kubernetes pods in a VM if you want to, there have always been a few different options for this. I think right now the leading option is Kata Containers.
Does using Kata Containers improve isolation? Very likely: you have an entire guest kernel for each domain. Of course, the entire isolation domain is subject to hardware bugs, but I think people do generally regard hardware security boundaries somewhat higher than Linux kernel security boundaries.
But, does using Kata Containers improve reliability? I'd bet not, no. In theory it would help mitigate reliability issues caused by kernel bugs, but frankly that's a bit contrived as most of us never or extremely infrequently experience the type of bug that mitigates. In practice, what happens is that the point of failure switches from being a container runtime like containerd to a VMM like qemu or Firecracker.
> The equivalent to that for VMs is the hypervisor dying, but IME it’s far more likely that K8s or etcd has an issue than a hypervisor. If nothing else, the latter as a general rule is much older, much more mature, and has had more time to work out bugs.
The way I see it, mature code is less likely to have surprise showstopper bugs. However, if we're talking qemu + KVM, that's a code base that is also rather old, old enough that it comes from a very different time and place for security practices. I'm not saying qemu is bad, obviously it isn't, but I do believe that many working in high-assurance environments have decided that qemu's age and attack surface is large enough to have become a liability, hence why Firecracker and Cloud Hypervisor exist.
I think the main advantage of a VMM remains the isolation of having an entire separate guest kernel. Though, you don't need an entire Linux VM with complete PC emulation to get that; micro VMs with minimal PC emulation (mostly paravirtualization) will suffice, or possibly even something entirely different, like the way gVisor is a VMM but the "guest kernel" is entirely userland and entirely memory safe.
I think his point is that instead of hundreds of containers, you can just have a small handful of massive servers and let the multitasking OS deal with it
Containers are too low-level. What we need is a high-level batch job DSL, where you specify the inputs and the computation graph to perform on those inputs, as well as some upper limits on the resources to use, and a scheduler will evaluate the data size and decide how to scale it. In many cases that means it will run everything on a single node, but in any case data devs shouldn't be tasked with making things run in parallel because the vast majority aren't capable and they end up with very bad choices.
And by the way, what I just described is a framework that Google has internally, named Flume. 10+ years ago they had already noticed that devs aren't capable of using Map/Reduce effectively because tuning the parallelism was beyond most people's abilities, so they came up with something much more high-level. Hadoop is still a Map/Reduce clone, thus destined to fail at useability.
Are you saying that running your application in a pile of containers somehow helps that problem ..? It's the same problem as CPU scheduling, we just don't have good schedulers yet.. Lots of people are working on it though
Not really? At the moment it's done by some user-land job scheduler. That could be something container based like k8s, something in-process like ray, or a workload manager like slurm.
This is especially aggravating when the os inside the container and the language runtimes are much heavier than the process itself.
I've seen arguments for nano services (I wouldn't even call them micros services), that completely ignored that part. Split a small service in n tiny services, such that you have 10(os, runtime, 0.5) rather than 2(os, runtime, x).
There is no os inside the container. That's a big part of the reason containerization is so popular as a replacement for heavier alternatives like full virtualization. I get that it's a bit confusing with base image names like "ubuntu" and "fedora", but that doesn't mean that there is a nested copy of ubuntu/fedora running for every container.
To be fair each of those pods can have dedicated, separate external storage volumes which may actually help and it’s def easier than maintaining 200 iscsi or more whatever targets yourself
I recently had to parse 500MB to 2GB daily log files into analytical information for sales. Quick and dirty, the application would of needed 64GB RAM and work laptop only has 48GB RAM. After taking time cleaning it up, it was using under 1GB of RAM and worked faster by only retaining records in RAM if need be between each day.
It is not about what you are doing, it is always about how you do it.
This was the same with doing OCR analysis of assembly and production manuals. Quick and dirty, it would of took over 24 hours of processing time, after moving to semaphores with parallelization it took less than two hours to process all the information.
In interviews just give them what they are looking for. Don't overthink it. Interviews have gotten so stupidly standardized as the industry at large copied the same Big Tech DSA/System Design/Behavioral process. And therefore interview processes have long been decoupled from the business reality most companies face. Just shard the database and don't forget the API Gateway
100%. Interviews should be a two-way filter. I’m sympathetic to unemployed-and-just-need-something, but also: boy are there a lot of companies hiring data engineers.
Meh .. I've played that game; it doesn't work out well for anyone involved.
I optimize my answers for the companies I want to work for, and get rejected by the ones I don't. The hardest part of that strategy is coming to terms with the idea that I constantly get rejected by people that I think are mostly <derogatory_words_here>, but I've developed thick skin over the years.
I'd much rather spend a year unemployed (and do a ton of painful interviews) and find a company who's values align with mine, than work for a year on a team I disagree with constantly and quit out of frustration.
The company's values may align to yours, even though they reject you. It's because the interview process doesn't need to have anything to do with their real-world process. Their engineers probe you for the same "best practices" that they themselves were constantly probed for in their own interviews. Interviewing is its very own skill that doesn't necessarily translate into real-life performance.
I agree with your observation. My issue is (from experience) it's really hard to tell from the outside if a teams' values align with mine. Many teams talk the talk, but don't walk the walk, as the saying goes. It's just easier to not participate than it is to guess, and be wrong.
I also believe that running a broken interview process actively selects for qualities you actually don't want, so it's much more likely that teams conducting those interviews aren't teams I want to work on.
Edit: As credence for my claims, the best team I've ever worked on was a team I did 90%+ of the hiring for, and we didn't do any of the 'typical' interview bullshit most companies do.
What we did instead was sit people down and have deep technical conversations about systems they'd worked on in the past. The candidate would explain, in as much detail as they could muster, a system they'd worked on in the past, down to the lowest level details. Usually, they would talk to us for at least 20-30 minutes, then, we (the interviewers) would pose questions, usually starting with the form 'if we changed X, what effect would it have'. Doing interviews in this style make a few things immediately obvious:
1. Did the candidate have a deep, systemic understanding of the system they worked on?
2. Does the candidate have a good mental model for evaluating change in the system?
That's how I conduct interviews, and unsurprisingly, when I get interviewed like that, my success rate is 100%. I don't think I've ever done an interview like that which did not result in an offer.
Anyways, there's some rambling and unsolicited opinions for you :)
The interview process determines who gets hired, which determines their real-world process. Even if most of their people were hired under a better system, future hires will come in under this one.
This. Most interviewers don't want to do interviews, they have more important job to do (at least, that's what they claim). So they learn questions and approaches from the same materials and guides that are used by candidates. Well, I'm guilty of doing exactly this a few times.
You could have learned this if you were better about collecting requirements. You can tell the interviewer "I'd do it like this for this size data, but I'd do it like this for 100x data. Which size should I design this for?" If they're looking for one direction and you ask which one, interviewers will tell you.
I've done that too and, in my experience, people that ask a scaling question that fits on a single machine don't have the capacity to have that nuanced conversation. I usually try to help the interviewer adjust the scale to something that actually requires many machines, but they usually don't get it.
Said another way, how do you have a meaningful conversation about scaling with a person who thinks their application is huge, but in reality only requires a tiny fraction of a single machine? Sometimes, there's such a massive gulf between perception and reality that the only thing to do is chuckle and move on.
Yes, but then how are these people going to justify the money they're spending on cloud systems?... They need to find only reasons to maintain their "investment", otherwise they could be held as incompetent when their solution is proven to be ineffective. So, they have to show that it was a unanimous technical decision to do whatever they wanted in the first place.
I've actually worked on distributed systems that were so broken, I created a script to connect to prod and just create the report from my laptop. My manager offered to buy me a second laptop for running the report since it was easier than getting approval from the architects to get rid of the distributed report system (it only created that one report).
Yeah I had this problem at a couple of times in startup interviews where the interviewer asked a question I happened to have expertise in and then disagreed with my answer and clearly they didn't know all that much about it. It's ok, they did me a favor.
It may or may not be related that the places that this happened were always very ethnically monotone with narrow age ranges (nothing against any particular ethnic group, they were all different ethnic monotones)
> I explained, from first principals, how it fits, and received feedback along the lines of "our engineers agreed with your technical assessment, but that's not the answer we wanted, so we're going to pass". I've had this experience a good handful of times.
Probably a better outcome than being hired onto a team where everyone know you're technically correct but they ignore your suggestions for some mysterious (to you) reason.
Though I do not know the situation AT the firm you were interviewing in, if there is some unexpected increase in data volume OR say a job fails on certain days or you need to do some sort of historical data load (>= 6 months of 1 gig of data per day), the solution for running it on a single VM might not scale. BUT again, interviews are partially about problem solving, partially about checking compliance at least for IC roles (IN my anecdotal experience).
That being said yeah I too have done some similar stuff where some data engineering jobs could be run on a single VM but some jobs really did need spark, so the team decision was to fit the smaller square peg into a larger square peg and call it a da.In fact, I had spent time refactoring one particular pivotal job to run as an API deployed on our "macrolith" and integrated with our Airflow but it was rejected, so I stopped caring about engineering hygiene.
You can parse JSON at several GB/s: https://github.com/simdjson/simdjson
And you could scale that by one or two orders of magnitude with thread-based parallelism on recent AMD Epyc or Intel Xeon CPUs. So parsing alone should not pose a problem (maybe even sub-second for 6 months of data). We would need a more precise problem statement to judge whether horizontal scaling is needed.
As other commentors pointed out, 1gb/day isn't a problem for storage and retroactive processing until you get to like, hundreds of years of data. You can chew through a few hundred TB of JSON data in a day, per core + nvme drive.
Regardless, storage and retroactive processing wasn't part of the problem. The problem was explicitly "parse json records as they come in, in a big batch, and increment some integers in a database".
I'm not going to figure out what the upper limit is on a single bare-metal machine, but you can be damn sure it's a metric fuck-ton higher than 1gb/day. You can do a lot with a 10TB of memory and 256 cores.
If we are talking about cloud VMs: sure, their cpu performance is atrocious and io can be horrible. This won't scale to infinity
But if there's the option to run this on a fairly modest dedicated machine, I'd be comfortable that any reasonable solution for pure ingest could scale to five orders of magnitude more data, and still about four orders of magnitude if we need to look at historical data. Of course you could scale well beyond that, but at that point it would be actual work
I have a funny story I need to tell some day about how I could get a 4GB JSON loaded purely in the browser at some insane speed, by reading the bytes, identifying the "\n" then making a lookup table. It started low stakes but ended up becoming a multi-million internal project (in man-hours) that virtually everyone on the company used. It's the kind of project that if started "big" from the beginning, I'd bet anything it wouldn't have gotten so far.
Edit: I did try JSON.parse() first, which I expected to fail and it did fail BUT it's important that you try anyway.
Yes, but I didn't read the full file, I kept the File reference and read the bytes in pages of 10MB IIRC to find all of the line break offsets. Then used those to slice and only read the relevant parts.
“there’s no wrong answer, we just want to see how you think” gaslighting in tech needs to be studied by the EEOC, Department of Labor, FTC, SEC, and Delaware Chancery Court to name a few
let’s see how they think and turn this into a paid interview
I agree - and it's not just what gets you promoted, but also what gets you hired, and what people look for in general.
You're looking for your first DevOps person, so you want someone who has experience doing DevOps. They'll tell you about all the fancy frameworks and tooling they've used to do Serious Business™, and you'll be impressed and hire them. They'll then proceed to do exactly that for your company, and you'll feel good because you feel it sets you up for the future.
Nobody's against it. So you end up in that situation, which even a basic home desktop would be more than capable of handling.
I have been the first (and only) DevOps person at a couple startups. I'm usually pretty guilty of NIH and wanting to develop in-house tooling to improve productivity. But more and more in my career I try to make boring choices.
Cost is usually not a huge problem beyond seed stage. Series A-B the biggest problem is growing the customer base so the fixed infra costs become a rounding error. We've built the product and we're usually focused on customer enablement and technical wins - proving that the product works 100% of the time to large enterprises so we can close deals. We can't afford weird flakiness in the middle of a POC.
Another factor I rarely see discussed is bus factor. I've been in the industry for over a decade, and I like to be able to go on vacation. It's nice to hand off the pager sometimes. Using established technologies makes it possible to delegate responsibility to the rest of the team, instead of me owning a little rats nest fiefdom of my own design.
The fact is that if 5k/month infra cost for a core part of the service sinks your VC backed startup, you've got bigger problems. Investors gave you a big pile of money to go and get customers _now_. An extra month of runway isn't going to save you.
The issue is when all the spending gets you is more complexity, maintenance, and you don't even get a performance benefit.
I once interviewed with a company that did some machine learning stuff, this was a while back when that typically meant "1 layer of weights from a regression we run overnight every night". The company asked how I had solved the complex problem of getting the weights to inference servers. I said we had a 30 line shell script that ssh'd them over and then mv'd them into place. Meanwhile the application reopened the file every so often. Zero problems with it ever. They thought I was a caveman.
I work for a small company with a handful of devs. We don't have a dedicated devops person, so I do it all. Everything is self-hosted. Been that way for years. But, yeah, if I go on vacation and something foes screwy, the business is hosed. However, even if it were hosted on AWS or elsewhere, it would not be any better. If anything, it may be worse. Instead of a person being well versed in standards based tech, they'd have to be an AWS expert. Why would we want that?
I have recently started using terraform/tofu and ansible to automate nearly all of the devops operations. We are at a point where Claude Code can use these tools and our existing configs to make configuration changes, debug issues by reviewing logs etc. It is much faster at debugging an issue than I am and I know our stuff inside and out.
I am beginning to think that AI will soon force people to rethink their cloud hosting strategy.
I identify as a caveman and I fucking love it. I build a 250k sloc C++ project hundreds of times a day with a 50 line bash script. Works every time, on any machine, everywhere.
Those scripts have logs, right? Log a hostname and path when they run. If no one thinks to look at logs, then there's a bigger problem going on than a one-off script.
That becomes a problem if you let the shell script mutate into an "everything" script that's solving tons of business problems. Or if you're reinventing kubernetes with shell scripts. There's still a place for simple solutions to simple problems.
You can literally have a 20 line Python script on cron that verifies if everything ran properly and fires off a PagerDuty if it didn't. And it looks like PagerDuty even supports heartbeat so that means even if your Python script failed, you could get alerted.
Which is why you take the time to put usage docs in the repo README, make sure the script is packaged and deployed via the same methods that the rest of the company uses, and ensure that it logs success/failure conditions. That's been pretty standard at every organization I've been at my entire professional career. Anyone who can't manage that is going to create worse problems when designing/building/maintaining a more complex system.
Yah. A lot of the complexity in data movement or processing is unneeded. But decent standardized orchestration, documentation, and change management isn't optional even for the 20 line shell script. Thankfully, that stuff is a lot easier for the 20 line standard shell script.
Or python. The python3 standard library is pretty capable, and it's ubiquitous. You can do a lot in 50-100 lines (counting documentation) with no dependencies. In turn it's easy to plug into the other stuff.
I've seen the ramifications of this "CV first" kind of engineering. Let's just say that it's a bad time when you're saddled with tech debt solely from a handful of influential people that really just wanted to work elsewhere.
This. It is resume-driven development. Especially at startups where the engineers aren't compensated well enough or don't believe the produce can succeed.
I’m not hiring anymore, but when I was, all I wanted to find was someone that knew the fundamentals (and was a good ’attitude fit’ as per the similarly titled book). Sorry @wccrawford, I wish we could have more places that value slow, boring tech — aside from banking/insurance?
I have hung on to my job for many years now because of being in a similar situation in regards to trying to do the right thing and the fear of not being hire-able.
There is something wrong with the industry in chasing fads and group think. It has always been this way. Businesses chased Java in the late 90s, early 00s. They chased CORBA, WSDL, ESB, ERP and a host of other acronyms back in the day.
More recently, Data Lake, Big Data, Cloud Compute, AI.
Most of the executives I have met really have no clue. They just go with what is being promoted in the space because it offers a safety net. Look, we are "not behind the curve!". We are innovating along with the rest of the industry.
Interviews do not really test much for ability to think and reason. If you ran an entire ISP, if you figured out, on your own, without any help, how to shard databases, put in multiple layers of redundancy, caching... well, nobody cares now. You had to do it in AWS or Azure or whatever stack they have currently.
Sadly, I do not think it will ever be fixed. It is something intrinsic to human nature.
Yeah, I probably need to push this harder now. I did actually join 1 project recently and got to the point that I felt I could add 1 more common thing to my resume, and that felt good. (Getting something done felt good, too.)
But getting to the point that I feel confident in certain frameworks is going to be hard. I'll figure it out somehow, I'm sure.
This exactly, actual doers are most of the time not rewarded meanwhile the AWS senior sucking Jeffs wiener specialist gets a job doing nothing but generating costs and leave behind more shit after his 3 years moving the ladder up to some even bigger bs pretend consulting job at an even bigger company. It's the same bs mostly for developers. I rewrite their library from TS to Rust and it gains them 50x performance increases and saves them 5k+ a week over all their compute now but nobody gives a shit and I do not have a certification for that to show off on my LinkedIn. Meanwhile my PM did nothing got paid to do some shity certificate and then gets the credit and the certificate and pisses of to the next bigger fish collecting another 100k more meanwhile I get a 1k bonus and a pat on the shoulder. Corporate late stage capitalism is complete fucking bs and I think about becoming a PM as well now. I feel like a fool and betrayed. Meanwhile they constantly threaten our Team to lay it off or outsource it as they say we are to expensive in a first world country and they easily find as good people in India etc. What a time to be alive.
If you're willing and able to promote yourself internally, you can make people give a shit, or at least publicly claim they do. That's 260k+ per year, and even big businesses are going to care about that at some level, especially if it's something that can be replicated. Find 10 systems you can do that with, and it's 2.6m+ per year.
But, if you don't want to play the self-promotion game, yeah someone else is going to benefit from your work.
Try Rust? The system programming world isn't very bullshit-infested and Rust is trendy (which is good for a change), also employers can't realistically expect many years of Rust experience.
Need training and something to show? Contribute to some FOSS project.
Yep, and a lot more datasets fit entirely into RAM now. Ignoring the recent price spikes for a moment, 128GB of RAM in a laptop is entirely achievable and not even the limit of what is possible. That was a pipe dream in 2014 when computers with only 4GB were still common. And of course for servers the max RAM is much higher, and in a lot of scenarios streaming data off a fast local SSD may be almost as good.
I have actually worked in a company as a consultant data guy in a non technical team, I had a 128 GB PC 10 years back, and did everything with open source R then, and it worked ! The others thought it was wizardry
I’ve seen this pattern play out before. The pushback on simpler alternatives seems from a legitimate need for short time to market from the demand some of the equation and a lack of knowledge on the supply side. Every time I hear an engineer call something hacky, they are at the edge of their abilities.
systemd would be a derail even if you weren’t misrepresenting the situation at several levels. Experienced sysadmins in my experience were the ones pushing adoption because they had to clean up the messes caused by SysV’s design limitations and flaws, whereas in this case it’s a different scenario where the extra functionality is both unneeded and making it worse at the core task.
> Experienced sysadmins in my experience were the ones pushing adoption because they had to clean up the messes caused by SysV’s design limitations and flaws
That's funny. I used to have to clean up the messes caused by systemd's design limitations and flaws, until I built my own distro with a sane init system installed.
Many of the noobs groaning about the indignity of shell scripts don't even realize that they could write init 'scripts' in whatever language they want, including Python (the language these types usually love so much, if they do any programming at all.)
I think you’d have a more fruitful discussion if you stopped trying to call people noobs when they don’t agree with you.
For example, I’ve been dealing with SysV since the early 90s and while it’s gotten better since we no longer have to support the really bizarre Unix variants, my problem with init scripts wasn’t “indignity” but the lack of consistency across distributions and versions, which affects anyone shipping software professionally (“can’t do this easily until $distro upgrades coreutils”), and from an operator’s perspective using Python doesn’t make that better because instead of supporting one consistent thing you’d end up with the subset of features each application team felt like implementing, consistent only to the extent that they care to follow other projects. One virtue of systemd is that having a single common way to specify dependencies, restarts, customization, etc. avoids the ops people having to learn dozens of different variations of the same ideas and especially how to deal with their gaps. A few years back, a data center power outage at one place I worked really highlighted that: the systemd-based servers recovered quickly because they actually had working retries; all of the older stuff using SysV had to be manually reviewed because there were all kinds of problems like races on dependencies like DNS or NFS, retry logic which failed hard after a short period of time, failures because a stale PID file wasn’t removed, or cases where a vendor had simply never implemented retries in their init scripts. While in theory you can handle all of those in SysV most people never did.
After a couple decades of that, a lot of us don’t want to spend time on problems Microsoft solved in Bill Clinton’s first term.
I hate to blather on about systemd in this decade but how in the world does creating something completely different than sysv init help people shipping software? Now they have to support yet another init scheme.
Prior to all of the important distributions consolidating on systemd, you had to support each distribution’s convention for customization, overrides, dependencies, conventions for things like changing users or locations for PID files, not to mention the differences in various shell tools.
Nothing insurmountable but it meant init files were inevitably much longer than the corresponding Upstart or systemd files despite doing less, and anytime we shipped a new version you had more testing since you had to implement a lot of functionality which is built in to other things.
I just created my own OS, with my own init system that does things how I think it should be done--and it does it every time, without the bizarre bugs that come from Linux Puttering's shitware code.
It's the same thing any corporation should be doing if they were smart, instead of outsourcing everything to RedHat, Microsoft, Google, etc.
The reality is unit files are more portable than init scripts, regardless of what anyone says.
Systemd unified and simplified administration across a lot of distributions. Before, it was a hodge podge, and there was a lot of knowledge lost going from rhel to Debian.
It's entirely possible that both SysV init and systemd suck for different reasons. I'm still partial to systemd since it takes care of daemons and supervision in a way that init does not, but I'll take s6 or process-compose or even supervisord if I have to. Horses for courses.
I want to love s6 but every time I see the existence of s6-rc-compile I get heated. I'm sure there are excellent reasons behind it but I personally don't want services to work that way.
Yah, that does look awfully baroque. My experience with s6 has largely been minor tweaks to an existing setup where the complexity was hidden away from me. I used to use runit for managing daemons, but nowadays my supervisor of choice is docker compose. process-compose does look enticing though, and the Nix world seems pretty fond of it.
Specifying system processes and their dependencies declaratively, rather than in a tangle of arbitrary executable code, is cleaner, more efficient, easier to use, and more auditable. And that's not even getting into the additional process management duties systemd assumes.
You can write arbitrary scripts into systemd... or like one step removed at most? That's not really a difference unless you have some nuance in mind that I don't.
I honestly do not like systemd, either. It is okay for managing processes but I wish it didn't spread into everything else in the machine.
Or if it must, could it actually work cohesively across their concepts? Would be nice to have an obvious and easy way to run Quadlet as its own user to isolate further, would be nice to have systemd-sysusers present in /etc/subuid so they can run containers.
I like what they are doing with atomic distros. It would be great to have a single file declarative setup for something like running a containerized reverse HTTP proxy with an isolated user. Instead of "atomic" but you manually edit files in /etc after install.
Worse in some ways, better in others. DuckDB is often an excellent tool for this kind of task. Since it can run parallelized reads I imagine it's often faster than command line tool, and with easier to understand syntax
More importantly, you have your data in a structured format that can be easily inspected at any stage of the pipeline using a familiar tool: SQL.
I've been using this pattern (scripts or code that execute commands against DuckDB) to process data more recently, and the ability to do deep investigations on the data as you're designing the pipeline (or when things go wrong) is very useful. Doing it with a code-based solution (read data into objects in memory) is much more challenging to view the data. Debugging tools to inspect the objects on the heap is painful compared to being able to JOIN/WHERE/GROUP BY your data.
Yep. It’s literally what SQL was designed for, your business website can running it… the you write a shell script to also pull some data on a cron. It’s beautiful
Pipes are parallelized when you have unidirectional data flow between stages. They really kind of suck for fan-out and joining though. I do love a good long pipeline of do-one-thing-well utilities, but that design still has major limits. To me, the main advantage of pipelines is not so much the parallelism, but being streams that process "lazily".
On the other hand, unix sockets combined with socat can perform some real wizardry, but I never quite got the hang of that style.
Pipelines are indeed one flow, and that works most of the time, but shell scripts make parallel tasks easy too. The shell provides tools to spawn subshells in the background and wait for their completion. Then there are utilities like xargs -P and make -j.
UNIX provides the Makefile as go-to tool if a simple pipeline is not enough. GNUmake makes this even more powerful by being able to generate rules on-the-fly.
If the tool of interest works with files (like the UNIX tools do) it fits very well.
If the tool doesn't work with single files I have had some success in using Makefiles for generic processing tasks by creating a marker file that a given task was complete as part of the target.
I think it’s not so much engineers actually setting up a distributed compute, as it is dropping a credit card into a paid cloud service, which behind the scenes sets up a distributed compute cluster and bills you for the compute in an obfuscated way, then gives a 20% discount + SSO if you sign up for annual enterprise plan.
This kind of practice is insidious because early on, they charge $20/month to get started on the first 100mb of log ingestion, and you can have it up and running in 30 seconds with a credit card. Who would turn that down?
Revisit that set up 2 years later and it’s turned into a 60k/y behemoth that no one can unwind
On the contrary, the key message from the blog post is not to load the entire dataset to RAM unless necessary. The trick is to stream when the pattern works. This is how our field routinely works with files over 100GB.
For a dasaset that live in RAM, the best solution are DuckDB or clickhouse-local.
Using SQLish data is easier than a bunch of bash script and really powerful.
Another alternative is Exasol that is factors (>10x) faster than Clickhouse and scales much better for complex analytics workloads that joins data. There is a free edition for personal use without data limit that can run on any number of cluster nodes.
If you just want to read and analyze single table data, then Clickhouse or DuckDB are perfect.
The issue is you can run sub tib jobs on a few small/standard instances with better tooling. Spark and Hadoop are for when you need multiple machines.
Dbt and airflow let you represent your data as a DAG and operate on that, which is critical if you want to actually maintain and correct data issues and keep your data transforms timely.
edit: a little surprised at multiple downvotes. My point is, you can run airflow and dbt on small instances, and you can do all your data processing on small instances with tools like duckdb or polars.
But it is very useful to use a tool like dbt that allows you to re-build and manage your data in a clear way, or a tool like airflow which lets you specify dependencies for runs.
After say 30 jobs or so, you'll find that being able to re-run all downstreams of a model starts to payoff.
Agreed, airflow and dbt have literally nothing to do with the size of the data and can be useful, or overkill, at any size. Dbt just templates the query strings we use to query the data and airflow just schedules when we query the data and what we do next. The fact that you can fit the whole dataset in duckdb without issue is kind of separate to these tools, we still need to be organised about how and when we query it.
dbt is super useful for building a dag and managing pieces of it that update on different schedules. eg with one dataset that's refreshed monthly and another daily, you can only rebuild the daily one unless the slower-cadence input has a new update.
Because developers are incentivized to have marketable software skills. Not marketable build things that are cheap and profitable skills.
Moore's law was supposed to make it simpler and cheaper to do more computationally expensive tasks. But in the meantime, everyone kept inflating the difficulty of a task faster than Moore could keep up.
I think some of this is because of the incredible amounts of capital that startups seem to be able to acquire. If startups had to demonstrate profitability before they were given any money to scale, the story would be very different I think.
Our lot burns a fortune on snowflake every month but no one is using it. Not enough data is being piped into it and the shitty old reports we have which just run some SQL work fine.
It looked good on someone’s resume and that was it. They are long gone.
> because setting up a 'Modern Data Stack' is what gets you promoted
It’s not just that, it’s that you better know their specific tech stack to even get hired. It’s a lot of dumb engineering leaders pretending that AWS, Azure and Snowflake are such wildly different ecosystems that not having direct experience in theirs is disqualifying (for pure DE roles, not talking broader sysadmin).
The entire data world is rife with people who don’t have the faintest clue what they’re doing, who really like buzzwords, and who have never thought about their problem space critically.
Well. I try for a middle ground. I am currently ditching both airflow and dbt. In Snowflake, I use scheduled tasks that call stored procedures. The stored procedures do everything I need to do. I even call external APIs like Datadog’s and Okta’s and pull down the logs directly into snowflake. I do try to name my stored procedures with meaningful names. I also add generous comments including urls back to the original story.
I forgot to mention in Snowflake, besides chron scheduled tasks, you can add dependent tasks that only run if the previous task succeeded. I have 40 tasks chained together that way. Each of my task calls a stored procedure. Within each procedure, I have Try Catch and a catch-all clause that Raiseerror.
"I've seen startups burning $5k/mo on distributed compute clusters to process <10GB of daily logs, purely because setting up a 'Modern Data Stack' is what gets you promoted, while writing a robust bash script is seen as 'unscalable' or 'hacky'."
Also seen strange responses from HN commenters when it's mentioned that bash is large and slow compared to ash and bash is better suited for use as an interactive shell whereas ash is better suited for use as a non-interactive shell, i.e., a scripting shell
I also use ash (with tabcomplete) as an interactive shell for several reasons
I see this at work too. They are ingesting a few GB per day but running the data through multiple systems. So the same functionality we delivered with a python script within a week now takes months to develop and constantly breaks.
On the other hand, now we have duckdb for all the “small big data”, and a slew of 10-100x faster than Java equivalent stuff in the data x rust ecosystem, like DataFusion, Feldera, ByteWax, RisingWave, Materialize etc
None of the systems I mentioned existed at the time the article was published. I think the author would love duckdb which is a very speedy CLI SQL thingy that reads and writes data in all sorts of formats. It fits in great with other Unix CLI stuff.
Many of the projects I mentioned you could see as a response to OP and the 2015 “Scalability, but at what COST?” paper which benchmarked distributed systems to see how many cores they need to beat a single thread. (https://news.ycombinator.com/item?id=26925449)
> None of the systems I mentioned existed at the time the article was published
So Hadoop was doing distributed compute wrong but now they have it figured out?
The point is that there is enormous overhead and complexity in going it in any kind of system. And your computer has a lot of power you probably aren’t maxing out.
> which is a very speedy CLI SQL thingy that reads and writes data in all sorts of formats.
Yeah im a big fan of SQLite :). But at analytical workloads like aggregating every row, DuckDB will outperform SQLite by a wide margin. SQLite is great stuff but it’s not a very good data Swiss Army knife because it’s very focused on a single core competency: embeddable OLTP with a simple codebase. DuckDB can read/write many more formats from local disk or via a variety of network protocols. DuckDB also embeds SQLite so you can use it with SQLite DBs as inputs or outputs.
> they were doing distributed compute wrong but now they have it figured out?
Like anything the future is here but it’s unevenly distributed. Frank McSherry, the first author of “Scalability but at what COST” wrote Timely Dataflow as his answer to that question. ByteWax is based on Timely as is Materialize. Stuff is still complex but these more modern systems with performance as their goal are orders of magnitude better than the Hadoop era Java stuff.
I call BS on those Rust 10-100x claims. Rust and Java are roughly equal in performance. It is just that there are a lot of old NoSQL frameworks in Java which are trash. I also checked out those companies, some of which are doing interesting stuff. None claim things are 100x faster because of Rust. You just hurt your credibility when you say such clearly false things. That's how you end up with a Hadoop cluster which is 236x slower than a batch script.
PS None of the companies you linked seem to be using a datapath architecture which is the key to the highest level of performance
It wasn’t my intention to say “this stuff is 100x faster because rust”. DuckDB is C++. My intention was to draw distinction between the Java/Hadoop era of cluster and data systems, and the 2020s era of cluster and data systems, much of which has designs informed by stuff like this article / “Scalability but at what COST?”. I guess instead of “faster” I should say “more efficient”.
For example, the Kafka ecosystem tends to use Avro as the data transfer serialization, which needs a copy/deserialization step before it can be used in application logic. Newer stream systems like Timely tend to use zero-copy capable data transfer formats (timely’s is called Abomination) but it’s the same idea in CapnProto or Flatbuffers - it’s infinity faster to not copy the data as you decode! In my experience this kind of approach is more accessible in systems languages like C++ or Rust, and harder to do in GC languages where the default approach to memory layout and memory management is “don’t worry about it.”
If airflow is a layer of abstraction something is wrong.
Yes it is an additional layer, but if your orchestration starts concerning itself with what it is doing then something is wrong. It is not a layer on top of other logic, it is a single layer where you define how to start your tasks, how to tell when something is wrong, and when to run them.
If you don't insist on doing heavy compitations within the airflow worker it is dirt cheap. If it's something that can easily be done in bash or python you can do it within the worker as long as you're willing to throw a minimal amount of hardware at it.
This times a zillion! I think there's been a huge industry push to convince managers and more junior engineers that spark and distributed tools are the correct way to do data engineering.
I think its a similar pattern to web dev influencers have convinced everyone to build huge hydrated-spa-framework-craziness where a static site would do.
My advice to get out of this mess:
- Managers, don't ask for specific solutions (spark, react). Ask for clever engineers to solve problems and optimise / track what you vare about (cost, performance etc). You hired them to know best, and they probably do.
- Technical leads, if your manager is saying "what about hyperscale?" You don't have to say "our existing solution will scale forever". It's fine to say, "our pipelines handle dataset up to 20GB, we don't expect to see anything larger soon, and if we do we'll do x/y/z to meet that scale". Your manager probably just wants to know scaling isn't going to crash everything, not that you've optimised the hell out of everything for your excel spreadsheet processing pipeline.
Here’s the thing though, most companies work with small data. The distribution of data set size follows a power law which means that few engineers get to work with petabyte sized datasets. That said, the job market definitely incentivizes people to have that kind of experience on their resume if they want to keep progressing in salary. This incentivizes over engineering.
> the job market definitely incentivizes people to have that kind of experience on their resume
Yeah, this is sadly often true, but it's also another trap that people don't have to fall into.
I've been involved with hiring data engineers, and I see experience with distributed computing way more often than I see knowledge of simple profiling and debugging tools. But I'd personally value the latter a lot more when interviewing.
Companies that hire for skills they don't need are of course perpetuating this problem, but they're also paying a big "tax" in the sense that they're not hiring for the skills they actually do need.
Absolutely, when I worked at (semi-well-known unicorn) a half-dozen years ago on the data-engineering team the manager told me "Hey we want to use spark next quarter, that's a huge initiative."
And I immediately asked, "in what capacity?" And the answer was don't-know/doesn't-matter, it's just important that we can say we're using it. I really wish I understood where that was coming from (his manager resume-building? somebody getting a kickback?)
The most interesting part is that you can say you're doing/using something entirely independent of if you actually are. Sure, that's a lie, but so is only using something so you can say you're using it (sure, they admitted to you that was the reason, but that won't be the reason they put on LinkedIn).
It's great to see this post I wrote years ago still being useful for people.
I agree with many here that the situation is arguably worse in many ways. However, along similar lines, I've been pleased to see a move away from cargo culting microservices (another topic I addressed in a separate post on that site).
To all those helping companies and teams improve performance, keep it up! There is hope!
Working on mrjob was a big part of my first job out of college. Fun to see it get mentioned more than ten years later.
What some commenters don't realize about these bureaucratic IO-heavy expensive tools is that sometimes they are used in order to apply a familiar way of thinking, which has Business Benefits. Sometimes you don't know if your task will take seconds, minutes, hours, days, or weeks on one fast machine with a well-thought-out program, but you really need it to take at most hours, and writing well-thought-out-programs takes time you could spend on other stuff. If you know you can scale the program in advance, it's lower risk to just write it as a Hadoop job and be done with it. Also, it helps to have an "easy" pattern for processing Data That Feels Big Even If It Isn't That Big, Although Yelp's Data Actually Was Big. Such was the case with mrjob stuff at Yelp in 2012. They got a lot of mileage out of it!
The other funny thing about mrjob is that it's a layer on Hadoop Streaming, which is a term for when the Java process actually running the Hadoop worker opens a subprocess to your Python script which accepts input on stdin and writes output on stdout, rather than working on values in memory. A high I/O price to pay for the convenience of writing Python!
That's a good point. Hadoop may not be the most efficient way, but when a deliverable is required, Hadoop is a known quantity and really works.
I did some interesting work ten years ago, building pipelines to create global raster images of the entire Open Street Map road network [1]. I was able to process the planet in 25 minutes on a $50k cluster.
I think I had the opposite problem: Hadoop wasn't shiny enough and Java had a terrible reputation in academic tech circles. I wish I'd known about mrjob because that would have kept the Python maximalists happy.
I had lengthy arguments with people who wanted to use Spark which simply did not have the chops for this. With Spark, attempting to process OSM for a small country failed.
Another interesting side-effect of using the map-reduce paradigm was with processing vector datasets. PostGIS took multiple days to process the million-vertex Norwegian national parks, however splitting the planet into data density sensitive tiles (~ 2000 vertices) I could process the planet in less than an hour.
Then Google Earth Engine came along and I had to either use that, or change career. Somewhat ironically GEE was built in Java.
I’ve also seen some Really Really Bad software due to engineers having “Not Invented Here” syndrome. If it takes using big well known frameworks to avoid some of that it’s worth the cost.
A little bit of history related to the article for any who might be interested...
mrjob, the tool mentioned in the article, has a local mode that does not use Hadoop, but just runs on the local computer. That mode is primarily for developing jobs you'll later run on a Hadoop cluster over more data. But, for smaller datasets, that local mode can be significantly faster than running on a cluster with Hadoop. That's especially true for transient AWS EMR clusters — for smaller jobs, local mode often finishes before the cluster is up and ready to start working.
Even so, I bet the author's approach is still significantly faster than mrjob's local mode for that dataset. What MapReduce brought was a constrained computation model that made it easy to scale way up. That has trade-offs that typically aren't worth it if you don't need that scale. Scaling up here refers to data that wouldn't easily fit on disks of the day — the ability to seamlessly stream input/output data from/to S3 was powerful.
I used mrjob a lot in the early 2010s — jobs that I worked on cumulatively processed many petabytes of data. What it enabled you to do, and how easy it was to do it, was pretty amazing when it was first released in 2010. But it hasn't been very relevant for a while now.
The bigness of your data has always depended on the what you are doing with it.
Consider the following table of medical surgeries: date,physician_name, surgery_name,success.
"What are the top 10 most common surgeries?" - easy in bash
"Who are the top physicians (% success) in the last year for those surgeries?" - still easy in bash
"Which surgeries are most affected by physician experience?" - very hard in bash, requires calculating for every surgery how many times that physician had performed that surgery on that day, then compare low and high experience outcomes.
A researcher might see a smooth continuum of increasingly complex questions, but there are huge jumps in computational complexity. At 50gb dataset might be 'bigger' than a 2tb one if you are asking tough questions.
It's easier for a business to say "we use Spark for data processing", than "we build bespoke processing engines on a case by case basis".
50GB and 2TB are both sizes that SQLite supports and could handle. You could probably solve all of the problems you mentioned with simple tools on a single server, in the language of your choice.
When I worked as a data engineer, I rewrote some Bash and Python scripts into C# that were previously processing gigabytes of JSON at 10s of MB/s - creating a huge bottleneck.
By applying some trivial optimizations, like streaming the parsing, I essentially managed to get it to run at almost disk speed (1GB/s on an SSD back then).
Just how much data do you need when these sort of clustered approaches really start to make sense?
> I rewrote some Bash and Python scripts into C# that were previously processing gigabytes of JSON
Hah, incredibly funny, I remember doing the complete opposite about 15 years ago, some beginner developer had setup a whole interconnected system with multiple processes and what not in order to process a bunch of JSON and it took forever. Got replaced with a bash script + Python!
> Just how much data do you need when these sort of clustered approaches really start to make sense?
I dunno exactly what thresholds others use, but I usually say if it'd take longer than a day to process (efficiently), then you probably want to figure out a better way than just running a program on a single machine to do it.
Yeah, I realize now that my comment actually misses the most important point, the "interconnected system with multiple processes" I was talking about was made in C#, that's why the whole "I did the reverse as you" was funny to me in the first place.
I like the peer comment's answer about a processing time threshold (e.g., a day). Another obvious threshold is data that doesn't conveniently fit on local disks. Large scale processing solutions can often process directly from/to object stores like S3. And if it's running inside the same provider (e.g., AWS in the case of S3), data can often be streamed much faster than with local SSDs. 10GB/s has been available for a decade or more, and I think 100GB/s is available these days.
> data can often be streamed much faster than with local SSDs. 10GB/s has been available for a decade or more, and I think 100GB/s is available these days.
In practice most AWS instances are 10Gbps capped. I have seen ~5Gbps consistently read from GCS and S3. Nitro based images are in theory 100Gbps capable, in practice I've never seen that.
Also, anything under 16 vCPUs generally has baseline / burst bandwidth, with the burst being best-effort, 5-60 minutes.
This has, at multiple companies for me, been the cause of surprise incidents, where people were unaware of this fact and were then surprised when the bandwidth suddenly plummeted by 50% or more after a sustained load.
I remember a panel once at a PyCon where we were discussing, I think, the anaconda distribution in the context of packaging and a respected data scientist (whose talks have always been hugely popular) made the point that he doesn't like Pandas because it's not excel. The latter was his go to tool for most of his exploratory work. If the data were too big, he'd sample it and things like that but his work finally was in Excel.
Quick Python/bash to cleanup data is fine too I suppose and with LLMs, it's easier than ever to write the quick throwaway script.
I took a bio statistic class. The tools were Excel/ R or Stata.
I think most people used R. Free and great graphing. Though the interactivity of Excel is great for what ifs. I never got R till I took that class. Though RStudio makes R seem like scriptable excel.
R/Python are fast enough for most things though a lot of genomic stuff (Blast alignments etc..) are in compiled languages.
How do you stream parse json? I thought you need to ingest it whole to ensure it is syntactically valid, and most parsers don't work with inchoate or invalid json? Or at least it doesn't seem trivial.
I don't know what the GP was referring too, but often this is about "JSONL" / "JSON Lines" - files containing one JSON object per line. This is common for things like log files. So, process the data as each line is deserialized rather than deserializing the entire file first.
I used Newtonsoft.Json which takes in a stream, and while it can give you objects, it can also expose it as a stream of tokens.
The bulk of the data was in big JSON arrays, so you basically consumed the array start token, then used the parser to consume an entire objects which could be turned into a C# object by the deserializer, then you consumed a comma or end array token until you ran out of tokens.
I had to do it like this because DS-es were running into the problem that some of the files didn't fit into memory. The previous approach took 1 hour, involved reading the whole file into memory and parsing it as JSON (when some of the files got over 10GB, even 64GB memory wasnt enough and the system started swapping).
It wasn't fast even before swapping (I learned just how slow Python can be), but then basically it took a day to run a single experiment. Then the data got turned into a dataframe.
I replaced that part of the Python code processing and outputted a CSV which Pandas could read without having to trip through Python code (I guess it has an internal optimized C implementation).
The preprocessor was able to run on the build machines and DSes consumed the CSV directly.
This sounds similar to how in C#/.NET there are (at least) 3 methods to reading XML: XmlDocument, XPathDocument, or XmlReader. The first 2 are in-memory object models that must parse the entire document to build up an object hierarchy, which you then access object-oriented representations of XML constructs like elements and attributes. The XmlReader is stream-based, where you handle tokens in the XML as they are read (forward-only.)
Any large XML document will clobber a program using the in-memory representations, and the solution is to move to XmlReader. System.Text.Json (.NET built-in parsing) has a similar token-based reader in addition to the standard (de)serialization to objects approach.
I'm going to go out on a limb and say no - this library seems to do the parsing in Python, and Python is slow, like many times slower than Java, C# or languages in this class - which you find out if you try to do heavy data processing with it, which is one of the reasons I dislike the language. It's also very hard to parallelize - in C# if you feed stuff into LINQ and entries are independent, you can make the work parallel with PLINQ very quickly, while threads aren't really a thing in Python (or at least they werent back then).
I've seen so many times that data processing quickly became a bottleneck and source of frustration with Python that stuff needed to be rewritten, that I came to not bother writing stuff in Python in the first place.
You can make Python fast by relying on NumPy and pandas with array programming, but doing so can be quite challenging to format and massage the data so that the things that you want can be expressed as array programming ops, that usually it became too much of a burden for me.
I wish Python was at least as fast as Node (which also can have its own share of performance cliffs)
It's possible that nowadays Python has JITs that improve performance to Java levels while keeping compatibility with most existing code - I haven't used Python professionally in quite a few years.
> native code parsing speedups for most common platforms
Which is to say, roughly analogous to "relying on NumPy". (A well-designed system avoids repeatedly calling from Python to C and prefers to let loops live within the C code; that applies at least as much to tree-like data as array-like data.)
> I wish Python was at least as fast as Node (which also can have its own share of performance cliffs) It's possible that nowadays Python has JITs that improve performance to Java levels while keeping compatibility with most existing code - I haven't used Python professionally in quite a few years.
No guarantees, but have you tried PyPy? It's existed since 2007 and definitely improved over time.
I would say that "performance cliffs" are just endemic to programming. Even in C you find people writing bad algorithms because better ones seem (at least superficially) much harder to write — especially if the good algorithm requires, say, a hash table. (C++ standard library containers definitely ameliorate this effect, but you pay in code complexity, especially where templates are needed.) And on the other hand you sometimes see big improvements from dropping to assembly (cf. ffmpeg).
You assume it is valid, until it isn't and you can have different strategies to handle that, like just skipping the broken part and carrying on.
Anyway, you write a state machine that processes the string in chunks – as you would do with a regular parser – but the difference is that the parser is eager to spit out a stream of data that matches the query as soon as you find it.
The objective is to reduce the memory consumption as much as possible, so that your program can handle an unbounded JSON string and only keep track of where in the structure it currently is – like a jQuery selector.
There's a whole heap of approaches, each with their own tradeoffs. But most of them aren't trivial, no. And most end up behaving erratically with invalid json.
You can buffer data, or yield as it becomes available before discarding, or use the visitor pattern, and others.
> Just how much data do you need when these sort of clustered approaches really start to make sense?
You really need an enormous amount of data (or data processing) to justify a clustered setup. Single machines can scale up rather quite a lot.
It'll cost money, but you can order a 24x128GB ram, 24x30TB ssd system which will arrive in a few days and give you 3 TB ram, 720 TB (fast) disk. You can go bigger, but it'll be a little exotic and the ordering process might take longer.
If you need more storage/ram than around that, you need clustering. Or if the processing power you get in your single system storage isn't enough, you would need to cluster, but ~ 256 cores of cpu is enough for a lot of things.
It's not about how much data you have, but also the sorts of things you are running on your data. Joins and group by's scale much faster than any aggregation. Additionally, you have a unified platform where large teams can share code in a structured way for all data processing jobs. It's similar in how companies use k8s as a way to manage the human side of software development in that sense.
I can however say that when I had a job at a major cloud provider optimizing spark core for our customers, one of the key areas where we saw rapid improvement was simply through fewer machines with vertically scaled hardware almost always outperformed any sort of distributed system (abet not always from a price performance perspective).
The real value often comes from the ability to do retries, and leverage left over underutilized hardware (i.e. spot instances, or in your own data center at times when scale is lower), handle hardware failures, ect, all with the ability for the full above suite of tools to work.
Disagree, though in practice it depends on the query, cardinality of the various columns across table, indices, and RDBMS implementation (so, everything).
A simple equijoin with high cardinality and indexed columns will usually be extremely fast. The same join in a 1:M might be fast, or it might result in a massive fanout. In the case of the latter, if your RDBMS uses a clustering index, and if you’ve designed your schemata to exploit this fact (e.g. a table called UserPurchase that has a PK of (user_id, purchase_id)) can still be quite fast.
Aggregations often imply large amounts of data being retrieved, though this is not necessarily true.
That level of database optimization is rare in practice. As soon as a non-database person gets decision making authority there goes your data model and disk layout.
And many important datasets never make it into any kind of database like that. Very few people provide "index columns" in their CSV files. Or they use long variable length strings as their primary key.
OP pertains to that kind of data. Some stuff in text files.
unconvinced. any join needs some kind of seek on the secondary relation index, or a bunch of state if ur stream joining to build temporary index sizes O(n) until end of batch. on the other hand summing N numbers needs O(1) memory and if your data is column shaped it’s like one CPU instruction to process 8 rows. in “big data” context usually there’s no traditional b-tree index to join either. For jobs that process every row in the input set Mr Join is horrible for perf to the point people end up with a dedicated join job/materialized view so downstream jobs don’t have to re do the work
An aggregation is less work than a join. You are segmenting the data in basically the same way in ideal conditions for a join as you are in an aggregation. Think of an aggregation as an inner join against a table of buckets (plus updating a single value instead of keeping a number of copies around). In practice this holds with aggregation being a linear amount faster than a join over the same data. That delta is the extra work the join needs to do to keep around a list of rows rather than a single value being updated (and in cache) repeatedly. Depending on the data this delta might be quite small. But without a very obtuse aggregation function (maybe ketosis perhaps), the aggregation will be faster. Its updating a single value vs appending to a list with the extra memory overhead this introduces.
you didn't need to read to rewrite to C# to do that - python should be able to handle streaming that amount/velocity of data fine, at least through a native extension like msgspec or pydantic. additionally, you made it much harder for other data engineers that need to maintain/extend the project in the future to do so.
The C# is probably far more maintainable and less error prone than Python. At least in my experience that's almost always the case.
The amount of Python jobs I've had which run fine for several hours and then break with runtime errors, whereas with C# you can be reliably sure that if it starts running it will finish running.
Not a language problem, it's a dev culture problem. You can hold your devs accountable to the quality of their code. Strong er typing support via static analysis as well as runtime validation with untrusted input/data has really helped python alot.
I'm not necessarily the biggest fan of python, but writing a data engineering tool in a non-data engineering focused language seems like a bad decision. Now when the OP leaves the organization is in a much tougher position.
> Now when the OP leaves the organization is in a much tougher position.
Are they really, though? You're assuming their org is unfamiliar with C#. Not all data engineers only know Python. The ones I work with mainly use C# because we all do!
I'm a software and data engineer. I work with C# pretty extensively in my software day job. I've never seen a data engineer job listing mention C#.
Additionally, the way the OP's comment reads, I'm ok with the assumption I made. It reads like it was a unilateral decision on their part and not something that got buy in from the team.
I think many devs learn the trade with Windows and don't get exposure to these tools.
Plus, they require a bit of reading because they operate on a higher level of abstraction than loops and ifs. You get implicit loops, your fields get cut up automatically, and you can apply regexes simultaneously on all fields. So it's not obvious to the untrained eye.
But you get a lot of power and flexibility on the cli, which enable you to rapidly put together an ad hoc solution which can get the job done or at least serve as a baseline before you reach for the big guns.
> The first thing to do is get a lot of game data. This proved more difficult than I thought it would be, but after some looking around online I found a git repository on GitHub from rozim that had plenty of games. I used this to compile a set of 3.46GB of data, which is about twice what Tom used in his test. The next step is to get all that data into our pipeline.
It would be interesting to redo the benchmark but with a (much) larger database.
Nowadays the biggest open-data for chess must comes from Lichess https://database.lichess.org, with ~7B games and 2.34 TB compressed, ~14TB uncompressed.
If you get all the data on fast SSDs in a single chassis, you probably still beat EMR over S3. But then you have a whole dedicated server to manage your 14TB of chess games.
The "EMR over S3" paradigm is based on the assumption that the data isn't read all that frequently, 1-10x a day typically, so you want your cheap S3 storage but once in a while you'll want to crank up the parallelism to run a big report over longer time periods.
Almost certainly not. You can go on AWS or GCP and spin up a VM with 2.2 TB RAM and 288 vCPUs. Worst case, if streaming the data sequentially isn't fast enough, you can use something like GNU Parallel to launch processes in parallel to use all those 288 cpus. (It's also extremely easy to set up - 'apt install parallel' is about all you need.) That starts to resemble Hadoop, if you squint, except that it's all running on the same machine. As a result, it's going to outperform Hadoop significantly.
The only reason not to do that is if for some reason the workload won't support that kind of out-of-the-box parallelism. But in that case, you'd be writing custom code for Hadoop or Spark anyway, so there's an argument for doing the same to run on a single VM. These days it's pretty easy to essentially vibe code a custom script to do what you need.
At the company I'm with, we use Spark and Apache Beam for many of our large workloads, but that's typically involving data at the petabyte scale. If you're just dealing with a few dozen terabytes, it's often faster and easier to spin up a single large VM. I just ran a process on Friday like that, on a 96-core VM with 350 GB RAM.
It depends on what you were trying to with the data. Hadoop would never win, but Spark can allow you to hold all that data in memory across multiple machines and perform various operations on it.
If all you wanted to do was filter the dataset for certain fields, you can likely do something faster programmatically on a single machine.
The comments here smell of "real engineers use command line". But I am not sure they ever actually worked with analysing data more than using it as a log parser.
Yes Hadoop is 2014.
These days you obviously don't set up a Hadoop cluster. You use the cloud provider service provided (BigQuery or AWS Athena for example).
Or map your data into DuckDB or use polars if it is small.
It depends. I’ve done plenty of data processing, including at large fortune 10s. Most of the big data could be shrunk to small data if you understood the use case— pre-aggregating, filtering to smaller datasets based on known analysis patterns, etc.
Now, you could argue that that’s cheating a bit and introduces preprocessing that is as complex as running Hadoop in the first place, but I think it depends.
In my experience, though, most companies really don’t have big data, and many that do don’t really need to.
Most companies aren’t fortune 500s.
I used to work at Elastic, and I noticed that most (not all!) of the customers who walked up to me at the conferences were there to ask about datasets that easily fit into memory on a cheap VPS.
> But I am not sure they ever actually worked with analysing data more than using it as a log parser.
It really feels that way. Real data analysis involves a lot more than just grepping logs. And the reason to be wary of starting out unprepared for that kind of analysis is that migrating to a better solution later is a nightmare.
In many ways HN is Reddit in denial at this point :) Comments and upvotes that are based mostly on vibes, with depth and discussion usually happening somewhere towards the middle of the comment tree.
It’s easy to overlook how often straightforward approaches are the best fit when the data and problem are well understood. Large expensive tools can become problems in their own right creating complexity that then requires even more tooling to manage. (Maybe that's the intent?) The issue is that teams and companies often adopt optimization frameworks earlier than necessary. Starting with simpler tools can get you most of the way there and in many cases they turn out to be all that’s needed.
I've contributed to PrestoDB, but the availability of DuckDB and fast multi core machines with even faster SSDs makes the need for distribution all the more niche, or even cargo-culting Google or Meta.
MapReduce is from a world with slow HDDs, expensive ram, expensive enterprise class servers, fast network.
In that case to get best performance, you’d have to shard your data across a cluster and use mapreduce.
Even in the authors 2014 SSDs multi-core consumer PC world, their aggregate pipeline would be around 2x faster if the work was split across two equivalent machines.
The limit of how much faster distributed computing is comes down to latency more than throughput. I’d not be surprised if this aggregate query could run in 10ms on pre sharded data in a distributed cluster.
Somebody has to go back to first principles. I wrote pig scripts in 2014 in Palo Alto. Yes, it was shit. IYKYK. But the author, and near everybody in this thread, are wrong to generalize.
PCIe would have to be millions of times faster than Ethernet before command line tools are actually faster than distributed computing and I don't see that happening any time soon.
I had a similar very tiny-scale problem that was solved much quicker by awk+perl. I had a relatively large dataset across many YAML files that I needed to compute a result. Turns out that using yq / jq to performn a query was much (order of magnitudes) slower (>10m) to compute any kind of result. Outputting the data into CSV, then iterating over that was much, much faster (seconds). Of course, dumping it into SQlite and querying that was nearly instantaneous.
I know it’s not a direct 1:1 comparison, but it brings to mind that solutions that were made common decades ago are still relevant today.
The same thing is true with Sqlite vs Postgres. Most startups need Sqlite, not Postgres. Many queries run an order of magnitude faster. Not only is it better for your users, it's life changing to see the test suites (which would take minutes to run) complete in mere seconds
Feels like quibbling over the differences between two databases that are going to act the same for 90% of projects out there doesn't really matter.
If you want speed, just have your database stored in the same place as your application, locally, rather than hopping across the world to retrieve data that can be located next to the code.
That would probably be the easiest thing to do to get a real measured performance gains.
As other commentators pointed out, computers are extremely powerful. This isn't 1995, you can easily host everything in the same local area and get a very responsive application with very minimal needs to worry about resource constraints.
Given how primitive SQLite's optimizer is and how similar the storage and execution engines between the two are in terms of architecture, this seems unlikely to be the norm unless you did something wrong on the Postgres side. (Of course, no RDBMS optimizer will always give the best answer, so there's bound to be such cases.)
I’m curious about the memory usage of the cat | grep part of the pipeline. I think the author is processing many small files?
In which case it makes the analysis a bit less practical, since the main use case I have for fancy data processing tools is when I can’t load a whole big file into memory.
Unix shell pipelines are task-parallel. Every tool gets spun up as its own unix process — think "program" (fork-exec). Standard input and standard output (stdin, stdout) get hooked up to pipes. Pipes are like temporary files managed by the kernel (hand-wave). Pipe buffer size is a few KB. Grep does a blocking read on stdin. Cat writes to stdout. Both on a kernel I/O boundary. Here the kernel can context-switch the process when waiting for I/O.
In the past there was time-slicing. Now with multiple cores and hardware threads they actually run concurrently.
This is very similar to old-school approach to something like multiple threads, but processes don’t share virtual address spaces in the CPU's memory management unit (MMU).
Further details: look up McIlroy's pipeline design.
Something to note here is that the result of xargs -P is unlikely to be satisfactory, since all of the subprocesses are simply connected to the terminal and stomp over each other's outputs. A better choice would be something like rush or, for the Perl fans, parallel.
Not only is this a contrived non-comparison, but the statement itself is readily disproven by the limitations basically _everyone_ using single instance ClickHouse often run into if they actually have a large dataset.
Spark and Hadoop have their place, maybe not in rinky dink startup land, but definitely in the world of petabyte and exabyte data processing.
Well, at my old company we had some datasets in the 6-8 PB range, so tell me how we would run analytics on that dataset on an Intel NUC.
Just because you don't have experience of these situations, it doesn't mean they don't exist. There's a reason Hadoop and Spark became synonymous with "big data."
Well yeah, but that's a _very_ different engineering decision with different constraints, it's not fully apples to apples.
Having materialised views increases insert load for every view, so if you want to slice your data in a way that wasn't predicted, or that would have increased ingress load beyond what you've got to spare, say, find all devices with a specific model and year+month because there's a dodgy lot, you'll really wish you were on a DB that can actually run that query instead of only being able to return your _precalculated_ results.
And now with things like DuckDB and clickhouse-local you won't have to worry about data processing performance ever again. Just kidding, but especially with ClickHouse it's so much better to handle the large data volume compared to the past, and even a single beefy server is often enough to satisfy all data analytics needs for a moderate-to-large company.
Yes, but you both don’t raise investor money with efficient understanding of the chaining of CLI tools that already do the job, neither convince your clients that they can get value for their money.
It was a fun moment to finally work on a data problem that did not fit on any (practical) machine. I needed about 50TiB of memory to process a multi-PiB set of logs.
It's worth remembering however that even though it's less efficient per-CPU or whatever to split a large task into many smaller tasks, it may be more efficient overall alongside other workloads as you can bin-pack tasks more efficiently on a cluster, not to mention if tasks fail you are retrying less of the overall work.
All this is to say, the article makes a very good point, but doing it all on one machine also has problems. Just don't cargo cult engineering decisions.
The saddest part about this article being from 2014 is that the situation has arguably gotten worse.
We now have even more layers of abstraction (Airflow, dbt, Snowflake) applied to datasets that often fit entirely in RAM.
I've seen startups burning $5k/mo on distributed compute clusters to process <10GB of daily logs, purely because setting up a 'Modern Data Stack' is what gets you promoted, while writing a robust bash script is seen as 'unscalable' or 'hacky'. The incentives are misaligned with efficiency.
I've done a handful of interviews recently where the 'scaling' problem involves something that comfortably fits on one machine. The funniest one was ingesting something like 1gb of json per day. I explained, from first principals, how it fits, and received feedback along the lines of "our engineers agreed with your technical assessment, but that's not the answer we wanted, so we're going to pass". I've had this experience a good handful of times.
I think a lot of people don't realize machines come with TBs of RAM and hundreds of physical cores. One machine is fucking huge these days.
The wildest part is they’ll take those massive machines, shard them into tiny Kubernetes pods, and then engineer something that “scales horizontally” with the number of pods.
Yeah man, you're running on a multitasking OS. Just let the scheduler do the thing.
Yeah this. As I explain many times to people, processes are the only virtualisation you need if you aren’t running a fucked up pile of shit.
The problem we have is fucked up piles of shit not that we don’t have kubernetes and don’t have containers.
Maybe you are right about kubernetes, I don't have enough experience to have an opinion. I disagree about containers though, especially the wider docker toolchain.
It is not that difficult to understand a Dockerfile and use containers. Containers, from a developer pov, solve the problem of reliably reproducing development, test and production environments and workloads, and distributing those changes to a wider environment. It is not perfect, its not 100% foolproof, and its not without its quirks or learning curve.
However, there is a reason docker has become as popular as it is today (not only containers, but also dockerfiles and docker compose), and that is because it has a good tradeoff between various concerns that make it a highly productive solution.
> problem of reliably reproducing development, test and production environments and workloads
Then again so does a tar file.
Some people might disagree that the problem is "solved" but there you go.
I suggest you read my comment here, which I'd rather not repeat as it's quite a long one https://news.ycombinator.com/item?id=46676676
Hahhah, yuuuup.
I can maybe make a case for running in containers if you need some specific security properties but .. mostly I think the proliferation of 'fucked up piles of shit' is the problem.
Containers are just processes plus some namespacing, nothing really stops you from running very huge tasks on Kubernetes nodes. I think the argument for containers and Kubernetes is pretty good owing to their operational advantages (OCI images for distributing software, distributed cron jobs in Kubernetes, observability tools like Falco, and so forth).
So I totally understand why people preemptively choose Kubernetes before they are scaling to the point where having a distributed scheduler is strictly necessary. Hadoop, on the other hand, you're definitely paying a large upfront cost for scalability you very much might not need.
Time to market and operational costs are much higher on kubernetes and containers from many years of actual experience. This is both in production and in development. It’s usually a bad engineering decision. If you’re doing a lift and shift, it’s definitely bad. If you’re starting greenfield it makes sense to pick technology stacks that don’t incur this crap.
It only makes sense if you’re managing large amounts of large siloed bits of kit. I’ve not seen this other than at unnamed big tech companies.
99.9% of people are just burning money for a fashion show where everyone is wearing clown suits because someone said clown suits are good.
Writing software that works containerized isn't that bad. A lot of the time, ensuring cross platform support for Linux is enough. And docker is pretty easy to use. Images can be spun up easily, and the orchestration of compose is simple but quite powerful. I'd argue that in some cases, it can speed up development by offering a standardized environment that can be brought up with a few commands.
Kubernetes, on the other hand, seems to bog everything down. It's quite capable and works well once it's going, but getting there is an endeavor, and any problem is buried under mountains of templatized YAML.
This, 100%.
Imagine working an a project for the first time, having a Dockerfile that works or compose file, that just downloads and spins up all dependencies and builds the project succesfully. Usually that just works and you get up and running within 30 minutes or so.
On the other hand, how it used to be: having to install the right versions of, for example redis, postgres, nginx, and whatever unholy mess of build tools is required for this particular hairball, hoping it works on you particular (version) of linux. Have fun with that.
Working on multiple projects, over a longer period of time, with different people, is so much easier when setup is just 'docker compose up -d' versus spending hours or days debugging the idiosyncrasies of a particular cocktail that you need to get going.
Thanks. You’ve reassured me that I’m not going mad when I look at our project repo and seriously consider binning the Dockerfile and deploying direct to Ubuntu.
The project is a Ruby on Rails app that talks to PostreSQL and a handful of third party services. It just seems unnecessary to include the complexity of containers.
I have a lot of years of actual experience. Maybe not as much as you, but a good 12 years in the industry (including 3 at Google, and Google doesn't use Docker, it probably wouldn't be effective enough) and a lot more as a hobbyist.
I just don't agree. I don't find Docker too complicated to get started with at all. A lot of my projects have very simple Dockerfiles. For example, here is a Dockerfile I have for a project that has a Node.JS frontend and a Go backend:
It is a glorified shell script that produces an OCI image with just a single binary. There's a bit of boilerplate but it's nothing out of the ordinary in my opinion. It gives you something you can push to an OCI registry and deploy basically anywhere that can run Docker or Podman, whether it's a Kubernetes cluster in GCP, a bare metal machine with systemd and podman, a NAS running Synology DSM or TrueNAS or similar, or even a Raspberry Pi if you build for aarch64. All of the configuration can be passed via environment variables or if you want, additional command line arguments, since starting a container very much is just like starting a process (because it is.)But of course, for development you want to be able to iterate rapidly, and don't want to be dealing with a bunch of Docker build BS for that. I agree with this. However, the utility of Docker doesn't really stop at building for production either. Thanks to the utility of OCI images, it's also pretty good for setting up dev environment boilerplate. Here's a docker-compose file for the same project:
And if your application is built from the ground up to handle these environments well, which doesn't take a whole lot (basically, just needs to be able to handle configuration from the environment, and to make things a little neater it can have defaults that work well for development), this provides a one-command, auto-reloading development environment whose only dependency is having Docker or Podman installed. `docker compose up` gives you a full local development environment.I'm omitting a bit of more advanced topics but these are lightly modified real Docker manifests mainly just reformatted to fewer lines for HN.
I adopted Kubernetes pretty early on. I felt like it was a much better abstraction to use for scheduling compute resources than cloud VMs, and it was how I introduced infrastructure-as-code to one of the first places I ever worked.
I'm less than thrilled about how complex Kubernetes can be, once you start digging into stuff like Helm and ArgoCD and even more, but in general it's an incredible asset that can take a lot of grunt work out of deployment while providing quite a bit of utility on top.
Is there a book like Docker: The Good Parts that would build a thorough understanding of the basics before throwing dozens of ecosystem brand words at you? How does virtualisation not incur an overhead? How do CPU- and GPU-bound tasks work?
> How does virtualisation not incur an overhead?
I think the key thing here is the difference between OS virtualization and hardware virtualization. When you run a virtual machine, you are doing hardware virtualization, as in the hypervisor is creating a fake devices like a fake SSD which your virtual machine's kernel then speaks to the fake SSD with the NVMe protocol like it was a real physical SSD. Then those NVMe instructions are translated by the hypervisor into changes to a file on your real filesystem, so your real/host kernel then speaks NVMe again to your real SSD. That is where the virtualization overhead comes in (along with having to run that 2nd kernel). This is somewhat helped by using virtio devices or PCIe pass-through but it is still significant overhead compared to OS virtualization.
When you run docker/kubernetes/FreeBSD jails/solaris zones/systemd nspawn/lxc you are doing OS virtualization. In that situation, your containerized programs talk to your real kernel and access your real hardware the same way any other program would. The only difference is your process has a flag that identifies which "container" it is in, and that flag instructs the kernel to only show/allow certain things. For example "when listing network devices, only show this tap device" and "when reading the filesystem, only read from this chroot". You're not running a 2nd kernel. You don't have to allocate spare ram to that kernel. You aren't creating fake hardware, and therefore you don't have to speak to the fake hardware with the protocols it expects. It's just a completely normal process like any other program running on your computer, but with a flag.
Docker is just Linux processes running directly on the host as all other processes do. There is no virtualization at all.
The major difference is that a typical process running under Docker or Podman:
- Is unshared from the mount, net, PID, etc. namespaces, so they have their own mount points, network interfaces, and PID numbers (i.e. they have their own PID 1.)
- Has a different root mount point.
- May have resource limits set with cgroups.
(And of course, those are all things you can also just do manually, like with `bwrap`.)
There is a bit more, but well, not much. A Docker process is just a Linux process.
So how does accessing the GPU work? Well sometimes there are some more advanced abstractions for the benefit of I presume stronger isolation, but generally you can just mount in the necessary device nodes and use the GPU directly, because it's a normal Linux process. This is generally what I do.
About 25 years here and 10 years embedded / EE before that.
The problem is that containers are made of images and those and kubernetes are incredibly stateful. They need to be stored. They need to be reachable. They need maintenance. And the control responsibility is inverted. You end up with a few problems which I think are not tenable.
Firstly, the state. Neither docker itself or etcd behind Kubernetes are particularly good at maintaining state consistently. Anyone who runs a large kubernetes cluster will know that once it's full of state, rebuilding it consistently in a DR scenario is HORRIBLE. It is not just a case of rolling in all your services. There's a lot of state like storage classes, roles, secrets etc which nothing works if you don't have in there. Unless you have a second cluster you can tear down and rebuild regularly, you have no idea if you can survive a control plane failure (we have had one of those as well).
Secondly, reachability. The container engine and kubernetes require the ability to reach out and get images. This is such a fucking awful idea from a security and reliability perspective it's unreal. I don't know how people even accept this. Typically your kubernetes cluster or container engine has the ability to just pull any old shit off docker hub. That also couples to you that service being up, available and not subject to the whims of whatever vendor figures they don't want to do their job any more (broadcom for example). To get around this you end up having to cache images which means more infrastructure to maintain. There is of course a whole secondary market for that...
Thirdly, maintainance. We have about 220 separate services. When there's a CVE, you have to rebuild, test and deploy ALL those containers. We can't just update an OS package and bounce services or push a new service binary out and roll it. It's a nightmare. It can take a month to get through this and believe me we have all the funky CD stuff.
And as mentioned above, control is inverted. I think it's utterly stupid on this basis that your container engine or cluster pulls containers in. When you deploy, the relationship should be a push because you can control that and mandate all of the above at once.
In the attempt to solve problems, we created worse ones. And no one is really happy.
I get your points but I'm not sure I agree. Kubernetes is a different kind of difficulty but I don't think its so different from handling VM fleets.
You can have 220 vms instead and need to update all of them too. They also are full of state and you will need some kind of automatic deployment (like ansible) to make it bearable, just like your k8s cluster. If you don't configure the network egress firewall, they can also both pull whatever images/binaries from docker hub/internet.
> To get around this you end up having to cache images which means more infrastructure to maintain
If you're not doing this for your VMs packages and your code packages, you have the same problem anyway.
> When there's a CVE
If there is a CVE in your code, you have to build all you binaries anyway. If it's in the system packages, you have to update all your VMs. Arguably, updating a single container and making a rolling deployment is faster than updating x VMs. In my experience updating VMs was harder and more error prone than updating a service description to bump a container version (you don't just update a few packages, sometimes you need to go from Centos 5 to Centos 7/8 or something and it also takes weeks to test and validate).
I mostly agree with you, with the exception that VMs are fully isolated from one another (modulo sharing a hypervisor), which is both good and bad.
If your K8s cluster (or etcd) shits the bed, everything dies. The equivalent to that for VMs is the hypervisor dying, but IME it’s far more likely that K8s or etcd has an issue than a hypervisor. If nothing else, the latter as a general rule is much older, much more mature, and has had more time to work out bugs.
As to updating VMs, again IME, typically you’d generate machine images with something like Packer + Ansible, and then roll them out with some other automation. Once that infrastructure is built, it’s quite easy, but there are far more examples today of doing this with K8s, so it’s likely easier to do that if you’re just starting out.
> If your K8s cluster (or etcd) shits the bed, everything dies.
When etcd and/or kubelet shits the bed, it shouldn't do anything other than halt scheduling tasks. The actual runtime might vary between setups, but typically containerd is the one actually handling the individual pod processes.
Of course, you can also run Kubernetes pods in a VM if you want to, there have always been a few different options for this. I think right now the leading option is Kata Containers.
Does using Kata Containers improve isolation? Very likely: you have an entire guest kernel for each domain. Of course, the entire isolation domain is subject to hardware bugs, but I think people do generally regard hardware security boundaries somewhat higher than Linux kernel security boundaries.
But, does using Kata Containers improve reliability? I'd bet not, no. In theory it would help mitigate reliability issues caused by kernel bugs, but frankly that's a bit contrived as most of us never or extremely infrequently experience the type of bug that mitigates. In practice, what happens is that the point of failure switches from being a container runtime like containerd to a VMM like qemu or Firecracker.
> The equivalent to that for VMs is the hypervisor dying, but IME it’s far more likely that K8s or etcd has an issue than a hypervisor. If nothing else, the latter as a general rule is much older, much more mature, and has had more time to work out bugs.
The way I see it, mature code is less likely to have surprise showstopper bugs. However, if we're talking qemu + KVM, that's a code base that is also rather old, old enough that it comes from a very different time and place for security practices. I'm not saying qemu is bad, obviously it isn't, but I do believe that many working in high-assurance environments have decided that qemu's age and attack surface is large enough to have become a liability, hence why Firecracker and Cloud Hypervisor exist.
I think the main advantage of a VMM remains the isolation of having an entire separate guest kernel. Though, you don't need an entire Linux VM with complete PC emulation to get that; micro VMs with minimal PC emulation (mostly paravirtualization) will suffice, or possibly even something entirely different, like the way gVisor is a VMM but the "guest kernel" is entirely userland and entirely memory safe.
I think his point is that instead of hundreds of containers, you can just have a small handful of massive servers and let the multitasking OS deal with it
Containers are too low-level. What we need is a high-level batch job DSL, where you specify the inputs and the computation graph to perform on those inputs, as well as some upper limits on the resources to use, and a scheduler will evaluate the data size and decide how to scale it. In many cases that means it will run everything on a single node, but in any case data devs shouldn't be tasked with making things run in parallel because the vast majority aren't capable and they end up with very bad choices.
And by the way, what I just described is a framework that Google has internally, named Flume. 10+ years ago they had already noticed that devs aren't capable of using Map/Reduce effectively because tuning the parallelism was beyond most people's abilities, so they came up with something much more high-level. Hadoop is still a Map/Reduce clone, thus destined to fail at useability.
Disagree.
Different processes can need different environments.
I advocate for something lightweight like FreeBSD jails.
Yes, Sun had the marketing message "The network is the computer" already in the 1980's, we were doing microservices with plain OS processes.
Containers solve:
1. Better TCP port administration with networking layer
2. Clusterfuck that is glibc versions
3. Shipping a Python venv
Can't really speak to (1), but (2) and (3) definitely qualify as 'fucked up piles of shit', which is what he's saying the real problem is.
Its all fun and games, until the control plane gets killed by the OOMkiller.
Naturally, that detaches all your containers. And theres no seamless reattach for control plane restart.
Or your CNI implementation is made of rolled up turds and you lose a node or two from the cluster control plane every day.
(Large EKS cluster)
Until you need to schedule GPUs or other heterogenous compute...
Are you saying that running your application in a pile of containers somehow helps that problem ..? It's the same problem as CPU scheduling, we just don't have good schedulers yet.. Lots of people are working on it though
Not really? At the moment it's done by some user-land job scheduler. That could be something container based like k8s, something in-process like ray, or a workload manager like slurm.
This is especially aggravating when the os inside the container and the language runtimes are much heavier than the process itself.
I've seen arguments for nano services (I wouldn't even call them micros services), that completely ignored that part. Split a small service in n tiny services, such that you have 10(os, runtime, 0.5) rather than 2(os, runtime, x).
There is no os inside the container. That's a big part of the reason containerization is so popular as a replacement for heavier alternatives like full virtualization. I get that it's a bit confusing with base image names like "ubuntu" and "fedora", but that doesn't mean that there is a nested copy of ubuntu/fedora running for every container.
I had to re-read this a few times. I am sad now.
To be fair each of those pods can have dedicated, separate external storage volumes which may actually help and it’s def easier than maintaining 200 iscsi or more whatever targets yourself
I think my brain hurts
I mean, a large part of the point is that you can run on separate physical machines, too.
I recently had to parse 500MB to 2GB daily log files into analytical information for sales. Quick and dirty, the application would of needed 64GB RAM and work laptop only has 48GB RAM. After taking time cleaning it up, it was using under 1GB of RAM and worked faster by only retaining records in RAM if need be between each day.
It is not about what you are doing, it is always about how you do it.
This was the same with doing OCR analysis of assembly and production manuals. Quick and dirty, it would of took over 24 hours of processing time, after moving to semaphores with parallelization it took less than two hours to process all the information.
> It is not about what you are doing, it is always about how you do it.
It saddens me to see how the LinkedIn slop style is expanding to other platforms
There is nothing slop style about this. Things often times being more about how you do them is one of the core characteristics of engineering
"It's not about X, it's about Y" is a very common (and tired) LinkedIn trope.
In interviews just give them what they are looking for. Don't overthink it. Interviews have gotten so stupidly standardized as the industry at large copied the same Big Tech DSA/System Design/Behavioral process. And therefore interview processes have long been decoupled from the business reality most companies face. Just shard the database and don't forget the API Gateway
> In interviews just give them what they are looking for
Unless, of course, you have multiple options and you don’t want to work for a company that’s looking for dumb stuff in interviews.
100%. Interviews should be a two-way filter. I’m sympathetic to unemployed-and-just-need-something, but also: boy are there a lot of companies hiring data engineers.
Meh .. I've played that game; it doesn't work out well for anyone involved.
I optimize my answers for the companies I want to work for, and get rejected by the ones I don't. The hardest part of that strategy is coming to terms with the idea that I constantly get rejected by people that I think are mostly <derogatory_words_here>, but I've developed thick skin over the years.
I'd much rather spend a year unemployed (and do a ton of painful interviews) and find a company who's values align with mine, than work for a year on a team I disagree with constantly and quit out of frustration.
The company's values may align to yours, even though they reject you. It's because the interview process doesn't need to have anything to do with their real-world process. Their engineers probe you for the same "best practices" that they themselves were constantly probed for in their own interviews. Interviewing is its very own skill that doesn't necessarily translate into real-life performance.
I agree with your observation. My issue is (from experience) it's really hard to tell from the outside if a teams' values align with mine. Many teams talk the talk, but don't walk the walk, as the saying goes. It's just easier to not participate than it is to guess, and be wrong.
I also believe that running a broken interview process actively selects for qualities you actually don't want, so it's much more likely that teams conducting those interviews aren't teams I want to work on.
Edit: As credence for my claims, the best team I've ever worked on was a team I did 90%+ of the hiring for, and we didn't do any of the 'typical' interview bullshit most companies do.
What we did instead was sit people down and have deep technical conversations about systems they'd worked on in the past. The candidate would explain, in as much detail as they could muster, a system they'd worked on in the past, down to the lowest level details. Usually, they would talk to us for at least 20-30 minutes, then, we (the interviewers) would pose questions, usually starting with the form 'if we changed X, what effect would it have'. Doing interviews in this style make a few things immediately obvious:
1. Did the candidate have a deep, systemic understanding of the system they worked on?
2. Does the candidate have a good mental model for evaluating change in the system?
That's how I conduct interviews, and unsurprisingly, when I get interviewed like that, my success rate is 100%. I don't think I've ever done an interview like that which did not result in an offer.
Anyways, there's some rambling and unsolicited opinions for you :)
The interview process determines who gets hired, which determines their real-world process. Even if most of their people were hired under a better system, future hires will come in under this one.
This. Most interviewers don't want to do interviews, they have more important job to do (at least, that's what they claim). So they learn questions and approaches from the same materials and guides that are used by candidates. Well, I'm guilty of doing exactly this a few times.
Meh. as an interviewer I would always make it clear if we wanted to switch to “let’s pretend it doesn’t fit on a machine now”.
Demonstrating competency is always good.
> but that's not the answer we wanted
You could have learned this if you were better about collecting requirements. You can tell the interviewer "I'd do it like this for this size data, but I'd do it like this for 100x data. Which size should I design this for?" If they're looking for one direction and you ask which one, interviewers will tell you.
I've done that too and, in my experience, people that ask a scaling question that fits on a single machine don't have the capacity to have that nuanced conversation. I usually try to help the interviewer adjust the scale to something that actually requires many machines, but they usually don't get it.
Said another way, how do you have a meaningful conversation about scaling with a person who thinks their application is huge, but in reality only requires a tiny fraction of a single machine? Sometimes, there's such a massive gulf between perception and reality that the only thing to do is chuckle and move on.
The burden of wisdom.
Yes, but then how are these people going to justify the money they're spending on cloud systems?... They need to find only reasons to maintain their "investment", otherwise they could be held as incompetent when their solution is proven to be ineffective. So, they have to show that it was a unanimous technical decision to do whatever they wanted in the first place.
I've actually worked on distributed systems that were so broken, I created a script to connect to prod and just create the report from my laptop. My manager offered to buy me a second laptop for running the report since it was easier than getting approval from the architects to get rid of the distributed report system (it only created that one report).
Yeah I had this problem at a couple of times in startup interviews where the interviewer asked a question I happened to have expertise in and then disagreed with my answer and clearly they didn't know all that much about it. It's ok, they did me a favor.
It may or may not be related that the places that this happened were always very ethnically monotone with narrow age ranges (nothing against any particular ethnic group, they were all different ethnic monotones)
Hah yeah, that's a funny one, being able to run circles around the interviewer.
> I explained, from first principals, how it fits, and received feedback along the lines of "our engineers agreed with your technical assessment, but that's not the answer we wanted, so we're going to pass". I've had this experience a good handful of times.
Probably a better outcome than being hired onto a team where everyone know you're technically correct but they ignore your suggestions for some mysterious (to you) reason.
Oh, absolutely.
Though I do not know the situation AT the firm you were interviewing in, if there is some unexpected increase in data volume OR say a job fails on certain days or you need to do some sort of historical data load (>= 6 months of 1 gig of data per day), the solution for running it on a single VM might not scale. BUT again, interviews are partially about problem solving, partially about checking compliance at least for IC roles (IN my anecdotal experience).
That being said yeah I too have done some similar stuff where some data engineering jobs could be run on a single VM but some jobs really did need spark, so the team decision was to fit the smaller square peg into a larger square peg and call it a da.In fact, I had spent time refactoring one particular pivotal job to run as an API deployed on our "macrolith" and integrated with our Airflow but it was rejected, so I stopped caring about engineering hygiene.
> https://github.com/simdjson/simdjson
Was not aware of this but seems it is not there natively in Python,but seems cool. Will try out in future.
As other commentors pointed out, 1gb/day isn't a problem for storage and retroactive processing until you get to like, hundreds of years of data. You can chew through a few hundred TB of JSON data in a day, per core + nvme drive.
Regardless, storage and retroactive processing wasn't part of the problem. The problem was explicitly "parse json records as they come in, in a big batch, and increment some integers in a database".
I'm not going to figure out what the upper limit is on a single bare-metal machine, but you can be damn sure it's a metric fuck-ton higher than 1gb/day. You can do a lot with a 10TB of memory and 256 cores.
If we are talking about cloud VMs: sure, their cpu performance is atrocious and io can be horrible. This won't scale to infinity
But if there's the option to run this on a fairly modest dedicated machine, I'd be comfortable that any reasonable solution for pure ingest could scale to five orders of magnitude more data, and still about four orders of magnitude if we need to look at historical data. Of course you could scale well beyond that, but at that point it would be actual work
“6 months of 1 gig of data per day”
Then you would need an enormous 2TB storage device. \s
This kind of bad interview is rife. It’s often more a case of guess what the interviewer thinks than come up with a good solution.
I have a funny story I need to tell some day about how I could get a 4GB JSON loaded purely in the browser at some insane speed, by reading the bytes, identifying the "\n" then making a lookup table. It started low stakes but ended up becoming a multi-million internal project (in man-hours) that virtually everyone on the company used. It's the kind of project that if started "big" from the beginning, I'd bet anything it wouldn't have gotten so far.
Edit: I did try JSON.parse() first, which I expected to fail and it did fail BUT it's important that you try anyway.
Curious about which browser and hardware. In my experience browsers often choke on 0.5GB strings, or decide to kill the tab/proccess.
Yes, but I didn't read the full file, I kept the File reference and read the bytes in pages of 10MB IIRC to find all of the line break offsets. Then used those to slice and only read the relevant parts.
Every one of these cores is really fast, too!
yeah man, computers are completely bananacakes
They wanted to see if you would be on board with their embezzlement scheme.
Yes, yes but how are we going to get HA with one machine..
Fuck off ..you're 10 person startup with an MVP and no revenue stream needs customers first..
“there’s no wrong answer, we just want to see how you think” gaslighting in tech needs to be studied by the EEOC, Department of Labor, FTC, SEC, and Delaware Chancery Court to name a few
let’s see how they think and turn this into a paid interview
1gb of json u can do in one parse ¯\_(ツ)_/¯ big batches are fast
I agree - and it's not just what gets you promoted, but also what gets you hired, and what people look for in general.
You're looking for your first DevOps person, so you want someone who has experience doing DevOps. They'll tell you about all the fancy frameworks and tooling they've used to do Serious Business™, and you'll be impressed and hire them. They'll then proceed to do exactly that for your company, and you'll feel good because you feel it sets you up for the future.
Nobody's against it. So you end up in that situation, which even a basic home desktop would be more than capable of handling.
I have been the first (and only) DevOps person at a couple startups. I'm usually pretty guilty of NIH and wanting to develop in-house tooling to improve productivity. But more and more in my career I try to make boring choices.
Cost is usually not a huge problem beyond seed stage. Series A-B the biggest problem is growing the customer base so the fixed infra costs become a rounding error. We've built the product and we're usually focused on customer enablement and technical wins - proving that the product works 100% of the time to large enterprises so we can close deals. We can't afford weird flakiness in the middle of a POC.
Another factor I rarely see discussed is bus factor. I've been in the industry for over a decade, and I like to be able to go on vacation. It's nice to hand off the pager sometimes. Using established technologies makes it possible to delegate responsibility to the rest of the team, instead of me owning a little rats nest fiefdom of my own design.
The fact is that if 5k/month infra cost for a core part of the service sinks your VC backed startup, you've got bigger problems. Investors gave you a big pile of money to go and get customers _now_. An extra month of runway isn't going to save you.
The issue is when all the spending gets you is more complexity, maintenance, and you don't even get a performance benefit.
I once interviewed with a company that did some machine learning stuff, this was a while back when that typically meant "1 layer of weights from a regression we run overnight every night". The company asked how I had solved the complex problem of getting the weights to inference servers. I said we had a 30 line shell script that ssh'd them over and then mv'd them into place. Meanwhile the application reopened the file every so often. Zero problems with it ever. They thought I was a caveman.
I work for a small company with a handful of devs. We don't have a dedicated devops person, so I do it all. Everything is self-hosted. Been that way for years. But, yeah, if I go on vacation and something foes screwy, the business is hosed. However, even if it were hosted on AWS or elsewhere, it would not be any better. If anything, it may be worse. Instead of a person being well versed in standards based tech, they'd have to be an AWS expert. Why would we want that?
I have recently started using terraform/tofu and ansible to automate nearly all of the devops operations. We are at a point where Claude Code can use these tools and our existing configs to make configuration changes, debug issues by reviewing logs etc. It is much faster at debugging an issue than I am and I know our stuff inside and out.
I am beginning to think that AI will soon force people to rethink their cloud hosting strategy.
> They thought I was a caveman.
I identify as a caveman and I fucking love it. I build a 250k sloc C++ project hundreds of times a day with a 50 line bash script. Works every time, on any machine, everywhere.
The issue with solutions like that is usually that people don't know how it works and how to find it if it ever stops working...
Basically discoverability is where shell script fail
Those scripts have logs, right? Log a hostname and path when they run. If no one thinks to look at logs, then there's a bigger problem going on than a one-off script.
That becomes a problem if you let the shell script mutate into an "everything" script that's solving tons of business problems. Or if you're reinventing kubernetes with shell scripts. There's still a place for simple solutions to simple problems.
That's happens naturally as every engineer adds just another feature to it.
You can literally have a 20 line Python script on cron that verifies if everything ran properly and fires off a PagerDuty if it didn't. And it looks like PagerDuty even supports heartbeat so that means even if your Python script failed, you could get alerted.
Which is why you take the time to put usage docs in the repo README, make sure the script is packaged and deployed via the same methods that the rest of the company uses, and ensure that it logs success/failure conditions. That's been pretty standard at every organization I've been at my entire professional career. Anyone who can't manage that is going to create worse problems when designing/building/maintaining a more complex system.
Yah. A lot of the complexity in data movement or processing is unneeded. But decent standardized orchestration, documentation, and change management isn't optional even for the 20 line shell script. Thankfully, that stuff is a lot easier for the 20 line standard shell script.
Or python. The python3 standard library is pretty capable, and it's ubiquitous. You can do a lot in 50-100 lines (counting documentation) with no dependencies. In turn it's easy to plug into the other stuff.
> Basically discoverability is where shell script fail
No, it's lack of documentation and no amount of $$$$/m enterprise AI solutions (R)(TM) would help you if there is no documentation.
In my experience, that $5k/month easily blows up into $100k/month
I've seen the ramifications of this "CV first" kind of engineering. Let's just say that it's a bad time when you're saddled with tech debt solely from a handful of influential people that really just wanted to work elsewhere.
I'm largely a stranger to the js world but from the outside it sure looks like projects are sharded so as to maximize npm contribution count
This. It is resume-driven development. Especially at startups where the engineers aren't compensated well enough or don't believe the produce can succeed.
I'm convinced k8s is a conspiracy by bigtech to suppress startups.
So its the EJBs of this age then?
I've spent my last 2 decades doing what's right, using the technologies that make sense instead of the techs that are cool on my resume.
And then I got laid off. Now, I've got very few modern frameworks on my resume and I've been jobless for over a year.
I'm feeling a right fool now.
I’m not hiring anymore, but when I was, all I wanted to find was someone that knew the fundamentals (and was a good ’attitude fit’ as per the similarly titled book). Sorry @wccrawford, I wish we could have more places that value slow, boring tech — aside from banking/insurance?
I have hung on to my job for many years now because of being in a similar situation in regards to trying to do the right thing and the fear of not being hire-able.
There is something wrong with the industry in chasing fads and group think. It has always been this way. Businesses chased Java in the late 90s, early 00s. They chased CORBA, WSDL, ESB, ERP and a host of other acronyms back in the day.
More recently, Data Lake, Big Data, Cloud Compute, AI.
Most of the executives I have met really have no clue. They just go with what is being promoted in the space because it offers a safety net. Look, we are "not behind the curve!". We are innovating along with the rest of the industry.
Interviews do not really test much for ability to think and reason. If you ran an entire ISP, if you figured out, on your own, without any help, how to shard databases, put in multiple layers of redundancy, caching... well, nobody cares now. You had to do it in AWS or Azure or whatever stack they have currently.
Sadly, I do not think it will ever be fixed. It is something intrinsic to human nature.
You can fix that with some open source work and home projects.
Then, in the interview, you say the first line of your posting here and the last and then add that you fixed the problem with intensive study.
Yeah, I probably need to push this harder now. I did actually join 1 project recently and got to the point that I felt I could add 1 more common thing to my resume, and that felt good. (Getting something done felt good, too.)
But getting to the point that I feel confident in certain frameworks is going to be hard. I'll figure it out somehow, I'm sure.
This exactly, actual doers are most of the time not rewarded meanwhile the AWS senior sucking Jeffs wiener specialist gets a job doing nothing but generating costs and leave behind more shit after his 3 years moving the ladder up to some even bigger bs pretend consulting job at an even bigger company. It's the same bs mostly for developers. I rewrite their library from TS to Rust and it gains them 50x performance increases and saves them 5k+ a week over all their compute now but nobody gives a shit and I do not have a certification for that to show off on my LinkedIn. Meanwhile my PM did nothing got paid to do some shity certificate and then gets the credit and the certificate and pisses of to the next bigger fish collecting another 100k more meanwhile I get a 1k bonus and a pat on the shoulder. Corporate late stage capitalism is complete fucking bs and I think about becoming a PM as well now. I feel like a fool and betrayed. Meanwhile they constantly threaten our Team to lay it off or outsource it as they say we are to expensive in a first world country and they easily find as good people in India etc. What a time to be alive.
> saves them 5k+ a week over all their compute
If you're willing and able to promote yourself internally, you can make people give a shit, or at least publicly claim they do. That's 260k+ per year, and even big businesses are going to care about that at some level, especially if it's something that can be replicated. Find 10 systems you can do that with, and it's 2.6m+ per year.
But, if you don't want to play the self-promotion game, yeah someone else is going to benefit from your work.
Try Rust? The system programming world isn't very bullshit-infested and Rust is trendy (which is good for a change), also employers can't realistically expect many years of Rust experience.
Need training and something to show? Contribute to some FOSS project.
> datasets that often fit entirely in RAM.
Yep, and a lot more datasets fit entirely into RAM now. Ignoring the recent price spikes for a moment, 128GB of RAM in a laptop is entirely achievable and not even the limit of what is possible. That was a pipe dream in 2014 when computers with only 4GB were still common. And of course for servers the max RAM is much higher, and in a lot of scenarios streaming data off a fast local SSD may be almost as good.
Oldie-but-goldy:
https://yourdatafitsinram.net/
You don't really need to ignore the price spikes even. You can still buy more than 128Gb RAM on a machine with the $5k from one of the months.
I have actually worked in a company as a consultant data guy in a non technical team, I had a 128 GB PC 10 years back, and did everything with open source R then, and it worked ! The others thought it was wizardry
I’ve seen this pattern play out before. The pushback on simpler alternatives seems from a legitimate need for short time to market from the demand some of the equation and a lack of knowledge on the supply side. Every time I hear an engineer call something hacky, they are at the edge of their abilities.
[flagged]
systemd would be a derail even if you weren’t misrepresenting the situation at several levels. Experienced sysadmins in my experience were the ones pushing adoption because they had to clean up the messes caused by SysV’s design limitations and flaws, whereas in this case it’s a different scenario where the extra functionality is both unneeded and making it worse at the core task.
> Experienced sysadmins in my experience were the ones pushing adoption because they had to clean up the messes caused by SysV’s design limitations and flaws
That's funny. I used to have to clean up the messes caused by systemd's design limitations and flaws, until I built my own distro with a sane init system installed.
Many of the noobs groaning about the indignity of shell scripts don't even realize that they could write init 'scripts' in whatever language they want, including Python (the language these types usually love so much, if they do any programming at all.)
I think you’d have a more fruitful discussion if you stopped trying to call people noobs when they don’t agree with you.
For example, I’ve been dealing with SysV since the early 90s and while it’s gotten better since we no longer have to support the really bizarre Unix variants, my problem with init scripts wasn’t “indignity” but the lack of consistency across distributions and versions, which affects anyone shipping software professionally (“can’t do this easily until $distro upgrades coreutils”), and from an operator’s perspective using Python doesn’t make that better because instead of supporting one consistent thing you’d end up with the subset of features each application team felt like implementing, consistent only to the extent that they care to follow other projects. One virtue of systemd is that having a single common way to specify dependencies, restarts, customization, etc. avoids the ops people having to learn dozens of different variations of the same ideas and especially how to deal with their gaps. A few years back, a data center power outage at one place I worked really highlighted that: the systemd-based servers recovered quickly because they actually had working retries; all of the older stuff using SysV had to be manually reviewed because there were all kinds of problems like races on dependencies like DNS or NFS, retry logic which failed hard after a short period of time, failures because a stale PID file wasn’t removed, or cases where a vendor had simply never implemented retries in their init scripts. While in theory you can handle all of those in SysV most people never did.
After a couple decades of that, a lot of us don’t want to spend time on problems Microsoft solved in Bill Clinton’s first term.
I hate to blather on about systemd in this decade but how in the world does creating something completely different than sysv init help people shipping software? Now they have to support yet another init scheme.
Prior to all of the important distributions consolidating on systemd, you had to support each distribution’s convention for customization, overrides, dependencies, conventions for things like changing users or locations for PID files, not to mention the differences in various shell tools.
Nothing insurmountable but it meant init files were inevitably much longer than the corresponding Upstart or systemd files despite doing less, and anytime we shipped a new version you had more testing since you had to implement a lot of functionality which is built in to other things.
I just created my own OS, with my own init system that does things how I think it should be done--and it does it every time, without the bizarre bugs that come from Linux Puttering's shitware code.
It's the same thing any corporation should be doing if they were smart, instead of outsourcing everything to RedHat, Microsoft, Google, etc.
The reality is unit files are more portable than init scripts, regardless of what anyone says.
Systemd unified and simplified administration across a lot of distributions. Before, it was a hodge podge, and there was a lot of knowledge lost going from rhel to Debian.
It's entirely possible that both SysV init and systemd suck for different reasons. I'm still partial to systemd since it takes care of daemons and supervision in a way that init does not, but I'll take s6 or process-compose or even supervisord if I have to. Horses for courses.
I want to love s6 but every time I see the existence of s6-rc-compile I get heated. I'm sure there are excellent reasons behind it but I personally don't want services to work that way.
Yah, that does look awfully baroque. My experience with s6 has largely been minor tweaks to an existing setup where the complexity was hidden away from me. I used to use runit for managing daemons, but nowadays my supervisor of choice is docker compose. process-compose does look enticing though, and the Nix world seems pretty fond of it.
Specifying system processes and their dependencies declaratively, rather than in a tangle of arbitrary executable code, is cleaner, more efficient, easier to use, and more auditable. And that's not even getting into the additional process management duties systemd assumes.
You can write arbitrary scripts into systemd... or like one step removed at most? That's not really a difference unless you have some nuance in mind that I don't.
I honestly do not like systemd, either. It is okay for managing processes but I wish it didn't spread into everything else in the machine.
Or if it must, could it actually work cohesively across their concepts? Would be nice to have an obvious and easy way to run Quadlet as its own user to isolate further, would be nice to have systemd-sysusers present in /etc/subuid so they can run containers.
I like what they are doing with atomic distros. It would be great to have a single file declarative setup for something like running a containerized reverse HTTP proxy with an isolated user. Instead of "atomic" but you manually edit files in /etc after install.
Eternal September
+1, Insightful
Best reply my LLM had. Sorry.
+5 Troll
Worse in some ways, better in others. DuckDB is often an excellent tool for this kind of task. Since it can run parallelized reads I imagine it's often faster than command line tool, and with easier to understand syntax
More importantly, you have your data in a structured format that can be easily inspected at any stage of the pipeline using a familiar tool: SQL.
I've been using this pattern (scripts or code that execute commands against DuckDB) to process data more recently, and the ability to do deep investigations on the data as you're designing the pipeline (or when things go wrong) is very useful. Doing it with a code-based solution (read data into objects in memory) is much more challenging to view the data. Debugging tools to inspect the objects on the heap is painful compared to being able to JOIN/WHERE/GROUP BY your data.
Yep. It’s literally what SQL was designed for, your business website can running it… the you write a shell script to also pull some data on a cron. It’s beautiful
IMHO the main point of the article is that typical unix command pipeline pipeline IS parallelized already.
The bottleneck in the example was maxing out disk IO, which I don't think duckdb can help with.
Pipes are parallelized when you have unidirectional data flow between stages. They really kind of suck for fan-out and joining though. I do love a good long pipeline of do-one-thing-well utilities, but that design still has major limits. To me, the main advantage of pipelines is not so much the parallelism, but being streams that process "lazily".
On the other hand, unix sockets combined with socat can perform some real wizardry, but I never quite got the hang of that style.
Pipelines are indeed one flow, and that works most of the time, but shell scripts make parallel tasks easy too. The shell provides tools to spawn subshells in the background and wait for their completion. Then there are utilities like xargs -P and make -j.
UNIX provides the Makefile as go-to tool if a simple pipeline is not enough. GNUmake makes this even more powerful by being able to generate rules on-the-fly.
If the tool of interest works with files (like the UNIX tools do) it fits very well.
If the tool doesn't work with single files I have had some success in using Makefiles for generic processing tasks by creating a marker file that a given task was complete as part of the target.
I think it’s not so much engineers actually setting up a distributed compute, as it is dropping a credit card into a paid cloud service, which behind the scenes sets up a distributed compute cluster and bills you for the compute in an obfuscated way, then gives a 20% discount + SSO if you sign up for annual enterprise plan.
This kind of practice is insidious because early on, they charge $20/month to get started on the first 100mb of log ingestion, and you can have it up and running in 30 seconds with a credit card. Who would turn that down?
Revisit that set up 2 years later and it’s turned into a 60k/y behemoth that no one can unwind
On the contrary, the key message from the blog post is not to load the entire dataset to RAM unless necessary. The trick is to stream when the pattern works. This is how our field routinely works with files over 100GB.
Yep. The cloud providers however always get paid, and get paid twice on Sunday when the dev-admins forget to turn stuff off.
It’s the same story as always, just it used to be Oracle certified tech, now it’s the AWS tech certified to ensure you pay Amazon.
For a dasaset that live in RAM, the best solution are DuckDB or clickhouse-local. Using SQLish data is easier than a bunch of bash script and really powerful.
Though ClickHouse is not limited to a single machine or local data processing. It's a full-featured distributed database.
Another alternative is Exasol that is factors (>10x) faster than Clickhouse and scales much better for complex analytics workloads that joins data. There is a free edition for personal use without data limit that can run on any number of cluster nodes.
If you just want to read and analyze single table data, then Clickhouse or DuckDB are perfect.
Disclaimer: I work at Exasol
This reminds me of this reddit comment from a long time ago: https://www.reddit.com/r/programming/comments/8cckg/comment/...
Airflow and dbt serve a real purpose.
The issue is you can run sub tib jobs on a few small/standard instances with better tooling. Spark and Hadoop are for when you need multiple machines.
Dbt and airflow let you represent your data as a DAG and operate on that, which is critical if you want to actually maintain and correct data issues and keep your data transforms timely.
edit: a little surprised at multiple downvotes. My point is, you can run airflow and dbt on small instances, and you can do all your data processing on small instances with tools like duckdb or polars.
But it is very useful to use a tool like dbt that allows you to re-build and manage your data in a clear way, or a tool like airflow which lets you specify dependencies for runs.
After say 30 jobs or so, you'll find that being able to re-run all downstreams of a model starts to payoff.
Agreed, airflow and dbt have literally nothing to do with the size of the data and can be useful, or overkill, at any size. Dbt just templates the query strings we use to query the data and airflow just schedules when we query the data and what we do next. The fact that you can fit the whole dataset in duckdb without issue is kind of separate to these tools, we still need to be organised about how and when we query it.
dbt is super useful for building a dag and managing pieces of it that update on different schedules. eg with one dataset that's refreshed monthly and another daily, you can only rebuild the daily one unless the slower-cadence input has a new update.
> a robust bash script
These hardly exist in practice.
But I get what you mean.
Yoy don't. It's bash only because the parent process is bash, but otherwise it's all grep, sort, tr, cut and othe textutils piped together.
awk can do some heavy lifting too if the environment is too locked down to import a kitchen sink of python modules.
Because developers are incentivized to have marketable software skills. Not marketable build things that are cheap and profitable skills.
Moore's law was supposed to make it simpler and cheaper to do more computationally expensive tasks. But in the meantime, everyone kept inflating the difficulty of a task faster than Moore could keep up.
I think some of this is because of the incredible amounts of capital that startups seem to be able to acquire. If startups had to demonstrate profitability before they were given any money to scale, the story would be very different I think.
Our lot burns a fortune on snowflake every month but no one is using it. Not enough data is being piped into it and the shitty old reports we have which just run some SQL work fine.
It looked good on someone’s resume and that was it. They are long gone.
> because setting up a 'Modern Data Stack' is what gets you promoted
It’s not just that, it’s that you better know their specific tech stack to even get hired. It’s a lot of dumb engineering leaders pretending that AWS, Azure and Snowflake are such wildly different ecosystems that not having direct experience in theirs is disqualifying (for pure DE roles, not talking broader sysadmin).
The entire data world is rife with people who don’t have the faintest clue what they’re doing, who really like buzzwords, and who have never thought about their problem space critically.
Well. I try for a middle ground. I am currently ditching both airflow and dbt. In Snowflake, I use scheduled tasks that call stored procedures. The stored procedures do everything I need to do. I even call external APIs like Datadog’s and Okta’s and pull down the logs directly into snowflake. I do try to name my stored procedures with meaningful names. I also add generous comments including urls back to the original story.
I forgot to mention in Snowflake, besides chron scheduled tasks, you can add dependent tasks that only run if the previous task succeeded. I have 40 tasks chained together that way. Each of my task calls a stored procedure. Within each procedure, I have Try Catch and a catch-all clause that Raiseerror.
"I've seen startups burning $5k/mo on distributed compute clusters to process <10GB of daily logs, purely because setting up a 'Modern Data Stack' is what gets you promoted, while writing a robust bash script is seen as 'unscalable' or 'hacky'."
Also seen strange responses from HN commenters when it's mentioned that bash is large and slow compared to ash and bash is better suited for use as an interactive shell whereas ash is better suited for use as a non-interactive shell, i.e., a scripting shell
I also use ash (with tabcomplete) as an interactive shell for several reasons
ENG are building what MGMT has told them to build for, the scale they want, not the scale they have
I see this at work too. They are ingesting a few GB per day but running the data through multiple systems. So the same functionality we delivered with a python script within a week now takes months to develop and constantly breaks.
On the other hand, now we have duckdb for all the “small big data”, and a slew of 10-100x faster than Java equivalent stuff in the data x rust ecosystem, like DataFusion, Feldera, ByteWax, RisingWave, Materialize etc
The point of the article is those don’t actually work that well.
I guarantee those rust projects have spent more time playing with rust and library design than the domain problem they are trying to solve.
None of the systems I mentioned existed at the time the article was published. I think the author would love duckdb which is a very speedy CLI SQL thingy that reads and writes data in all sorts of formats. It fits in great with other Unix CLI stuff.
Many of the projects I mentioned you could see as a response to OP and the 2015 “Scalability, but at what COST?” paper which benchmarked distributed systems to see how many cores they need to beat a single thread. (https://news.ycombinator.com/item?id=26925449)
> None of the systems I mentioned existed at the time the article was published
So Hadoop was doing distributed compute wrong but now they have it figured out?
The point is that there is enormous overhead and complexity in going it in any kind of system. And your computer has a lot of power you probably aren’t maxing out.
> which is a very speedy CLI SQL thingy that reads and writes data in all sorts of formats.
Do you know about SQLite?
Yeah im a big fan of SQLite :). But at analytical workloads like aggregating every row, DuckDB will outperform SQLite by a wide margin. SQLite is great stuff but it’s not a very good data Swiss Army knife because it’s very focused on a single core competency: embeddable OLTP with a simple codebase. DuckDB can read/write many more formats from local disk or via a variety of network protocols. DuckDB also embeds SQLite so you can use it with SQLite DBs as inputs or outputs.
> they were doing distributed compute wrong but now they have it figured out?
Like anything the future is here but it’s unevenly distributed. Frank McSherry, the first author of “Scalability but at what COST” wrote Timely Dataflow as his answer to that question. ByteWax is based on Timely as is Materialize. Stuff is still complex but these more modern systems with performance as their goal are orders of magnitude better than the Hadoop era Java stuff.
I call BS on those Rust 10-100x claims. Rust and Java are roughly equal in performance. It is just that there are a lot of old NoSQL frameworks in Java which are trash. I also checked out those companies, some of which are doing interesting stuff. None claim things are 100x faster because of Rust. You just hurt your credibility when you say such clearly false things. That's how you end up with a Hadoop cluster which is 236x slower than a batch script.
PS None of the companies you linked seem to be using a datapath architecture which is the key to the highest level of performance
It wasn’t my intention to say “this stuff is 100x faster because rust”. DuckDB is C++. My intention was to draw distinction between the Java/Hadoop era of cluster and data systems, and the 2020s era of cluster and data systems, much of which has designs informed by stuff like this article / “Scalability but at what COST?”. I guess instead of “faster” I should say “more efficient”.
For example, the Kafka ecosystem tends to use Avro as the data transfer serialization, which needs a copy/deserialization step before it can be used in application logic. Newer stream systems like Timely tend to use zero-copy capable data transfer formats (timely’s is called Abomination) but it’s the same idea in CapnProto or Flatbuffers - it’s infinity faster to not copy the data as you decode! In my experience this kind of approach is more accessible in systems languages like C++ or Rust, and harder to do in GC languages where the default approach to memory layout and memory management is “don’t worry about it.”
happy middle ground: https://www.definite.app/ (I'm the founder).
datalake (DuckLake), pipelines (hubspot, stripe, postgres), and dashboards in a single app for $250/mo.
marketing/finance get dashboards, everyone else gets SQL + AI access. one abstraction instead of five, for a fraction of your Snowflake bill.
If airflow is a layer of abstraction something is wrong.
Yes it is an additional layer, but if your orchestration starts concerning itself with what it is doing then something is wrong. It is not a layer on top of other logic, it is a single layer where you define how to start your tasks, how to tell when something is wrong, and when to run them.
If you don't insist on doing heavy compitations within the airflow worker it is dirt cheap. If it's something that can easily be done in bash or python you can do it within the worker as long as you're willing to throw a minimal amount of hardware at it.
This times a zillion! I think there's been a huge industry push to convince managers and more junior engineers that spark and distributed tools are the correct way to do data engineering.
I think its a similar pattern to web dev influencers have convinced everyone to build huge hydrated-spa-framework-craziness where a static site would do.
My advice to get out of this mess:
- Managers, don't ask for specific solutions (spark, react). Ask for clever engineers to solve problems and optimise / track what you vare about (cost, performance etc). You hired them to know best, and they probably do.
- Technical leads, if your manager is saying "what about hyperscale?" You don't have to say "our existing solution will scale forever". It's fine to say, "our pipelines handle dataset up to 20GB, we don't expect to see anything larger soon, and if we do we'll do x/y/z to meet that scale". Your manager probably just wants to know scaling isn't going to crash everything, not that you've optimised the hell out of everything for your excel spreadsheet processing pipeline.
Here’s the thing though, most companies work with small data. The distribution of data set size follows a power law which means that few engineers get to work with petabyte sized datasets. That said, the job market definitely incentivizes people to have that kind of experience on their resume if they want to keep progressing in salary. This incentivizes over engineering.
> the job market definitely incentivizes people to have that kind of experience on their resume
Yeah, this is sadly often true, but it's also another trap that people don't have to fall into.
I've been involved with hiring data engineers, and I see experience with distributed computing way more often than I see knowledge of simple profiling and debugging tools. But I'd personally value the latter a lot more when interviewing.
Companies that hire for skills they don't need are of course perpetuating this problem, but they're also paying a big "tax" in the sense that they're not hiring for the skills they actually do need.
yes, but engineers also suck at communicating costs and benefits (and understanding them).
Absolutely, when I worked at (semi-well-known unicorn) a half-dozen years ago on the data-engineering team the manager told me "Hey we want to use spark next quarter, that's a huge initiative."
And I immediately asked, "in what capacity?" And the answer was don't-know/doesn't-matter, it's just important that we can say we're using it. I really wish I understood where that was coming from (his manager resume-building? somebody getting a kickback?)
That's when you rewrite your codebase in the SPARK dialect of Ada and play innocent when your management questions you about it.
They'll never say it's resume building or kickbacks, they'll invent some technically sounding and/or business reason to achieve the same result.
The most interesting part is that you can say you're doing/using something entirely independent of if you actually are. Sure, that's a lie, but so is only using something so you can say you're using it (sure, they admitted to you that was the reason, but that won't be the reason they put on LinkedIn).
Author here!
It's great to see this post I wrote years ago still being useful for people.
I agree with many here that the situation is arguably worse in many ways. However, along similar lines, I've been pleased to see a move away from cargo culting microservices (another topic I addressed in a separate post on that site).
To all those helping companies and teams improve performance, keep it up! There is hope!
Adam,
Thank you very much!
Been re-reading your post multiple times.
You inspired me to port Waters-Series (kind-of streams) to JavaScript to get pipelining for stream processing.
Working on mrjob was a big part of my first job out of college. Fun to see it get mentioned more than ten years later.
What some commenters don't realize about these bureaucratic IO-heavy expensive tools is that sometimes they are used in order to apply a familiar way of thinking, which has Business Benefits. Sometimes you don't know if your task will take seconds, minutes, hours, days, or weeks on one fast machine with a well-thought-out program, but you really need it to take at most hours, and writing well-thought-out-programs takes time you could spend on other stuff. If you know you can scale the program in advance, it's lower risk to just write it as a Hadoop job and be done with it. Also, it helps to have an "easy" pattern for processing Data That Feels Big Even If It Isn't That Big, Although Yelp's Data Actually Was Big. Such was the case with mrjob stuff at Yelp in 2012. They got a lot of mileage out of it!
The other funny thing about mrjob is that it's a layer on Hadoop Streaming, which is a term for when the Java process actually running the Hadoop worker opens a subprocess to your Python script which accepts input on stdin and writes output on stdout, rather than working on values in memory. A high I/O price to pay for the convenience of writing Python!
That's a good point. Hadoop may not be the most efficient way, but when a deliverable is required, Hadoop is a known quantity and really works.
I did some interesting work ten years ago, building pipelines to create global raster images of the entire Open Street Map road network [1]. I was able to process the planet in 25 minutes on a $50k cluster.
I think I had the opposite problem: Hadoop wasn't shiny enough and Java had a terrible reputation in academic tech circles. I wish I'd known about mrjob because that would have kept the Python maximalists happy.
I had lengthy arguments with people who wanted to use Spark which simply did not have the chops for this. With Spark, attempting to process OSM for a small country failed.
Another interesting side-effect of using the map-reduce paradigm was with processing vector datasets. PostGIS took multiple days to process the million-vertex Norwegian national parks, however splitting the planet into data density sensitive tiles (~ 2000 vertices) I could process the planet in less than an hour.
Then Google Earth Engine came along and I had to either use that, or change career. Somewhat ironically GEE was built in Java.
[1] https://github.com/willtemperley/osm-hadoop
I’ve also seen some Really Really Bad software due to engineers having “Not Invented Here” syndrome. If it takes using big well known frameworks to avoid some of that it’s worth the cost.
A little bit of history related to the article for any who might be interested...
mrjob, the tool mentioned in the article, has a local mode that does not use Hadoop, but just runs on the local computer. That mode is primarily for developing jobs you'll later run on a Hadoop cluster over more data. But, for smaller datasets, that local mode can be significantly faster than running on a cluster with Hadoop. That's especially true for transient AWS EMR clusters — for smaller jobs, local mode often finishes before the cluster is up and ready to start working.
Even so, I bet the author's approach is still significantly faster than mrjob's local mode for that dataset. What MapReduce brought was a constrained computation model that made it easy to scale way up. That has trade-offs that typically aren't worth it if you don't need that scale. Scaling up here refers to data that wouldn't easily fit on disks of the day — the ability to seamlessly stream input/output data from/to S3 was powerful.
I used mrjob a lot in the early 2010s — jobs that I worked on cumulatively processed many petabytes of data. What it enabled you to do, and how easy it was to do it, was pretty amazing when it was first released in 2010. But it hasn't been very relevant for a while now.
The bigness of your data has always depended on the what you are doing with it.
Consider the following table of medical surgeries: date,physician_name, surgery_name,success.
"What are the top 10 most common surgeries?" - easy in bash
"Who are the top physicians (% success) in the last year for those surgeries?" - still easy in bash
"Which surgeries are most affected by physician experience?" - very hard in bash, requires calculating for every surgery how many times that physician had performed that surgery on that day, then compare low and high experience outcomes.
A researcher might see a smooth continuum of increasingly complex questions, but there are huge jumps in computational complexity. At 50gb dataset might be 'bigger' than a 2tb one if you are asking tough questions.
It's easier for a business to say "we use Spark for data processing", than "we build bespoke processing engines on a case by case basis".
50GB and 2TB are both sizes that SQLite supports and could handle. You could probably solve all of the problems you mentioned with simple tools on a single server, in the language of your choice.
Sounds like a good fit for DuckDB.
When I worked as a data engineer, I rewrote some Bash and Python scripts into C# that were previously processing gigabytes of JSON at 10s of MB/s - creating a huge bottleneck.
By applying some trivial optimizations, like streaming the parsing, I essentially managed to get it to run at almost disk speed (1GB/s on an SSD back then).
Just how much data do you need when these sort of clustered approaches really start to make sense?
> I rewrote some Bash and Python scripts into C# that were previously processing gigabytes of JSON
Hah, incredibly funny, I remember doing the complete opposite about 15 years ago, some beginner developer had setup a whole interconnected system with multiple processes and what not in order to process a bunch of JSON and it took forever. Got replaced with a bash script + Python!
> Just how much data do you need when these sort of clustered approaches really start to make sense?
I dunno exactly what thresholds others use, but I usually say if it'd take longer than a day to process (efficiently), then you probably want to figure out a better way than just running a program on a single machine to do it.
A C# executable is more similar to a bash/Python script than it is to a interconnected system using multiple processes.
Yeah, I realize now that my comment actually misses the most important point, the "interconnected system with multiple processes" I was talking about was made in C#, that's why the whole "I did the reverse as you" was funny to me in the first place.
I like the peer comment's answer about a processing time threshold (e.g., a day). Another obvious threshold is data that doesn't conveniently fit on local disks. Large scale processing solutions can often process directly from/to object stores like S3. And if it's running inside the same provider (e.g., AWS in the case of S3), data can often be streamed much faster than with local SSDs. 10GB/s has been available for a decade or more, and I think 100GB/s is available these days.
> data can often be streamed much faster than with local SSDs. 10GB/s has been available for a decade or more, and I think 100GB/s is available these days.
In practice most AWS instances are 10Gbps capped. I have seen ~5Gbps consistently read from GCS and S3. Nitro based images are in theory 100Gbps capable, in practice I've never seen that.
Also, anything under 16 vCPUs generally has baseline / burst bandwidth, with the burst being best-effort, 5-60 minutes.
This has, at multiple companies for me, been the cause of surprise incidents, where people were unaware of this fact and were then surprised when the bandwidth suddenly plummeted by 50% or more after a sustained load.
I remember a panel once at a PyCon where we were discussing, I think, the anaconda distribution in the context of packaging and a respected data scientist (whose talks have always been hugely popular) made the point that he doesn't like Pandas because it's not excel. The latter was his go to tool for most of his exploratory work. If the data were too big, he'd sample it and things like that but his work finally was in Excel.
Quick Python/bash to cleanup data is fine too I suppose and with LLMs, it's easier than ever to write the quick throwaway script.
I took a bio statistic class. The tools were Excel/ R or Stata.
I think most people used R. Free and great graphing. Though the interactivity of Excel is great for what ifs. I never got R till I took that class. Though RStudio makes R seem like scriptable excel.
R/Python are fast enough for most things though a lot of genomic stuff (Blast alignments etc..) are in compiled languages.
Whenever I had to use anaconda it was slow as molasses. Was that ever fixed?
This has been fixed for ages. The dep solver was changed to Libmamba a few years ago.
What tasks were slow?
How do you stream parse json? I thought you need to ingest it whole to ensure it is syntactically valid, and most parsers don't work with inchoate or invalid json? Or at least it doesn't seem trivial.
I don't know what the GP was referring too, but often this is about "JSONL" / "JSON Lines" - files containing one JSON object per line. This is common for things like log files. So, process the data as each line is deserialized rather than deserializing the entire file first.
I used Newtonsoft.Json which takes in a stream, and while it can give you objects, it can also expose it as a stream of tokens.
The bulk of the data was in big JSON arrays, so you basically consumed the array start token, then used the parser to consume an entire objects which could be turned into a C# object by the deserializer, then you consumed a comma or end array token until you ran out of tokens.
I had to do it like this because DS-es were running into the problem that some of the files didn't fit into memory. The previous approach took 1 hour, involved reading the whole file into memory and parsing it as JSON (when some of the files got over 10GB, even 64GB memory wasnt enough and the system started swapping).
It wasn't fast even before swapping (I learned just how slow Python can be), but then basically it took a day to run a single experiment. Then the data got turned into a dataframe.
I replaced that part of the Python code processing and outputted a CSV which Pandas could read without having to trip through Python code (I guess it has an internal optimized C implementation).
The preprocessor was able to run on the build machines and DSes consumed the CSV directly.
This sounds similar to how in C#/.NET there are (at least) 3 methods to reading XML: XmlDocument, XPathDocument, or XmlReader. The first 2 are in-memory object models that must parse the entire document to build up an object hierarchy, which you then access object-oriented representations of XML constructs like elements and attributes. The XmlReader is stream-based, where you handle tokens in the XML as they are read (forward-only.)
Any large XML document will clobber a program using the in-memory representations, and the solution is to move to XmlReader. System.Text.Json (.NET built-in parsing) has a similar token-based reader in addition to the standard (de)serialization to objects approach.
Would for example https://pypi.org/project/json-stream/ have met your needs?
I'm going to go out on a limb and say no - this library seems to do the parsing in Python, and Python is slow, like many times slower than Java, C# or languages in this class - which you find out if you try to do heavy data processing with it, which is one of the reasons I dislike the language. It's also very hard to parallelize - in C# if you feed stuff into LINQ and entries are independent, you can make the work parallel with PLINQ very quickly, while threads aren't really a thing in Python (or at least they werent back then).
I've seen so many times that data processing quickly became a bottleneck and source of frustration with Python that stuff needed to be rewritten, that I came to not bother writing stuff in Python in the first place.
You can make Python fast by relying on NumPy and pandas with array programming, but doing so can be quite challenging to format and massage the data so that the things that you want can be expressed as array programming ops, that usually it became too much of a burden for me.
I wish Python was at least as fast as Node (which also can have its own share of performance cliffs)
It's possible that nowadays Python has JITs that improve performance to Java levels while keeping compatibility with most existing code - I haven't used Python professionally in quite a few years.
From the README, features include:
> native code parsing speedups for most common platforms
Which is to say, roughly analogous to "relying on NumPy". (A well-designed system avoids repeatedly calling from Python to C and prefers to let loops live within the C code; that applies at least as much to tree-like data as array-like data.)
> I wish Python was at least as fast as Node (which also can have its own share of performance cliffs) It's possible that nowadays Python has JITs that improve performance to Java levels while keeping compatibility with most existing code - I haven't used Python professionally in quite a few years.
No guarantees, but have you tried PyPy? It's existed since 2007 and definitely improved over time.
I would say that "performance cliffs" are just endemic to programming. Even in C you find people writing bad algorithms because better ones seem (at least superficially) much harder to write — especially if the good algorithm requires, say, a hash table. (C++ standard library containers definitely ameliorate this effect, but you pay in code complexity, especially where templates are needed.) And on the other hand you sometimes see big improvements from dropping to assembly (cf. ffmpeg).
You assume it is valid, until it isn't and you can have different strategies to handle that, like just skipping the broken part and carrying on.
Anyway, you write a state machine that processes the string in chunks – as you would do with a regular parser – but the difference is that the parser is eager to spit out a stream of data that matches the query as soon as you find it.
The objective is to reduce the memory consumption as much as possible, so that your program can handle an unbounded JSON string and only keep track of where in the structure it currently is – like a jQuery selector.
There's a whole heap of approaches, each with their own tradeoffs. But most of them aren't trivial, no. And most end up behaving erratically with invalid json.
You can buffer data, or yield as it becomes available before discarding, or use the visitor pattern, and others.
One Python library that handles pretty much all of them, as a place to start learning, would be: https://github.com/daggaz/json-stream
https://devblogs.microsoft.com/dotnet/the-convenience-of-sys...
https://learn.microsoft.com/en-us/dotnet/standard/serializat...
> Just how much data do you need when these sort of clustered approaches really start to make sense?
I did not see your comment earlier, but to stay with Chess see https://news.ycombinator.com/item?id=46667287, with ~14Tb uncompressed.
It's not humongous and it can certainly fit on disk(s), but not on a typical laptop.
> Just how much data do you need when these sort of clustered approaches really start to make sense?
You really need an enormous amount of data (or data processing) to justify a clustered setup. Single machines can scale up rather quite a lot.
It'll cost money, but you can order a 24x128GB ram, 24x30TB ssd system which will arrive in a few days and give you 3 TB ram, 720 TB (fast) disk. You can go bigger, but it'll be a little exotic and the ordering process might take longer.
If you need more storage/ram than around that, you need clustering. Or if the processing power you get in your single system storage isn't enough, you would need to cluster, but ~ 256 cores of cpu is enough for a lot of things.
What motherboard supports this much ram?
This supports up to 48x 256GB DIMMs over two sockets, which I believe is the maximum that EPYC Turin supports: https://www.asrockrack.com/general/productdetail.asp?Model=T...
https://store.supermicro.com/us_en/systems/a-systems/h13-2u-...
I have no experience with these, but lots of good experiences with last decade supermicro systems.
It's not about how much data you have, but also the sorts of things you are running on your data. Joins and group by's scale much faster than any aggregation. Additionally, you have a unified platform where large teams can share code in a structured way for all data processing jobs. It's similar in how companies use k8s as a way to manage the human side of software development in that sense.
I can however say that when I had a job at a major cloud provider optimizing spark core for our customers, one of the key areas where we saw rapid improvement was simply through fewer machines with vertically scaled hardware almost always outperformed any sort of distributed system (abet not always from a price performance perspective).
The real value often comes from the ability to do retries, and leverage left over underutilized hardware (i.e. spot instances, or in your own data center at times when scale is lower), handle hardware failures, ect, all with the ability for the full above suite of tools to work.
Other way around. Aggregation is usually faster than a join.
Disagree, though in practice it depends on the query, cardinality of the various columns across table, indices, and RDBMS implementation (so, everything).
A simple equijoin with high cardinality and indexed columns will usually be extremely fast. The same join in a 1:M might be fast, or it might result in a massive fanout. In the case of the latter, if your RDBMS uses a clustering index, and if you’ve designed your schemata to exploit this fact (e.g. a table called UserPurchase that has a PK of (user_id, purchase_id)) can still be quite fast.
Aggregations often imply large amounts of data being retrieved, though this is not necessarily true.
That level of database optimization is rare in practice. As soon as a non-database person gets decision making authority there goes your data model and disk layout.
And many important datasets never make it into any kind of database like that. Very few people provide "index columns" in their CSV files. Or they use long variable length strings as their primary key.
OP pertains to that kind of data. Some stuff in text files.
How is a proper PK choice a high level of optimization?
unconvinced. any join needs some kind of seek on the secondary relation index, or a bunch of state if ur stream joining to build temporary index sizes O(n) until end of batch. on the other hand summing N numbers needs O(1) memory and if your data is column shaped it’s like one CPU instruction to process 8 rows. in “big data” context usually there’s no traditional b-tree index to join either. For jobs that process every row in the input set Mr Join is horrible for perf to the point people end up with a dedicated join job/materialized view so downstream jobs don’t have to re do the work
An aggregation is less work than a join. You are segmenting the data in basically the same way in ideal conditions for a join as you are in an aggregation. Think of an aggregation as an inner join against a table of buckets (plus updating a single value instead of keeping a number of copies around). In practice this holds with aggregation being a linear amount faster than a join over the same data. That delta is the extra work the join needs to do to keep around a list of rows rather than a single value being updated (and in cache) repeatedly. Depending on the data this delta might be quite small. But without a very obtuse aggregation function (maybe ketosis perhaps), the aggregation will be faster. Its updating a single value vs appending to a list with the extra memory overhead this introduces.
I'm saying that a smaller amount of data means more compute is required for a join. Sorry if that wasn't clear.
Adam Drake's example (OP) also streams from disk. And the unix pipeline is task-parallel.
Bash is built around streaming though. You have to know how to use the tools to get the gains.
you didn't need to read to rewrite to C# to do that - python should be able to handle streaming that amount/velocity of data fine, at least through a native extension like msgspec or pydantic. additionally, you made it much harder for other data engineers that need to maintain/extend the project in the future to do so.
The C# is probably far more maintainable and less error prone than Python. At least in my experience that's almost always the case.
The amount of Python jobs I've had which run fine for several hours and then break with runtime errors, whereas with C# you can be reliably sure that if it starts running it will finish running.
Not a language problem, it's a dev culture problem. You can hold your devs accountable to the quality of their code. Strong er typing support via static analysis as well as runtime validation with untrusted input/data has really helped python alot.
I'm not necessarily the biggest fan of python, but writing a data engineering tool in a non-data engineering focused language seems like a bad decision. Now when the OP leaves the organization is in a much tougher position.
> Now when the OP leaves the organization is in a much tougher position.
Are they really, though? You're assuming their org is unfamiliar with C#. Not all data engineers only know Python. The ones I work with mainly use C# because we all do!
I'm a software and data engineer. I work with C# pretty extensively in my software day job. I've never seen a data engineer job listing mention C#.
Additionally, the way the OP's comment reads, I'm ok with the assumption I made. It reads like it was a unilateral decision on their part and not something that got buy in from the team.
A selection of times it's been previously posted:
(2018, 222 comments) https://news.ycombinator.com/item?id=17135841
(2022, 166 comments) https://news.ycombinator.com/item?id=30595026
(2024, 139 comments) https://news.ycombinator.com/item?id=39136472 - by the same submitter as this post.
I think many devs learn the trade with Windows and don't get exposure to these tools.
Plus, they require a bit of reading because they operate on a higher level of abstraction than loops and ifs. You get implicit loops, your fields get cut up automatically, and you can apply regexes simultaneously on all fields. So it's not obvious to the untrained eye.
But you get a lot of power and flexibility on the cli, which enable you to rapidly put together an ad hoc solution which can get the job done or at least serve as a baseline before you reach for the big guns.
> The first thing to do is get a lot of game data. This proved more difficult than I thought it would be, but after some looking around online I found a git repository on GitHub from rozim that had plenty of games. I used this to compile a set of 3.46GB of data, which is about twice what Tom used in his test. The next step is to get all that data into our pipeline.
It would be interesting to redo the benchmark but with a (much) larger database.
Nowadays the biggest open-data for chess must comes from Lichess https://database.lichess.org, with ~7B games and 2.34 TB compressed, ~14TB uncompressed.
Would Hadoop win here?
If you get all the data on fast SSDs in a single chassis, you probably still beat EMR over S3. But then you have a whole dedicated server to manage your 14TB of chess games.
The "EMR over S3" paradigm is based on the assumption that the data isn't read all that frequently, 1-10x a day typically, so you want your cheap S3 storage but once in a while you'll want to crank up the parallelism to run a big report over longer time periods.
Almost certainly not. You can go on AWS or GCP and spin up a VM with 2.2 TB RAM and 288 vCPUs. Worst case, if streaming the data sequentially isn't fast enough, you can use something like GNU Parallel to launch processes in parallel to use all those 288 cpus. (It's also extremely easy to set up - 'apt install parallel' is about all you need.) That starts to resemble Hadoop, if you squint, except that it's all running on the same machine. As a result, it's going to outperform Hadoop significantly.
The only reason not to do that is if for some reason the workload won't support that kind of out-of-the-box parallelism. But in that case, you'd be writing custom code for Hadoop or Spark anyway, so there's an argument for doing the same to run on a single VM. These days it's pretty easy to essentially vibe code a custom script to do what you need.
At the company I'm with, we use Spark and Apache Beam for many of our large workloads, but that's typically involving data at the petabyte scale. If you're just dealing with a few dozen terabytes, it's often faster and easier to spin up a single large VM. I just ran a process on Friday like that, on a 96-core VM with 350 GB RAM.
It depends on what you were trying to with the data. Hadoop would never win, but Spark can allow you to hold all that data in memory across multiple machines and perform various operations on it.
If all you wanted to do was filter the dataset for certain fields, you can likely do something faster programmatically on a single machine.
Probably not.
The compressed data can fit onto a local SSD. Decompression can definitely be streamed.
Not only can it be streamed, but lz4 will probably make things quicker.
> Would Hadoop win here?
Hadoop never wins. Its the worst of all possible worlds.
what would you calculate in the data?
I could be tempted to do some fun on an NVL72 ;-)
Tangential, but this reminds of the older K website when it used to be shakti.com that had an intro like this in their about section:
1K rows: use excel
1M rows: use pandas/polars
1B rows: use shakti
1T rows: only shakti
Source: https://web.archive.org/web/20230331180931/https://shakti.co...
No joins in that article?
The comments here smell of "real engineers use command line". But I am not sure they ever actually worked with analysing data more than using it as a log parser.
Yes Hadoop is 2014.
These days you obviously don't set up a Hadoop cluster. You use the cloud provider service provided (BigQuery or AWS Athena for example).
Or map your data into DuckDB or use polars if it is small.
It depends. I’ve done plenty of data processing, including at large fortune 10s. Most of the big data could be shrunk to small data if you understood the use case— pre-aggregating, filtering to smaller datasets based on known analysis patterns, etc.
Now, you could argue that that’s cheating a bit and introduces preprocessing that is as complex as running Hadoop in the first place, but I think it depends.
In my experience, though, most companies really don’t have big data, and many that do don’t really need to.
Most companies aren’t fortune 500s.
I used to work at Elastic, and I noticed that most (not all!) of the customers who walked up to me at the conferences were there to ask about datasets that easily fit into memory on a cheap VPS.
Let your analysts use DuckDB or pandas/polars then instead of quirky command line tools.
> But I am not sure they ever actually worked with analysing data more than using it as a log parser.
It really feels that way. Real data analysis involves a lot more than just grepping logs. And the reason to be wary of starting out unprepared for that kind of analysis is that migrating to a better solution later is a nightmare.
In many ways HN is Reddit in denial at this point :) Comments and upvotes that are based mostly on vibes, with depth and discussion usually happening somewhere towards the middle of the comment tree.
Where else would you JOIN in?
I'm about to start looking - this JOINt's way past its prime
It’s easy to overlook how often straightforward approaches are the best fit when the data and problem are well understood. Large expensive tools can become problems in their own right creating complexity that then requires even more tooling to manage. (Maybe that's the intent?) The issue is that teams and companies often adopt optimization frameworks earlier than necessary. Starting with simpler tools can get you most of the way there and in many cases they turn out to be all that’s needed.
I've contributed to PrestoDB, but the availability of DuckDB and fast multi core machines with even faster SSDs makes the need for distribution all the more niche, or even cargo-culting Google or Meta.
The benefit of prestodb is that it can be used without even starting one of these expensive instances in AWS Athena.
MapReduce is from a world with slow HDDs, expensive ram, expensive enterprise class servers, fast network.
In that case to get best performance, you’d have to shard your data across a cluster and use mapreduce.
Even in the authors 2014 SSDs multi-core consumer PC world, their aggregate pipeline would be around 2x faster if the work was split across two equivalent machines.
The limit of how much faster distributed computing is comes down to latency more than throughput. I’d not be surprised if this aggregate query could run in 10ms on pre sharded data in a distributed cluster.
Confusing the concept and the implementation.
Somebody has to go back to first principles. I wrote pig scripts in 2014 in Palo Alto. Yes, it was shit. IYKYK. But the author, and near everybody in this thread, are wrong to generalize.
PCIe would have to be millions of times faster than Ethernet before command line tools are actually faster than distributed computing and I don't see that happening any time soon.
I had a similar very tiny-scale problem that was solved much quicker by awk+perl. I had a relatively large dataset across many YAML files that I needed to compute a result. Turns out that using yq / jq to performn a query was much (order of magnitudes) slower (>10m) to compute any kind of result. Outputting the data into CSV, then iterating over that was much, much faster (seconds). Of course, dumping it into SQlite and querying that was nearly instantaneous.
I know it’s not a direct 1:1 comparison, but it brings to mind that solutions that were made common decades ago are still relevant today.
The same thing is true with Sqlite vs Postgres. Most startups need Sqlite, not Postgres. Many queries run an order of magnitude faster. Not only is it better for your users, it's life changing to see the test suites (which would take minutes to run) complete in mere seconds
Feels like quibbling over the differences between two databases that are going to act the same for 90% of projects out there doesn't really matter.
If you want speed, just have your database stored in the same place as your application, locally, rather than hopping across the world to retrieve data that can be located next to the code.
That would probably be the easiest thing to do to get a real measured performance gains.
As other commentators pointed out, computers are extremely powerful. This isn't 1995, you can easily host everything in the same local area and get a very responsive application with very minimal needs to worry about resource constraints.
> Many queries run an order of magnitude faster.
Given how primitive SQLite's optimizer is and how similar the storage and execution engines between the two are in terms of architecture, this seems unlikely to be the norm unless you did something wrong on the Postgres side. (Of course, no RDBMS optimizer will always give the best answer, so there's bound to be such cases.)
I don't know where this SQLite obsession is coming from these days, but it doesn't help that Cloudflare/Fly/other EngBrands are doing it.
I’m curious about the memory usage of the cat | grep part of the pipeline. I think the author is processing many small files?
In which case it makes the analysis a bit less practical, since the main use case I have for fancy data processing tools is when I can’t load a whole big file into memory.
Memory footprint is tiny:
Unix shell pipelines are task-parallel. Every tool gets spun up as its own unix process — think "program" (fork-exec). Standard input and standard output (stdin, stdout) get hooked up to pipes. Pipes are like temporary files managed by the kernel (hand-wave). Pipe buffer size is a few KB. Grep does a blocking read on stdin. Cat writes to stdout. Both on a kernel I/O boundary. Here the kernel can context-switch the process when waiting for I/O.
In the past there was time-slicing. Now with multiple cores and hardware threads they actually run concurrently.
This is very similar to old-school approach to something like multiple threads, but processes don’t share virtual address spaces in the CPU's memory management unit (MMU).
Further details: look up McIlroy's pipeline design.
Something to note here is that the result of xargs -P is unlikely to be satisfactory, since all of the subprocesses are simply connected to the terminal and stomp over each other's outputs. A better choice would be something like rush or, for the Perl fans, parallel.
Great article. Hadoop (and other similar tools) are for datasets so huge they don't fit on one machine.
https://www.scylladb.com/2019/12/12/how-scylla-scaled-to-one...
I like this one where they put a dataset on 80 machines only then for someone to put the same dataset on 1 Intel NUC and outperform in query time.
https://altinity.com/blog/2020-1-1-clickhouse-cost-efficienc...
Datasets never become big enough…
>Datasets never become big enough…
Not only is this a contrived non-comparison, but the statement itself is readily disproven by the limitations basically _everyone_ using single instance ClickHouse often run into if they actually have a large dataset.
Spark and Hadoop have their place, maybe not in rinky dink startup land, but definitely in the world of petabyte and exabyte data processing.
When a single server is not enough, you deploy ClickHouse on a cluster, up to thousands of machines, e.g., https://clickhouse.com/blog/how-clickhouse-powers-ahrefs-the...
Well, at my old company we had some datasets in the 6-8 PB range, so tell me how we would run analytics on that dataset on an Intel NUC.
Just because you don't have experience of these situations, it doesn't mean they don't exist. There's a reason Hadoop and Spark became synonymous with "big data."
These situations are rare not difficult.
The solutions are well known even to many non-programmers who actually have that problem:
There are also sensor arrays that write 100,000 data points per millisecond. But again, that is a hardware problem not a software problem.
Well yeah, but that's a _very_ different engineering decision with different constraints, it's not fully apples to apples.
Having materialised views increases insert load for every view, so if you want to slice your data in a way that wasn't predicted, or that would have increased ingress load beyond what you've got to spare, say, find all devices with a specific model and year+month because there's a dodgy lot, you'll really wish you were on a DB that can actually run that query instead of only being able to return your _precalculated_ results.
And we can have pretty fucking big single machines right now
https://yourdatafitsinram.net/
And now with things like DuckDB and clickhouse-local you won't have to worry about data processing performance ever again. Just kidding, but especially with ClickHouse it's so much better to handle the large data volume compared to the past, and even a single beefy server is often enough to satisfy all data analytics needs for a moderate-to-large company.
This has been a recurring theme for ages, with a few companies taking it to extremes—there are people transpiring COBOL to bash too…
Hadoop, blast from the past
Lecture 2: Command-line Environment of Missing Semester 2026 released today! - https://www.youtube.com/watch?v=ccBGsPedE9Q
Yes, but you both don’t raise investor money with efficient understanding of the chaining of CLI tools that already do the job, neither convince your clients that they can get value for their money.
This makes me think of Bane's rule, described in this comment here [1]:
Bane's rule, you don't understand a distributed computing problem until you can get it to fit on a single machine first.
[1] https://news.ycombinator.com/item?id=8902739
And since AI agents are extremely good at using them, command-line tools are also probably 235x more effective for your data science needs.
It was a fun moment to finally work on a data problem that did not fit on any (practical) machine. I needed about 50TiB of memory to process a multi-PiB set of logs.
It's worth remembering however that even though it's less efficient per-CPU or whatever to split a large task into many smaller tasks, it may be more efficient overall alongside other workloads as you can bin-pack tasks more efficiently on a cluster, not to mention if tasks fail you are retrying less of the overall work.
All this is to say, the article makes a very good point, but doing it all on one machine also has problems. Just don't cargo cult engineering decisions.
Remember back in the day it being called “Hadpoop”
highly recommend xsv by BurntSushi [ csv parser / wrangler written in rust ]
It's retired in favor of qsv and xan: https://github.com/BurntSushi/xsv
And now you can do this with polars in parallel on all your cores and the GPU, using almost the same syntax as in pyspark.
Earlier in 2010 - http://widgetsandshit.com/teddziuba/2010/10/taco-bell-progra...
mawk is fast