The 100cr salary of @zerodhaonline founders is not eye-opening to me, but the tech and how they operate on such a large scale is to me. How do they optimize their database with close to 20TB of data? A 🧵
@zerodhaonline stores financial and transactional data on their PostgreSQL instances. With billions of rows that power their Console and its large-scale number crunching and reporting.
On peak traffic, despite having a heavy “hot” cache layer, the warehouse DB serves up to 40-50,000 queries per second. Instead of normalizing their data, they denormalized it because JOINs across tables holding billions of rows is difficult to scale.
After a lot of experimentation, they partitioned their data to group records by months based on their timestamps. They even experimented with per-day partitioning that significantly boosted their bulk inserts speed but slowed down query speed.
Per-month partitioning turned out to be a far better trade-off. They removed auto-increment IDs and tuned parameters of PostgreSQL that I won’t discuss in this post, but they turned the knobs to gain performance after experimenting.
Due to the nature of bulk inserts and updates of billion rows in batches that @zerodhaonline does aftermarket hours, they turned auto-vacuum off. If auto-vacuum kicks in the middle of the bulk operation, it will lock everything down.
Instead, they had a cron job that does “vacuum analyze” on partitions. Taking care of not over-indexing their DB, they also used materialized views to make the queries faster, and computation for the same queries doesn’t happen repeatedly.
If you found this thread insightful please RT and like the thread. It motivates me to bring high-quality stuff and not post random things. Consider following me for more insightful threads! Also, tag @Nithin0dha and @nikhilkamathcio