The Stack That Helped Medium Drive 2.6 Millennia of Reading Time

 Background

Medium is a network. It’s a place to share stories and ideas that matter — it’s where you move thinking forward, and people have spent 1.4 billion minutes — or 2.6 millennia — reading on Medium.


We get over 25 million unique readers every month and tens of thousands of posts published each week. But we want Medium to be a place where the measure of success isn’t views, but viewpoints. Where the quality of the idea matters, not the author’s qualifications. A place where conversation pushes ideas forward and words still matter.

https://hackmd.io/@alexaa34/SJRNgruR-g

https://medium.com/@alexharris59600/the-stack-that-helped-medium-drive-2-6-millennia-of-reading-time-264c371285a6

I lead the engineering team. I was previously a Staff Software Engineer at Google, where I worked on Google+ and Gmail, and co-founded the Closure project. In past lives I’ve raced snowboards, jumped out of planes, and lived in the jungle.


The Team

I couldn’t be prouder of this team. It’s an awesome bunch of talented, curious, mindful individuals who come together to do great work.


We operate in cross-functional, mission-driven teams, so while some people specialize, everyone should feel able to touch any part of the stack. We believe that exposure to different disciplines makes you a stronger engineer. I wrote about our other values here.


The teams have a lot of flexibility in how they organize around their work, but as a company we set quarterly goals and encourage iterative sprints. We use GitHub for code reviews and bug tracking and Google Apps for email, docs, and spreadsheets. We’re heavy users of Slack — and slack bots — and many teams use Trello.


Initial Stack

We deployed to EC2 from the start. The main app servers were written in Node.js, and we migrated to DynamoDB for the public launch.


There was a node server that we used for image processing, delegating to GraphicsMagick for the actual hard work. And another server acted as a SQS queue processor for background tasks.


We used SES for email, S3 for static assets, CloudFront as CDN, and nginx as a reverse proxy. We used Datadog for monitoring and PagerDuty for alerting.


The site used TinyMCE as a foundation for the editor. Before launch we were already using the Closure Compiler and some portions of the Closure Library, but Handlebars for templates.


Current Stack

For a site as seemingly simple as Medium, it may be surprising how much complexity is behind the scenes. It’s just a blog, right? You could probably knock something out using Rails in a couple of days. :)


Anyway, enough snark. Let’s start at the bottom.


Production Environment


We are on Amazon’s Virtual Private Cloud. We use Ansible for system management, which allows us to keep our configuration under source control and easily roll out changes in a controlled way.


We have a service-oriented architecture, running about a dozen production services (depending on how you count them and some more micro than others). The primary choice as to whether to deploy a separate service is the specificity of the work it performs, how likely dependent changes are to be made across service boundaries, and the resource utilization characteristics.


Our main app servers are still written in Node, which allows us to share code between server and client, something we use quite heavily with the editor and post transformations. Node has worked pretty well for us, but performance problems have emerged where we block the event loop. To alleviate this, we run multiple instances per machine and route expensive endpoints to specific instances, thus isolating them. We’ve also hooked into the V8 runtime to get insights into what ticks are taking a long time; generally it’s due to object reification during JSON deserialization.


We have several auxiliary services written in Go. We’ve found Go very easy to build, package, and deploy. We like the type-safety without the verbosity and JVM tuning of Java. Personally, I’m a fan of using opinionated languages in a team environment; it improves consistency, reduces ambiguity, and ultimately gives you less rope to hang yourself.


We now serve static assets using CloudFlare, though we send 5% of traffic to Fastly and 5% to CloudFront to keep their caches warm should we need to cut over in an emergency. Recently we turned up CloudFlare for application traffic as well — primarily for DDOS protection but we’ve been happy with the performance gains.


We use a combination of Nginx and HAProxy as reverse proxies and load balancers, to satisfy the Venn Diagram of features we need.


We still use Datadog for monitoring and PagerDuty for alerts, but we now heavily use ELK (Elasticsearch, Logstash, Kibana) for debugging production issues.

Comments

Popular posts from this blog

Microsoft adds Windows protections for malicious Remote Desktop files

How to write technical blog posts that people actually read?

Ultimate Guide to Activate YouTube on Smart TVs & Streaming Devices