Practicing Continuous Deployment by David Cramer of DISQUS¶
Presenters: David Cramer from DISQUS
PyCon 2012 presentation page: https://us.pycon.org/2012/schedule/presentation/12/
Slides:
Video: http://www.youtube.com/watch?v=QGfxLXoMpPk
Video running time: 41:20
What does “ready” mean?¶
- Reviewed by peers
- Passes automated tests – continuous integration is essential
- Some level of QA
┌────────────────┐ ┌────────────────┐
| Review | ← | Commit |
└────────────────┘ └────────────────┘
↓ ↑
┌────────────────┐ ┌────────────────┐
| Integration | → | Failed build |
└────────────────┘ └────────────────┘
↓
┌────────────────┐ ┌────────────────┐
| Deploy | → | Reporting |
└────────────────┘ └────────────────┘
↓
┌────────────────┐
| Rollback |
└────────────────┘
Continuous deployment does not necessarily mean that you deploy all the time or every 5 minutes.
You can deploy as often as you want. The important thing is that you can deploy whenever you want.
The good and the bad¶
The good¶
- Develop features incrementally
- Release frequently
- Smaller doses of QA
The bad¶
- Culture shock
- Stability depends on test coverage
- Initial time investment
Keep development simple¶
- Automate testing of complicated processes and architecture
- Simple can be better than complete
- Especially for local development
- python setup.py {develop,test}
- Puppet, Chef, Buildout, Fabric, etc.
Automated testing is a requirement. Continuous integration is the basis for all of this.
David feels that packaging your app is essential, because you need things to be repeatable.
Bootstrapping local¶
- Simplify local setup
- git clone dcramer@disqus:disqus.git
- make
- python manage.py runserver
- Need to test dependencies?
- virtualbox + vagrant up (Link to Vagrant)
Progressive rollout¶
We actively use early versions of features before public release.
At DISQUS, we do about 12,000 to 15,000 requests/second and we peak much higher than that.
It’s important that a feature doesn’t take the site down. We want to slowly release features.
(9:32) Feature flippers or switches
Deploy features to portions of a user base at a time to ensure smooth, measurable releases
They use a platform called Gargoyle – currently very Django-specific, but trying to generalize it to be Django-agnostic and maybe even language-agnostic.
Example:
- Only enable this new feature for internal users.
- OK, now turn it on for 1% of our base.
- Keep bumping up until we know it’s scalable.
Iterate quickly by hiding features¶
Early adopters are free QA
from gargoyle import gargoyle
def my_view(request):
if gargoyle.is_active('awesome', request):
return 'new happy version :D'
else:
return 'old sad version :('
New users can check a box to volunteer to test bleeding edge features.
Review all the commits¶
Phabricator - a code review tool open-sourced by Facebook.
(12:40) When you do a code review, it’s done through a commit - friendly for developers. Don’t have to use the Web UI.
arc diff runs a set of lints and runs your tests for you.
They’ve released a plugin for nose called quickunit.
Integration == Jenkins¶
Integration requirements¶
- Developers must know when they’ve broken something
- IRC, Email, IM
- Support proper reporting
- XUnit, Pylint, Coverage.py
- Painless setup
- apt-get install jenkins
It’s important for developers to know right away when stuff is broken so they can ideally fix it before they’ve context switched to something else.
Integration issues¶
False positives
- Reporting isn’t accurate
- Services fail (even a third party service)
- Bad tests
Test coverage
- Regressions on untested code
Feedback delay
- Integration tests vs. unit tests
Fixing false positives¶
- Rerun tests several times on failure
- Report continually failing tests
- Replace external service tests with a functional test suite
Maintaining coverage¶
- Raise awareness with reporting
- Fail/alert when coverage drops on a build
- Commit tests with code
- Coverage against commit diff for untested regressions
- Utilize code review
Reporting¶
<You> Why is mongodb-1 down?
<Ops> It’s down? Must have crashed again.
Meaningful metrics¶
- Rate of traffic (not just hits!)
- Business vs. system
- Response time (database, web)
- Exceptions
- Social media
Sentry¶
You can now use it even if you’re not using Django.
It’s designed to receive exceptions and track them.
Wrap up¶
Deployment - the least important part of continuous deployment. Everyone solves it differently.
What DISQUS does. Ship a relocatable virtualenv as a tarball.
Getting Started¶
- Package your app
- Value code review
- Ease deployment, fast rollbacks
- Setup automated tests
- Gather some easy metrics
Questions?¶
Code reviews: Before Phabricator, DISQUS used GitHub pull requests but they found it to not be scalable.
(31:49) Selenium tests – we deleted all our Selenium tests. We’re reimplementing some of them, but simpler.
(32:50) How many times a day do you deploy? At minimum, once a day. Lately, it’s been no more than half a dozen times per day.
(33:55) Why do you roll back? Why not fix it and move forward? Sometimes it might take a while to fix it.
(34:30) What do you do about database changes? Especially for rollbacks. Google DISQUS schema changes or David Cramer schema changes
(35:52) Any code review policies? Maximum # of lines or maximum amount of time until review. Current standard is at the start of the day and the end of the day, you must clean your slate. Even this kind of sucks, because you may have to wait a day to get your change reviewed. What we really want is to give a max of 20 minutes and if it isn’t reviewed, then it automatically gets assigned to someone else.
(37:23) Numbers of production servers? 200ish. 4 billion pageviews.
(38:00) How long does it take to deploy? Ashamed to admit it. All of our servers in one location although we push a lot of stuff to Akamai.
(39:00) One monolithic deploy or many? We’re moving towards SOA and away from monolithic.
(40:15) Can you tell us about your rollback process? At one point, it was just swap the symlink and restart the servers.
(40:33) Business metrics measurements - what tools? Graphite, statsd, porkchop
(41:10) Done.