Video including captions is available here: https://www.almtoolbox.com/blog/recordings-from-gitlab-commit-conference-brooklyn-2019

Thank you for being here
I'm gonna talk today about the changes in the process
that we established at GitLabon our
way to continuous delivery but this talk
is titled this way only because it kind
of fits within the track I prefer the
name kubernetes the pre-call
let me introduce myself first I'm the
Engineering Manager for the delivery
team I've been with GitLab since
September 2012 so that's seven years and
I got hired as a back-end engineer and
to my tenure at GitLab changed
positions multiple times I was
responsible for the omnibus package
installation methods and so on and then
recently I moved into the position of
managing a delivery team whose sole
responsibility was figuring out how to
migrate github.com and all of our
release management processes to
continuous delivery to just give you a
bit of an idea of what we're going to be
talking today about I'm going to give
you a history overview of how release
management will evolved at github how
the whole team that I'm living right now
and the whole process got created and
how we while we were doing things
started changing things just to make it
more interesting and finally the results
of all of that and if there is anything
I want you to get out of this talk is
that there is no shame in trying things
out seeing how you can change your
processes around how you can leverage
your legacy tools even though everyone
screams this is the best thing you
should be doing this you should be doing
that
so to give you a bit of an idea of how
we ended up in a place where we started
from 2013 onwards actually from 2011
github has had monthly release on every
22nd of the month we haven't missed a
bit in all these years from 2013 when we
formed company and had more than one
engineer we had a rotating release
manager role this was an engineer who
was responsible this month
for writing a blog post tagging a
version pushing this out to the public
even tweeting and in the first couple of
years it was only three of us because we
have we were at five people at that
point and that meant that a lot of
actions that we did were manual mostly
because we didn't have the tools so I
would just log into a machine build a
package manually upload it to a stream
manually copy the SHA put it on the blog
post and release it
that was the release process as we
started getting more and more of
back-end Engineers hired they started
getting into that role as well so what
happens usually when you put an engineer
on a problem they find a way to get out
of it so what they did was automated
some of these tasks but that still meant
that a lot of the tasks that we had were
followed this documentation execute this
script do this do that all of that
manual because the release manager role
was actually a rotation so all of the
knowledge that gets built up during one
month just gets wiped clean for the next
one and the idea was also to make sure
that we improve things by having fresh
pair of eyes looking at the process
now we actually had one near miss with
our release we almost missed the 22nd
deadline that we had back in December
2017 we deployed one day before the
release which was unheard of I think
that was where the alarm started going
off that we lose a lot of knowledge by
rotating constantly and I got asked was
to because I'm a loudmouth
I got tasked with investigating what
happened how how can we improve this and
as I was collecting data it became
apparent that we actually do need to
spend some effort in either formula and
a team that is going to be leading this
change making sure that we don't have
semi automated things anymore but
automated or we are never going to get
better at this
so in July 2018 I got two engineers and
at some point someone at the company
said hey it's great that you're doing
all this automation and kind of release
goes well with kubernetes and kubernetes
is this awesome thing so how about you
just do kubernetes as well right like
it's easy they added two more engineers
and said you're going to be the delivery
team so automate the release and migrate
github.com to kubernetes github.com has
millions of users has a lot of traffic
and that actually is a task on its own
it's a huge challenge so whenever you
see someone screaming kubernetes I hope
you're gonna remember this like this
Dilbert fake comic so I accepted the
task the team accepted the task it was a
great challenge so we set out to see
what are our requirements for everything
that we need to achieve so get overcome
live system we can't have any downtime
so everything we do we need to make sure
that get up
one stays up we cannot move the
timelines 22nd remains the same
engineers need to release codes because
this is our lifeline
so no delays
you should migrate github.com in the
next three to six months to kubernetes
changing the whole platform so no time
to do this great but one thing that
actually stuck with me the most was the
question is our engineering organization
ready for continuous delivery it's great
when you're using all the greatest tools
but how you use them is really really
important so this was the biggest
unknown for me like at least in the
other three items I knew that I can't
change that but can I change the last
one so what do you actually need to do
to prepare your organization for
continuous delivery first of all your
development needs to completely shift
left that means that before things get
merged into your main branch things
already need to pass all the testing
verification security checks are you are
your testing systems solid do you have
end-to-end integration tests what do
those tests tell you how are you using
data and metrics to inform your
deployment decisions and do you have
capability to react quickly to any sort
of change and unfortunately for all of
these things in 2018 the answer was no
that was my face when I realized this
was a humongous challenge nuts on the
technical side but on the process side
so now that I understand all of my
requirements and the challenges that I'm
going to encounter what is my team
spending their time on
I love pie charts because they don't
tell you anything but this pie chart is
created out of the data we gathered over
the 14 day period where the development
kind of slows down so that we can
prepare for a release my team spent 60%
of their time in the 14 day period
babysitting deploys and then 26 percent
of our time was related to manual tasks
or semi manual tasks that someone had to
do writing a blog post or helping
writing the bulk blog blog post
communicating the changes between people
doing various cherry-picks 4p1 problems
that a developer found by the way if you
trust your developer today understand
what the p1 means you are fooling
yourself we also had a manual process
where release managers had to do some
basic Huey that's kind of silly so
release manager goes in clicks on a
button Oh button works great
that's check done and github also had a
special thing where Community Edition
and Enterprise editions were built from
separate repositories and we had to
merge one into another
because Enterprise Edition was a
superset of community edition so that
took ten percent of our time so if you
take a look at the whole thing in - in
14 days in two weeks we my team did
nothing but sit on the computer and
watch while paint dry yes in this case
so if we change the 80% of our whatever
we were doing during this period we
would be able to make sure that we have
no release delays because we'll be freed
up to make sure that everything happens
at a time if we do deploys quicker and
smaller chunks get
employ to production we will ensure that
there is no time or at least we would be
able to control that better and if we
free up all of that amount of time we
will be able to actually start working
on the kubernetes migration that we were
set out to do and another thing that I
thought was like a really great bonus
here is while we are doing the changes
here we would be able to prepare the
organization for the incoming change in
process so that is what we set out to do
if you take a look at the cycle time
compression we set out to go there but
we started going down this route instead
so one of the items that we observed is
everything was tied into this simple
process how developers behave how the
product behaves we had a seventh of the
month which was our feature freeze date
at this point we would branch off from
the main line and we will have a
slower-moving
branch from which we would do deploys
and prepare release from this reinforced
a really great behavior where developers
would kind of pile around that 7th
because I have time seventies in seven
days and then on the sixth at midnight
they would panic merge things because
they know that if they miss this
deadline they have to wait for the next
month but if they get it under this
deadline they have good two weeks to fix
any problems that happen now we are
creatures of habit right so I thought
what if we and I didn't think of this by
the way a lot of companies do this what
if we speed this up and do the same
thing but just more frequently right
like if it hurts
so this is the same system but instead
of doing one branch we create three
every week we create a new one
developers get this similar system of
alright well I have some time to fix
things but I don't have much time so I'm
gonna think twice do I want to spend
time panicking and like fixing things
quickly or am I going to like make sure
that things are actually operational
before I merge things it also gave us a
bit more time to make sure that whatever
we deploy this week we can be certain
that by the end of that week the only
new thing that is going to be bringing
problems is whatever was created with
this new branch that is going to go in
we also had great help and that is we
got to use the tool that we built
so I mentioned github.com github.com is
one of the biggest instances of GitLab
in the world but we use GitLab to build
GitLab and then we use another GitLab
to deploy GitLab
one other thing that we had as an
advantage was we had access to all of
these developers that were working with
us because if we don't get something
that we need they won't be able to
deploy their thing so they has quite a
lot of excitement when we came to them
and asked hey can we improve this
feature how can we get this done better
so some of the release tool stars that
you saw the 26% of the time we automated
but just taking it into GitLab CI
triggering things through the API and
using the schedule pipelines so if we
need to create a branch it's set in the
schedule pipeline it triggers every
Sunday evening any p1 item that comes in
we automatically cherry-pick things into
the branch that is currently active we
create various issues to track
press through the environments or the QA
tasks that need to be executed and as I
said github.com gets deployed from get
labs so we had to mirror some projects
between instances and I think another
thing that is worth mentioning here is
like GitLab see I was the actual pool
to get or rather the glue to make sure
that this all pulls in one direction
that we wanted and finally GitLab chat
ops feels like underappreciated hero
here but a lot of the release tasks got
automated just because we got a very
easy access to everything we had to do
by using it through slack for example we
cannot get lock chat ups with slack
everything is there it's very convenient
you don't have to change your context so
to explain a tiny bit more how this
ended up looking so the happy developer
as you can see on this side goes through
the whole process right review making
sure that their pipelines pass do some
verifications through the review apps
that we connected to a kubernetes
cluster and when they're absolutely sure
they want to merge this thing they merge
it usually that means it's out of their
hands
that's magnifying glass thing and the
thing that you see scrolling here is our
production pipeline what happens is what
happened was we realized that all of the
items we had to do were related to
moving the semi automated tasks into CI
so developers machine is a machine and
CI machine is a machine so why not just
have it there and it automatically logs
things and the release manager doesn't
have to make sure that they're looking
at the screen for six hours while all
the deploy is going now one challenge
that we encountered here
that the tool we had at that point was
we already outgrew it so instead of
continuing using that will we decided
what are the top two things that we need
to do to make sure that we can deploy
safely first get the package in second
thing make sure that it's deployed in a
certain order all right well that's easy
and we rewrote the tool we had rather we
didn't rewrite it we just wrote a new
tool using ansible and we placed our CI
runner on a bastion host that had access
to the infrastructure that was one of
the bigger battles so to speak we had to
do because we had to get a sign up from
security to put basically a remote code
execution machine in our system we did
get to do that mostly because we get a
lot of insight in how we can actually
make this happen and one of the great
things that happened was we now got to
connect through all of our environments
and do sequential checks as well so what
happens was think like when the
developers merge whatever they did we
automatically create a new package that
package gets picked up by our system
deployed on staging and we got to put
automated QA
in CI as well if the automated QA passes
it progresses to the rest of the
environments that meant 60% of the time
that we used is now out there it just
happens we don't have to do anything
with it so the finish line
when we enable the system same 14-day
period we did free up 82% of our time
see to emerge also got automated and the
0.3% you see there is sometimes the
pipeline fails and we need to check it
to see why it failed that's it
oops the release tasks remained
relatively high still the biggest chunk
of this 17% here is the security release
we need to do security release requires
a lot of coordinating there are a lot of
stakeholders security teams development
teams marketing and a lot of back
porting to prior releases so that
remains like a big chunk of our time in
May in 2019 we still had the old system
and we had around seven deploys that
month which was standard for github.com
in June we cautiously enabled things and
we went to 12 I think and then I think
I'm super proud about this August this
year we had 35 deploys down on github
welcome that means more than one deploy
a day and did I mention that none of
this using Cooper natives all of this is
using our old legacy system but what
happens with this is we bought ourselves
time so my team has time to actually
work on the migration but one of the
biggest changes that happened was in the
habits of the engineering organization
people are thinking twice when they
click that merge button they make sure
that the integration tests are done
before they click that merge button we
have all of the developers on call
because once we enable the system the
old habits just got exposed
and we had to call a lot of people to
help us out with why our use our
performance going down performance going
down why are we seeing the uptick in
errors and within two months of us
enabling the system the whole
engineering organization or rather the
whole development organization went and
call sufficient to say I'm not really
super popular there anymore but where I
am popular or rather where my team is
popular there is everyone is grateful
and everyone is excited that when an
issue is found they can fix it really
quickly deploy it really quickly and
within a couple of hours of finding a
problem we can push this out to
production which is a huge huge change
why is this kubernetes the prequel
because with us freeing up all of that
time my team migrated one of the
services that we had to kubernetes so if
you go to github.com right now and try
to do docker pool or darker push that is
being served from a kubernetes cluster
we are using successfully our deploy
boards we are using our monitoring
successfully and we are using our web
terminals successfully because obviously
we had time to play a bit more than we
would usually that's it from my side I
wanted to leave you with a bunch of
links if you're interested in how all of
this developed you can check out the
design Doc's that we wrote you can check
out where that pie chart you saw
originally came from the little time for
release reports and finally you can
follow follow along our progress with
our kubernetes migration and what kind
of challenges we ran into when we
started doing the kubernetes migration
thank you
questions we got a few minutes for
questions so I'll bring you the mic if
you want to ask anything so I you're
talking about how sort of like CI CD
enables cloud native right and you're
talking about a lot of like having to
put the entire organization on call and
that kind of thing
right how do you get buy-in from
executives to make those kinds of
changes so one thing that I think the
executives love is dollar numbers so
when they see how much time gets spent
on busy work something that does not
contribute to the actual goal of the
organization and if you transfer that
into a dollar amount expose how you can
change that dollar amount to something
way less everyone starts listening maybe
they don't understand what we actually
did here maybe they do it totally does
not matter what matters to them is that
a developer can fix a problem within a
couple of hours instead of two weeks
three weeks a month
and then sort of to fall off that
question how do you implemented CI CD
without having the engineering team be
on call all the time or is CI CD just a
sinister ploy to extract more
productivity out of the engineers so I
think the the uncle was not made to make
developers less productive the uncle was
developed to create some sympathy to
with people who are actually managing
the infrastructure and are on the
forefront and the idea is not to have
developers uncle the ADIZ will teach
them how to get themselves out of the
uncle rotation so think about what kind
of problems it scale you need to think
about and how to resolve them properly
without yeah just merging randomly so
what I think is happening already like
within a two-month period that that we
had developers own called there are
changes in habit and they're starting to
ask the right questions how can I get
access to the production database to get
the to understand the scale of the
problem that I'm trying to resolve that
is a great question to ask because now
you can provide them the data they can
inform their decision or rather they can
understand how to fix problem on that
scale and they're not gonna get paged or
their colleague is not gonna get paged
and the organization is getting better
with it and I really do believe that
with time we are going to remove the
need for developers being on-call even
if they are on-call they're not going to
get paged and I think that's a great
success story
at this point it sounds like you're
still releasing once a month yes correct
is are there any plans to increase the
amount of times you release eventually
so there is a difference between what we
do on github.com and what we do for
self-managed release we already hear
from well all of you that having a
release one once a month is great but
our organization can't update that
quickly so we don't necessarily want to
change that cadence but we still use the
same tools to to release to self-managed
as well so what is actually happening
right now is we are getting tried and
tested product earlier and once we
actually say this is 11.3 release what
customer is going to get is something
that already ran I wouldn't say bug free
because that's not possible
but at least definitely like not as many
bugs as as you usually have in these
type of processes so the system enabled
us to ship faster to get overcome with
more confidence and that actually allows
us to make sure that we are not going to
have more releases for our self-managed
customers but actually less because we
don't have to create more patch releases
and we get to focus on doing the new
features that we want to show