Video including captions is available here: https://www.almtoolbox.com/blog/recordings-from-gitlab-commit-conference-brooklyn-2019 Thank you for being here I'm gonna talk today about the changes in the process that we established at GitLabon our way to continuous delivery but this talk is titled this way only because it kind of fits within the track I prefer the name kubernetes the pre-call let me introduce myself first I'm the Engineering Manager for the delivery team I've been with GitLab since September 2012 so that's seven years and I got hired as a back-end engineer and to my tenure at GitLab changed positions multiple times I was responsible for the omnibus package installation methods and so on and then recently I moved into the position of managing a delivery team whose sole responsibility was figuring out how to migrate github.com and all of our release management processes to continuous delivery to just give you a bit of an idea of what we're going to be talking today about I'm going to give you a history overview of how release management will evolved at github how the whole team that I'm living right now and the whole process got created and how we while we were doing things started changing things just to make it more interesting and finally the results of all of that and if there is anything I want you to get out of this talk is that there is no shame in trying things out seeing how you can change your processes around how you can leverage your legacy tools even though everyone screams this is the best thing you should be doing this you should be doing that so to give you a bit of an idea of how we ended up in a place where we started from 2013 onwards actually from 2011 github has had monthly release on every 22nd of the month we haven't missed a bit in all these years from 2013 when we formed company and had more than one engineer we had a rotating release manager role this was an engineer who was responsible this month for writing a blog post tagging a version pushing this out to the public even tweeting and in the first couple of years it was only three of us because we have we were at five people at that point and that meant that a lot of actions that we did were manual mostly because we didn't have the tools so I would just log into a machine build a package manually upload it to a stream manually copy the SHA put it on the blog post and release it that was the release process as we started getting more and more of back-end Engineers hired they started getting into that role as well so what happens usually when you put an engineer on a problem they find a way to get out of it so what they did was automated some of these tasks but that still meant that a lot of the tasks that we had were followed this documentation execute this script do this do that all of that manual because the release manager role was actually a rotation so all of the knowledge that gets built up during one month just gets wiped clean for the next one and the idea was also to make sure that we improve things by having fresh pair of eyes looking at the process now we actually had one near miss with our release we almost missed the 22nd deadline that we had back in December 2017 we deployed one day before the release which was unheard of I think that was where the alarm started going off that we lose a lot of knowledge by rotating constantly and I got asked was to because I'm a loudmouth I got tasked with investigating what happened how how can we improve this and as I was collecting data it became apparent that we actually do need to spend some effort in either formula and a team that is going to be leading this change making sure that we don't have semi automated things anymore but automated or we are never going to get better at this so in July 2018 I got two engineers and at some point someone at the company said hey it's great that you're doing all this automation and kind of release goes well with kubernetes and kubernetes is this awesome thing so how about you just do kubernetes as well right like it's easy they added two more engineers and said you're going to be the delivery team so automate the release and migrate github.com to kubernetes github.com has millions of users has a lot of traffic and that actually is a task on its own it's a huge challenge so whenever you see someone screaming kubernetes I hope you're gonna remember this like this Dilbert fake comic so I accepted the task the team accepted the task it was a great challenge so we set out to see what are our requirements for everything that we need to achieve so get overcome live system we can't have any downtime so everything we do we need to make sure that get up one stays up we cannot move the timelines 22nd remains the same engineers need to release codes because this is our lifeline so no delays you should migrate github.com in the next three to six months to kubernetes changing the whole platform so no time to do this great but one thing that actually stuck with me the most was the question is our engineering organization ready for continuous delivery it's great when you're using all the greatest tools but how you use them is really really important so this was the biggest unknown for me like at least in the other three items I knew that I can't change that but can I change the last one so what do you actually need to do to prepare your organization for continuous delivery first of all your development needs to completely shift left that means that before things get merged into your main branch things already need to pass all the testing verification security checks are you are your testing systems solid do you have end-to-end integration tests what do those tests tell you how are you using data and metrics to inform your deployment decisions and do you have capability to react quickly to any sort of change and unfortunately for all of these things in 2018 the answer was no that was my face when I realized this was a humongous challenge nuts on the technical side but on the process side so now that I understand all of my requirements and the challenges that I'm going to encounter what is my team spending their time on I love pie charts because they don't tell you anything but this pie chart is created out of the data we gathered over the 14 day period where the development kind of slows down so that we can prepare for a release my team spent 60% of their time in the 14 day period babysitting deploys and then 26 percent of our time was related to manual tasks or semi manual tasks that someone had to do writing a blog post or helping writing the bulk blog blog post communicating the changes between people doing various cherry-picks 4p1 problems that a developer found by the way if you trust your developer today understand what the p1 means you are fooling yourself we also had a manual process where release managers had to do some basic Huey that's kind of silly so release manager goes in clicks on a button Oh button works great that's check done and github also had a special thing where Community Edition and Enterprise editions were built from separate repositories and we had to merge one into another because Enterprise Edition was a superset of community edition so that took ten percent of our time so if you take a look at the whole thing in - in 14 days in two weeks we my team did nothing but sit on the computer and watch while paint dry yes in this case so if we change the 80% of our whatever we were doing during this period we would be able to make sure that we have no release delays because we'll be freed up to make sure that everything happens at a time if we do deploys quicker and smaller chunks get employ to production we will ensure that there is no time or at least we would be able to control that better and if we free up all of that amount of time we will be able to actually start working on the kubernetes migration that we were set out to do and another thing that I thought was like a really great bonus here is while we are doing the changes here we would be able to prepare the organization for the incoming change in process so that is what we set out to do if you take a look at the cycle time compression we set out to go there but we started going down this route instead so one of the items that we observed is everything was tied into this simple process how developers behave how the product behaves we had a seventh of the month which was our feature freeze date at this point we would branch off from the main line and we will have a slower-moving branch from which we would do deploys and prepare release from this reinforced a really great behavior where developers would kind of pile around that 7th because I have time seventies in seven days and then on the sixth at midnight they would panic merge things because they know that if they miss this deadline they have to wait for the next month but if they get it under this deadline they have good two weeks to fix any problems that happen now we are creatures of habit right so I thought what if we and I didn't think of this by the way a lot of companies do this what if we speed this up and do the same thing but just more frequently right like if it hurts so this is the same system but instead of doing one branch we create three every week we create a new one developers get this similar system of alright well I have some time to fix things but I don't have much time so I'm gonna think twice do I want to spend time panicking and like fixing things quickly or am I going to like make sure that things are actually operational before I merge things it also gave us a bit more time to make sure that whatever we deploy this week we can be certain that by the end of that week the only new thing that is going to be bringing problems is whatever was created with this new branch that is going to go in we also had great help and that is we got to use the tool that we built so I mentioned github.com github.com is one of the biggest instances of GitLab in the world but we use GitLab to build GitLab and then we use another GitLab to deploy GitLab one other thing that we had as an advantage was we had access to all of these developers that were working with us because if we don't get something that we need they won't be able to deploy their thing so they has quite a lot of excitement when we came to them and asked hey can we improve this feature how can we get this done better so some of the release tool stars that you saw the 26% of the time we automated but just taking it into GitLab CI triggering things through the API and using the schedule pipelines so if we need to create a branch it's set in the schedule pipeline it triggers every Sunday evening any p1 item that comes in we automatically cherry-pick things into the branch that is currently active we create various issues to track press through the environments or the QA tasks that need to be executed and as I said github.com gets deployed from get labs so we had to mirror some projects between instances and I think another thing that is worth mentioning here is like GitLab see I was the actual pool to get or rather the glue to make sure that this all pulls in one direction that we wanted and finally GitLab chat ops feels like underappreciated hero here but a lot of the release tasks got automated just because we got a very easy access to everything we had to do by using it through slack for example we cannot get lock chat ups with slack everything is there it's very convenient you don't have to change your context so to explain a tiny bit more how this ended up looking so the happy developer as you can see on this side goes through the whole process right review making sure that their pipelines pass do some verifications through the review apps that we connected to a kubernetes cluster and when they're absolutely sure they want to merge this thing they merge it usually that means it's out of their hands that's magnifying glass thing and the thing that you see scrolling here is our production pipeline what happens is what happened was we realized that all of the items we had to do were related to moving the semi automated tasks into CI so developers machine is a machine and CI machine is a machine so why not just have it there and it automatically logs things and the release manager doesn't have to make sure that they're looking at the screen for six hours while all the deploy is going now one challenge that we encountered here that the tool we had at that point was we already outgrew it so instead of continuing using that will we decided what are the top two things that we need to do to make sure that we can deploy safely first get the package in second thing make sure that it's deployed in a certain order all right well that's easy and we rewrote the tool we had rather we didn't rewrite it we just wrote a new tool using ansible and we placed our CI runner on a bastion host that had access to the infrastructure that was one of the bigger battles so to speak we had to do because we had to get a sign up from security to put basically a remote code execution machine in our system we did get to do that mostly because we get a lot of insight in how we can actually make this happen and one of the great things that happened was we now got to connect through all of our environments and do sequential checks as well so what happens was think like when the developers merge whatever they did we automatically create a new package that package gets picked up by our system deployed on staging and we got to put automated QA in CI as well if the automated QA passes it progresses to the rest of the environments that meant 60% of the time that we used is now out there it just happens we don't have to do anything with it so the finish line when we enable the system same 14-day period we did free up 82% of our time see to emerge also got automated and the 0.3% you see there is sometimes the pipeline fails and we need to check it to see why it failed that's it oops the release tasks remained relatively high still the biggest chunk of this 17% here is the security release we need to do security release requires a lot of coordinating there are a lot of stakeholders security teams development teams marketing and a lot of back porting to prior releases so that remains like a big chunk of our time in May in 2019 we still had the old system and we had around seven deploys that month which was standard for github.com in June we cautiously enabled things and we went to 12 I think and then I think I'm super proud about this August this year we had 35 deploys down on github welcome that means more than one deploy a day and did I mention that none of this using Cooper natives all of this is using our old legacy system but what happens with this is we bought ourselves time so my team has time to actually work on the migration but one of the biggest changes that happened was in the habits of the engineering organization people are thinking twice when they click that merge button they make sure that the integration tests are done before they click that merge button we have all of the developers on call because once we enable the system the old habits just got exposed and we had to call a lot of people to help us out with why our use our performance going down performance going down why are we seeing the uptick in errors and within two months of us enabling the system the whole engineering organization or rather the whole development organization went and call sufficient to say I'm not really super popular there anymore but where I am popular or rather where my team is popular there is everyone is grateful and everyone is excited that when an issue is found they can fix it really quickly deploy it really quickly and within a couple of hours of finding a problem we can push this out to production which is a huge huge change why is this kubernetes the prequel because with us freeing up all of that time my team migrated one of the services that we had to kubernetes so if you go to github.com right now and try to do docker pool or darker push that is being served from a kubernetes cluster we are using successfully our deploy boards we are using our monitoring successfully and we are using our web terminals successfully because obviously we had time to play a bit more than we would usually that's it from my side I wanted to leave you with a bunch of links if you're interested in how all of this developed you can check out the design Doc's that we wrote you can check out where that pie chart you saw originally came from the little time for release reports and finally you can follow follow along our progress with our kubernetes migration and what kind of challenges we ran into when we started doing the kubernetes migration thank you questions we got a few minutes for questions so I'll bring you the mic if you want to ask anything so I you're talking about how sort of like CI CD enables cloud native right and you're talking about a lot of like having to put the entire organization on call and that kind of thing right how do you get buy-in from executives to make those kinds of changes so one thing that I think the executives love is dollar numbers so when they see how much time gets spent on busy work something that does not contribute to the actual goal of the organization and if you transfer that into a dollar amount expose how you can change that dollar amount to something way less everyone starts listening maybe they don't understand what we actually did here maybe they do it totally does not matter what matters to them is that a developer can fix a problem within a couple of hours instead of two weeks three weeks a month and then sort of to fall off that question how do you implemented CI CD without having the engineering team be on call all the time or is CI CD just a sinister ploy to extract more productivity out of the engineers so I think the the uncle was not made to make developers less productive the uncle was developed to create some sympathy to with people who are actually managing the infrastructure and are on the forefront and the idea is not to have developers uncle the ADIZ will teach them how to get themselves out of the uncle rotation so think about what kind of problems it scale you need to think about and how to resolve them properly without yeah just merging randomly so what I think is happening already like within a two-month period that that we had developers own called there are changes in habit and they're starting to ask the right questions how can I get access to the production database to get the to understand the scale of the problem that I'm trying to resolve that is a great question to ask because now you can provide them the data they can inform their decision or rather they can understand how to fix problem on that scale and they're not gonna get paged or their colleague is not gonna get paged and the organization is getting better with it and I really do believe that with time we are going to remove the need for developers being on-call even if they are on-call they're not going to get paged and I think that's a great success story at this point it sounds like you're still releasing once a month yes correct is are there any plans to increase the amount of times you release eventually so there is a difference between what we do on github.com and what we do for self-managed release we already hear from well all of you that having a release one once a month is great but our organization can't update that quickly so we don't necessarily want to change that cadence but we still use the same tools to to release to self-managed as well so what is actually happening right now is we are getting tried and tested product earlier and once we actually say this is 11.3 release what customer is going to get is something that already ran I wouldn't say bug free because that's not possible but at least definitely like not as many bugs as as you usually have in these type of processes so the system enabled us to ship faster to get overcome with more confidence and that actually allows us to make sure that we are not going to have more releases for our self-managed customers but actually less because we don't have to create more patch releases and we get to focus on doing the new features that we want to show