Wednesday, October 05, 2016

GOTO 2016 • Microservices at Netflix Scale: Principles, Tradeoffs & Lessons Learned • R. Meshenberg

thank you and thank you for coming to the stock focus titled microservices
natural scale will talk about the various definitions of microservices and
we'll talk about netflix scale but first I'd like to start with asking your
question did anyone make a large purchase recently a house expensive car
parts of you how many of you made that purchase just looking at the features
the benefits and not looking at the price
all right good so the analogy here is microservices is clear.its in in many
talks and many seminars ago microservices purely for their benefits
but that's only one side of the story
these benefits come to certain costs and so today what I'd like to talk to you
about is netflix journey to microservices are benefits and literates
some of them costs lessons learned
basically the best practices and anti-patterns we discovered through our
journey to microservices and some resources that like to make available to
our first by way of introduction netflix was over 81 million subscribers all over
the world were probably the largest internet television network and we allow
people actually people enjoy over a hundred twenty-five million hours of TV
each day that translates to a lot of bands will talk about it later
what I do it Netflix's around several teams under the umbrella of platform
engineering we effectively make the Lego blocks that all the rest of the
engineering teams that netflix use to build their own applications so we
create a common layer to enable teams to move fast
netflix around microservices little-known fact but what does not look
scale so a hundred twenty-five million of hours which birthday translates to
over four billion our streams which months
in other terms of translates about one-third of North American downstream
traffic on average
that's a lot of it at any point in time all this traffic is supported by over
500 microservices the reason I say over 500 simply because we don't know how
many services are running at any given time at any given day we have maybe
between hundreds to thousands production changes new services are being deployed
existing services being changed some services of being retired there is no
central system that gets at all because we want people to it to be able to move
all these services we have a aspirational goal of four knives
availability the more more closer to reality will be three-and-a-half nice
availability and we'll talk about that and that runs across 3 a.w.s regions to
in north america and one in Europe across nine availability zones and we'll
talk about why are all of that on our people side that's a effectively
represent the company of about 2,400 people about half of them are technical
for about twelve hundred engineers working all these services so let's
start with the journey how how did we get to this point because we weren't
always in the cloud weren't always microservices first of all our journey
took seven years
it didn't happen overnight we started back in August 2008 and just finished
earlier this year so it was a long journey and our journey microservices
was not really a necessarily journey from moralistic microservices per se it
was triggered by our need to move into the cloud and for various reasons we
blogged and talked about it at numerous places are not going to reiterate their
arm but effectively the only way for us to take a very big complex system and
take it from data center into a cloud we couldn't just forklift we couldn't just
take this big complex fragile system moved the cloud and say we're done we
have to choose a little piece by piece effectively micro serviced by
microservice and then deployed in the cloud 3 architected redesign
three implemented and put it in the cloud so what it looked like before in
our own data center we have models there was this application that all the
various teams contributed jar files it would get baked into a war the war file
would go through your normal or bi-weekly test recycle or the restraint
cycle and eventually get deployed and on on a database that was a pretty big rtms
oracle DB in particular that would take all the traffic and you can clearly see
what the problems with this are besides the velocity will talk about that
separately but even from reliability perspective this is very fragile any
issue that you introduced to your application whether it's instant or
latent will be homogeneously represented throughout your forms so if you put a
bug into one machine you plug into all machines in addition this database does
represent a single point of failure so in a previous life i mentioned august
2008 this is exactly what happened in august 2008 we have a database failure
are our main production database got corrupted and all of the netflix users
most of them were not streaming at the time whatever still mostly getting
dvd-by-mail got this lovely message sorry we're working on fixing our
problem and we'll get back to that took four days
imagine having a four-day outage thats thats not pretty so that was sort of a
triggering event that led our transformation into microservices that
leather transformation into being in a cloud only and I'm really happy that
transition is done before we get into the application and practice let's talk
about the first principles effectively the assumptions that we all made once we
start chiseling away those microservices from the models application we have to
make certain assumptions we have to prioritize for something so let's let's
quickly go through what those assumptions were and we'll also talk
about why we made those assumptions because they can depending on what you
trying to optimize for the assumptions could be different and I didn't want to
picture a particular solutions or frameworks without understanding first
what those assumptions are so first of all we really don't like NIH or method
here if we faced with a choice between building something or buying something
and just to be clear by included it doesn't necessarily think limiters to a
vendor software it means just using off-the-shelf open-source software we
prefer not to build it if we can leverage and potentially contribute to
something from open source would rather do that and so we we only reserve
building something in house for the cases where there is no other solutions
for other that'll work for us particularly is our scale we've tried
many solutions that unfortunately ended up not working and ended up big building
a lot of our own software anyway but at least we tried not to most of your
services as you breaking out your models should be stateless with the caveat of
course--but been except for persistence and caching layer what that means is you
shouldn't rely on insta sessions if you were to any particular instance or or
node dies you should be able to simply retry it here a different instance and
the proceedings your execution
it's not enough to say sure the cable will do that you have to prove it and
the way we provide netflix by casting cast monkey and alike will talk about
that scale out versus scale-up you have two choices when you need more resources
one you can go to instances or nodes with more resources whether its CPU
memory storage network you can always get something that has more until you
can't until you're running on that instance that has the most course
available that has the most disagreeable and what you done to architectural paint
yourself in the corner so what you'd rather do is when dealing at scale you
want to scale out it's so much easier to add new instances
and especially in the cloud because it gives you that nice benefit of
elasticity not infinite elasticity but this is nonetheless redundancy in
isolation we'll talk about it when we will talk about resiliency but it's
really important resented and these two principles are really common sense but
redundancy you make more than one of anything you simply don't want any
single point of failure even though you make more than one of anything you still
need to isolate the blast radius for even given failure even if you have a
hundred fifty nodes in the cluster if one of the failures can result in
cascading failure throughout the whole cluster you haven't isolated the failure
and so it's important to do that and of course the destructive testing what you
want to do is you don't want to be a one of you don't want to just do it once a
quarter once a release or once whatever cycle like you wanted to be constantly
running he wants to constant constantly prove that your system can withstand the
failures that occur now it's not if the failure occur failure will occur it's
wonderful occur and chances are Murphy's low still working last I checked failure
will occur in the middle of the night on saturday night where most of your
engineers are asleep or drunk or both and that's not the time when you want to
wake him up and have havin to deal with failure and so we actually put a lot of
thought and practice into running destructive failures all the time during
the office hours where engineers are there they're caffeinated the fully
alert they can deal with problems right away so it started was chaos monkey
probably our most famous member of the seminary but actually grew into much
bigger thing there's a whole principle of being able to destructively prove
that our systems are resilient
so how do this first principles all these assumptions work work in action
well we talked about state of services in our ecosystem in order for stateless
service to be a good citizen effectively it just needs to do a few things it
needs to register to service discovery and depending on your ecosystem service
discovery may be anything it could be dns it could be your own service
discovery anything it needs to implement a externally callable health check again
whether it's your service discovery or or Lord answer or other external agent
can verify that your systems are not just there but they actually up and
running and their functional and in order to call other services it needs to
utilize that information and get from service discovery of the other services
are located
simple right right until it feels so when it feels all you need to be able to
verify is that the calling services this particular diagram service a just needs
to retry that request get a different instance of service be get the same
response and then move on
in order to do that you have to kill an instance of service be or introduce a
network partitioning event between the service a an instance of service be and
then verify that there is also the same you have to do it in production that's
the catch
I've seen tons of instances were developers and companies claim that yes
I've done this testing have done this testing on my development machine or
environment have done this testing and test or QA environment and yet in
production the system's failed in different way so if you really want to
prove to yourself that you're stateless services are truly stateless do this in
you don't have to do it overnight you can build confidence but eventually if
you really want to do this you gotta do it in production
case manager it already that's what does it effectively a monkey running around
data center virtual datacenter in our case but the central nonetheless and
killing instances randomly we run it monday through friday 93 and each
application subject to it initially it was just ate less only now stateful
applications as well and you can tune it basically you can you can tune it from
doing it infrequently of maybe once a day to if you feeling more bold more
frequently and see how your system reacts data so i mentioned in our data
center environment we had this big or DBMS instances running oracle db's when
we started migrating to microservices to to the cloud embrace cassandra is our
main key value storage and for several reasons one it it actually runs at scale
it's a one of the bigger scale
no sequel engines it's open source and that was important to us because very
early on we needed to envy in fact contributor quite a bit to cassandra in
order to make it multi-regional multi-directional replication ready and
in terms of the storage engine from perspective cap theorem it's available
is it tolerates partitions well and it has to be tunable consistency now in
many talks about the sandra you'll hear that its eventual consistency engine
that's actually only a part of the story with Cassandra request whether you're
reading or writing you can tune the consistency of that request from CL 1
which is pure eventual consistency to local quorum all the way to the global
forum i don't recommend global quorum when we'll we'll go in in those reasons
so this is actually how Cassandra multi-regional replication works on one
side your client let's say right there is a local quorum that means you're
going to get three replicates local leaders in the region before getting an
act before confirming that the right has been made but what you don't want to do
is tolerate that confirmation over the long distance network across the regions
your region right simply will be too slow it's not going to scale so what you
want to do is you want to replicate the data across regions asynchronously and
that's actually change that the 11 of our computers contributed to consent or
very early on many years ago and now its default in the distribution so we talked
about stateless services we talked about stateful services that is also issue of
building services which probably composed of both but the deal with money
so you have to be a little bit more careful and that's actually that's the
service that took us the longest to migrate to the cloud to break into
microservices and for a good reason we don't want to overcharge our users we
also don't want to understand your users we want to do it just right so you know
also in order to be compliant with socks and pci and in us we had to put these
services into a separate account is a limited-access it's not as fully open as
the all the other production systems that netflix it should be fully logged
inaudible and and the keep the key there is is just just to be a little bit more
careful about what you're doing ok so i will go very quickly through through the
benefits of microservices because i'm pretty sure you've heard many of them
already many times but they think that some of them worse repeating but before
that what our netflix player is what are we trying to optimize for well first and
foremost we want to optimize for velocity for innovation for us it's
really important that all of our teams that are working on netflix product move
fast and because of that were actually willing and so far been able to to
sacrifice a little bit of reliability that's why you see related
letís priority number two facilities may sound in order to achieve that high
velocity and efficiency is only a third on that list
so when we were making the straight nose when we trying to decide which way to
optimize we always optimized for innovation first and that's exactly the
challenge that you have in the typical model systems and recycles our model
systems you have your teams that are producing various components they're all
developing eventually they submit their development results into a test cycle
for strain reliefs and eventually after the QA signs of that gets released that
creates a lot of tight coupling between the teams and that just doesn't work if
you're trying to optimize for velocity it's too slow loose coupling is when you
have each team working independently that means EG mr. now work all of the
parts of the cycle they develop the architect that designed to develop they
test the deploy and they support in other words its end-to-end ownership now
in this sometimes called DevOps model i'm trying not to use this word because
it's getting over used to the point of being a buzzword but if you think of
end-to-end ownership your team is responsible for everything from cradle
to the grave and this way their motivations are set up so they write
quality code but they also read it fast because engineers especially great
engineers that motivated by the results that motivated by the impact that they
make and if each team owns the full cycle you get something like this where
all of them are constantly making progress all of them are constantly
doing something if you think about in technical terms imagine a team of the
thread and all of them running independently in parallel without a
single gate to block them to impede the progress all this team's that not France
they very loosely coupled we don't have a central place to gate or approve their
releases the each move on their own cycle at their own cadence you also get
the benefit of separation
concerns for example the lowest layer the infrastructure team that I work with
the mostly concerned with availability scalability security the fundamental
qualities that each application should have so then the teams who built their
applications out of Lego blocks they provide don't have to reinvent the field
they can build on top of it and leverage what we already done enough about the
benefits though let's talk about the cost nothing is free
first and foremost microservices of the work change if you want all your teams
to embrace the full cycle that means you no longer have necessarily curating you
no longer have an ops team now who wants to hear that the job is going away for
changing right these are the harder things these are the people things
emotions are involved or changes are hard and so what you what you need to do
is you have to evolve the organization over time gradually you won't be able to
do it overnight and just to give you a couple of examples of how your practices
may change you can depending on what works for you in our data center
environment many years ago we had a centralized mark that was basically
approving and driving all the releases right we had a IT group that was
responsible for capacity and budgeting and planning and executing the capacity
we had DBS who were effectively gatekeepers to this bigger dbms is that
we had we don't have any of that now if the developer our team's resources the
self provision and through the tools now because they do it through the tools
it's all transparent with everybody can see what's going on and the people who
concerned themselves capacity they can see real-time signal where the capacity
is needed we don't have an obstacle we have a sorry team that builds central a
set of tools that everybody can use leverage to operate their own systems
but they're not getting anybody and the dba's the work with other teams to work
out what sue what schemes for the databases would be optimal but we don't
operate anything for anybody
so it's a lot of centralized teams who build a support model so build the frame
works for other teams to enable them to own the full for life cycle again it's
effectively we're building these Lego blocks but in order for you to read that
benefit you have to invest you have to build the teams that will build these
Lego blocks here comes to catch you have to do it while still running your old
stack and because migration doesn't happen overnight
you're going to be living in this duel world for for a while and netflix we
could use this term that this image represents Roman riding that's when
you're on the right to horses one leg at a time
this guy looks very uncomfortable think about it you supporting to text x that's
double the bugs double the maintenance sometimes you have to propagate new
features in two places you have to replicate data and multi master data
replication especially at scale is no picnic
no matter what what great engineers you have and how much testing you do it's
not going to be foolproof and only once you switch over the source of true the
new state and stop that multi master data replication you will be able to
breathe a sigh of relief until then you're going to be constantly fighting
some kind of battles so what kind of lessons we learned through that through
the journey
first and foremost that there are few pieces that are critical to ensure this
loose coupling you talk about loose coupling between the various teams
developing microservices that are really two pieces that are needed 1i covered on
the slide which is IPC or IPC RBC again interchangeable terms but you
effectively want to establish a contract or a language between any two services
to talk to one another
this way when time comes to develop develop a new system
you already know which
english to talk to all these systems and if you have 500 plus microservices
running you don't want to know 500 different languages just simply doesn't
scale so I pc is one of those pieces the other one equally as important
you want some homogeneity want some consistency in how your applications are
being deployed for us initially it was Asgard that we have to develop in-house
and then we recently replaced it with another tool that we also open source
called spinnaker that automates all of our deployment workflows the benefit of
that being again the same central tool is that people don't have to reinvent
this wheel already and we also get a source of truth . of view from what
applications were deployed when and how and that provides an invaluable insight
in a time of crisis one of the first things you want to see when something
goes south is what was changed most recently databases it's one thing when
your database gets called by one or two models applications is completely
different when suddenly is get pummeled by 500-plus microservices you want to
protect your databases and so a pattern that we evolved into overtime is that
for most heavily hit databases even cassandra has its limits
we protected with a layer of cash and so we use actually two different cache
cache and technologies within that flex one is the key value add on the build on
top of memcache called the cash and the other one is a richer schemer that you
could use Redis for the school dynamite but the idea is the same
what you want to do is on the read path you want to hit cash first and go to
database only on the cash mess and if you go database in the cache miss on the
way back you're going to feel backfill the cash
you also want to make sure that in your call graph right at the top of the call
graph you will have a certain requests cash so that for example in the
beginning of the call you had to fish the user information to understand the
metadata about that user you will propagate downstream so all the
subsequent current services don't have to go and call that database again to
fish the same metadata over and over and over again you will find that you will
reduce the load your databases by at least one order of magnitude maybe two
maybe three depends on your call graph depth and current interconnectedness
telemetry operational visibility matters a lot
if you're on your model is you probably have some good metrics of what kind of
application systematics you look into to understand your systems health when
you're on hundreds of microservices will you to limit the scale how many grafts
are you looking for server x number of microservices that you have will you be
able to see forest from the trees first server how much individual human action
is needed when something goes wrong
how much of that were you able to automate just to give you an idea that
netflix we generate over 20 million metrics a second on average that
translates to about 1.7 trillion a day that the simply unsustainable for
anybody in human to look into all of those metrics and actually be able to
get some signal from that noise and so most of these metrics never get looked
at that is by human they get piped into automatic automatic error detection
algorithms they get piped into automatic remediation algorithms that will detect
if there is an anomaly and in many cases will correct it without human ever ever
being involved the same goes to your log analysis tools right if you generate
logs like like us and
that couple of years ago my ex-colleague going the term that netflix is a long
generation service that also allows you to watch movies because we just produce
such an obscene amount of logs and so you need to be able to find and hone and
tools that will allow you two to get that signal out of all that noise and
preferably automate as much of it as you can so this is what it looks like once
you get into the skill of microservices you simply don't have a luxury of having
architectural diagrams because things change all the time and so what you'll
need is a runtime at runtime you need to be able to discern who's calling who how
we're the errors various traffic flowing is there any congestion into a system it
may look something like this it may look something different in this particular
case you see the traffic starts at the elastic load balancer for us then it
gets spread through our front line of defense called dual which is our front
and proxy then your calls may go into playback systems API backhand and this
area comprises our age systems the skin a frontline of Defense from then on you
have middle-tier services platform services like caching and the databases
and so on so forth but this type of inside the same type of the limiter you
have to generate at runtime because if you just create a static architecture
diagram with the next two rivers is going to change is going to render it
obsolete reliability
it matters a lot especially at scale failure happens distributed systems and
the rate of failure is proportional to the amount of change that you pushing
through in a scale that we're running for us both are massive and so this
Drive aspirational with dr 449 availability unfortunately that only
leaves us 52 minutes of downtime per year up to now we haven't been able to
consistently achieve that now and when that looks these days when Netflix is
out that causes some disappointment outrage or withdrawal
likely some people actually do retain their sense of humor and react a lot
more positively but availability is important as much as we'd like to focus
on innovation and velocity only you can do it at least for long if your system
becomes unavailable people just stopped using it now in distributed systems
especially when you have large number of microservices you have to face the fact
that you'll see some cascading failures if you test each individual service
microservice that you're right
and each individual service reliability is 29 well it's not great but it's not
bad but if your graph is fairly interconnected and you have over 500 of
them the total availability is not going to be very good in fact you're going to
be out of service most of the time that's not acceptable
so what you want to do is you want to detect and correct this failures as fast
as you can
this comes like this coat the whole concept of circuit breaker circuit
breakers detective fault in your electrical systems and trigger a
fullback in electrical systems is simply switch the power off button software we
have more options if you detect a failure and downstream system
what you don't want is that that same failure to be propagated eventually to
the client because that that's not very helpful even if you propagated failure
fast what you want to do is you wouldn't you want to detect that there is a
problem and figure out whether the problem is with a critical service which
games over or non critical service and if it's a non critical service like for
example it now Netflix it could be a personalization maybe for whatever
reason we cannot give you the level of personalization that you like to do
well you still can probably would enjoy browsing and selecting a from a set of
movies and TV shows that's not necessarily personalized to your maximum
liking probably will still find something good to watch so what you'd
like to do you want all those failures to turn into fullbacks as immediately as
you can and this way you can still operate well you're the team responsible
goes to debug and fix the problem
so a few years back we actually open source the library that enables us to do
just that school districts and it's been used by many companies we've been
getting really good feedback and contributions of that I can't emphasize
this enough you have to destructive testing failures will happen and you
only know how your system will react to the failure if you inject it you gotta
do it in production
I get not to start with but eventually if you want to prove it you gotta do it
in production because failures happen not just at the system or cluster level
but across those clusters of the network level some failures and more insidious
than others
if something fails it's actually a lot cleaner you know it's broken you know
you're getting error or angled response back you can still visit
it's a lot more in CVS when the response is returned just too slow too late right
at the point where it's going to trigger your timeout imagine could cause
cascading timeouts that just not a not a pretty scenario and so we actually
develop a system internally known as fit very well be able to inject request
individual kolpath certain faults and the faults could be errors or the could
be Layton sees or combinations thereof and when all else fails you got a
trigger failure on more massive scale
it's not enough to trigger failure individual instance or zone or even for
cluster it networks we actually trigger failures exercises in production on a
monthly basis at least for a full regions so all of our services run at 3
a.w.s regions across nine zones at least once a month we randomly select one of
the region's and we evacuate out of it
we basically simulate that if that region would fail what would our users
see in our goal which happened to say right now is the case when we evacuate
out of the region is a user you will not see a thing
network services will continue working as if nothing happened it wasn't always
like that the first couple of times we run the simulation we called chaos khong
things were not as fun we had about 40 engineers in a warm scenario debugging
for about four hours trying to figure out various things that went wrong and
that we fix those things and try it again
that was ten engineers debugging for about two hours and now it's effectively
a piece of automation a piece of script that runs at monthly intervals most of
engineers don't even know when it runs because the results are so transparent
it takes awhile together but once you get there the rewards are pretty damn
so that's that's the graphs of us in this particular case evacuating us east
into u.s. west and US regions you will see that as the on the bottom as the
traffic went out to the other regions the overall traffic did not change or
has this much cooler illustration shows this is what a lot faster timescale
replay what exactly happened the traffic that you see emanating from the middle
circle is the traffic from our users from the internet the traffic across the
edges that's the traffic between the regions and you will see that right now
be triggered a failure in this region right here in us to it turned red and we
just started practicing traffic to the other two surviving regions and as that
continues to to ramp up speed and scale at certain point we will flip the switch
and you will no longer see any traffic hitting us your us because all the all
traffic has been fully redirected to the other regions
this is a big hammer takes a long time together but if high availability is
your goal you're going to have to access I something like this
not necessarily this it's whatever works for you
ok no token microservices would be complete without mentioning containers
latest and greatest severe shiny but let's get down to basics containers
don't make microservices containers change the level of encapsulation of
isolation from virtual machine to process
containers bring you great benefits specifically for developer velocity you
can iterate on a second length cycles you get the same artifact that around
your development machine the same as in production you can do a lot of really
cool magic with containers is actually the first talked this morning showed but
it's not a silver bullet and so make sure that you're using it because it's
the right tool for the job decides to run containers scale would require
something like this slide is intentionally meant to be an information
I'm not gonna dwell in details here but it does require a very significant and
complex systems now Google has done amazing work scooper natives it's
matured quite a bit over the last couple of years if you thinking of containers
use a foundational blocks that are available to you don't write your own
unfortunately for us we started a bit earlier were a lot of these blocks were
not available or we're not available at our scale and so again our preference is
not to build something we can buy in this case we actually into the building
a lot of pieces in our container runtime called the tightest we had to write our
own custom scheduler and a whole bunch of other things simply because the the
ecosystem were not rated was not ready yet for our scale so let's talk about
some resources that are available for you or you can think of it as a
commercial break from the regularly scheduled presentation because we by
being a pioneer into the cloud so early we ended up writing a lot of things
and because we ended up writing a lot of these kind of foundational pieces we
open source them and so they are available to you if you go to necklace
that get a github com you will find most of our infrastructure pieces the tooling
pieces that are available to you to use and contribute if you choose to and the
separating two major categories so it should be fairly easy for you to find
anything that you're looking for their don't try to read the small small print
you can just go to
site and it's all there but the main domain pieces that you could benefit
from a spinnaker is the tool of the open source that works for continuous
delivery workflow deployments consistently throughout in your state
for stateless applications for the common run time sharing a library's if
you need service discovery for example we have Eureka that you could use there
are many other alternatives as well and there are a few pc components there as
well for data persistence we open source and contributed components to Cassandra
ecosystem components to read this ecosystem if you need to make Redis
distributed we have the component dynamite easy cash for the memcache
client application and such for insight again if you need the telemetry for
distributed systems that scale large pieces of it are available you can plug
it into your ecosystem security matters a lot and probably the last thing you
want your team's to keep reinventing security pieces
it takes some skilled security software engineers to write good cripta to write
good secure systems that will not fall down their basic attacks and what you'd
like to do is to provide these either services or shared libraries so your
application engineers can focus on the application logic that's their business
impact it's not reinventing security so to wrap up microservices are good they
bring great value to development velocity availability many other
dimensions but they're not free microservices at scale first and
foremost requires organizational change and centralized infrastructure
investment if you want to do it right and it's up to you what you want to make
that investment just keep in mind that if you have ten teams working on 10
microservices and you want to introduce a certain centralized change there is
this much amount of tax that the all 10 teams will have to pay if you want to do
it later and perhaps 200 teams writing 200 microservices your centralized tax
becomes a lot larger and so optimized accordingly
also I can't stress this enough don't do something just because we did it
what worked for us may not work for you so be aware of your situation and what
are the right trade of symptom of the optimizations for you
we're happy to share tools but they'll know only work for you if you making the
similar assumptions and similar trade-offs as we do i hope this was
helpful I can take any of your questions now
thank you very much and there's some interesting questions from from the
audience and actually one of them cause the little smile on my face and analog
to the falling tree in the forest if Kaos Kong kill stuff and nobody notices
it notices it
how do you know it actually happened excellent question
so just because just because our users may not be aware that kids can't kill
something our telemetry is we have our telemetry analogues an obscene amount
that pretty much tells us everything that's happening within our system now
of course the most obvious side effect that we never want to see but see this
from time to time is when chaos common causes user impact but this is actually
a failure that we can learn from and make our systems more resilient the
toughest situation is when kill something and users don't see anything
it actually takes discipline from all the system developers involved to look
at their log of their system health understand how the system's fair under
the evacuation situation under the latency situations and make sure that
there is no bottlenecks that were close to reaching that there is no failure
that we just the degraded out of and everything seemed peachy but it wasn't
so there's a little bit of follow-up that happens internally but first and
foremost of course when we're around Congress that the there's no user impact
and who or how do you decide what services and the steams need to be
created or go away
is that a central architect or excellent question so we don't have a central
architect will have architectural committees will really try to embrace
culturally this whole concept of loose coupling and individual or in this
particular case team freedom and responsibility so each team is free but
also responsible for making these calls when your services need to come into
operation when certain services need to be retired and how to deal with the
migration strategies
I think someone is afraid to do the chaos testing and production so I think
that's why this question came up would you also recommend destructive testing
on production on more critical applications for example financial again
it depends what you're trying to optimize for it's a really good question
but at certain point you need to embrace the fact that failure will happen and if
you don't test it is going to happen anyway what you're what you're gaining
on running the destructive testing and yes in production is your understanding
of whether your technology is ready to deal with failures but more importantly
are your people ready to this the road to deal with failures for example they
have read books or will they have to scramble do you have enough people who
understand your systems do you have enough people who know who to call
who understand the systems these readiness you can do simulation you can
run the drills but nothing beats running in production and if you want to do it
in production but on a copy of that lets say because you're building system in a
shadow mode that's fine too but you want to get as close to the real thing as
you're comfortable with
please don't do it just because netflix doesn't we got comfortable business over
years so to do a fire drill you don't really need to put the building on fire
that's what you're saying probably in that now a question that has popped up
more than once so I really have to ask them where can we get those awesome
service visualisations so the service of visualization that they used herein
videos i called flux and flow they haven't been open sourced yet but i use
the keyboard yet because our philosophy is generally want to open source
anything that's not proprietary critical to the business which these are pieces
of infrastructure or not we'd like to share them at the same time I can't
commit on behalf of those teams to any particular timeline because business
priorities always come first
all i can tell you that would love to open source of would love to share it
with you and actually
your feedback and contributions I just don't know how soon or how that's what's
gonna happen
I guess we have time for for one or two more
what kind of discipline do you ask of your team members
well that's a loaded question nothing really in particular but if you haven't
seen our culture deck i highly recommend you go see that's available on
SlideShare on our site if you've seen it this whole concept of freedom and
responsibility the type of developers that are higher than become successful
that looks really embrace it and it doesn't mean that these people don't
make mistakes
it means that they learn from their mistakes mistakes are fine i mean that's
how we learn
you just don't want to keep making the same mistake all over again
ok last question services belong two teams
how do you handle new functionality that needs change in many services is there
something to manage this smoothly
yes and it's it's one of those hard situations where there is no no magic
bullet there are cases especially with security
let's say you want to respond to a newly discovered threat or you need you need a
critical piece of functionality to be spread throughout all the services where
I campaign a centralized campaign is necessary and you need somebody to
coordinate it most of the time for us this is basically the team that is
pushing to change the champion in the change they go and cross-functional
interfaces whoever required and make that change get into production because
most of our teams do implement some sort of see icd were even centralized library
changes get pushed on the regular cadence automatically it really becomes
a question of long tail there will be no matter how much you try to embrace
continuous delivery that will be a subset of applications that are not
pushed as often and those will require special attention
ok that's it alright thank you very much and no please make sure that you vote
for the session