Wednesday, October 05, 2016

NoSQL Distilled to an hour by Martin Fowler

okay well welcome to the conference and I'm the talk i usually give with these
cards conferences generally is over the you of no sequel databases based a lot
on what I want and discovered when writing my book no sequel to still well
what sense is the conference it's more aimed at no sequel databases i want to
get a bit of a sense as to your level of knowledge so that i can shape things a
little bit in the talk so how many people is this true for you use at least
one no sequel database in some kind of real situation and oh good a good number
of you okay how many people haven't used any no sequel databases here at all and
came here to discover when they might be useful so small around okay
and some of him who put your hand up already my put your hand up with this
you you've got a sense of the breadth of different your no sequel databases and
when to use them in different places
how many would say they have that quite a good amount of you
ok so that makes a little bit of a difference to how I'll at least go
through my talk the talk that arm i normally given our sequel databases
assumes the second question people who aren't that familiar with them and
obviously in your case you're more familiar audience as I'd suspected
minute macbook matter so i'm going to use very much the same structure of the
talking the same slides
well one of the main things i'm gonna do is talk a little bit about why i choose
to explain things the way I do because I suspect many people one of the
challenges is getting other people to understand what no sequel databases are
about even if you're familiar with it yourself you have to talk about this
kind of thing to other people
and that's why i think i can perhaps be most useful is in talking about at least
how I approach that with the food that you might find that same approach useful
so for those particularly in the last question
you're not going to learn anything about no sequel databases for me but you
probably knew that anyway but what you hopefully can learn a little bit about
is at least how I think about explaining the concepts so as with many things like
this i do like to have a bachelors in my clicker I'd like to start by looking at
some of the history because as with many things knowing the history of something
can tell you a good bit about where you are with something and even if you're
familiar with the ground it's good to know where things happen because I left
explains why they are now i'm not going to go back too far but i will say that
for a lot of us much of this comes with the rise of relational databases i am
just now old enough to remember when relational databases were the new hot
thing as I when my career started and people arguing about whether relational
databases had a future or not some of you might remember that but i don't see
that much gray hair around so most of you probably don't and relational
databases provided a lot of very valuable things for us and they
obviously give us assistance mechanism they help with integration and they have
a standard query approach so you can query all your different sequel database
no secret that you're different sequel databases the same way they provide
transactions which gives you a useful tool to help with concurrency management
and they have a great deal of good reporting capabilities and they're not
as fantastic as some people like to mention early on but they still have
been a very valuable tool but they do have their difficulties and one
particular difficulty was something that was very obvious as I got working with
them was the fact that whenever you're taking data and you're thinking of a
typical application where you might be showing some data on a user interface
that structure of the data in the user interface is different to the way in
which you have to structure the data in the database
everything has to be partitioned out into individual rows in tables because
the tabular nature is the essence of what relational databases are about you
always dealing with tables while relations if you want to use the
mathematical term but no you think I think of them always as tables and that
doesn't necessarily fit how you want to work with things in memory
well it's displaying to you I or even manipulating in a memory you have to
have richer structures in particular you tend to have hierarchical structures
make a lot of sense for working in memory but you can't represent those
easily in the database and this is what's been referred to a lot of times
as an impedance mismatch what you have in memory and what you have in a
database is different now this is particularly talked about in terms of
object oriented systems we talk about you object relational mismatch problem
but it's not really about objects it's really the difference between in-memory
data structures and the database because the same thing happens to you if using
functional programming you've got no easy map between the maps and lists a
functional programming and the tables of the database and even if you're just
using regular data structures in seal something of that kind you still have
the same mismatch problem
now this mismatch problem was the reason why a lot of people felt at a certain
point in time the object databases would be a big thing many people including
myself felt the object databases would have that really replace relational
well we know how that turned out now as it happens the relational databases
still remained dominant object databases didn't die out completely but they were
a tiny tiny niche and you very very rarely come across them in practice
would you sad i had some good experiences with object databases but
that was how it worked out there are many reasons that have been given as to
why it was object databases
despite the fact that lay didn't suffer from this impedance mismatch problem
despite the fact that they gave us many useful things that we wanted
I mean I handled persistence and transactions and things of that kind
just as well as relational databases why did they not succeed as much as we'd
and my hypothesis for this is that it really comes down to the integration
role that relational databases play in many organizations many organizations
integrate between different applications by sharing the database and then
different applications with different sequel goes and talks to the same tables
how many of you working organizations that do this that integrate between
applications in a relational database that's quite familiar to to everyone and
that's where object oriented databases broke down they couldn't play in that
kind of space because they were the whole point was to really do not use the
simple relational tabular model so as a result it just didn't fit and this is an
important question then for anything else but
wants to take on the role of replacing sequel databases in many organizations
how are you going to get round this integration problem i think there is a
root for no sequel to go that way and I'll come to that a bit later on but
I've this I think is a very central question for anyone thinking of using a
non-relational technology so relational domination continued night Elvis piece
of history because it helps set an important piece of context that i'll be
using later on the impedance mismatch problem is I think quite a serious one
and it effects a lot of what we want to do with no sequel databases and the fact
that it wasn't enough to break relational databases because of the role
of integration
this sets up i think an important part of the of the the forces that are in
play before we begin to look at the no sequel world but then the question comes
why has no sequel got interesting at all why did not people look at it and say oh
that's just like object databases are not even going to go near here and the
reason i think that that we're now beginning to ask about some alternative
to sequel has entirely due to one effect which is the internet bringing in a
great deal of traffic to certain websites around the world and i'm
thinking here particularly the early big ones
Google's the ebays the Amazons and the like
now when you're faced with a lot of traffic
you've got to grow to support that traffic so one possibility is to just
build a bigger computer but the problem is that is to start expensive and
secondly is only so big you can really grow up
so what everybody did was I said well instead of having one big computer let's
have lots and lots and lots of computers now this point we run into the problem
with sequel databases because sequel databases that really been designed with
the one computer
notion in their mind they have not been designed with lots and lots of databases
in mind that doesn't mean that people don't use sequel with lots and lots of
databases I've talked with people in organizations into big organizations
where I've talked to people include ebay and bet they're well they have huge
farms and they're using relational databases and the common thing that I
here whenever I hear someone talk about how they've dealt with that is unnatural
acts they've had to bend sequel databases in so many different
directions but they've lost most of the benefits of using relational databases
in the first place and that's because the mental model of relational databases
just doesn't fitness distributed world so two organizations decided to do
something about that
so we have google and amazon and they produce their own data storage
approaches that I work really quite different to relational databases now
these are at least originally very closed very much proprietary systems
only really got known about sort of through back channels of conversation
and then through some moderately revealing papers from its from the two
organizations but that created the whole person of interest that leads to the
no sequel so again stepping back and saying how I'm explaining this it is
that I'm explaining what is the crucial force that makes no sequel different
object oriented databases
it's the high-traffic the need to run on lots and lots of nodes now is an irony
there because not all no sequel is about large traffic in fact I would argue that
most of the action isn't but that was the force that made people say we need
something other than relational so next thing is where does this no sequel would
come from because one of the things that hopefully everybody has cottoned on to
know now is that no sequel is a completely arbitrary term it doesn't
really have any meaning
I'm and a lot of people don't realize this so I explained the background it
really comes from your highness concern who was working in London had been using
how do i was interested in some of these big table 'ish the Dynamo style
databases do some activity going on in San Francisco me New bit about it but
didn't know about much for to find out more
let's organize a little meeting to get together people and we can exchange
ideas and of course if you're going to do that and in the modern age what's the
most important thing you have to have in order to make this happen
the twitter hashtag that's it that's when L sequel comes up it's just a
twitter hashtag for a meeting there was no clever person who sat down and said
we need a new category of databases and i'm going to define them with these
characteristics or anything like that
nothing like that at all just all here's a hashtag for a meeting and then people
latched onto it
there is a bunch of database people who showed up at that meeting but then a
bunch of people but cotton down and grab that no sequel hashtag and said well
that's a good phrase let's run with it so it's a completely accidental piece of
terminology and the family here at a no sequel conference is really in a way
rather silly because we're not talking about anything that's well-defined so
what so i would like to have a definition of no sequel if i'm going to
talk about it if I'm gonna write a book about it i like to define what I'm
talking about but I think it's impossible to come up with a definition
all one can do is say I can observe from common characteristics in the world but
i think are important things that are the common things that people that the
databases are referred to as no sequel databases tend to have noticed the very
backwards why I'm approaching now I'm saying I'm looking out in the world
noticing that certain databases tend to be called no sequel databases either by
themselves or by people working with them and I say all the common features
of these databases and that gives me a set of characteristics and these are the
ones I like to pick fairly trivially they're not relational notice it's not
notion is not whether or not sequel it's that they're not relational the very
first usage of the term no sequel was to describe a relational database query
language that wasn't sequel but instead use unix pipes and filters totally
unconnected with the no sequel world has come since many years before it another
feature is that they tend to be open source now big table and dynamo weren't
and there are certainly databases out there that call themselves no sequel
that are open source but most of the ones that people talk about her open
source and I think this is very interesting part of the whole thing the
fact that we seized a very much an open source driven movement cluster
friendliness the ability to run on large clusters is another common
characteristic many of the no sequel databases do this very well but one
particular category the graph databases don't tend to do that particularly well
that's not necessarily a bad thing that's just a different thing I i define
them as being part of the 21st century web and that's partly a timing thing we
could look at really old databases like and data storage technologies like I Sam
and months that were around before I began one of these all the data storage
technologies that relational was replaced
could we call them no sequel I know you can't Chloe a nice and data storage with
sequel so does that mean it's no sequel
well now because no sequel is a term that really goes with what was happening
in the early two thousands so it really comes out of that . and then the last
point is the pretty much all skinless they don't have a set schema we have to
fit with in the in the data and that's another
a whole interesting characteristic of its own so here when explaining no
sequel databases to people i really have to stress the fact there is no strict
definition it's an accidental term but people kind of latched onto and leads to
this very very mixed fewer things nothing very consistent here and I think
terribly logical about it you know some student came up with this they will get
an F right but we've got it and we're stuck with it and there's nothing we can
do about it
one could also think about the ties into the whole big data world Hadoop's data
analytics and things about kind I don't tend to go there particularly when i'm
focusing on no sequel although is obviously a big synergy between the two
areas we see it at this conference and a lot of other conferences and that's
because of that but no sequel database is geared to handle very large amounts
of data or obviously a good fit for many of these kinds of problems but I think
it's more to do with that that when we're trying to analyze complex data in
interesting ways the relational model is a good tool some of the time but not the
rest of the time and therefore we need non-sequel approaches to help us in
those other times and there any raava logically moves us to talk about the
data model this is one of the natural ways of thinking about what's difference
in the sequel databases well it's we've got different data models and again we
haven't really gotten one day to model we have several off top
so I bring up the screen and what I would say it's probably the most often
talked about no sequel databases not a comprehensive list but just the ones
that at least struck my consciousness is the most commonly talked about and
they're often divided up in terms of different data models as well see these
divisions division can be somewhat arbitrary so I'm one of these demos are
key value is fairly straightforward i have someone puts data i access it by a
key i look it up a value system fairly straightforward if you've used BBM or
new things are very old tribe we've been around a long time and other categories
document databases where you have some kind of structured data form i always
find it all that these are called document databases because they're not
like any document that I'm you still don't look like Excel laura or a page on
a webpage or something they don't have text in them used to documents being a
mix of text and data but now these are just purely structured data but we call
the documents for whatever reason an interesting point about documents is
that they don't have a thick schema so you can put any kind of data you like in
this is obviously always true of a key-value database because of the fact
that it's a completely in a sticky value databases in the completely opaque blob
of data so obviously it can't have a schema because it's just big hunk of
data but even the documents databases even that they've got a structure still
have no schemer to them but as I always like to point out when we're talking
about scheme lessness and I could spend a lot more on this and do it and
different talks but for this one I've had to pare it down to a minimum
just because there isn't a scheme in the database doesn't mean there isn't a
schemer in your application if y'all got if you're writing code like this you are
that I have somewhere a field called price on the field can prompt quantity
otherwise things are going to break what's happening is I have an implicit
schema so it's not really schema-less in a way it's your schema is implicit and
this is actually most of the time a bad thing because it means if you want to
figure out how to manipulate your database
you've got to figure out what the schemer it is so you've ever going to
derive it by looking at the data and trying to think of what they call it
back to call it quantity don't call it cutie why
well you've got a 30-round in the code to find out where the codes manipulating
the data so you can see what the implicit schemer it's this is not a good
now it has some benefits and there are some very valuable benefits about being
skinless but there is definitely a curse that goes with it and it's something
that tends to be underplayed a lot by people who talk about schema free
databases there is definitely a downside
you've still got to pay attention to what your skimmer is you still gotta
figure out how to make sure that you understand it and make allow people to
see how to minute to work with it and it also makes only in issue in terms of
data migrating over time i hear people talk about how old data migration and
upgrading schemas that's easy with no sequel databases because there's no
schema well actually you've got one or two extra tools in your tool kit that
helped a little bit but most of the problems still the same at some point
you're still gonna have to do the same kind of data migration things that you
do with a schema Rick schema schema database on a skinless database and then
you know that kind of stuff tends not to hit people until later on in their
project but it's a common refrain I hear from people who have been you know a
year year-and-a-half into the project we thought schema-less is going to mean we
didn't have to do these things but we did it
I'll be wearing a skinless stuff and don't over stress it so it took so far
about two of the four data models document and key value and I've kind of
diverting a little bit to talk about the implicit schema but now i'm going to go
back to looking at this key value of this document thing and I'll the brand
immediately point out that this boundary between the two is actually quite a
blurry boundary and in fact you're here people describe different databases as
being document or key value depending on who the speaker is some people might say
all react is a key value database and somebody also know reactor document
database and you go huh what's going on how can they be classified arbitrarily
so in such a way and it's because this boundary line is very there it Sookie
value databases for instance might have metadata that you can attach to them
which kind of makes them begin to look a bit like a document because they're
having all these extra fields of metadata placed on them as well that's
kind of document ish
similarly a document database there's no reason why you can't have some special
field four acts as a an ID or a key into that document and then you find that
most people when they're accessing the deferred the document database are
actually using the key
they're not actually using the rest of the document so they're treating it like
a key-value database even though it's a document database so things get very
blurry between the two
I actually don't think the distinction between key value and document database
is terribly useful it's a kind of hint as to what to expect but it's not
necessarily a big factor
what's interesting is what the two have in common and the term that i use and
I'm trying to encourage other people to use but I certainly used to this is on
an aggregate oriented database which is a little bit of a strange term perhaps
if you're not familiar what do I mean by aggregate well I'm taking the term
that from eric everness I'm who wrote the book domain-driven design how many
people have come across the main design to very good book not an easy book but a
very good book and in the main driven design he wrote this book in the context
of your building object-oriented system and you're using relational databases so
this is long before no sequel and he noticed that when you operating in that
world you often like to not deal with individual objects but clusters of
objects are related together so if you want to pull information about an order
from a database you typically want all the line items on the order as well you
know two of this 5 of that six of the other you upload the whole aggregate
together so typically when you're interacting with a database you don't
want to think about it on an individual object or individual row level you want
to think about it instead of the aggregate level and in particular when
you're using transactions it's often not a good idea to allow those transactions
to go across these aggregates so he built up this whole approach of saying
we think in terms of aggregates now these aggregates these are if you look
at them from his work and you bring over to the no sequel work well these are
really the values of the key-value database under documents of the document
we've got some kind of hierarchical be structured fairly rich structured clump
of data but you can represent those maps and lists in some kind of way and you
deal with it by taking a whole aggregate from the store into memory or push or
updating it when you push it back so you have that you think of the aggregate
level is you talking with your data storage so it matches very well with the
main driven design notion of aggregates because you can you say the whole
document or you say the whole value and pull it back and I think this is the
really interesting common characteristic the fact that you operate at an
aggregate level rather than individual roundtable level
and for many applications this is very natural match i was talking with people
folks at the guardian newspaper who used i started to use Mongo a lot of cases in
their work and they said well the document model was a natural match for
us because the article makes a natural aggregate unit instead of having to
worry about how to spread across all the different database tables we just save
the article and pull the article back so in many situations and aggregate makes a
natural choice of using a common family database you have a similar aggregate
structure but now the aggregate is identified by a combination of the roki
in the column family night so it's a bit more complex because common fact column
family databases are more complex model but the basic idea of an aggregate still
holds you still have this unit larger unit in a row or an object but you're
pushing back and forth and when I talk about no sequel databases that's all I
say about column family databases because their clock complicated to
explain and as long as people do what I really want people to understand is this
notion of pulling aggregates back and forth whether it's a column family a key
value or document database that's an important detail if you're actually
sitting working with them but if you just trying to get a basic understanding
of what these databases are about its the aggregate message that's really
important and then the thing is that instead of saving into tables we can
save the whole aggregates onto disk and this is really valuable when it comes to
distribution and this is why it ties in with the whole idea of running on
clusters if you want to run things on clusters you want to make sure that all
of the aggregate is on the same note you don't want to be doing a whole bunch of
separate remote calls to gather information for one aggregate and that's
one of the problems of sequel it's hard to do we get everything in terms of one
sequel query usually need a few but the great thing about these things is you
can have
aggregate and the database knows about the aggregation a sequel database
doesn't know that these orders and these line items are connected together in
this way
hey kind of knows through the referential links and all the rest of it
but it doesn't really know he doesn't know how to keep everything together but
with a aggregate oriented database the aggregate is very clear and so as a
result it can be managed across the network you can put these aggregates on
this machine those aggregates on that machine and you know what people are
always going to ask 21 aggregate and therefore you go and find it and bring
it back
managing across the cluster is now much much easier and that's the heart of the
cluster friendliness it's the role of the aggregate that there's a downside
the downside comes if we think about the older example when somebody wants to see
a report like this but says give me the revenue for the products / different
times and what's happening here when you're producing this report you no
longer want to see an order being an aggregate of line items the aggregation
shifts and become somewhere else if you had gotten aggregate oriented database
now you're aggregates out completely the wrong way round for what you want your
stuffed is the technical term
well actually you're not stuffed but you have to start using MapReduce algorithms
which is effectively the same thing as being stuffed
but actually anyone MapReduce comes into play right because now you have to
reshuffle your aggregates to saps to satisfy somebody else now it's a price
worth paying because if you're dealing with large amounts of the large enough
amount of data you can't hold it all in one nice relational structure not
without those are natural apps i talked about early on so you gonna have to bend
somehow and bending this way is probably better than trying to force the
relational approach but the really can't really vital message here is aggregate
oriented databases work best when you have a clear aggregate that you're
manipulating all the time as soon as you want to look at that database some data
in something other than that aggregate then you're gonna have to pay a cost for
the simplicity of the usual like a good back-and-forth that's aggregate oriented
databases and that's a large lunch of the no sequel category but there is one
category that looks a bit different and that is a graph databases which are
nothing like aggregate databases whatsoever in fact that had a couple of
reviewers when i was writing the no sequel book say why you talking about
graph databases at all but nothing like any of the others you shouldn't be
calling those no sequel databases but you know whenever you go to a no sequel
database conference
there's always some graph database people here aren't there
no there I can see you somewhere maybe hiding but you're saying no and
therefore j work with no sequel
yeah it's right and it's because it is totally arbitrary thing now if I have
been coming up with definitions i would have a greater oriented databases and
what everything will be clear and straightforward but nobody asked me
so we get a graph databases so that's a whole different animal in many ways
aggregates are about taking all that relational database and building them
into these bigger lumps of managing the club's aggregate getting stuff together
grats go the other way they look at relational databases and they say let's
split those big honking relational rolls into even smaller pieces and let's focus
on lots of relationships between little things and so you get these complicated
graph structures and you say well if I want to look at a graph like this i want
to query in terms of the graph and so come up with graph query languages which
are very much dependent on having graphs and they design the data storage to be
able to navigate through grants easily as we know relational databases are
often people often have the the Miss thought of thinking that relational
databases are making relations between different pieces of data but when you
want to actually form relationships with these relationships with foreign keys
relational databases struggle and more joins you have to slow your queries get
it's not as bad as it was when I was younger and that the dire rule of thumb
was never have more than three joins in a query because if you did that you knew
the db2 would kind of grind to a halt
but it's still an effort to put a lot of joins it and the graph database people
so happy we do joins with hundreds of joins we don't care with join happy join
friendly people
and so they do all this stuff this is by the way of course a reason why the
netherworld be a standard query language for no sequel databases because they're
just two different write a graph database karmic we read in the same way
you're gonna queer in a door into database that you could argue that
aggregate oriented databases don't really need much of a query language
anyway because most important thing is getting the key and if you haven't got a
key some metadata oriented kind of query you're not doing a big complicated
things that sequel wants to do so
a query language is less important for aggregate or into databases it's very
important for a graph database i thought one where r give it the real strength of
the graph databases what you can do with its query language but it's clear
language is always going to be different to a relational for an aggregate or into
system and whatever you do you ski Milus because everybody likes me schema-less
these days but I've already have ranted about scheme lessness so that's enough
of that
so it's the first part of how i like to explain no sequel databases people
focusing on the data models and why the data models are different and the fact
that they are quite different now for the second thing i like to talk about
I like to go into talking about consistency and i always start off by
bringing up this which is the common way consistencies described in the no sequel
world no sequel we're not acid where base base is an acronym that is so bad
it makes acid look meaningful
I what I want to want to point out is that thinking along these lines is just
not a good idea
you don't go here so we start I start by saying the whole point of relational
databases you're taking a logic what have been treated as a simple logical
lump of day to the aggregate and you're splitting it across lots of rows and we
had a 3-2 oriented databases etc and they have advantages because of the fact
that you store the whole like we get on itself obviously graph databases are
different graph databases absolutely need to be asset so this is the first
reason why no sequel equal space is a misnomer because the graph database
people they will do acid transactions they have to because they're breaking
things into little stuff whenever you break larger things into little things
you need this whole kind of acid transaction think so we really we're
talking about consistency we kind of total graph boys ok we're not going to
talk about you
you're boring you're the same as everybody else will focus on the
different stuff which is the agador entered world so I made a comment
earlier on the aggregates a transaction bound lineup with transaction boundaries
in the main driven design and here is again a very nice synergy between the
two the aggregate is really day essence of what we're talking about in terms of
transactions and a very important point that i like to stress the people is if
you make atomic updates in aggregate oriented database you have them but you
only have done within one aggregate so what you can't do is say I've got an
order over here in order over there and I want to manipulate both of them and
update them within a single transaction because I'm touching to aggregate I
can't do that i have to only touch one aggregate in order to get an anatomic
update now that's of course less of an issue in many ways because of the fact
that the aggregates are actually structured in themselves but he
obviously is a bit of an issue
but what a lot of people forget is that acid doesn't help them that much even in
traditional databases and that but they're just so used to it but I don't
think about it so the example I to give his we gotta go out where work when 2007
got somebody using a browser it's touching the servants touching a
database and I got to users and above accessing the same data so without
saying okay we both like to get the same order for instance then one of them is
going to update it then the other person's going to update it and what
we've got is a potential problem because you've got lost updates or whatever
now transactions don't really solve this problem i mean i can if you open up the
transaction for the entire interaction how many people would do this in a real
production system with moderate volumes what a shock
nobody does because of course you don't want to have long-running transactions
are open while people are going off to lunch
not a good idea
so what we do is we only have the transaction at the point of update but
of course that leads to exactly the same problem that would have by having no
transactions we can still get somebody over writing somebody else's update and
still get a lost update so how do we deal with this we do with all the time
we use what I call enough one lock in a big way of doing that is we get a
version stamp on whatever it is that were dealing with and we make sure that
when we update something we supplied version stamp of what we read then that
increments version stand them when the next person doesn't update and they
attempt to post we see this version mix-match look at the etags or whatever
it is whatever mechanism we're using we say up we've got some kind of conflict
here and we deal with it we actually handle it typically within our
applications who may be in some kind of framework object-relational mapping the
Larry thing but it's not done by the database because we can't open the
database transactions that long and typically the way a good way of doing it
mr. again think in terms of an aggregate and transaction boundary so really in
practice there's much less difference in terms of consistency when you might
think the database can help us to a certain degree it will enforce that
updating an aggregate is done in anatomic way that's good but we have to
manage cross aggregate things and we still have two in the same technique of
using offline locks is every bit as useful with these databases as it is
when working with relational database so the difference in practice doesn't end
up being that much transactions are useful tool but then don't solve old
concurrency problems that we have to deal with so you that much switch is not
such a big deal as as many people think but
that's only one form of consistency i like to divide consistency in two forms
mycological consistency which is what I've just talked about which can happen
on a single node father only got one computer i get logical consistency
issues as soon as I get more than one user of that computer but we have a
different form of consistency now problem now
chuckle replication consistency the whole beauty of running on a cluster is
that i can send them off to lots of different nodes on the cluster and if I
can do that often it's useful to replicate data so that I don't have to
talk to that machine over there if i can get to this machine over there much
quicker replication good thing we like it
nice fast rapid response types but introduces considered whole different
family of consistency problems and this is unconnected really with asset because
this is what happens when you do replicating data it's gonna hit you
whatever kind of database technology music so again i always like to
illustrate problems of examples so here's an example this case so the my
imagination is a that my me and my PO or promote he was the one who actually
knows about databases he we're both trying to book a room going to book the
last room in a hotel and he's in india i'm in America and we're tapping away
we're talking to our local nodes which of course in different countries so we
both send in our request for a reservation and the two local nodes then
we'll communicate and bonbon and auto sequoia radius i do to resolve the
problem and also ok we have to solve this
so say promote gets the room so now what happens if i take this connection line
and I break it now there's some problem in the internet in the air is cut off
from the US whatever it is there's a breakage now what happens when we both
tried to do these things while we actually have two alternatives to
different things that we can choose to do one thing is to say we've got no
connection so
but we can't do anything we can't guarantee that we don't double room so
we're not doing anything at all no rooms are going to be booked hotel is
that's one option the second option is to say yeah we'll take your room
bookings no problem and then you get a double true when you got to figure out
how to sort it out and this is consistency
this is availability and this is the essence of what it's about
and the most important thing you need but that Walker so let's say the most
important thing to know is you've got to choose between one or the other
you can't have both if your network connection goes down you have to decide
which strategy you can operate with but more important than that is it's not for
you as a programmer to decide which it is but you're gonna go with it's
actually a business decision
you can't as a programmer say laughter if the internet connection goes down
we're completely stopping all our hotel bookings but a lot of business people so
but in the next going down all the blasted time we can't do that we gotta
run a business here
actually I do with your weave double booked hotel rooms for years we have
ways of doing it we have special buffers in the hotel that's of certain rooms
that we don't unbox some people don't turn up we manage things that way I mean
we've done this for decades don't you computer people tell me the sky suddenly
have to stop making money just because I'm doing that I'm in the canonical case
of course of this is amazon they wanted to make sure you can always put stuff in
your shopping cart because what is the most important activity in America
nothing must stop you shop so they go ahead and do that now there are cases
when you do need the consistency there are cases when you order of double
booking but it there is on the business and by the way if anybody tells you we
must be absolutely consistent in financial matters because no financial
institution would allow you to be inconsistent they've clearly never
worked in the banking sector for real because by the way when you transfer
money from one bank to another they're not using two-phase commit to do that
they've lived with all of these problems for decades so the consistency
availability choice is a business choice based on what you're going to do when
your communications go down and this is the problem i have with the cap theorem
the cap theorem is often stated as we have these three properties you gotta
pick two of them because you can't have all three and i think that's really
thinking about the problem wrong way around
well I think about the problem is saying if you can have a system that can
partition but can have breaks then you have to choose between consistency in
it's not an issue if you got a single node system right either you're worried
about that but you know one buses going to talk about i mean that's that's I
mean it can't happen but it's not on your list of priorities but if you got
talking across a network particularly a wide area network then you begin to
worry about these things and then you have to say what am I gonna do when I
partition I'm going to shut everything down or I'm going to keep running
accepting consistencies and deal with them later you have to choose and by the
way it's not a binary choice
there's a whole gradation of possibilities you cancer allow certain
amount of consistency is certain amount availability things like me when he
stops talking about dealing with quorums and things of that kind
that's all about making a choice in between the two but there's always some
degree of choice between consistency in availability will actually most of the
time it
isn't really about availability at all coming at the limit is but a lot of the
time it's actually more an issue about response time because if i want to book
my room and I'm talking to the American server it's gonna be slow if it has to
talk to the Indian server and the Brazilian server and the nigerian server
and all the other server scattered around the world that slows down
yes he can make sure everything is consistent but now i got a slower
response and that's a problem because we know but if people have to wait for a
response then go to somebody else so after what you doing is trading off
consistency vs response time saying okay how much am i prepared to deal with here
and this is really the age-old thing about dealing about the trade-offs
between safety and liveness which is a classic problem of of concurrent
programming locally sure thing is to remember that when you have replicated
data when you have any kind of distributed system you have this issue
about how to deal with what happens when things break in the network and you just
have to cope with that thank you again next 250 got from only got 10 minutes on
my clock that's cool so there's a lot more things I can talk about this is
only the beginning of discussing issues around consistency and how things have
to change
I don't have time to go into the other stuff so I don't like to skip over that
while I chime in trying to do when i do this bit of talk is I'm trying to bring
out the fact that primarily that it's not the primary it's not the acid-base
thing it's not about to some big difference in that sense really when
you're talking about
as I said that the with the logical consistency it's really not that
different way you got a acid relational system versus an aggregate oriented
I mean the boundaries are slightly different but you really pretty much
have to do the same things
really more complicated stuff is to do with replication consistency and that
occurs everywhere I mean what is a cash were replicated stuff all the
consistency issues i just talked about other same with handling caches and if
anybody out there doesn't think that they're now having to deal with
replication consistency problems on our website and I get any traffic at all
there are doing something really weird or just ignoring all the caches are
flying all around the internet we have to worry about this stuff and of course
cache invalidation is one of the two hard problems in computer science so it
isn't easy but we have to deal with it so we have to think about a whole
different set of consistency issues and we have to think about them in business
terms i really want to stress that point it's a business decision where you draw
your line between consistency in availability
it's not a technical decision so there's nothing at the two main things that
people need to know about how to think about consistency differently
what are the different data models at that point you can begin to adjust the
question of well when and why should people consider using a no sequel system
well obviously one reason say what what what to look at no sequel you going to
say one of the things that drive you towards it and I think there are two
main drivers the one that gets a lot of the attention is the one close this
whole thing to get interesting in the first place
the fact that you've got large amounts of data and that obviously makes you
head towards that direction and usually towards the aggregate oriented running
on a cluster kind of things that may have been the force that kind of broke
relations dominance allowed to say there are circumstances when we can't use a
relational database that was the crack but once you've opened app question up
once you said oh I don't have to use relational forever think then other
things begin to slip through the crack right so again the graph day-to-day
basis a big example of this they don't necessarily deal with huge amounts of
data any better in sheer volume that
and relational databases do but once you've asked the question is relational
the right for your problem and you begin to realize all I've got lots of
interconnected data the crack is opening up and the graph databases going right
the way through it because they can see there's an advantage here and this leads
to the memories and for using no sequel but it makes you development easier and
actually the most of the places i've talked to were using no sequel it's the
easier development that is drink driving it rather than the large amounts of data
so if i look at where we've not seen a girl oriented databases applied the
guardian newspaper great example they weren't having data volume problems they
were just finding it too much of a pain in the neck to deal with a relational
database and they've got a natural aggregate so they go for document
I'm just last week I published on my website a case study of a system we've
been developing in California with the gap to handle their purchase orders and
this purchase orders again and natural aggregate so what they used on that
project was not go again as it turned out and they said it was great but
didn't have to worry about all the relational database complications of
mapping bad data back soon for words they have a purchase order that really
left from the database they manipulated they throw it back in the database much
easier development suddenly a whole host of data issues went away
that's what was driving up if i look at it worth of projects that we've been at
Fort works that's the constant picture is much more than the occasional cases
where we have to be great assets but much more often it's the easy
development that is a large penis we've used neo4j great deal got quite a few
projects have used it and again it's that if you got graph interconnected
style data
yeah you can do it in a relational database but it's such a freaking
pain-in-the-neck you'd be stupid to now that we've got an option a decent
database that can actually handle grats that's a much better way to go and we've
done some really sophisticated things of genetic processing and staff using
oh sure it's really funky stuff using that kind of thing but the main reason
that people said we couldn't do this was because databases are used to
integration and here we've been helped by a happy coincidence because at the
same time as the interest in no sequels been growing people are beginning to
think well instead of integrating through share databases we should use
this web service stuff and get a restful and you r le and resource like and as it
turns out this is perfect because if you can wrap your data behind some kind of
service like this then nobody cares what the database technology is anymore
suddenly you can use sequel you can use graphs or aggregate data it doesn't
matter as long as you can sort of supply or appropriate resource endpoint small
the rest of it and that is that growth of popularity of using services to
integrate is really i think at the heart of what's overcoming that degree of the
problem so as well as having the hammer but made the crack in terms of large
amounts of data we've also got the web service integration was giving us more
options in terms of how we go to communicate so does this mean the future
is no sequel my answer is no the future is what I call polyglot persistence the
fact that now we have a choice of data storage technologies to use of which
relational is one probably the biggest will probably be the biggest for many
years if not indefinitely but the point is we have a choice we have to decide
what is the appropriate data storage technique for our application or indeed
have an application that's got a whole bunch of different techniques and
technologies in place for different parts of it and that is I think we're
the future lives we now have to the bad news in a way as we gotta stop thinking
about it
and of course we talk about opportunities and your joke is an
opportunity in problem is the same thing so there's a whole bunch of issues that
we have to deal with in this future that's the biggest one is that we've got
to make decisions now we can't just say all the corporate standard is oracle we
have to say what is the right database for this kind of problem so when
thinking about what kind of project might no sequel dates database be useful
i use these drivers as my base i say well if you want it was easier
development give you it gives you more rapid time to market faster cycle time
new ideas out quicker because there's less friction development so that's one
of the reasons to use no sequel and again if I've got lots and lots of data
then a data intensive application is typically what drives it i also argue
that it's not just that but you have to use these properties i think no sequel
is more useful what I call strategic projects while I mean by this strategic
projects or projects that will make a real difference to the business or the
underlying organization as opposed to stuff that just keeps stuff running in
the background your payroll is a utility you just wanted to work you don't really
care what you do payroll better than your competitors strategic stuff is
stuff you want to do better than your competitors the reason why I think no
sequel is more suited towards strategic stuff is basically because it's amateur
the tools aren't there people don't know about it it's more risky so if you've
got something that isn't completely important to you but is just a utility
keep the lights on
you don't want to take on some newfangled technology that's still in
its early days so you want actually focusing on where it's going to make a
difference if you're going to have you see a business opportunity
we're bringing in a graph database is going to allow you to analyze the data
more effectively more quickly and give you a competitive advantage that's where
you want to take it you don't want to say all I want to do
is boring task that no one cares about let's use some newfangled database that
nobody knows to do it that's how I how I look at this and how I some of the
differences are actually timed myself beautifully to my own time which was a
1030 finish which means i have five minutes according to you what which I
can avoid answering some questions so please
fire away
yup so the question is why not put Triple stores on here
well maybe because most of the stuff that I've seen we're talking about no
sequel databases has not been talking much about triple stores now there are
number of these things they've been around for ages kind of object databases
have and not got out of an ish they did intend to get labeled as no sequel
particularly and that's the totally arbitrary reason I left them off means a
lot of things that are left off in that discussion but what it does mean is as
we make the shift a polyglot persistence it's beyond what we look at that moment
or maybe even the future is no sequel databases now as we say all we have to
now think about choosing what is the appropriate data storage that raises all
sorts of questions i mean i have i personally have a hope the object
databases have a reappearance because they'll many problems object database is
very well suited it would be nice and see them come back i also think there
are other things that we often forget about people having underestimate the
file system which is a really good hierarchically key-value database with
hierarchically ordered keys you can do a lot with the file system in it they're
very very common and quite fast and available and have got really they're
very nice these days caching very good almost operating systems people
underestimate what you can do with the filesystem I think there's a
particularly interesting class of applications that could use technologies
like get to provide versioned access to information when you really don't want
what you really want to make sure that every change is properly logged I mean I
use a lot of ideas you can take from from version control systems and so many
up luzio into the whole area of temporal databases and things like that i'm quite
interested by one atomic is doing for instance in the hole
universe out there of interesting non-relational database approaches which
are not necessarily classified as no sequel and really to totally arbitrary
words they don't care
well I'm talking about with this talk and the Balkan and things like that is
kind of the first step the things that are going on under the most noticeably
under that no sequel flag but I think that's just the advance guard of what's
going to be a much broader range of the data storage into things over there
no I don't think the future is some kind of super database that could be both the
graph and the relational database and several of things because the things are
just different models and so you inevitably have you good
you're never going to have one query language that's going to work on a
relational database on the graph database and on aggregate or database
I mean without a huge amount of compromises because you have to think
about things differently
I mean world some databases I mean you're seeing this doing some degree so
with postgres at the moment saying how we can support embedded data in Jason
data and things of that kind yeah that'll happen but I think we have to
think about the different data models different things because when the data
model is different enough when the consistency needs are different enough
you have to think about them as in a different kind of way and so I think the
idea of there'll be some kind of single super database is unlikely
the change comes to be driven a number of different ways I mean sometimes it is
very much driven because well i had data usage isn't really matching what
relational can do so we have to push for something different
sometimes you get the effect with are called no DBA effect which is where are
our database group is such a pain in the neck
we want to use a non-relational database because they're not way it's not
classified as a database and we don't have to talk that talk to them and you
laugh but I've heard several cases where that is exactly the reason and as a
result i can get changes much faster they can go much more quickly and that's
and the businesses supporting them and you know if the business is supporting
it doesn't matter what some technology group wants to say if the businesses is
determined enough they will go find a way around
I'm so that kind of happens in some routes as well so the aminos it's kinda
guessed that the most common signs have seen so far and between the two
alright to call the lady is so pressing questions