00:01 When it comes down to modeling with document databases
00:04 you apply a lot of the same thinking as you do with relational databases
00:07 about what the entity should be, and so on.
00:10 However, there's one fundamental question that you often ask
00:13 that really does take some thinking about maybe working through
00:18 some of the guidelines, and that is to embed or not to embed related items.
00:23 So in our previous example, you saw that we had a book
00:26 and the book had ratings embedded within it,
00:28 but we could just as well have the ratings be a separate table
00:30 or the ratings could have even gone into the user object
00:33 about reference back to the book, instead of the reverse.
00:36 So should we embed that ratings, and if we do,
00:40 does it go in books, does it go in users, or does it not go there at all.
00:43 So what I'm going to do, is I'm going to give you some guidelines,
00:46 these are soft rules, we don't have like a really prescriptive way of doing things
00:51 like third normal form here, but some of the thinking there does help;
00:54 so let's get into the rules.
00:56 First of all, the question you want to ask is is that embedded data
00:59 wanted eighty percent of the time that you get the original object;
01:02 do I usually want the rating information when I have the book?
01:08 If it would have resulted in me doing a join in a traditional database
01:11 or going back and doing a second query to Mongo to pull that data out,
01:14 it's very beneficial to have that rating data embedded in the book.
01:19 We designed it that way, so let's suppose like most of our query patterns
01:22 and most the way our application works is
01:25 we want to list the number of ratings, the average number of ratings,
01:29 things like this we want to surface that in almost all the time,
01:32 we want that embedded data when we get a book.
01:35 So that would guide us to embed the data, if this is not true,
01:40 if you only very rarely want that data,
01:42 then you most likely will not want to embed it,
01:45 there's a serious performance cost for what you might think of as dead weight,
01:48 other embedded stuff that comes along with the object
01:51 that you generally don't care about most of the time,
01:54 you can do things like suppress those items coming back,
01:57 so you can basically suppress the ratings object,
02:00 but if you are doing that, it's probably a sign like
02:02 hey maybe I shouldn't really be designing it this way.
02:04 A lot of considerations, but here's the first rule—
02:07 do you want the embedded data most of the time?
02:11 Next, how often do you want the embedded data without the containing document?
02:15 The way our things are structured now is I cannot get the ratings
02:19 without getting the books, I cannot get individual
02:22 ratings without getting all of the ratings.
02:24 So if what I wanted to do was on the user profile page
02:27 show here are all of my individual ratings as a user
02:31 listed on my like favorites page, or things I've rated or something like this,
02:36 that's actually a little bit challenging the way things are written.
02:39 We can definitely do it, and if there's just one
02:41 query we do it that way it's totally fine,
02:43 but this is one of the tensions, you can't get the ratings without getting the books
02:47 you can't get individual ratings, without getting all the other ratings
02:50 from that particular book, there's no way MongoDB
02:53 to actually suppress that, I don't think, like you can suppress the other fields
02:56 we're using a projection right, you get all the ratings, or none of the ratings.
03:00 So how often is it necessary to get a rating without getting a book itself?
03:04 Right, if that's something you want to do often
03:07 or it's a very very hot spot in your application
03:09 maybe again you do not want to embed it,
03:11 if you want the object without the containing document.
03:14 Another really important question to answer is
03:17 is the embedded data a bounded set?
03:19 If it is just a single nested item, fine, that's no problem,
03:22 if it's a list or an array, like we have in the context of ratings,
03:25 how big could the ratings get,
03:28 how many ratings might a book have reasonably speaking;
03:31 if there's ten ratings, it's probably totally fine
03:34 to have the rating data embedded in the book,
03:36 it's nice self contained, you get a little atomicity
03:39 and some nice features of have it embedded there.
03:41 If there's a hundred ratings, maybe it's good,
03:45 if there's a thousand ratings, if there's an unbounded number of ratings
03:48 you do not want to embed it, right so is it a bounded set, first of all
03:53 and related to that, is the bounded set small,
03:55 because every time you get the book back
03:58 you're pulling all of that stuff off disk, possibly out of memory,
04:01 over network for deserialization or serialization
04:04 depending on the side that you're working with.
04:06 So that comes with a cost, and in fact,
04:08 MongoDB puts a limit on the size of these documents,
04:12 you're not allowed to have a document larger than 16 MB,
04:17 in fact, if you try to take a document that's larger than 16 MB
04:20 and save it into MongoDB, even if you pull it back,
04:23 add something it makes it a little bit bigger and you call save
04:26 it's going to totally fail and say no, no, no this is over the limit.
04:29 So this should not be thought of as like a safe upper bound
04:33 this should be thought of as like the absolute limit
04:36 if you've got a document that's ten megabytes,
04:38 it doesn't mean like wow, we're only halfway there, this is amazing or great,
04:41 no, that's a huge performance cost to pull 10 MB over
04:46 every time you need a little bit of something out of there.
04:48 So really, you should aim for a much, much, much smaller thing
04:51 than the upper limit of 16 MB, but the point here is
04:53 there is actually a limit where if this embedded data outgrows that 16 MB
04:59 you just cannot save it back to the database,
05:02 that's a will no longer operate problem,
05:04 is the bound small is more of a performance trade-off type of problem, right,
05:08 but you want to think about these very, very carefully,
05:10 average size of a document is definitely something worth keeping in mind.
05:14 How varied are your queries?
05:17 Do you have like a web app and it asks like maybe ten really common questions
05:21 and you very much know the structure,
05:24 like these are the types of queries my app asks,
05:26 these are the really hot pages and here's what I want to optimize for,
05:29 or is this more of like a bi type thing where people and analysts come along
05:34 and they can ask like almost any sort of reporting question whatsoever;
05:38 it turns out the more focused your queries are,
05:41 the more likely you are to embed data in other things, right,
05:44 if you know that you typically use these things together,
05:47 then embedding them often makes a lot of sense.
05:49 If you're not really sure about the use case,
05:51 it's hard to answer the above questions,
05:53 do you want the data eighty percent of the time, I have no idea,
05:55 there's all sorts of queries, some of the time, right,
05:58 and so the more varied your queries, the more likely you are going to
06:00 tend towards the normalized data, not the embedded modeling data.
06:06 And finally, related to this how varied are your queries
06:09 as are you working with an integration database that lives at the center
06:14 and almost is used for inter-process, inter-application communication
06:17 or is it very focused application database?
06:19 We're going to dig into that idea next.