try_mongo/video_transcripts/3-modeling-documents/2-modeling-guidelines_trans...

00:00 As we discussed, modeling with documents and
00:03 document databases is little bit more art.
00:05 There's a little bit more flexibility and
00:07 kind of just gut-feel of how you should do it,
00:09 but let me give you some guidelines that will
00:12 give you a clear set of considerations
00:15 as you work on your data models.
00:17 You'll see that the primary question for working
00:20 with document databases is to embed or not to embed.
00:24 That is, when there's a relationship between
00:26 two things in your application.
00:28 Should one of those be a sub-document?
00:31 Should it be contained within the same record
00:33 as the thing it is related to?
00:34 Or should they be two separate collections,
00:36 (what MongoDB calls tables: collections),
00:39 because they're not tabular, right?
00:40 Should those be two separate collections,
00:42 but just relate to each other.
00:44 We're going to try to answer this question.
00:46 I'll try to provide you some guidelines
00:48 for answering this question.
00:49 The first one is: is the embedded data
00:52 wanted 80% of the time when you have the outer,
00:56 or other related object?
00:58 Let's go back to our example we just worked with.
01:00 We had a book and the book had ratings.
01:02 To make this concrete, the question here is
01:05 do we care about having information about the ratings
01:08 most of the time when we're working with books?
01:10 So if our website lists books,
01:13 and that listing has a number of ratings
01:16 and the average rating, things like that,
01:17 listed as part of the listing,
01:19 you pull up a book, maybe the ratings,
01:21 and the reviews are shown right there--
01:22 most of the time, when we have the book,
01:24 we have the ratings involved somehow,
01:26 then we would want to embed the ratings within the book.
01:29 That's what we did in our data model.
01:30 We said, yes,
01:31 we do want the ratings most of the time.
01:34 Now, let's look at this in reverse.
01:36 How often do you want the embedded data
01:39 without the containing document?
01:41 So in the same example, how often is it the case
01:43 that you would like the ratings without the book?
01:49 Where would that show up in our apps?
01:50 So maybe you as a user,
01:52 go to your profile page in the bookstore,
01:55 and there you can see all the books you've rated, right?
01:57 And the details about the ratings.
01:58 You don't actually care about necessarily the books,
02:00 you just want, these are the ratings
02:02 that I've given to things, and here's my comment and so on.
02:05 You don't actually want most of the details
02:07 or maybe any of the details about the book itself.
02:09 So if you're in that sort of situation,
02:12 a lot of the time, you might want to put that
02:14 into a separate collection and not embed it.
02:16 You can still do it.
02:18 You can still do this query and it'll come back very quickly,
02:20 there's ways to work with it.
02:21 But you'll see that you have to do a query in the database
02:24 and then a little bit of filtering on the application side.
02:26 So it's not prohibitive,
02:31 it's not that you can't get the contained document
02:33 without its container,
02:34 but it's a little bit more clumsy.
02:36 So if this is something you frequently want to do,
02:39 then maybe consider not embedding it.
02:41 Now, is the embedded document a bounded set?
02:43 Let's look at ratings.
02:45 How many ratings might a book have?
02:47 10 ratings, 100 ratings, 1000 ratings.
02:49 Is that number going to just grow
02:51 The number of ratings that we have?
02:53 If this was page views and details about the browser,
02:57 IP address, and date and time of a page we viewed,
03:00 that would not make a good candidate for embedding that,
03:02 like say, for views of a book
03:04 because that could just grow and grow
03:05 as the popularity of your site grows.
03:07 And it could make the document so large
03:09 that when you retrieve it from the database,
03:10 actually the network traffic and the disk traffic
03:12 would be a problem.
03:13 I don't really see that happening with ratings.
03:15 I mean, even on Amazon, super popular books have
03:18 hundreds not millions of ratings.
03:20 So this is probably okay, but if it's an unbounded set,
03:23 you do not want to embed it.
03:25 And these have bounds small, right?
03:27 Maybe millions of views still are being recorded
03:31 within sign of a book would be a really bad idea.
03:33 And the reason is these documents are limited
03:35 to 16 megabytes.
03:37 So no single record of MongoDB can be larger
03:40 than 16 megabytes.
03:41 And this is not a limitation of MongoDB.
03:42 This is them trying to protect you from yourself.
03:45 You do not want to go and just say, query buy, say, ISBN,
03:49 and try to pull back a book
03:50 and actually read 100 megabytes off a disk in
03:53 over the network.
03:53 That would destroy the performance of your database.
03:56 So having these very, very large records is a problem.
03:59 So they actually set an upper bound on how large
04:02 that can be.
04:03 And that limit is right now, currently,
04:05 at the time of recording, 16 megabytes.
04:07 But you shouldn't think of 16 megabytes as like,
04:09 well, if it's 10 megabytes, everything's fine.
04:11 We still got a long ways to go.
04:13 No, you should try to keep these,
04:15 in the kilobytes: tens, twenties, hundreds of kilobytes,
04:18 not megabytes because that's going to really hurt your
04:21 database performance unless in some situation,
04:23 it just makes avton of sense to have these
04:25 very large documents.
04:25 So having small bounded sets means that your documents
04:29 won't grow into huge, huge monolithic things
04:32 that are hard to work with.
04:33 Also, how varied are your queries?
04:35 So one of the things that you do with document database is
04:38 is you try to structure the document to answer
04:41 the most common questions in the most well-structured way.
04:46 So if you're always going to say, I would like to, on my pages,
04:49 show a book in this related ratings,
04:51 you would absolutely put the ratings inside the book
04:53 because that means you just do a query against a book
04:55 and you already have that pre-joined data.
04:58 But if it's sort of a data warehouse
05:00 and you're asking all kinds of questions
05:02 from all different sorts of applications,
05:04 then trying to understand, well, what is the right way
05:06 to build my document so it matches the queries
05:08 that I typically do or the ones that I need to be
05:10 really quick and fast?
05:12 That becomes hard because there's all
05:13 these different queries and
05:15 the way you model for one is actually the opposite
05:18 of the way you model for the other.
05:20 So depending on how focused your application is,
05:22 or how many applications are using the database,
05:24 you'll have a different answer to this question.
05:26 So the more specific your queries are,
05:28 the more likely you are to embed things and structure them
05:31 exactly to match those queries.
05:33 Related to that, is are you working with an
05:36 integration database or an application database?
05:39 We'll get to that next.