Google’s BigTable - GQL Reference:

A GQL query cannot perform a SQL-like “join” query.

A Google employee (second post by “ryan”):

We’ve had good results when we take a step back and think about our data models from a different angle. Most app developers are accustomed to designing SQL data models in a certain way, with a normalized schema, foreign keys, fixed column types, etc. With Google App Engine, you can often get good results by loosening the normalization and other restrictions, and often by doing more work in writes instead of queries.

Joins in CouchDB

If you’d be using an SQL database, you’d obviously have two tables with foreign keys and you’d be using joins. But what would the “obvious” approach in CouchDB look like?

CouchDB first impressions:

Joins, of course, are simply not possible… but in the right situations you wouldn’t need ‘em.

Amazon SimpleDB - A different perspective (in the comments):

You are right as far as I can see, that there is no join syntax in SimpleDB.

Normalization and table joins are so 2007.

Update: for more discussion see the reddit comments.


Rss Commenti

20 Commenti

  1. This article you linked to gives a “solution” to the “join problem” in a very realistic setting by CouchDB’s author himself:
    http://www.cmlenz.net/archives/2007/10/couchdb-joins

    However, the solution, which uses arrays as keys with an arbitrary integer to sort on, rather disgusts me. I don’t even want to imagine this kind of view in a more complex setting…

    Normalization and table joins may be 2007, but automobiles were invented in the previous century and still work very well today :)

    #1 Thomas
  2. In CouchDB indeed there is a (IMO) rather dirty way of doing joins (although I think it only really works with up to two “tables”, although there no such concept in CouchDB).

    In BigTable you simply cannot do it, unless you make everything of the same entity type, then you might be able to perform some dirty magic. But the message is: make documents/entities self-contained as much as possible.

    #2 Zef Hemel
  3. The join isn’t what is important. Using a document oriented database means that you design you tables/domains differently. You flatten them, similar to denormalizing RDBMS tables in order to make queries more efficiently.

    #3 Jack Black
  4. Thomas, you analogy regarding the automobiles couldn’t be more appropriate. But I would re-phrase it this way: Automobiles were invented in the previous century and they work just as badly today.

    Beyond a slight improvement in efficiency, power windows and heated seats, automobiles haven’t changed much at all.

    (BTW: the automobile was invented in the century previous to the previous century-1898)

    #4 Jack Black
  5. i have to agree with you jack on this topic. i guess the rdbms guys, like myself, have just grown to naturally think in relations. usually anything that will be repeated would be put into it’s own table and then joined to the parent table. however if you look at the author of couchdb explanation with putting the comments inside of the blog document, it kinda make sense.

    i think all us sql guys are just in a bit of a shock by the approach that google is taken. by like anything you will just need to rethink the way you do things. personally, i kinda like this approach.

    #5 tony petruzzi
  6. Acting like you’re the first to predict the demise of the relational database is so ignorant.

    Have you ever heard of object databases? They were supposed to kill RDBMSs. No need for ORM Vietnams. No impedance mismatch. Where are OO DBs? Fuckin’ nowhere.

    MySQL, PG, Oracle, and SQL Server aren’t goin’ anywhere, homez.

    #6 some guy
  7. Goodbye, data integrity!

    Goodbye, consistency!

    #7 frank
  8. This is totally true. MySQL fetches rows based on indexed keys really quickly. Doing a JOIN totally slows things down. Normalization = Bad

    #8 Amusing Trousers
  9. “This is totally true. MySQL fetches rows based on indexed keys really quickly. Doing a JOIN totally slows things down. Normalization = Bad”

    Sounds like someone is joining on columns that aren’t indexed. Could also be poor schema design.

    #9 Josh in California
  10. some guy said:

    “MySQL, PG, Oracle, and SQL Server aren’t goin’ anywhere, homez.”

    I’m sayin. MySQL seems to be doing a pretty bang up job FTW.

    #10 Anonymous
  11. I’ve use most of the big relational databases including Oracle, Sybase, Informix, DB2, Teradata, MySQL, SQL Server, Postgres & even Adabas back in the day. Hell, you might as well throw in Access & Dbase just for laughs.

    The funny thing is while I still use Oracle & MySQL daily, the relational db is dying on the web. Hardly no ISP even offer Oracle who somehow let mysql just own the ISP space.

    If Oracle’s RAC is so kick ass, why hasn’t Oracle built a 1,000 node RAC cluster and wiped out Google already? Because it is no match. Probably 80% of all RAC clusters are just little rinky dink two node ones.

    How can Google who knows jack shit about relational databases be the kings of search? Why aren’t Oracle, IBM & Microsoft the kings of search?

    The answer is that relational database are good for somethings, but not for others.

    #11 Nazz
  12. So, okay, we redesign our little database and flatten everything. Now a blog post also contains an array of comments. Then we run into the concurrency problem which, with e.g. Slashdot, is going to be a very real problem.

    How do we deal with it? Require the user to resubmit her comment just because the programmer was lazy? Unacceptable. So the blog engine needs to deal with this: wait a while, exponential fallback or something, then retry.

    This mechanism could in some cases be offloaded to the DBMS, if it allows us to upload some code to it, that should be run as a single transaction. In the example, we would send the DBMS the code that fetches a blog entry, appends a comment and stores the entry back into the database. Only if all steps succeed is this transaction committed. Does CouchDB support something like this?

    #12 Thomas
  13. Thomas, CouchDB is working on some ACID transaction stuff I think. I think there’s a discussion going on on those right now, haven’t really been following it: http://mail-archives.apache.org/mod_mbox/incubator-couchdb-user/200804.mbox/%3C0433F5A7-24B9-4194-91CF-631282DCD975@gmail.com%3E

    #13 Zef Hemel
  14. These databases don’t claim to be a replacement for the general-purpose DBMS. They would obviously be a very bad choice for an ERP system.

    But they do a surprisingly good job for the typical database-backed web site. They are designed with a different set of assumptions and requirements like extreme scalability and distributed operation.

    #14 Oren
  15. I have heard this comment so many times now!
    It is true in some cases but not completely true for other cases.
    First one being the places where you just have to store data and fetch it given the id/handle of the record and show it to the user with almost no data processing on it.

    Second cases turn out to be things like log processing. Actually a place where you need to do a lot of data crunching.
    One guy at work implemented a search log processing system in a object database (lisp based allergo cache by Franz). We had some 2-3 million records in there and needed to pull out some key features by doing some data crunching and math. This thing used to take 15 minutes to process everything and push the data in, another 10 to do data crunching and spit out the features data.
    Just for the kicks I implemented this in MySQL and Ruby. It still took 15 mins to push data in the mysql database (after processing) but the features were being fetched in like 35 seconds odd!

    Obejct DB, AFAIK, is not well suited for the second type of things.

    #15 Piyush
  16. some guy: I’ve used object databases. They’re a dream to work with. They’re hard to get in the door in 2008, kind of like how GC was hard to get in the door in 1996 or HLLs were hard to get in the door in 1982. Things that make programmers more productive all start out fringe, and only get picked up very slowly, unless they’re very backwards-compatible like MS-DOS or C++.

    #16 tim
  17. What CouchDB is really missing now is a bulk GET, doing 10000 http GET requests is so slow.

    #17 Jeffrey Gelens
  18. @tim: Object databases have been trying to get in from the fringe for about ten to twenty years now. That’s not called “slow”, that’s called “dead”. Your post reminds me of Monty Python’s “Dead Parrot” sketch.

    “Look at that CacheDB. Bee-yu-ti-ful plummage, isn’t it?” :)

    #18 Paul Tiseo
  19. To add: BigTable is not a database system in the way people are used to thinking of it or experiencing it, meaning a broad set of services overlaying some subset of the Relational Model.

    It’s a glorified map-like data structure, a.k.a. associative array. From my never-worked-at-Google understanding, it’s a key-value array, where there are basically two keys per value. (Actually, three keys, “row”, “column” and time.) For all this talk, you could represent BigTable in any RDBMS simply by creating one table. Google has multiple BigTables for various of their out-facing applications like Search.

    Even Google states that they built BigTable because their need for data consistency was very weak relative to their tremendous need for performance. This is a very fringe need and why RDBMSes will remain a preferable tool for things like ERPs (as Oren states) and any other complex, transactional web application.

    #19 Paul Tiseo
  20. If Google is intending to provide a data storage/access foundation for an emerging class of Web based applications that offer software as a service the model it uses must address ALL of the requirements of data management.
    ‘Bigtable’ has no inherent provision for storing normalized data. Comments I have seen treat data normalization as a
    hangup for ’squares’ from another generation. The comments suggest that we should just ‘loosen up’ and forget about data normalization. This is silly, dangerous thinking. It is always wonderful when things can be simple but the reality is
    that the world we model with data is rife with complex patterns of one to many relationships. We cannot wish them away. Forty years of experience with data management has taught us that stores of data lose their integrity when redundant ( unnormalized ) is present. During the 1960s database management in its pre-normalized phase was in crisis. Computer scientists identified redundantly recorded data as the culprit and created the theory and practice of normalizing data to remove update anomalies.

    #20 joe

Lascia un Commento