Normalized data is for sissies  OCT 27 2004

I'm not a hot shot programmer by any means, but I've done quite a bit of playing around with getting data in and out of databases programmatically. Something that's always confused me is the near-religion of data normalization among programmers and database admins. Every developer I've ever worked with has told me that when you're building a database, you need to normalize your data -- basically this means organizing your data in such a way that removes redundancy -- and failure to do so would result in public ridicule and possible revocation of access to any computing device. But I've always wondered, given that hard drives are cheap and getting cheaper, what's the problem with using more storage space in exchange for greater speed?

I was delighted when I read Cal Henderson's take on normalized data from a recent talk he gave about Flickr (from page 27):

- Normalised data is for sissies
- Keep multiple copies of data around
- Makes searching faster
- Have to ensure consistency in the application logic

To which I would add: hard drives are cheap.

Cal presents normalization as a trade-off that, depending on your circumstances, might be worth looking at...which is a much more useful way of approaching the situation than what I've typically heard (normalize or die!). Want faster access to your data? Replicate it in the database but be aware that it'll cost you some storage space and you'll need to keep track of the extra data in your application (which can be a pain in the ass). In Flickr's case, they have 13 SELECTs for every INSERT, DELETE, and UPDATE statement hitting their database. Normalization can slow SELECT speed down while denormalization makes your I/D/Us more complicated and slower. Since the application part of Flickr depends so heavily on SELECTs from the database, it makes sense for them to denormalize their data somewhat to speed things up.

Here's an analogy for the smokers in the audience...what sucks worse than realizing you left your lighter at home and you're stuck in traffic on the way to the office? The solution is to buy a bunch of lighters, one for your car, one for your coat pocket, one for the drawer in your office, one for your purse, etc. It's a trade-off. Your initial cash investment is greater (but lighters, like hard drives, are cheap) and you need to be diligent about leaving each lighter in its proper place, but you're never without a lighter when you need one.

There are 85 reader comments

David Schontzler10 27 200412:10PM

And DB designer who is worth their salt knows that normalization isn't always the way to go. Usually, you'll normalize your DB, run some tests, and then de-normalize as needed. Many times, there is no need to de-normalize, but I could definitely see the need for it with large sites like Flickr, eBay, and others.

David Grant20 27 200412:20PM

One down side is that you have to keep track of all 13 lighters, and make sure that they all have the same amount of fluid in them when the fluid in one changes.

Robert Occhialini21 27 200412:21PM

It sounds, to me, like the scalability problems being described in your post, and in his presentation, could easily be rectified through the use of indexes and stored procedures. While normalization can, in some cases, slow down selects, indexes especially would make this issue dissapear.

In some ways, normalizing databases is functionally similar to semantic markup, and we know how you feel about that.

Dennis24 27 200412:24PM

You gloss over the real reason for data normalization by saying, "you'll need to keep track of the extra data". It's not the extra disk space that's the issue.

John Spurlock25 27 200412:25PM

Optimized for SELECT + don't worry about disk space + denormalized to some extent + still ok with those hardcore DBAs = OLAP

Anil Dash26 27 200412:26PM

I think the reason normalization is preached as religion is because it's a rule that you ought not break until you understand where/when to break it. Every discipline has areas like that, where you say "don't ever do x" but you really mean "only do x in particular circumstances that are too complicated for beginners to get a grasp of in a short period of time".

Of course, this is the argument of people who preach abstinence, too, so maybe it's not the best technique.

David Kornahrens26 27 200412:26PM

I agree about the latest Hardrives becoming some-what cheap. As I have a 2 tera that goes down periodically.

chris crippen28 27 200412:28PM

and if you have more then one programmer working on different parts of an application, each programmer then needs to know where to update each table that that holds this "duplicate" information. missing one update could invalidate your entire database.

David Barrett31 27 200412:31PM

"Your initial cash investment is greater (but lighters, like hard drives, are cheap) and you need to be diligent about leaving each lighter in its proper place, but you're never without a lighter when you need one."

True, but you will end up spending more cash in the long run because you'll be smoking more cigarettes!

bryce32 27 200412:32PM

I agree with Dennis. The point of reducing redundancy is reducing the possibility of *conflicting* data, if one element is modified but its duplicates do not. As David Schontzler points out, normalization (to the 3rd form) is not always the way to go, which is why there are actually 3 levels of normalization.

bryce34 27 200412:34PM

s/do not/are not

jal40 27 200412:40PM

Hey, fine by me if you want to risk damaging your data, but ignoring set theory in a platform designed around it seems like a generally bad idea. What I mean by that is that by not normalizing, you're taking responsibility for ensuring the consistency of your data, something that databases are really quite good at, if used properly. Me, I'm lazy, and want the DB to do as much as posible for me.

Sure, I denormalize for speed, but note that denormalizing implies that you have normal data in the first place.

Before preaching "oh, don't worry about it", I suggest you fix a few screwed up schemas full of ambigious data first.

Ian46 27 200412:46PM

There is also often a need to de-normalize when you don't have lots of money and hard drives, etc. My company is building a helpdesk system aimed as small businesses, colleges, etc. Often they'll be running it on very basic hardware which is most likely hosting other apps as well. If a requirement is fast searching then having a 4 way join to get at the data on a loaded server is going to potentially take a long time. De-normalizing the data in key spots can help a great deal in these situations.

jkottke50 27 200412:50PM

You gloss over the real reason for data normalization by saying, "you'll need to keep track of the extra data"

Which is a quote taken out of context...I followed that up with "which can be a pain in the ass"...that's hardly glossing. But the point is that whether to normalize or not is a trade-off between several factors which depends on the situation at hand. And yes, keeping tabs on duplicate data is quite an important factor, probably the most important one in most situations.

I think the reason normalization is preached as religion is because it's a rule that you ought not break until you understand where/when to break it.

Yeah, my developer friends probably kept me in the kiddie pool on this one. I wish they hadn't though...the bigger picture is much more interesting, worthy of discussion, and is an opportunity for real creativity in database/app design.

Maciej Ceg?owski51 27 200412:51PM

The issue is complexity, not disk space. Keeping duplicate data synchronized is of the fastest ways to introduce bugs. That's why denormalization is always a trade-off. Sometimes it's worth it (ease of maintenance, speed issues) but it carries a cost.

Todd W.51 27 200412:51PM

So, if it's not database normalization, what's the explanation of why Flickr is alway slower than snot on an ice fisherman's beard?

Admittedly, I don't know squat about db architecting, but it seems like database normalization is aimed at fixing a lot of problems up front. Keeping track of all the redundant insert/delete/updating leaves a loooooooot of room for error if you (or the several people working on the db) forget something. In the cigarette lighter analogy, it seems like once the db got big enough, you'd have to carry around a crate of lighters and drop them behind you Hansel-and-Gretel-style.

Chad Baker52 27 200412:52PM

I think the obvious point is that it depends on the situation. For the app I work on, we denormalize because we want to capture the data as it is at the time of a transaction. If the master records change after that point, the copies are left alone. For situations where that's not the case, I'd rather keep things normalized and spend a few extra milliseconds (in most cases) to get the data out.

Maciej Cegłowski54 27 200412:54PM

Aargh please make your page utf8

mark18 27 2004 1:18PM

The one true ideal is that there is no one true ideal. Each problem domain requires a unique solution. hopefully that solution is arrived at by learning from the past, but not being blinded by hard line ideology or by hard line reaction to ideology.

Fazal Majid02 27 2004 2:02PM

Denormalization is a perfect example of premature optimization. You should design your data model to be normalized, and denormalize afterwards if you have specific performance problems or physical indexing/clustering/partitioning requirements for it.

There is one justification for denormalization in limited cases, and that is when you have a historical table that should not change when the normalized data changes. For instance, in an invoice table, you do not want the invoices to change retroactively if your pricing changes, so you will copy a snapshot of the pricing data at invoice time. In this case, you specifically do NOT want changes to be propagated, and the two columns are actually slightly different in semantics anyway (frozen vs. mutable values), so strictly speaking this isn't really a form of denormalization.

tyamada07 27 2004 2:07PM

processors and ram memory are cheap also

Tom Harrison27 27 2004 2:27PM

"Data Normalization" is a simplified term used to describe the concept of eliminating redundancy to improve efficiency in the design of a database schema. A more in-depth description of the concept would reveal varying degrees of data normalization, referred to as First, Second and Third Normal Form. Starting at the bottom with First Normal Form, your design simply works around each table in your database representing a unique bit of information such as objects in the system and weak elements that give relational meaning to those objects. Second Normal Form would further "normalize" the data by eliminating redundancy that was missed the first time. Your design would then be considered to be of Third Normal Form when all duplicate data has been omitted at future duplication is not permitted. Often times, Third Normal Form is unattainable, and a solution of Second Normal Form is perfectly acceptable.

There are probably many situations where such an in-depth analysis of a problem is not required to attain a solution, but most systems still benefit the most from this type of architecural process. Implying that the concept of data normalization is one that should be tossed out the window is going overboard in the same way that it would be to say every acceptable schema needs to acheive Third Normal Form. Finally, in an enterprise-level system where a huge amount of data is operated on by large pre-defined sets of queries that work together to build the output, stored procedures would improve the efficiency of those transactions.

Kip Ingram27 27 2004 2:27PM

I love this post. Well said!

gerrard43 27 2004 2:43PM

In my experience, most of the people who dislike normalization have a limited understanding of the capabilites of the various RDBMS out there. For example, normalized views are often a very attractive alternative to denormalization without the added complexity associated with maintaining consistency via application code.

What about the perf impact of locking and concurrency? Is Flickr executing their data modifications in transactions? Are these transactions holding locks on 13 tables while they are executing? How does that impact the perf of the select queries?

I always start out db design by normalizing as much as possible, and then denormalize only when actual testing indicates it will boost performance.

Good comment, Fazal. One of my colleagues always makes fun of me for spending time thinking out how to save cycles in my string manipulation code, while ignoring the fact that the app might be making several remote machine calls via db access or web services ;).

Ian59 27 2004 2:59PM

There are those with a reason to denormalize (usually speed related) but there are also others who are just lazy. I've looked at a scary number of db's where data is stored in a denormalized state because nobody wanted to write the extra lines of app code to insert across multiple tables, check for errors, handle transactions,etc.

Also with the explosion of web development becoming accessible to the masses via easier tools you have alot of folks out there who just don't know any better. Top that off with even more customers who know even less about what to ask when a software company comes in and pitches a product and I imagine that many(most?) web app db's are not normalized appropriately.

Tia01 27 2004 3:01PM

In thinking about our particular situation, we strive to have our data as normalized as much as we can, but with regard to the web servers (hardware), they are in a sense denormalized because they are clustered and that's a conscious decision on our part.

We had the option of buying one superfast kick-ass ridiculous webserver or load balance several pretty fast, fairly kick-ass servers. We opted for the cluster because with our traffic and the content we have, it makes much more sense. If a box fails, it's not catastrophic. In our day to day, yes, it is a pain in the ass to update the code but the heart and soul of our site is our databases and with the amount of new content we are generating daily, it serves us best to address speed issues with relatively cheap hardware than with having to deal with ballooning redundancies in the db.

Jonathan Snook01 27 2004 3:01PM

In reading the slides, it's really hard to get a true sense of the performance issues they're running into. It certainly sounds like MySQL's limitations are what causing the performance issues and that they've used denormalization as a way to solve the problem. These performance issues also seem to be related to doing free text searches as opposed to searches based on primary keys.

I haven't run into anybody that is a hardcore normalization nut with no room to budge on the matter. Most understand that you normalize as far as it makes sense to. It has little to do with hard drive space and more to do with a managable and flexible database design.

Cory02 27 2004 3:02PM

First off, normalize.

Then:

1. Are you able to get acceptable performance without denormalizing (indexes, clusters)?
2. Will denormalization make the system be less reliable?
3. Will system performance after denormalizing still be unacceptable?

Any denormalized database creates administrative overhead. This includes but is not limited to documenting decisions and ensuring data integrity across dupes programatically. Storage shouldn't be a reason for denormalization. If it is a primary justification, it is an excuse for lazy (and often poor) design. Take into account the I/O saved, CPU saved, complexity of update programming, cost of returning to a normalized design and end user performance.

Denormalization is what is says: denormalization. It is a tool in database design, not the design itself. Normalize it first, then look to increase performance by looking at all the tools available to you. Denormalization has higher maintenance costs (often in people time) than other techniques. Furthering what Tom said, most applications won't need analysis after normalization. Chances are, if yours is one of them, you've already got a DBA who knows the trade-offs. If you don't need performance tweaks, the benefits from normalization will far outweigh glorified Excel spreadsheets.

jfrank09 27 2004 3:09PM

If you replace engeneering with dogmas like "normalize or die" or "normalized data is for sissies" you're already screwed.

Bo21 27 2004 3:21PM

Sorry, but this guy has it all wrong. I'm sure it gets the job done for him but his arguments won't work in the real world. Normalization is the BEST way to go. As has been pointed ouot before, normalization ensures correctness, increases flexibility, increases maintainability, and vastly lowers the overall cost-to-change for any big app. If you work in industries that have big ol' 20yr old dbs lying around (ie banking) then this becomes really clear really quickly. Denormalization is an *optimization* and (let's all say it together) optimization is the root of all evil. I always tell people: *never* optimize (unless you really know what you're doing in which case, put in the extra time and fix the design). Also, this whole principle is wrong IMO. It's no secret to anybody that web-sites(services) do 8 reads for every 2 writes. This doesn't mean you should screw up your db schema it means you should invest in CACHING. This has been obvious for the last 5 years: on the web cache is king.

Matthew Smith31 27 2004 3:31PM

Cal Henderson can afford to say things like "Keep multiple copies of data around". If the Flickr DB is inaccurate...who gives a damn? It's a photo sharing site. (With all due respect, of course:))

But in the real world, where you need to generate accurate reports from reliable information about sales, inventory and customers, you favor reliability over performance every time. I don't normalize data because I'm looking to conserve disk space - I do it because I'm trying to preserve consistency and accuracy.

It's already enough that you have to worry whether USERS are entering accurate information. You shouldn't also have to worry about the app keeping track of it consistently across multiple redundant tables.

Mike D35 27 2004 3:35PM

Another complexity-related problem with denormalization is that you're adding new dependencies between your program and the storage component. When you denormalize, you're making the layout dependent on usage. If your usage changes ... uh oh. The design of a reasonably normalized database is independent of how the data will be used.

On complexity, a lot of developers I've worked with take the attitude that that programmers should make the extra effort in order to produce the most efficient program. But, complexity has serious consequences: less portability (see pointer arithmetic), more bugs, and far slower development.

If efficiency was really king, we'd develop web applications in C or even assembler. Cheap disk space, processor time, and memory lets us us high level tools and practices (like normalization) to build bigger and more robust applications that we could otherwise.

since196855 27 2004 3:55PM

Kottke wrote:

the bigger picture is much more interesting, worthy of discussion, and is an opportunity for real creativity in database/app design.

The thing is, there isn't a tremendous amount of room for creativity in database design. Most of the problems have not only been solved, they've been solved for 25 or 30 years. Normalizing your data is still best practice in most cases.

Stephen57 27 2004 3:57PM

I'm not going to pretend to know every aspect of this issue, but normalization seems to be the best way of keeping things consistent. If and only if optimization is the primary concern, then denormalization makes sense. However, this is rarely the case.

It makes more sense for simplicity, accuracy, and data integrity to keep one piece of data in one place. This applies to markup as well: think back to the web before PHP was common (or JSP, or ASP, or whatever) and each and every page needed to be altered to fix one change. Modifying 13 tables or thousands of records is even more nightmarish.

Brian44 27 2004 4:44PM

Denormalization makes the baby Jesus have to stay extra late on Fridays to fix munged data.

Olivier Travers11 27 2004 5:11PM

The primary goal in preventing redundancy through normalization is not to save storage space but to enforce data integrity. What's the impact of corrupt data at Flickr? A mistagged picture maybe - seeing one uncut dick in the middle of a flower gallery is probably as bad as it can get. However not everyone can afford the consequences of a rollback gone awry on the kind of complex I/D/U statements you seem to advocate. I wouldn't like my bank account to suddenly belong to five different people.

Moreover, what is the performance cost of all the triggers (or other mechanism used to similar effect) necessary to enforce integrity constraints within your denormalized database? "Ensure consistency in the application logic" is not going to be free.

You can jump out of a plane without a parachute and say parachutes are for sissies and they're a heavy burden. That might make sense if you're just 10 feet above water and the plane is not moving too fast. In most cases however, I know which side of the "tradeoff" I want to be on... Normalizing or not is a bit more complex than how you described it.

barlow28 27 2004 5:28PM

I read your post, Jason, and was encouraged because the app I'm working on has some complex, slow SELECTS and I've been tempted to denormalize. And then I read all the responses and I'm back to being convinced to normalize as much as possible.

I don't think there has ever been a software project that, when it changed hands, the new coders told the end client how great a job the previous programmers did and what a wise set of choices they made! So I just try to comment any time I'm doing something unorthodox and hope that the next guy will sympathize with me.

dusty46 27 2004 5:46PM

The thing is, there isn't a tremendous amount of room for creativity in database design. Most of the problems have not only been solved, they've been solved for 25 or 30 years. Normalizing your data is still best practice in most cases.

Well said.

jkottke00 27 2004 6:00PM

The thing is, there isn't a tremendous amount of room for creativity in database design. Most of the problems have not only been solved, they've been solved for 25 or 30 years.

And yet, we have this discussion. Can't be all that devoid of creativity...

The primary goal in preventing redundancy through normalization is not to save storage space but to enforce data integrity.

Yeah, I get it...like 20 people have said this. Doesn't anyone read the thread before posting?

r. vaca31 27 2004 6:31PM

As since1968 points out, this is nothing new and exciting in the database world. The OLAP vs. OLTP guys had this particular war already. It started in the 80's and heated up in the 90's with all sorts of new companies appearing with solutions based on new database designs (e.g. Express OLAP, which was later bought up by Oracle) That was an exciting time for databases - with the object/relational wars going on too and the new excitement around spatial DBs. I'm a fan of new ideas but this isnt one of them.

The current discussion has really been about welcoming new users to the complex world of databases (which many programmers take for granted) rather than any creative discussion (Jason, I dont mean that in a snarky way!) The cutting edge stuff in the database world these days is probably around stuff like data/design optimization for hardware clusters.

The system described in the slides is scalable for reads but that isnt hard to do - you can always build downstream read-only slaves and in fact this hub-and-spoke model is how many large businesses provide data access to many of their departments. The thing that is glossed over is that as transaction volume increases (even assuming the percentage of I/U/Ds stays the same) there is no equivalent easy way to replicate the Master. The term "Replication" as it is used in the presentation is not what is meant by Replication in the database world. Creating read-slaves is just a copy. Replication involves multiple, distributed Masters with all the design and conflict resolution that that involves.

Cal does recognize this when he writes "We can scale horizontally, at least for a while"

John R33 27 2004 6:33PM

It's not really a discussion. It's the people that know what they are talking about stating what has been tried-and-true best-practice for decades.

The only reason there are so many posts is that your readership is loyal and want to set you straight. By the way, what happens when a blogger changes his or her opinion?

Olivier Travers35 27 2004 6:35PM

I did read the thread. Just found it fun to bash the message in your head just a little further :-) Or more seriously, to rephrase it a bit less technically and illustrate the point differently.

Mark Beeson35 27 2004 6:35PM

And yet, we have this discussion. Can't be all that devoid of creativity...

Yes, every time a newbie db designer starts asserting statements, we have this discussion. Every single dba has gone through this.

Yes, hard drives are cheap.
Yes, processors are cheap.
Yes, bandwidth is cheap.

Quality of service and scalability, however, are not cheap. Assume the following table:

PHOTOID int NOT NULL PRIMARY KEY,
FILENAME varchar(255) NOT NULL,
FILESYSTEM varchar(255) NOT NULL

vs this table:

PHOTOID int NOT NULL PRIMARY KEY,
FILESYSTEMID int NOT NULL,
FILENAME varchar(255) NOT NULL

Assume you have an index on all three columns. Inserting a row in the first table will have a non-trivial amount of overhead (for updating the index on the FILESYSTEM varchar column, as opposed to updating an int column in the second table). While processors are certainly fast, this sort of wasteful database design lends itself to enormous scalability problems. What happens when 10 inserts per second are happening on that table? The overhead of updating a varchar index will degrade your quality of service, as more and more queries get thrown at that table.

And just throwing another machine at the database isn't exactly that easy. Ask any MySQL developer about creating a cluster of database that can all handle inserts. They'll laugh at you, because (as of version 4.X) it can't be done. You can do it in Oracle, but then you have to deal with the cost of Oracle licenses, which aren't exactly cheap.

If you're concerned that the (select filesystem.filesystemname,photo.* from filesystem,photo where photo.filesystemid = filesystem.filesystemid) query would run slow, don't be. Most modern databases will cache those sorts of joins in memory, or if you're really concerned then just build a view that encompasses that query, and select from the view. You can additionally do logic on the application server to cache those results at the app server layer. Which gets back to my original point-- scalability. A well-built database (with correct normalization) will scale much much better than a database with just flat data.

pickyin54 27 2004 6:54PM

normalization, while heavily touted and preached (and tested if one ever attended one of those programmer interviews), is not always the practical approach and is far away as a compulsory rule.

the actual strength of a database lies its individual design and relationship schema. if the DB designer can look far enough ahead to include flexibility and inter-connectivity in his DB, chances are a little redundancy will not affect the final implementation.

normalization is almost never practiced in lower level DB2 based databases (especially those running on mid-range servers), simply because processing speed and hard disk space factors are not a concern.

Ryland31 27 200411:31PM

Isn't normalization kind of what the R in RDBMS is for? Normalization is pretty much inherent in the idea of a relational database schema. Denormalization is basically working against all the nifty tricks your RDMBS system can do with indexing, replication, referential integrity, and so on.

By the way, what happens when a blogger changes his or her opinion?

There is a theory which states that if any blogger discovers that he is wrong and changes his opinion, that he will instantly disappear and be replaced by another blogger that is even more bizarre and inexplicable. Then there is a theory which states that this has already happened.

Stephen10 28 200412:10AM

I can think of exactly 42 reasons why that theory is correct.

chowder59 28 200412:59AM

this is pretty much what they taught me at uat when i studied database theory. kind of a 'know the rules and then you can break them' kind of deal is not a bad approach.

rg01 28 2004 3:01AM

Any project which has developers making design decisions on normalization is either too small for the normalization/denormalization argument to matter considerably, or it is screwed. A competent software architect will ensure there is no room for a hotshot programmer to put the project at such risk.

dhartung20 28 2004 3:20AM

Well, Codd is part of the reason. His book, the Relational Model, was a Comp.Sci. textbook staple for a decade or two (I think more recent works by his disciple Date now fill that role). For theory, it's dense but fairly well written. DBAs are often CS graduates in comparison to, you know, "coders", and web coders are even less, um, part of the discipline, and less inclined to be acolytes, while more inclined to hackerish get-it-to-work approaches like keeping several copies of a datum around. In my experience, anyway.

Stewart Butterfield57 28 2004 3:57AM

Holy smokes. I'll send this to Cal (I work with him at Flickr).

This was just a throwaway line used to make a point (it was a powerpoint bullet in a talk on PHP, not a position paper on DB design). None of us advocate doing things all crazy and stupid. And selective de-normalization (underscoring the point that things are normalized to begin with) is a design decision.

We're not using a database to store and manipulate relational data, we're using it to make an app run. I mean, of *course* we're doing the former, but only as a means to doing the latter. (Sort of like saying that it's not computer science, it's making a product. Or, I'm not putting one foot in front of the other while continuously falling, I'm walking to the store to get some milk.)

Also - you can do some pretty fancy things with replication: replicated data doesn't need to use the same table types or even necessarily the same tables - you can run a de-normalized mirror of your primary database just for searching, which is pretty nifty.

Also - you don't just go de-normalizing everything, and don't touch things that are hard to reliably manage in business logic. Flickr is definitely not a bank or a weapons control system or anything like that, but people's photos (and their privacy preferences and all of that) are still important. When you have paying customers, it's never really ok to screw up :)

Also: Slow? Is not!

(OK .... sometimes image serving is slow, and we're working on that - it's a symptom of needing to scale faster than we originally imagined. But that has nothing to do with the database - even pages that are doing a lot of DB hits render sub-second.)

mattw58 28 2004 3:58AM

The tech issues around normalized/not data are solvable with code or db features. In my own experience, I've found the social make-up of the team just as important: For a certain type of team, normalized data means hurried bug fixing is less likely to result in screw ups. More here.

Jonas00 28 2004 4:00AM

The whole *point* of using a RDBMS is to keep all logic that ensures consitent data in *one* central point. That's why another mantra for database people is ACID.

I believe most people find it obvious that you don't store data in an RDBMS to conserve disk space. That would be stupid. And neither because you have extreme speed requirements, in which case misusing SQL is the *wrong* thing to do, the right thing would be to look beyond that and using some distributed embedded engine.

Just because this guy made a big shot web site, burning a lot of cash along the way, doesn't make him right.

Stewart Butterfield05 28 2004 4:05AM

Wow - Webb's response right after mine is a much better way of putting it. I shouldn't have said anything ;)

And Jonas - "The whole *point* of using a RDBMS is to keep all logic that ensures consitent data in *one* central point." No, see, that's idiotic. If that was the *point* of you using a RDB then who was paying you? And what for? The *point* comes from the application.

eric pan30 28 2004 5:30AM

Any project which has developers making design decisions on normalization is either too small for the normalization/denormalization argument to matter considerably, or it is screwed. A competent software architect will ensure there is no room for a hotshot programmer to put the project at such risk.

Contrary to what rg assumes, there is no such absolute dichotomy. You can be a hotshot programmer and a terrific DB architect :)/

And you'd normalize your database unless there were specific, compelling scenarios in which denormalization is a better alternative.

Case in point:

Take a database, large number of interrelated tables, optimally indexed. If you wish to make a large number of those fields in those tables searchable, it is not practical in MySQL to do a combination of SELECTs and UNION ALLs across them. Even with UNION ALL, where MySQL is not supposed to sort through the data, each additional UNION ALL adds significant overhead and the time taken for the query increases non-(worse than)-linearly.

For a solution, now we can consider trading storage space for speed, namely by pulling searchable data out of those tables into a UNION'ed table, index the fields on it, and just query THAT when needed. The search table can be recreated whenever need be, and will not be relied upon for other uses, so there is little problem with data inconsistency.

Knowing when there are exceptions to the guidelines of data normalization is the wisest position. Spouting ridiculous and somewhat pretentious (either that, or at least intentionally inciteful) mantras such as "Normalization is for sissies" is as much a admission of newbiehood as is a statement such as "Always normalize, no matter what the situation". Which no one ever really says, anyway. It's a straw man.

And Mr. Kottke, in algorithms class, way back when, we learned that no matter how much you increase the quality and quantity of machines you have working for you, they can never replace a bad algorithm. Unless you can disprove that, I don't think the majority of your veteran DB design readership will find Cal Henderson's presentation very compelling: 13 SELECTs per I/U/D is ridiculous, when you could interrelate all your tables (heh, normalize) such that you don't need a 13 separate database transactions overheading an I/U/D. You might not even need any.

eric pan44 28 2004 5:44AM

PS:

I'm glad Stewart somewhat clarified what they're doing over at Flickr. It appears that it's mostly senor Kottke's take on Cal's presentation that is the problem. There are fewer issues at stake here than this thread would immediately imply; most counterargument is directed to Jason's interpretation and hasty conclusion, (hopefully) rather than a haphazard implementation of a database architecture in use at Flickr.

JayBee22 28 2004 7:22AM

It's been said already - normalisation is only marginally about storage space.

Let me tell you a story. I used to work for an academic institution where the data was (and still is) embarrassingly de-normalised. In fact, I hesitiate to use that phrase as it implies that once upon a time in the dim and distant mists of the past it approached normalisation. I can guarantee that this is as far from the truth as the data is from "normal".

So, that's the setup. Let's chat about the data. It holds student records. So badly that it was possible for students to be marked as graduates WITHOUT THEIR NAME BEING PASSED TO ANYBODY FOR INCLUSION IN THE GRADUATION MAILING LIST. This would never be caught until the student called up to ask why they had their pass results but hadn't been told that the graduation was last week...

They also created a new record for each student FOR EACH YEAR OF STUDY. They did this by duplicating records and changing the year. Wanna change something across a post-doc student's history? That's potentially 10 years worth of massively duplicated data. And they didn't have a system for doing this in a managed way.

Oh, and the real kicker? Each department had rw access to this database, and they all had different interfaces and different limits on what they could and couldn't edit. Want to add a new table to reference a student's background records? Gotta remember to grant the right access to the right people - and if you forget? No checks. And if THEY forget (becuase, of course, they got no training - how do you train someone to nail jelly to the wall?) then potentially everyone's data's screwed.

Trust me, just because denormalisation is possible and - in some, specifically defined and understood circumstances - can actually lead to performance improvements, that doesn't mean we should say "Aha! Someone important said it doesn't matter! So it must be okay?"

I once logged into our server as root and didn't type in anything dangerous, though I was terrified. Does that mean I should start handing out root logins to all our clients as their FTP details? I mean, it would make the password managment process so much simpler, right?

Jeebus41 28 2004 7:41AM

As many others have said, you're completely missing the point here. Assuming that the only reason to normalize is to save disk space is like saying that the only reason you should write valid or semantically meaningful markup is to avoid code bloat.

Howard Fore04 28 2004 8:04AM

Data integrity isn't the only type of integrity that normalization makes easier. Over time nearly all projects (and their databases) change scope and will add tables and/or columns. When you represent the same fact in different tables that may have been added at different times (and perhaps by different people) you greatly increase the probability that what is called "UserID" in one table is called "UserCode" in another. And then it becomes real fun for the uninitiated to come up to speed.

Jenn53 28 2004 9:53AM

As for selects, I can't believe only two people have mentioned views. That seems like an obvious possibility for a solution here. Also, many have mentioned checking your indexes to make sure you have the right ones for the joins. This is good advice. Specifying the type of join might also help, or in radical cases, rethinking your schema completely. In short, before you think about becoming a renegade data cowboy and throwing 30 years of research to the wind, make sure you are aware of SQL's full capabilities. It can do more than just select-insert-update, you know. This is like saying you have a webpage with 10MB of images and you want to speed it up by removing all the whitespace from your HTML. Where is the real problem?

You must realize that denormalizing won't save you any time during inserts and updates, and will likely create extra overhead for those operations. Is this worth it? Before you say yes, test your database's performance, QUANTITATIVELY, before and after denormalizing. It all depends on your particular data and usage, but I think the cases where denormalizing helps are very specific and very rare. And in most of these cases the tiny performance gain isn't even worth the extra bugs and coding hassle.

John Blaze20 28 200410:20AM

If read the (MySQL) intro. to Database Normalization correctly, the author says that you must judge your self about the level of normalization your database requires. There are 5 levels and the author admits to only using the first two levels. I use PostgreSQL and like the MySQL tutorial author, usually normalize to the second level, then start usings the explain tool (MySQL has it also) and work from there...

Jason Wall35 28 200410:35AM

Just as long as we're talking about normalization, don't forget Boyce-Codd Normal Form! I haven't met a DBA yet who uses it. :)

This is the same argument we used to have about tables and css. There will always be purists, and those who simply don't care, and then there is the larger middle ground. Because in the real world, solutions are rarely perfect and as is the case with all design, the end result is usually a compromise between flexibility and speed.

I think Jason had a point in bringing up the fact that denormalization is a valid performance tuning technique. Its not often discussed. Its not so much the conclusion drawn here, but the fact that the discussion was published so nubies can find it later. ;)

greg49 28 200411:49AM

The answer to Cal's analogy is to quit smoking or die.

James MacKenzie52 28 200411:52AM

I am heavily in favor of normalization. Redundant data is problematic with time. You may remember the "real" records a week from now, but what about 6 months from now? Which is the "core" dataset? Can someone new look at the data model and begin to assess the data?

All things aside, knowing with 100% certainly that all unique values are, in fact, unique and all keys relate properly to data in related tables provides a level of certainty that is the whole point of storing data in a relational database. If you have normalized data, you can create complex queries at will without worry. Your results will be accurate and you will save a lot of time spent wondering about the accuracy of your results.

I manage an esoteric contract with an MS Access database (tables and a few stored procedures only) and an ASP front end. The size of the database is not large, but the relationships border on the bizarre at times. This is a reflection of how things are run at the management level and I have to mimic this in a structural way. Without hard-core normalization, I would have shot myself by now trying to roll out user modules. I can confidently use a variety of SQL statements and session variables to fire up all kinds mini-applications that keep my users happy (ish ;-). At times I am required to export data to spreadsheets and always there are questions. Any of these exports I have been able to move on-line in real time cause the questions disappear. This is a human factor. This is what I design for.

So I don't have experience with huge databases and can see a point where performance might be an issue, so I can't rule out de-normalization in these cases. If I had to do that, I would probably allow data retrieval and exporting from the de-normalized data, but never any updates to the core data itself. I may not be reading the issue correctly and I apologize if I have glossed over anyone's valid argument.

Note: I know a lot of you will scoff at MS Access, but by keeping the design tight and the file free of code, I can honestly say that most of you would be very surprised to find out how stable it can be. If you store tables only, you actually rarely, if ever, need to compact/repair your database, especially with a browser-based front end.

Andre Restivo19 28 200412:19PM

Normalization isn't about saving hard disk space, it's about minimizing redundancy.

It's about making sure you don't have information about the same thing in 2 places and with different values.

The lighter analogy doesn't address this issue at all. A better analogy would be having one calendar at home, another one in your car and yet another one at your office so you would always have one at hand. But you would never be sure which one has the correct information.

Per20 28 200412:20PM

Where the h**** is Jeremy when you need him anyway?

Wilhelm23 28 200412:23PM

Stewart Butterfield writes: "We're not using a database to store and manipulate relational data, we're using it to make an app run."

How totally stupid and unuseful can you get? Sure, we could play the 'my abstraction level is superior to yours' game all day ("no you're not, you're using it to allow people to share pictures with each other over the internet! neener neener!") but who cares?

The point of databases from a programmatic view is to provide an optimal data storage experience. In the experience of people who have implemented thousands more databases than you, including web sites on the Internet that are or were hundreds of times more popular than Flickr, data normalization is the best way unless (a) the db implementation is pretty bad (e.g., mysql), or (b) you need to do an optimization and you don't have the option of paying for a db service. Note that of the 58 mysql databases I've put up, only 5 required munging. Flickr-like functionality is utterly not within that realm.

Patrick46 28 200412:46PM

Isn't normalization kind of what the R in RDBMS is for? Normalization is pretty much inherent in the idea of a relational database schema. Denormalization is basically working against all the nifty tricks your RDMBS system can do with indexing, replication, referential integrity, and so on.

Well said. This is kind of a niche arguement, apparently, but lack of normalization makes relational analysis of data REALLY difficult. If you're into that kind of work, of course. For example, let an average user get a copy of Access, and they'll wind up building a database to store all kinds of neat stuff, like serial murder investigations and terrorist money laundering. Problem is, the data is often stored in such a nasty, denormalized, unrelational form that analysis, six degrees of Kevin Bacon style, is practically impossible. We have plenty of data, we're just powerless to do anything useful with it.

r. vaca31 28 2004 1:31PM

I'm glad Stewart somewhat clarified what they're doing over at Flickr. It appears that it's mostly senor Kottke's take on Cal's presentation that is the problem. There are fewer issues at stake here than this thread would immediately imply; most counterargument is directed to Jason's interpretation and hasty conclusion, (hopefully) rather than a haphazard implementation of a database architecture in use at Flickr.

True. This thread is mostly about DB guys defending their turf. I would advise Stewart to stay out since this isnt really about Flickr. :) Based on a set of slides and bare sketches, I don't think anyone here is qualified to judge further on whether they are doing the right thing. Cal is a smart guy and just based on his track record, what they are doing is probably the right decision.

Of course, denormalization is the right thing to do under certain circumstances. I think there is agreement on that too. The only "problem" is that Database Design is a very mature field and Jason stepped right in and made some uninformed comments which he presented as a revelation. Saying that normalization has anything to do with "disk space" is like saying, well, good web design has anything to do with disk space - its a complete non sequitur (Yes, I know its been mentioned thirty times in this thread...but thats the core of why everyone here is so agitated :) )

Chad Baker07 28 2004 2:07PM

Is there any way to normalize these comments?

JayBee07 28 2004 3:07PM

Is there any way to normalize these comments?

You'll have to normalise the commentators first ;)

Mark Beeson20 28 2004 3:20PM

I can't believe only two people have mentioned views.

That's what you get when professional dbas go slumming into Amateur Hour. Most of the neo-dbas around here think of SQL as insert/update/delete/select only because that's all MySQL can do. ;)

But it's good to see that sub-selects finally hit production grade in MySQL. ;)

And that stored procedures, views, pseudo-triggers, etc, for MySQL are planned for late 2005. ;)

Dave Fitch37 28 2004 4:37PM

Arguing about db design to a large degree seems to be missing the point: the real issue is data integrity, and how you manage it. You'd be surprised how much data is rubbish - converting stuff into information is a damn sight easier said than done.

Managing data, particularly ensuring that it is captured appropriately and then KEPT consistent, is often ignored until things are horribly broken. Nothing causes confusion like someone using out-of date data/classifications/structures etc. because they weren't concerned with data integrity/use.

The university example is great, and many of us can tell you similar horror stories...

Ziad47 28 2004 4:47PM

I think the main question is: do you want to use the same structure for the data that is produced everyday (I/U/D) and for the data that is accessed or published or searched ?
In my opinion, the two are meant for different audiences, and should be designed differently (for instance, you'd want to denormalize when building your fulltext indices)

Joey Jojo04 28 2004 5:04PM

My argurment would be that hard drive space may be getting cheaper, but CPU time is getting even CHEAPER, so why not normalize and let the CPU worry about it all. :)

Btw, I like boobies.

Donnie Millar26 28 2004 5:26PM

A quote I picked up from the MySQL Manual.

Normally, you should try to keep all data non-redundant (what is called "third normal form" in database theory). However, do not be afraid to duplicate information or create summary tables if necessary to gain more speed.

Brianiac02 28 2004 9:02PM

As someone who has had to work with a system for K-12 student records (that sounds just like the system JayBee mentioned), and a financial institution, I would ask that, rather than de-normalizing, simply punch the next person who has to interface to the database as hard as you can in the stomach. Let them know the alternative, and they will politely thank you.

chrissy41 29 2004 1:41AM

Wow. All those PhDs at Microsoft, Oracle, IBM and Sybase had it all wrong! Thanks for the enlightenment, tee.

Ben39 29 2004 4:39AM

Hmm, I'm not sure about this whole de-normalise your database idea. Sure, it might make sense in certain cases, but the slowness in query execution for complicated joins which occur as a consequence of a heavily normalised database are generally within the remit of the particular DBMS you're using; how they implement select queries will vary in speed an awful lot between, say, MySQL, PosgreSQL and MS SQL Server.

Plus as Dave Fitch remarks, removing redundant data is only one advantage of a normalised database; duplicated data increases the chances that you're not pulling out consistent or integrally assured data.

I think for the sort of massive systems being mooted, a far better approach would be to abandon relational databases completely and instead use serialised objects. Then you're really talking about massive speed increases.

MarkTAW23 29 200410:23AM

Dear Mr. Kottke.

We would like to ship you a lifetime supply of lighters, however we have 17 different address for you in our database. Can you please verify your shipping address with us?

Sincerely,
Denormalized Lighters, Inc.

Cory08 29 200411:08AM

It should also be said because denormalization increases select speeds, it is often used in data warehousing for the purpose of data mining and finding patterns. My understanding pretty much stops there; I haven't worked with any data warehouses myself.

Brooks15 29 2004 4:15PM

And as you're puffing away sitting in traffic, a lighter sitting in the bin between the front seats, your kid at home is burning down the house with a second, redundant lighter.

As many others have already pointed out, what happens when each location of the "same" data is no longer the same? For one reason or another there will be discrepancies and when that happens how much of your database can you continue to trust?

eric pan12 29 2004 5:12PM

I think it's interesting that Jason refers to this entire debate as a "religious discussion", as if the multitude of examples showing the horrible piles of crap that non-normalized data can be aren't enough to weigh in against his Just-Throw-More-Hardware-And-Brute-Force- At-A-Badly-Organized-Database-And-All-Will-Be-Well hypothesis.

Best of luck in your recovery, Dr. Kottke, but keep in mind that the dbas speaking out here are not adherent to a faith so much as they dislike flawed reasoning. (We could discuss how all science is religion, but that's somewhat beyond the scope of this thread.)

r. vaca47 29 2004 5:47PM

It's not just manageability. With a well distributed schema you can offload different services to different machines rather than relying on one box to do all your transactions. This is what Ebay discovered the hard way in 1999:

"Most of the meltdown was due to a monolithic application addressing everything in a single, monolithic database. One application held all the applications of eBay, with the exception of search.

But it didn't work."

Fazal Majid26 29 2004 8:26PM

MySQL AB's pronouncements on normalization rate the same level of credibility as their earlier ones on transactions ("who needs them?") and referential integrity. Not high at all.

Jock02 29 2004 9:02PM

Heck of a "religious discussion" when no one is disagreeing with anyone save Jason. ;-)

This thread is closed to new comments. Thanks to everyone who responded.

kottke.org

Front page
About + contact
Site archives

Subscribe

Follow kottke.org on Twitter

Follow kottke.org on Tumblr

Like kottke.org on Facebook

Subscribe to the RSS feed

Advertisement

Ads by The Deck

Support kottke.org shop at Amazon

And more at Amazon.com

Looking for work?

More at We Work Remotely

Kottke @ Quarterly

Subscribe to Quarterly and get a real-life mailing from Jason every three months.

 

Enginehosting

Hosting provided EngineHosting