Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This just in, Uber rediscovers what all us database people already knew, structured data is usually way easier to compress and store and index and query than unstructured blobs of text, which is why we kept telling you to stop storing json in your databases.


There's nothing wrong with storing json in your database if the tradeoffs are clear and it's used in a sensible way.

Having structured data and an additional json/jsonb column where it makes sense can be very powerful. There's a reason every new release of Postgres improves on the performance and features available for the json data type. (https://www.postgresql.org/docs/9.5/functions-json.html)


> There's nothing wrong with storing json in your database if the tradeoffs are clear and it's used in a sensible way.

Of course. If there was, postgres wouldn't even support it.

The GP's rant is usually thrown against people that default into json instead of thinking about it and maybe coming up with an adequate structure. There are way too many of those people.


Rigid structures and schemas are nice, but having document oriented data also has its advantages.


> having document oriented data also has its advantages.

As the VC said to the founder who wanted to do things the naive way: "well, I've seen quite a few people try that over the last few decades. It'll be interesting to see if you can make it work this time."


If you haven't seen document oriented data work you haven't been looking very hard.


And so are document storage databases.


It cant be. Json has a huge structural problem: it's an ASCII representation of a schema+value list, where the schema is repeated with each value. It improved on xml because it doesn't repeat the schema twice, at least...

It's nonsensical most of the time: do a table, transform values out of the db or in the consumer.

The reason postgres does it is because lazy developpers overused the json columns and then got fucked and say postgres is slow (talking from repeated experience here). Yeah searching in random unstructured blob is slow, surprise.

I dont dislike the idea to store json and structured data together but... you dont need performance then. Transferring a binary representation of a table and having a binary to object converter in your consumer (even chrome) is several orders of magnitudes faster than parsing strings, especially with json vomit of schema at every value.


> It's nonsensical most of the time

As usual, it comes down to being sensible about how to use a given tool. You can start with a json column and later when access patterns become clear you split out specific keys that are often accessed / queried on into specific columns.

Another good use case for data that you want to have but don't have to query on often: https://supabase.com/blog/audit

> Yeah searching in random unstructured blob is slow, surprise.

If your use case it to search / query on it often then jsonb column is the wrong choice. You can have an index on a json key, which works reasonably well but I'd probably not put it in the hot path: https://www.postgresql.org/docs/current/datatype-json.html#J...


Couldn't agree more. Not to mention timestamps that JSON simply doesn't handle and is essential for event data. The raijin database has an clever approach to solving the schema rigidity problem: https://raijin.co


Yeah it’s silly, even when Spark is writing unstructured logs, that doesn’t mean that you can‘t parse them after-the-fact and store them in a structured way. Even if it doesn’t work for 100% of the cases, it’s very easy to achieve for 99% of them, in which case you’ll still keep a “raw_message” column which you can query as text.

Next up: Uber discovers column oriented databases are more efficient for data warehouses.


Word! Storing JSON is so often the most direct and explicit way of accruing technical debt: "We don't really know what structure the data we'll get should have, just specify that it's going to be JSON"...


I like to say that when you try to make a "schemaless" database, you've just made 1000 different schemas instead.


Yeah, "Schemaless" is a total misnomer. You either have "schema-on-write" or "schema-on-read".


Schemaless means there’s no assurance that the stored data matches any consistent schema. You may try to apply a schema on read, but you don’t know if the data being read will match it.


"schema in code" covers all bases.


But if you’re not storing data as JSON, can you really say you’re agile? /s


Look, we'll just get it in this way for now, once it's live we'll have all the time we need to change the schema in the background


We don’t have a use case yet, but let’s just collect all the data and figure out what to do with it later!

It’s funny how these cliches repeat everywhere in the industry, and it’s almost impossible for people to figure this out beforehand. It seems like everyone needs to deal with data lakes (at scale) at least once in their life before they truly appreciate the costs of the flexibility they offer.


The Data Exhaust approach is simultaneously bad and justifiable. You should measure what matters and think about what you want to measure and why before collecting data. On the other hand, collecting data in case what you want to measure changes later is a usually lowish cost way of maybe having the right data in advance later.


Oh I agree, that's why I was careful to put "at scale" in there -- these types of approaches are typically good when you're still trying to understand your problem domain, and have not yet hit production scale.

But I've met many a customer that's spending 7-figures on a yearly basis on data that they have yet to extract value from. The rationale is typically "we don't know yet what parameters are important to the model we come up with later", but even then, you could do better than store everything in plaintext JSON on S3.


You can't realistically expect every log format to get a custom schema declared for it prior to deployment.


If you never intend to monitor them systematically, absolutely!

If you're a bit serious you can at least impose date, time to the millisecond, pointer to the source of the log line, level, and a message. Let s be crazy and even say the message could have a structure too, but I can feel the weight of effort on your shoulders and say you ve already saved yourself the embarassement a colleague of mine faced when he realized he couldnt give me millisecond timestamp, rendering a latency calculation in the past impossible.


Sorry if I was ambiguous before. When I said "log format", I was referring to the message part of the log line. Standardized timestamp, line in the source code that emitted the log line, and level are the bare minimum for all logging.

Keeping the message part of the log line's format in sync with some external store is deviously difficult particularly when the interesting parts of the log are the dynamic portions that can take on multiple shapes.


I think it's important to note the difference between unstructured and schemaless. JSON is very much not a blob of text. One layer's structured data is another layer's opaque blob.


It s the same ! Look the syntax of json does not impose a structure (what fields, what order, can I remove a field), making it dangerous for any stable parsing over time.


It is pretty well known at this point by much of the industry that Uber has the same promo policy incentives as Google. That’s what happens when you ape google.


This just in: “I knew that already hahaha morons” still as unhelpful and uninteresting a comment as ever.


Maybe people do things like that because:

- their application parses and generates JSON already, so it's low-effort.

- the JSON can have various shapes: database records generally don't do that.

- even if it has the same shape, it can change over time; they don't want to deal with the insane hassle of upgrade-time DB schema changes in existing installations

The alternative to JSON-in-DB is to have a persistent object store. That has downsides too.


It's a hassle to change shape in a db possibly, but have you lived through changing the shape of data in a json store where historical data is important ?

You either dont care abt the past and cant read it anymore, version your writer and reader each time you realize an address in a new country has yet another frigging field, or parse each json value to add the new shape in place in your store.

Json doesnt solve the problem of shape evolution, but it tempt you very strongly to think you can ignore it.


You do database migrations or you handle data with a variable shape. Where do you want to put your effort? The latter makes rollbacks easier at least.


But you dont care as much about rollback (we rarely rollack successful migration and only on prod issues the next few days, and always prepare for it with a reverse script) as you care about the past (can you read data from 2 years ago? this can matter a lot more, and you must know it's never done on unstructured data: the code evolve with the new shape, week after week and you re cucked when it's the old consumer code you need to unearth to understand the past). It's never perfect, but the belief data dont need structure nor a well automated history of migration, is dangerous.

I've seen myself, anecdotically, that most of the time with json data, the past is ignored during the ramp up to scale and then once the product is alive and popular, problems start arising. Usually the cry is "why the hell dont we have a dumb db rather than this magic dynamic crap". I now work in a more serious giant corporation where dumb dbs are the default and it's way more comfortable than I was led to believe, when I was younger.


Postgres has excellent JSON support, it's one of my favorite features, it's a nice middle ground between having a schema for absolutely everything or standing up mongodb beside it because it's web scale. Us in particular, we leverage json-schema in our application for a big portion json data and it works great.


MongoDB actually doesn't store JSON. It stores a binary encoding of JSON called BSON (Binary JSON) which encodes type and size information.

This means we can encode objects in your program directly into objects in the database. It also means we can natively encode documents, sub-documents, arrays, geo-spatial coordinates, floats, ints and decimals. This is a primary function of the driver.

This also allows us to efficiently index these fields, even sub-documents and arrays.

All MongoDB collections are compressed on disk by default.

(I work for MongoDB)


Thanks for the clarification. I appreciate the info, and all-in-all I think MongoDB seems really good these days. Though I did have a lot of problems with earlier versions (which, I acknowledge have mostly been resolved in current versions) that have kinda soured me on the product and I hesitate to reach for it on new projects when Postgres has been rock-solid for me for over a decade. I wish you guys all the best in building Mongo.

Getting back to my main point though, less the bit of sarcasm, is more of a general rule of thumb that has served me well is that if you already have something like postgresql stood up, you can generally take it much further than you may initially think before having to complicate your infrastructure by setting up another database (not just mongodb, but pretty much anything else).


Modern SQL engines can index JSONs tho. And they can be structured




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: