🏠 Home
Lyokolux's blog

About data storage and gathering everything

I read Design Patterns that encourage junk data of css-irl.info (great post btw) and it reminded me thoughts I had about data. I want to go in depth as I feel particularly concerned about this topic.

By default, data is created and stored indefinitely by web services. A service with temporary data is more of an exception. The example of the post on JS Bin clearly shows this: a lot of data is generated. We’re talking about 130 GB of text (javascript, html and css). Services store data indefinitely.

If it’s storable, then it’s unlimited. The creator of the data doesn’t see the impact. It’s not for nothing that the term ROT was coined. I wonder, what about this data generated and stored by default? It feels wrong.

ROT Data

It’s hard to put a name to this behavior. I want to put a word to this data that is limitless. I’ve come up with the concept ROT for Redundant, Obsolete or Trivial data. Another concept concerns useless data or data that doesn’t add value. It’s called dark data. So there are different data and they bring different values. We focus on those that have little value.

In a first case the information can already exist. It is then not needed to have a duplication of them. This is the R of ROT. The data is in this case directly disposable: I don’t need a duplicate of the sane picture on my phone, yet it happens with apps that copy the untouched files in their own directory. I took the example of Adobe Fill & Sign because it comes to my mind. They have some reasons to do so but it also results in a waste of storage. It can be more efficient.

Secondly, the data may expire. They are no longer relevant. This is the O of ROT. Is it still useful to remember that you connected to this service 10 years ago? In most cases, no, but some still do for legal reasons. Perhaps you logged on last year, and that’s the most relevant case. The value of a piece of data can therefore expire over time, or when another piece of data overtakes it. Some can be saved for history purposes but they are are disposable. The history can be build in a more adapted or more efficient storage than what we currently use. Cold data storage exists for a reason. In most of the case: the data expire and we won’t need them again.

Thirdly, data can be obvious, so we won’t need them. This is the T of ROT. Storing the amount of letters of a name alone does not bring much. Yet such data exists and can not be reused.

Preventing ROT data

Good relational database designs have in common that they prevent ROT data. So we can reuse their techniques in our case: web services and user’s data.

Relational databases avoid duplication to make it small as possible. The only duplication allowed is for backups and safety. There is only one source of truth, no useless duplication, no ROT.

Another powerful design is to store raw data which can be derived into several useful ones. Let’s take the obvious date example. All date informations can be derived from such a raw timestamp 1708805421455: the year, the month, the day, the hour, minutes, even seconds or millieconds! All these informations in one timestamp. It matches 2024-02-24T20:10:21.455Z in ISO format, or Saturday, the 24th of February at 8:10 PM. Look at this useful timestamp: we can format the date information in the way we want it. It is used for custom displays by the web browsers and apps to match user preferences or app settings.

These derived data are disposable.

A compressed image of the original one can be deleted. We still have the original. So it can be on social networks where posts are buried and no one reads them anymore. They bring no value anymore, often supplemented by other posts

Sharing resources and information can also avoid duplication. Is a local copy needed? Could a simple service deliver the expected rare data instead? It always depend of the needs but the question is rarely considered . Each company or institutions does not have to holds a complete copy of everything as long as there are accessible backups. It is yet the case.

Yes, the subject is still ROT up to this point, but it goes on. In the meantime YES, I found the term by searching for a word and writing the post down. That led me to data lifetime. The concept is similar to Rust’s lifetime: how long does data live? This feeds all my thoughts: how long does and should data live?

This is already long enough, it is not good enough for me but I want to publish. So let’s think about data lifetime another time. It’s better to write and publish than wait months.