Having spent the last ~8 months at my work grappling with the consequences and downsides of a Data lake, all I want to do is never deal with one again.
Nothing about it was superior or even on par with simply fixing our current shortcomings OLAP database setup.
The data lake is not faster to write to; it’s definitely not faster to read from. Querying using Athena/etc was slow, painful to use, broke exceedingly often and would have resulted in us doing so much work stapling in schemas/etc that we would have been net better off to just do things properly from the start and use a database.
The data lake also does not have better access semantics and our implementation has resulted in some of my teammates practically reinventing consistency from first principles. By hand. Except worse.
Save yourself from this pain: find the right database and figure out how to use it, don’t reinvent one from first principles.
Completely agree with you. Data lakes were marketed well because, well... data warehousing is hard, and a lot of work. Data lakes don't make that hard work disappear, it just changes how and where it happens.
I've found data lakes complement DW's (in databases) well. Keep the raw data in the lake and query as needed for discovery, and load it into structured tables as the business needs arise.
I don't think anyone ever suggested that. The use case for a data lake is precisely the one you describe, it allows you to start collecting data without having to do a lot of work ahead of time before y9u know how you actually want to structure things. Allows for schema evolution too. It's not a panacea, it's just a way to avoid the inertia most large data projects have.
Nobody here suggested it, just something I see organizations doing quite often.
(edit: the rationale behind this tends to be that you can avoid the heavy lifting of ETL/transformation logic by just using a data lake - obviously not the case, as most of us know)
I've worked on nearly a dozen Data Lakes. I have never seen nor heard of anyone who said that Data Lakes meant you could avoid ETL. If anything it has necessitated more of it as users expect to join these disparate data sets.
There is after all a reason that the role Data Engineer became popular just as Data Lakes become popular.
No. Data lakes were marketed well because they are significantly cheaper and solve long standing problems.
S3 is basically free and has unlimited scalability. Oracle, DB2, HANA, SQL Server etc are ridiculously expensive and struggle under high concurrent load even with QoS in place.
If you're able to solve the problems that you were previously using oracle or SQL Server for with S3, more power to you, but the truth is that to replicate the functionality of that old Oracle server you'll start with S3, but you'll also want some querying (Aurora? RDS? Hbase?), probably some analytics and ingestion (Redshift? Kinesis? Elastic? Hive? Oozie? Airflow?), along with some security now that you've got multiple tools interacting (Ranger? Knox?), probably some load balancing (Zookeeper?), maybe some lineage and data cataloging (Atlas?), etc.
In my experience what starts with "Just throw some data in S3, forget that old crusty expensive server!" ends with 22 technologies trying to cohesively exist because each one provides a small but necessary slice of your platform. Your organization will never be able to find one person who is an expert in all of these (on the contrary, you can find an Oracle, or DB2, or SQL Server expert for half the money) so you end up with seven folks who are each an expert in three of the 22 pieces you've cobbled together, but they all have slightly different ideas on how things should work together, so you end up with a barely functioning platform after a year's worth of work because you didn't want to just start with a $400k license from Oracle.
I think the presumption that's differing here is query workload.
An OLAP database is, in the default case, an always-online instance or cluster, costing fixed monthly OpEx.
Whereas, if your goal in having that database is to do one query once a month based on a huge amount of data, then it will certainly be cheaper to have an analytical pipeline that is "offline" except when that query is running, with only the OLTP stage (something ingesting into S3; maybe even customers writing directly to your S3 bucket at their own Requester-Pays expense) online.
My biggest problem with Oracle is not the database itself. There is no doubt that Oracle is a fine piece of software, and is bullet proof, and has decades of experience built into it.
My problem is the scalability and elasticity of it's licensing model. It doesn't meet the needs of today's analytics without spending enormous amounts of money up front.
Nope. One can start easily with Airflow+Spark(ERM)+Presto+S3 and get about 80% what'd get from your run of the mill Oracle database. At a fraction of the price, without half the headache in procurement, licensing or performance tweaking. And better scalability.
You'd be looking at $M in licenses for anything half-serious based in Oracle tech. Becoming good at replacing Oracle stuff probably has been one of the best paying jobs for a while.
S3 is just storage. It doesn't provide any querying, crawling, metadata, provenance, or other details required for data at scale.
That's why AWS has entire product suites from Athena, Redshift Spectrum, Data Lake Formation, Glue, etc to help companies actually do something with the files stored in S3. And it's often a mess compared to just fixing their processes and ingesting it properly into a SQL data warehouse first.
For smaller use cases data lakes probably don't make sense.
But data lakes have arisen from the enterprise where the centralised data warehouse was the standard for the last few decades. They know how to use a database. They know how to model and schema the data. And they know about all of the problems it has. They didn't buy into the data lake concept because it's trendy.
Fact is that for large enterprises and for those with problematic data sets e.g. telemetry databases simply don't scale. You will always have priority workloads e.g. reporting during which time users and non-priority ETL jobs come second. And often Data Science use cases are banned altogether.
The reason data lakes make sense is because it is effectively unlimited scalability. You can have as many crazy ETL jobs, inexperienced users, Data Scientists all reading/writing at the same time with no impact.
Generally you want a hybrid model. Databases for SQL users and data lake for everything else.
> Generally you want a hybrid model. Databases for SQL users and data lake for everything else.
I do a mix of data science and software engineering, dealing with the datalake is a nightmare and I avoid it at almost all costs.
You know what the first thing everyone I worked with wanted to do after pointlessly pouring everything into the black hole that was the datalake? Re-implement some kind of SQL (and database semantics) back on top of it again; except now it's worse.
This doesn’t make any sense in the context of tools like Snowflake and BigQuery, where the allocation of compute is separated from the data itself. You can scale each of these use cases independently without cross-domain impact.
The data lake model seems to be more about not wanting to commit to a warehouse (for example: future proofing, looking at non-relational data, etc.).
> The reason data lakes make sense is because it is effectively unlimited scalability. You can have as many crazy ETL jobs, inexperienced users, Data Scientists all reading/writing at the same time with no impact.
Eh, almost all Data Lakes cannot handle small files well. All it takes is for someone to write 100 million of tiny files into the Data Lake to make life miserable for everyone else.
Every time I've seen someone do this it was a mistake and quickly resolved. Either you have way too many partitions in a Spark job or you are treating S3 like it's a queue. And if you really do need lots of delta records then just simply have a compaction job.
Well, inexperience users/data scientists tend to not care about what they write out :)
Nevertheless, my point is a Data Lake's does not offer free unlimited scalability. It takes a lots of effort and good engineering practice to make a Data Lake run smoothly at scale.
Data scientists shouldn't be writing anything to the Data Lake. Data Lakes store raw datasets (sort of like Event Streaming databases store raw events.) In academic terms, they store primary-source data.
Once data has been through some transformations at the hands of a Data Scientist, it's now a secondary source—a report, usually—and exists in a form better suited to living in a Data Warehouse.
Data Lakes need a priesthood to guard their interface, like DBAs are for DBMSes. The difference being that DBAs need to guard against misarchitected read workloads, while the manager of a Data Lake doesn't need to worry about that. They only need to worry about people putting the wrong things (= secondary-source data) into the Data Lake in the first place.
In most Data Lakes I've seen, usually there are specific teams with write privilege to it, where "putting $foo in the Data Lake" is their whole job: researchers who write scrapers, data teams that buy datasets from partners and dump them in, etc. Nobody else in the company needs to write to the Data Lake, because nobody else has raw data; if your data already lives in a company RDBMS, you don't move it from there into the Data Lake to process it; you write your query to pull data from both.
An analogy: there is a city by a lake. The city has water treatment plants which turn lakewater into drinking water and pump it into the city water system. Let's say you want to do an analysis of the lake water, but you need the water more dilute (i.e. with fewer impurities) than the lake water itself is. What would you do: pump the city water supply into the lake until the whole lake is properly dilute? Or just take some lake water in a cup and pour some water from your tap into the cup, and repeat?
Nothing about it was superior or even on par with simply fixing our current shortcomings OLAP database setup.
The data lake is not faster to write to; it’s definitely not faster to read from. Querying using Athena/etc was slow, painful to use, broke exceedingly often and would have resulted in us doing so much work stapling in schemas/etc that we would have been net better off to just do things properly from the start and use a database. The data lake also does not have better access semantics and our implementation has resulted in some of my teammates practically reinventing consistency from first principles. By hand. Except worse.
Save yourself from this pain: find the right database and figure out how to use it, don’t reinvent one from first principles.