He's missing the other half of why 'select *' is bad, which really is about the star: Selecting columns by name automatically makes almost all incompatible schema changes cause the query to fail (in an easily detected place, for an obvious reason), while still allowing non-breaking schema changes like adding or dropping columns you don't use to have exactly the same behavior as before.
The other nice benefit is that, when you list the fields you're retrieving, it functions like a mini-reference as you write/modify your query, so you don't have to keep switching to the schema to see if it's called "id" or "row_id", if it's "package_name" or "pkgname"...
To me, "SELECT *" almost feels like using your variables without declaring them first -- not technically wrong in many languages, but still uncomfortable-feeling.
I don't understand the var hatred in c#. Why type a type name twice? I can understand a function result needing to be explicitly typed but why do this?
StringBuilder stringBuilder = new StringBuilder();
That is horrendously redundant. Var is handy shorthand and should be used!!
'select <star>' also doesn't always do what you expect.
I recall a problem with some Oracle views that were based on 'select <star>' queries. What we didn't realize is that the '*' is evaluated when the view is compiled, and if you later add columns to the underlying tables the view did not automatically include them.
Edit: OK I don't know how to include a literal star in text here.
The only thing is you need to be careful about where the contract is.
There's a huge difference between an application executing a sql statement:
SELECT * FROM accounts;
and a UDF like:
CREATE OR REPLACE FUNCTION accounts__list_all()
RETURNS SETOF accounts
LANGUAGE SQL AS
$$
SELECT * FROM accounts ORDER BY account_no;
$$;
In the latter case the select * reduces maintenance points in your db because you have already tied the return type to the table type, so either you get to use the * or list all columns manually and the latter, given the requirement to list all every time, makes any schema changes incompatible with your API.
Can't believe this partially informed article got on HN. The point of the problem with * is as you say, when the schema changes, named column queries will still work. * with numbered columns will fail and not always in a detectable way. Article is BS.
The main danger with mixing SELECT * and ORMs when you only need some particular columns is the performance hit.
Not just on the database itself. You also need to ship surplus data to your application and then instantiate objects with surplus fields. All of this takes bandwidth, RAM and time.
I also once saw someone write idiomatic ruby that performed a sum over an ActiveRecord .each. In dev, nobody noticed. In production it pulled a 9 million row table across, iterated over 9 million ruby objects, all to sum up a few hundred kb of actual data that the database could do in a few hundred ms.
The problem is hoping that ORMs will make those icky databases go away. They don't. They can't.
I guess you could stretch and say he means "myth" like you might use "lore" or "mythology" as in the collected beliefs of a group or people, but that's stretching it, if you ask me.
I'm not a DBA, but I really like reading articles about common pitfalls and mistakes, to broaden my knowledge base. Something like "'SELECT *' and your ORM might be using it for you" would have been a much more informative title to me, without dancing around the clearly intended use of the word "myth," and I personally would have still read it, without expecting content completely at odds with what I would say is the natural interpretation of the title.
His argument seems to be that it's a "myth" because you could do the same thing by explicitly listing out the columns. Which in no way makes the former a myth, and it is just pure clickbait, as you said.
It gets much worse than just single table grab-alls, though. I once worked at a shop where they insisted upon prolifically using "select just about everything from every relational table, comprehensively joined to the n-th level" views as some sort of confused notion about code reuse, such that everything would then select from these master views. The end result was that it was incredibly frustrating trying to resolve performance problems, because literally everything was a performance problem. It was one of those instances where I shook my fists at every naive developer railing about purported premature optimizations.
"Don't go out of your way to optimize prematurely" does not mean "architect your software in the opposite direction to industry common sense". "We will optimize later if needed" does not work at architecture level. That's why you prototype before committing yourself.
"Don't go out of your way to optimize prematurely" does not mean
It doesn't mean anything, because both optimize and premature are in the opinion of the viewer. There is no consensus on this at all, but it has been my experience that when those famous words are uttered, badness is about to occur.
It seems to me that in discussions of NoSQL vs relational databases, one of the main arguments that comes up regarding ease of design and maintenance is "you can just use an ORM". But I see major disadvantage after major disadvantage of using an ORM, and there seems to be some overlap with the disadvantages of NoSQL (by which I am primarily referring to document-oriented databases). I really can't tell, once you factor these disadvantages in, what comes out on top.
The ORM I really want just takes my SQL and a connection to my database at compile or load time, and wires it up to the type system of the language I'm using. All the ones I've tried just seem to embrace the awfulness of the low-level database access APIs and are just trying to replace the good-enough DSL that is SQL with something worse.
In that regard, while not actually an ORM per se, one of the best experiences I've ever had with SQL integration in a language was SAP's ABAP, where column types and data types are (almost?) the exact same thing.
ORMs are just like databases really. There's a good way and there's a bad way.
It's not the ORM that causes problems, it's all about the mindset of thinking that you don't have to worry about anything because the ORM will have your back anyway. I use ActiveRecord in Rails every day and there's a lot of things it just does right. It abstracts certain things I wouldn't want to constantly have to worry about and gives me a fine-grained enough layer of control over the queries that I can actually get all the expressiveness and power I need from the records.
Like they say, don't throw the baby with the bath water my friend.
I don't think one is inherently superior to the other, it just depends on the data. If you have highly structured data for which integrity is mission-critical, you want to be using a RDBMS. If you have unstructured data for which throughput is mission-critical, you'll probably benefit from using a NoSQL system.
i.e. rails' ActiveRecord allows you to select columns easily in most cases:
User.where(..) # all fields
User.where(..).select(:name, :id) # only two of them
joe.boards.select(:title) # only one in join
joe.boards.pluck(:title) # only column value instead of full objects with one attribute set
Though I recall it's involved for #belongs_to relations.
You should also be able to setup the select as a default per-class, other than per query.
I suspect, I likely didn't understand your question, could you expand it?
It's possible I'm just ignorant of the capabilities of most ORMs. My only real experience is with the Django ORM. AFAIK, if you select an object, it wants to make sure all the members of the object are populated.
.NETs Entity Framework lets you do this using the fantastic LINQ syntax. Commonly you will say which objects to select, then give a projection into a new anonymous type with only the columns you want.
- returning more data than you need over a network connection can be a bottleneck. The ORM/java prog written by a programmer that doesn't care may only need a few columns. Tables can be very wide, especially when CLOB/BLOBs are involved
- Stored procedures, functions and triggers that use SELECT * may break horribly when columns are added or removed from tables, views or materialised views
- Table types (collection and record types) will break if any column redefinition occurs
- It shows a lack of understanding of the data by the developer. This may, in turn, cause further issues
Proper schema design and understanding of the data involved is key when working with databases. The existence of NoSQL and the general dumbing down of Computer Science degrees, plus the teaching of ORMs has caused endless problems. The need for NoSQL in the first place was possibly caused by the use of high level programming languages where you didn't need to know the data - you just fetch it all, then piss around with it in ruby/whatever, then fetch it again & again until you're done.
I use select * a lot in PostgreSQL. I do this any time I have a relatively self-contained table which can't be further normalized very far, and CRUD queries wrapped behind a stored procedure which returns the table type.
The reason I use select * there is that it reduces a maintenance point when it comes to return types and tables. In that case, I am guaranteed to get a return type which matches the table type. If the two are closely linked, then select * makes a lot of sense, and the alternatives aren't going to do better.
This makes schema changes more compatible than trying to change the column everywhere in every query in addition to the higher levels of code.
I would generally agree that an application calling select * from table is a bad thing for the same reasons, but if you are encapsulating your db behind an API, a lot of reasons shift.
I have the opposite preference. If there are a big bag of breakages when I change the schema, at least I see where they are. I'm not a fan of relying solely on tests when the DB itself can notice introduced discrepancies.
The fundamental question is where and what you are doing.
As I say, I use select * a lot, but these are in two groups:
1. In UDF's to ensure the proper return type, and
2. Against UDF's where it is the UDF designer's job to maintain the software interface contract.
In those cases, select * works, but it only does so because you have dependency injection and can decide on a case-by-case basis whether to pass the changes up, wrap them in a view, or the like. Most of the time this is a simple choice. The app needs a new column that we are storing and so.... But in a multi-app environment there may be other choices, and that dependency injection makes all the difference in the world.
EDIT: for example, if you have a UDF returning the type of a given table, then you may want to use select * because you have already decided these are to be tied together. Similarly select * from my_udf() is not a bad thing the way that select * from my_table is.
There used to be a bug in either MySQL or SQL server (can't recall which anymore) where if you used select * in a view and then adjusted the schema the data would be out of line with the columns.
SQL Server behaviour is: If you have a "select *" view and then add more columns to underlying table, columns will not show up in view until view if refreshed
Thats not too bad, and it does not get the column names mixed up.
That's also the way it works in PostgreSQL. I think the view is actually compiled into a parse tree with named columns, and the view given a defined type iirc.
(Select * from ... in PostgreSQL is actually really usefun in anything object-relational because you get objects of a named type back btw.)
That's the way it works now. In earlier versions any added columns would still be retrieved, shifting all the other column values to the right.. So all the data would wind up having the wrong column names.
Indeed I did - thanks. I just remembered it was the version before 2008, as I left the company in the middle of the upgrade to that one.
The columns being renamed on views bit us big time. We had a large ERP package that we'd built numerous custom views on top of. Whenever we applied an update to the ERP package, it'd add columns and all our views would be borked.
Why is this article on HN? Come on. Select * is just a construct. When is it bad? That would be a better discussion. In any case the article demonstrates a lack of fundamental understanding of relation database technologies such as a clustered index...hm wait. I'm taking the bait. Read a good book on relational databases. SQL for Smarties will get you started.
Is select count(1) from mytable faster than select count(*) from mytable?
It is bad because in a column store each column will be stored in a different location, resulting in seeks on disk or cache misses in memory. In a row store the complete row will often be in a contiguous block of memory. It gets worse when the column store uses dictionary compression and the dictionary is not already cached, since you need one more memory access to a random location. For wide tables the overhead is noticeable and the application should really only retrieve what is necessary.
select * can also be bad if you have any sort of system that expects or uses your columns by index. someone changes up your database, adds a column or some such and all of a sudden your consuming code breaks all over the place.
You can also experience severe performance degradation (both server and client) if someone adds an absolutely huge binary column.
Perhaps an example of table checking_account_register hmm lets add a new feature, after a paper check is cancelled we'll scan it into an image file and stick it in the database. Suddenly you get giant TIFF for each row returned, surprise! Of course a better spot Might be a separate scan table linking checks to an image of the check (perhaps multiple images, multiple scan attempts, multiple sides of the check, and all that), but for the sake of argument, etc.
I think probably more databases get killed by processing load via no WHERE or LIMIT clause than from using a * as a column list... probably. With a close second of gathering way too much data and weeding it out in a HAVING.
Of course, adding large binary blobs to a relational database is almost certainly a bad idea anyway. It’s just not meant to be. You can achieve transactionality by other means and it won’t actually make managing and backing up data easier because suddenly database backups are an awful lot larger.
If you have text fields in your columns and you don't actually need them, SELECT * in indeed bad as it will cause temporary tables to be made, even if indexes exist for the query.