Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Myth: Select * is Bad (use-the-index-luke.com)
87 points by fatalmind on Aug 25, 2013 | hide | past | favorite | 67 comments


He's missing the other half of why 'select *' is bad, which really is about the star: Selecting columns by name automatically makes almost all incompatible schema changes cause the query to fail (in an easily detected place, for an obvious reason), while still allowing non-breaking schema changes like adding or dropping columns you don't use to have exactly the same behavior as before.


I came here to say the exact same thing.

The other nice benefit is that, when you list the fields you're retrieving, it functions like a mini-reference as you write/modify your query, so you don't have to keep switching to the schema to see if it's called "id" or "row_id", if it's "package_name" or "pkgname"...

To me, "SELECT *" almost feels like using your variables without declaring them first -- not technically wrong in many languages, but still uncomfortable-feeling.


to me SELECT * is like var in C#


I don't understand the var hatred in c#. Why type a type name twice? I can understand a function result needing to be explicitly typed but why do this?

StringBuilder stringBuilder = new StringBuilder();

That is horrendously redundant. Var is handy shorthand and should be used!!


To provide an alternative view, in e.g. Haskell the redundancy would be solved the other way round:

    (stringBuilder :: StringBuilder) <- construct
(Assuming that there was a type class that had construct as a method and StringBuilder was an instance of that class.)


when declare and init an obj with constructor, yes it's annoying to type the type twice and I def. use `var`

But what if you're doing something like

    var x = myOtherObjectName.y;
readability is destroyed in this case.


'select <star>' also doesn't always do what you expect.

I recall a problem with some Oracle views that were based on 'select <star>' queries. What we didn't realize is that the '*' is evaluated when the view is compiled, and if you later add columns to the underlying tables the view did not automatically include them.

Edit: OK I don't know how to include a literal star in text here.


I'm going to guess you need to backslash-escape your stars.

EDIT: Nope, that wasn't it.

shakes fist at crazy home-grown not-quite-Markdown


It works if you put a * space * around your stars.

  Or you can make a *literal* line by putting two spaces before it.


Try typing 3 stars in a row: * :).


No luck


The following works without issues, for whatever reason: 'select *'


The only thing is you need to be careful about where the contract is.

There's a huge difference between an application executing a sql statement:

    SELECT * FROM accounts;
and a UDF like:

    CREATE OR REPLACE FUNCTION accounts__list_all()
    RETURNS SETOF accounts
    LANGUAGE SQL AS
    $$
       SELECT * FROM accounts ORDER BY account_no;
    $$;
In the latter case the select * reduces maintenance points in your db because you have already tied the return type to the table type, so either you get to use the * or list all columns manually and the latter, given the requirement to list all every time, makes any schema changes incompatible with your API.


Can't believe this partially informed article got on HN. The point of the problem with * is as you say, when the schema changes, named column queries will still work. * with numbered columns will fail and not always in a detectable way. Article is BS.


The main danger with mixing SELECT * and ORMs when you only need some particular columns is the performance hit.

Not just on the database itself. You also need to ship surplus data to your application and then instantiate objects with surplus fields. All of this takes bandwidth, RAM and time.

I also once saw someone write idiomatic ruby that performed a sum over an ActiveRecord .each. In dev, nobody noticed. In production it pulled a 9 million row table across, iterated over 9 million ruby objects, all to sum up a few hundred kb of actual data that the database could do in a few hundred ms.

The problem is hoping that ORMs will make those icky databases go away. They don't. They can't.


He says its a myth, then admits it isn't a myth.

In conclusion: Clickbait for DBAs


The original HN title, before it was "corrected" was "It's not about the star, stupid", which was less clickbaity.


Less clickbaity only because it was arguably meaningless in isolation. Was it an article about astronomy? Or perhaps online ratings systems?


I guess you could stretch and say he means "myth" like you might use "lore" or "mythology" as in the collected beliefs of a group or people, but that's stretching it, if you ask me.

I'm not a DBA, but I really like reading articles about common pitfalls and mistakes, to broaden my knowledge base. Something like "'SELECT *' and your ORM might be using it for you" would have been a much more informative title to me, without dancing around the clearly intended use of the word "myth," and I personally would have still read it, without expecting content completely at odds with what I would say is the natural interpretation of the title.


His argument seems to be that it's a "myth" because you could do the same thing by explicitly listing out the columns. Which in no way makes the former a myth, and it is just pure clickbait, as you said.

It gets much worse than just single table grab-alls, though. I once worked at a shop where they insisted upon prolifically using "select just about everything from every relational table, comprehensively joined to the n-th level" views as some sort of confused notion about code reuse, such that everything would then select from these master views. The end result was that it was incredibly frustrating trying to resolve performance problems, because literally everything was a performance problem. It was one of those instances where I shook my fists at every naive developer railing about purported premature optimizations.


"Don't go out of your way to optimize prematurely" does not mean "architect your software in the opposite direction to industry common sense". "We will optimize later if needed" does not work at architecture level. That's why you prototype before committing yourself.


"Don't go out of your way to optimize prematurely" does not mean

It doesn't mean anything, because both optimize and premature are in the opinion of the viewer. There is no consensus on this at all, but it has been my experience that when those famous words are uttered, badness is about to occur.


The more you implement techniques like master objects, the sooner you get to play with sharding!

Mortgage-driven development:

http://codemanship.co.uk/parlezuml/blog/?postid=147


It seems to me that in discussions of NoSQL vs relational databases, one of the main arguments that comes up regarding ease of design and maintenance is "you can just use an ORM". But I see major disadvantage after major disadvantage of using an ORM, and there seems to be some overlap with the disadvantages of NoSQL (by which I am primarily referring to document-oriented databases). I really can't tell, once you factor these disadvantages in, what comes out on top.


The ORM I really want just takes my SQL and a connection to my database at compile or load time, and wires it up to the type system of the language I'm using. All the ones I've tried just seem to embrace the awfulness of the low-level database access APIs and are just trying to replace the good-enough DSL that is SQL with something worse.


http://www.haskell.org/haskellwiki/MetaHDBC

Does exactly that. Runs your SQL against the database at compile time to get type information, which it injects into the program AST.


The closest I've seen an ORM come to this is Squeryl: http://squeryl.org


Which has been superseded by Slick (http://slick.typesafe.com/) and does exactly as the parent says.


In that regard, while not actually an ORM per se, one of the best experiences I've ever had with SQL integration in a language was SAP's ABAP, where column types and data types are (almost?) the exact same thing.


If you are in C# land Dapper does a good job of letting you write SQL and handle the messy put result sets in to objects part of the equation.


It's 100% possible to use SQLAlchemy to do that in Python.


ORMs are just like databases really. There's a good way and there's a bad way.

It's not the ORM that causes problems, it's all about the mindset of thinking that you don't have to worry about anything because the ORM will have your back anyway. I use ActiveRecord in Rails every day and there's a lot of things it just does right. It abstracts certain things I wouldn't want to constantly have to worry about and gives me a fine-grained enough layer of control over the queries that I can actually get all the expressiveness and power I need from the records.

Like they say, don't throw the baby with the bath water my friend.


I don't think one is inherently superior to the other, it just depends on the data. If you have highly structured data for which integrity is mission-critical, you want to be using a RDBMS. If you have unstructured data for which throughput is mission-critical, you'll probably benefit from using a NoSQL system.


I'm sure someone has tried, but has anyone done a good job on an ORM that actually lets you select which columns to load and only selects those?


Don't most ORM allow that?

i.e. rails' ActiveRecord allows you to select columns easily in most cases:

    User.where(..) # all fields
    User.where(..).select(:name, :id) # only two of them
    joe.boards.select(:title)  # only one in join
    joe.boards.pluck(:title)   # only column value instead of full objects with one attribute set
Though I recall it's involved for #belongs_to relations.

You should also be able to setup the select as a default per-class, other than per query.

I suspect, I likely didn't understand your question, could you expand it?


It's possible I'm just ignorant of the capabilities of most ORMs. My only real experience is with the Django ORM. AFAIK, if you select an object, it wants to make sure all the members of the object are populated.

My apologies if it was a dumb question. :)



By default yes, but like most ORMs you can ask to populate only some members: https://docs.djangoproject.com/en/dev/ref/models/querysets/#...


Django offer a couple of options:

    .only(… field list…)
    .defer(… fields to fetch on demand…)
https://docs.djangoproject.com/en/dev/ref/models/querysets/#...


here you go: http://use-the-index-luke.com/sql/join/hash-join-partial-obj...

It's really a PIA.

Recently, I've learned about http://jooq.org/ — not having checked in detail yet, it might be easier there.


SQL alchemy - it does everything the right way.

http://www.sqlalchemy.org/


.NETs Entity Framework lets you do this using the fantastic LINQ syntax. Commonly you will say which objects to select, then give a projection into a new anonymous type with only the columns you want.


It would be much nicer to use if C# had structural typing for the anonymous types, so you could usefully return them.


Crikey.

SELECT * can be bad for the following reasons:

- returning more data than you need over a network connection can be a bottleneck. The ORM/java prog written by a programmer that doesn't care may only need a few columns. Tables can be very wide, especially when CLOB/BLOBs are involved

- Stored procedures, functions and triggers that use SELECT * may break horribly when columns are added or removed from tables, views or materialised views

- Table types (collection and record types) will break if any column redefinition occurs

- It shows a lack of understanding of the data by the developer. This may, in turn, cause further issues

Proper schema design and understanding of the data involved is key when working with databases. The existence of NoSQL and the general dumbing down of Computer Science degrees, plus the teaching of ORMs has caused endless problems. The need for NoSQL in the first place was possibly caused by the use of high level programming languages where you didn't need to know the data - you just fetch it all, then piss around with it in ruby/whatever, then fetch it again & again until you're done.

Rant over ;)


I use select * a lot in PostgreSQL. I do this any time I have a relatively self-contained table which can't be further normalized very far, and CRUD queries wrapped behind a stored procedure which returns the table type.

The reason I use select * there is that it reduces a maintenance point when it comes to return types and tables. In that case, I am guaranteed to get a return type which matches the table type. If the two are closely linked, then select * makes a lot of sense, and the alternatives aren't going to do better.

This makes schema changes more compatible than trying to change the column everywhere in every query in addition to the higher levels of code.

I would generally agree that an application calling select * from table is a bad thing for the same reasons, but if you are encapsulating your db behind an API, a lot of reasons shift.


I have the opposite preference. If there are a big bag of breakages when I change the schema, at least I see where they are. I'm not a fan of relying solely on tests when the DB itself can notice introduced discrepancies.


The fundamental question is where and what you are doing.

As I say, I use select * a lot, but these are in two groups:

1. In UDF's to ensure the proper return type, and

2. Against UDF's where it is the UDF designer's job to maintain the software interface contract.

In those cases, select * works, but it only does so because you have dependency injection and can decide on a case-by-case basis whether to pass the changes up, wrap them in a view, or the like. Most of the time this is a simple choice. The app needs a new column that we are storing and so.... But in a multi-app environment there may be other choices, and that dependency injection makes all the difference in the world.

EDIT: for example, if you have a UDF returning the type of a given table, then you may want to use select * because you have already decided these are to be tied together. Similarly select * from my_udf() is not a bad thing the way that select * from my_table is.


Ah, well. I guess my point was about selecting from tables, which is still the common case.

I presume that if you find yourself picking columns out of UDFs, it's a code smell that the UDF needs rethinking.


There used to be a bug in either MySQL or SQL server (can't recall which anymore) where if you used select * in a view and then adjusted the schema the data would be out of line with the columns.


SQL Server and the 'feature' was still present in version 2003, not sure about version after that.


SQL Server behaviour is: If you have a "select *" view and then add more columns to underlying table, columns will not show up in view until view if refreshed

Thats not too bad, and it does not get the column names mixed up.


That's also the way it works in PostgreSQL. I think the view is actually compiled into a parse tree with named columns, and the view given a defined type iirc.

(Select * from ... in PostgreSQL is actually really usefun in anything object-relational because you get objects of a named type back btw.)


That's the way it works now. In earlier versions any added columns would still be retrieved, shifting all the other column values to the right.. So all the data would wind up having the wrong column names.


I've been using MS Sql from version 7, through 2000, 2005, 2008 etc, and I never saw that bug.


There was no SQL Server 2003. Do you mean 2005?


Indeed I did - thanks. I just remembered it was the version before 2008, as I left the company in the middle of the upgrade to that one.

The columns being renamed on views bit us big time. We had a large ERP package that we'd built numerous custom views on top of. Whenever we applied an update to the ERP package, it'd add columns and all our views would be borked.


Why is this article on HN? Come on. Select * is just a construct. When is it bad? That would be a better discussion. In any case the article demonstrates a lack of fundamental understanding of relation database technologies such as a clustered index...hm wait. I'm taking the bait. Read a good book on relational databases. SQL for Smarties will get you started.

Is select count(1) from mytable faster than select count(*) from mytable?


Bad in column-oriented databases like Vertica.


It is bad because in a column store each column will be stored in a different location, resulting in seeks on disk or cache misses in memory. In a row store the complete row will often be in a contiguous block of memory. It gets worse when the column store uses dictionary compression and the dictionary is not already cached, since you need one more memory access to a random location. For wide tables the overhead is noticeable and the application should really only retrieve what is necessary.


SELECT * can be bad when you join other tables. For example:

  SELECT
      *
  FROM table1
  JOIN table2
      ON table2.field = table1.field
In the above example you select everything from both table1 and table2. When they contain fields with the same name, strange things can happen.

Therefore always use: SELECT table1.* FROM table1


select * can also be bad if you have any sort of system that expects or uses your columns by index. someone changes up your database, adds a column or some such and all of a sudden your consuming code breaks all over the place.


> if you have any sort of system that expects or uses your columns by index.

Of course if you have that sort of system, select * is the least of your worries.


You can also experience severe performance degradation (both server and client) if someone adds an absolutely huge binary column.

Perhaps an example of table checking_account_register hmm lets add a new feature, after a paper check is cancelled we'll scan it into an image file and stick it in the database. Suddenly you get giant TIFF for each row returned, surprise! Of course a better spot Might be a separate scan table linking checks to an image of the check (perhaps multiple images, multiple scan attempts, multiple sides of the check, and all that), but for the sake of argument, etc.

I think probably more databases get killed by processing load via no WHERE or LIMIT clause than from using a * as a column list... probably. With a close second of gathering way too much data and weeding it out in a HAVING.


Of course, adding large binary blobs to a relational database is almost certainly a bad idea anyway. It’s just not meant to be. You can achieve transactionality by other means and it won’t actually make managing and backing up data easier because suddenly database backups are an awful lot larger.


If you have text fields in your columns and you don't actually need them, SELECT * in indeed bad as it will cause temporary tables to be made, even if indexes exist for the query.


Thanks for sharing your experience with Firefox OS.


What's the context to this? It doesn't make sense to me in relation to the article, but maybe I'm missing something.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: