Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's less Google and more Google engineers, and from that standpoint it is totally normal fire them to kick the living daylights out of any code, particularly Google code.

Spark streaming really feels operationally immature compared to a lot of other stream processing frameworks, even Dataflow. The criticism is both unsurprising and warranted.



What other stream processing frameworks do you prefer in the place of spark streaming?Storm is pretty mature, but does not play very well with YARN last I tried. Flink looks pretty good, but is fairly new. Samza is another one. I'm curious to know if you have any specific issues with spark streaming's operational immaturity. Some things I don't like in general are: # Backpressure algo is fairly new, but pluggable # Does not handle stragglers very well, inspite of back pressure - this is due to treating everything as batch # Events from the system are not very rich and cannot be customized. # Error handling is very unclear, and does not offer a lot of flexibility

In spite of these shortcomings, it has pretty good Kafka integration, mostly uses the same paradigm as batch and plays well with hadoop infrastructure. Makes it a decent choice for many use cases


Storm has the community around it, though it shows signs of decline (perhaps the release of YARN-friendly Heron will help). There are a lot of suitors for developer affection. Fink is relatively green. Samza looks good, but is still picking up momentum in the community. DataTorrent seems to have some moment with its Apache Apex.

Regardless, most of these stream processing frameworks are still very much in early days and lack a lot of the sophistication you find in custom in house systems (such as found at... Google ;-). The open source world will no doubt catch up and overtake those systems, but right now there is still enough of a gap that it is rather painful.

Spark streaming pains:

1. Backpressure & stragglers. Duh. 2. Setup & tear down is still rough, even compared to Storm. 3. The whole context singleton thing means you need a new VM for each job, which annoys the #@$@#$ out of me. 4. Error handling isn't just unclear, it's kind of disastrous. 5. You can feel its "batch" heritage in lots of places, not just the stragglers. For some that is a feature, for me, a bug, even though with Storm I use Trident. 6. When a job runs amuck, it's a pain to recover from it. Storm is no picnic either, but it is indeed better.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: