Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Facebook/CMU Covid-19 U.S. county-by-county symptom map (fb.com)
93 points by bookofjoe on April 20, 2020 | hide | past | favorite | 54 comments


These kinds of maps always end up far less enlightening than I'd hoped. Maybe because of noise from counties with low populations? Like, "Oh, there's elevated levels in Finney County, Kansas. Hmm."

Anyway I find this map a lot more useful, which lets you filter by population, show counts per capita, and limit counts to the last 7 days:

https://rchern.github.io/covid-19/


Well, to be fair they are measuring something different, precisely because the confirmed positives is skewed by the scarcity of testing. Since that doesn't seem to be getting fixed, other approaches are needed. So discrepancies versus number of confirmed positives is one of the things you might look for in this data as a sign that more investigation is needed. It's not intended to be an accurate accounting of covid-19 cases: "The estimates can be helpful for policymakers and health researchers to forecast potential COVID-19 outbreaks. These estimates don’t represent confirmed COVID-19 cases and shouldn’t be used for diagnostic or treatment purposes, or guidance on personal or business travel."


This is one of the best maps I have seen yet if you switch it to "Hospital Referral Region". It is a much more accurate picture of the medical system vs a map that shows three counties that might all be served by the same major hospital.


I was going to write a paragraph about insights I could glean from the map, only to discover while writing that white == no data.


Thanks!


I'm doing some map stuff right now on a project and its cool FB is using OSM for their data. Here's a recent article on their public embrace of OSM: https://www.engadget.com/2019-07-23-facebook-opens-up-its-ai...


They in part collected this data from a little survey request modal at the top of your facebook timeline main page thingy, so just be aware that it's not at all representational of the population as a whole. Old people and young people and people who read Hacker News are less likely to have answered than would be expected.


depends on how you define "old". My personal observation-driven wild guess would be that the 40-65 year old demographic would be overrepresented in anything presented by Facebook. They are by far the most active there.


I dug through the source code and extracted the JSON files and converted them to a tabular/CSV format for use outside. https://observablehq.com/@thadk/facebook-covid-survey-data-f...

If you're one of the creators, please release the data files for this kind of thing. Google started releasing only PDFs of mobility, its data was scraped through much effort, and only later did they begin releasing the official CSVs.


If you read that recent random sampling from Stanford the numbers are probably 50x-85x higher


The result in the Stanford paper is very likely biased due to the false positive rate of the test they used:

https://statmodeling.stat.columbia.edu/2020/04/19/fatal-flaw...


I've found that the false postive rate of the normal test to be 10% and chest CT to 6%. https://abc7news.com/coronavirus-test-free-testing-update-ac....

Shouldn't we use this simply because the false positive of 2%-1% is lower than those tests we are using for making policies right now.


But thats a better percentage than a lot of the normal covid tests. I haven't done the math entirely but a 98.3% to 99.9% accurate test doesn't seem to correlate with a 50x-85x jump. If anyone can write out the proof that would be great.

Also this article has one weird thing I picked up on and maybe I someone can verify it but they've said 100 people died in Santa Clara country but I could find 73 possible covid-19 deaths. All of those haven't been verified. This is based on April 17th numbers


I don't know the specific numbers but this jump is definitely possible.

Let's say the false-negative rate is 0% and the false-positive rate is 2%. For sake of argument, let's say that 0.05% actually have Covid. If you tested 10000, 5 people have it in reality but your testing would show that 205 people tested positive.


That doesn't seem right if it says 5 people had it then a false positive would only say that there's a 2% that likelihood that that test is wrong. Its not saying there is a 2% likelihood that 10000 people are sick. Its saying there's a 98% chance that this test diagnosis them right and 2% chance it does it wrong.

This is better than the false positive of other tests like the normal tests with a 10-15% false positive rate. https://abc7news.com/coronavirus-test-free-testing-update-ac...

Also this


and the FB data is likely underestimated too, because many persons don't feel comfortable about answering honestly with their real name to a company/university that could be coerced by government to share their identity (and that could mean fines, jail, or quarantine)


Stanford should completely retract that study. It has produced far more misleading interpretations than anything useful.


I don't think they've interpreted anything. It falls in line with other random sampling tests. Without this data we could be making the wrong decision. Especially with countries doing widely different things. Places like Sweden, Korea, and Taiwan without quarantine and places like Germany or the US with quarantine should know which strategy works best. Which isn't at all clear right now.


Why? Even if it's not totally random(they used Facebook ads), surely the data still tells us something?


Not necessarily. Their population survey had 1.5% positive results. The false positive rate of the test is not known for certain but I believe the 95% confidence interval goes up to 1.7%. If that were the case, all the positive tests could even be false positives.

We did learn from the study that the population rate isn't outrageously high, like 30%, but I'm not sure what more we can confidently say beyond that.


What are typical false positive rates for other tests/studies?


China's false postives for Covid tests are 30%, I chest CT has a false positive of 6% https://www.advancedsciencenews.com/a-more-sensitive-test-fo....

So these test a more accurate than either of the more accurate test to diagnose Covid-19 from what I've found


The people who chose to respond to this ad were probably a very biased sample of the population:

https://twitter.com/foxjust/status/1251270848075440133/photo...


Wouldn't that be true of anyone participating in a study? Something about this isn't making sense to me. If a study is voluntary, isn't it biased towards those who want to participate?


We should rather repeat that kind of study in every state, every week. Improve as we go.


The sampling was not random; they got their participants from a Facebook ad.

https://www.medrxiv.org/content/10.1101/2020.04.14.20062463v...


So, about as random as this Facebook survey then?


No idea if Facebook did that, but they do know demographic profile of responders, so they could correct data to make it more stat sig.


The people who chose to respond to this ad were probably a very biased sample of the population:

https://twitter.com/foxjust/status/1251270848075440133/photo...


Yes, in that neither of them are random


They appear to be trying to estimate the percentage who had symptoms at the time of the survey date, while the Stanford study was estimating the percentage who've had the virus (even if asymptomatic) at any time in the past.


isn't this also random sampling?


For this data to be useful, I think we first need to establish that people who use Facebook and participate in its surveys represent a relatively random sample of the general population. Or adjust for all the ways in which it does not.

Otherwise, it's likely biased.


They address this:

“ To help CMU measure results, Facebook shares a single statistic known as a weight value that doesn’t identify a person but helps correct for any sample bias, adjusting for who responds to survey invitations. Making adjustments using weights ensures that the sample more accurately reflects the characteristics of the population represented in the data. The weight for a survey respondent in the sample can be thought of as the number of people in the population that they represent based on their age, gender and state of residence.”


I wish they had some confidence intervals on here. The variance between Bay Area counties (or over time) is quite high - it's rather dubious that Alameda County was at 1.8% while SF was at 0% a week ago.

Indeed, numbers > 1% for symptomatic patients in the Bay Area a week ago are implausible (likely over a magnitude too high) given confirmed case counts since then.


Many of the numbers have denominators in the underlying data: https://observablehq.com/@thadk/facebook-covid-survey-data-f...


Every attempt I'm aware of to do a general population survey has indicated that confirmed case counts really are at least an order of magnitude lower than the real spread. I don't think it's a possibility we can reject out of hand - frankly I'm starting to swing to the position that it's been proven true.


Most general population surveys have suffered from heavy self-selection bias or insufficient specitivity. In the early days, yes, the Bay Area was probably missing 90% of cases.

Testing has ramped up though (it's pretty easy to get a test if your symptomatic), so I don't believe we're missing 90% of symptomatic cases [as defined by this survey] at this point (which is part of the reason that new cases/day have remained stable while hospitalization numbers are dropping). Think about it: if people take a survey that says they have covid symptoms, wouldn't they get tested?


I'm just not sure you're applying consistent standards here. Going to a coronavirus test center on medical advice suffers from much stronger self-selection bias and generally unreported specificity.


The time dimension is the most interesting aspect for me in these charts.

Anyone aware of good world-wide corona virus datasets? I told myself I'd start looking 1-2 months from now, because they'll probably be easier to find and higher quality by then, but these posts are wetting my appetite.


This one is kind of nice, but only available at the country level: https://github.com/datasets/covid-19/blob/master/data/countr...


This ties together a number of different datasets and makes them graphable: http://casualhacker.net/covid19/


Why did they have to include "Flu" in the dropdown? I am afraid if that will loose a little focus of this tool and the seriousness of the pandemic.


"The estimates can be helpful for policymakers and health researchers to forecast potential COVID-19 outbreaks. These estimates don’t represent confirmed COVID-19 cases and shouldn’t be used for diagnostic or treatment purposes, or guidance on personal or business travel. Facebook’s research partners are committed to only using survey results to study and help contain COVID-19."

The dataset is explicitly not intended for general public consumption, and having access to flu and COVID symptom overlap is important for epidemiology research and disease surveillance.


Ah! But his map isn't a representation of COVID-19 cases.

It's a representation of people who self diagnose symptoms that are associated with COVID-19.

The small print below the map matters:

> This map shows an estimated percentage of people with COVID-19 symptoms, not confirmed cases.

The self diagnose part is important as well, because people are asked to observe themselves and fill out a survey through Facebook. They don't take a test administered by a health expert and report the exact results back.

Put differently, you're looking at a map of people who report recognizing symptoms with themselves that are associated with COVID-19. But are also associated with the flu or any other respiratory affliction ranging from the common cold over allergies to early lung cancer.


Maybe because the syndromes that medical facilities report to health depts are ILI (influenza like illness) & ARDI (acute respiratory distress illness) and this tool is trying to put a common name to the survey?


Why would anyone trust facebook with this data?


"Facebook uses aggregated public data from a survey conducted by Carnegie Mellon University Delphi Research Center. Facebook doesn’t receive, collect or store individual survey responses."

But, what are you afraid Facebook would do with these survey results?


Target snake-oil covid-19 "cures" to people who responded as having symptoms or being particularly at risk?

I've seen various scams (fake "hacking" services, weight loss products, etc) both organic (liked by tons of very obvious compromised/fake accounts) & paid, and when reported they said it doesn't violate our terms of use, so I wouldn't be surprised if this scummy company does the same thing again.


Facebook isn't the wild west anymore. Last month they banned ads for anything remotely COVID-19 related, including hand sanitizer, disinfecting wipes, medical masks, etc.

"...we are now prohibiting ads for products that refer to the coronavirus in ways intended to create a panic or imply that their products guarantee a cure or prevent people from contracting it."

https://about.fb.com/news/2020/04/coronavirus/#exploitative-...


Presumably hacking services (whether real or scams) were also banned and yet I saw them and even reported them and nothing has been done.


I don't doubt scammers are able to temporarily sneak their ads past the approval team. But Facebook certainly doesn't make it easy these days, nor allow morally questionable targeting. They're constantly shutting down controversial categories (crypto, cbd, etc), it's always a hot topic in Facebook ad buyer groups.


Facebook does not see individual responses to this survey; they direct users to a survey hosted on Qualtrics whose data is aggregated at Carnegie Mellon, which provides the county and state estimates back to Facebook.


In this particular case it looks like FB promoted a CMU survey to certain people.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: