Benford’s Law Applied to COVID-19 Reports
A look at COVID-19 case reports from around the world to see how well the numbers of daily positive cases fit into Benford’s Law. The better the fit, the more accurate the data.
What is Benford’s Law?
Numbers that represent real-life events follow a certain regularity. Specifically, the first digit of these numbers follows a strange pattern with the number 1 appearing about 30% of the time, the number 2 about 18% of the time, etc. — a frequency that declines logarithmically. This pattern is known as Benford’s Law, and it can be used to identify fraud and other irregularities with reported numbers. To learn more, see this Wikipedia page or the 2020 Netflix show “Connected” (episode “Digits”).
How does it apply to COVID-19 reports?
COVID-19 reports are made of numbers just like any other reports created by people, and that makes it possible to apply Benford’s Law. I obtained the daily positive case numbers for the U.S. from The Covid Tracking Project, and for countries around the world from Johns Hopkins University, and calculated the frequency of numbers 1 to 9 in the first digit of the numbers. I compared the results to the expected, or Benford, frequency to get the Benford Error — the difference between the actual frequency and the frequency expected by Benford’s Law. This error tells how good or bad the data for the location is.
For example, England, UK has reported a total of 262 numbers of daily positive cases of COVID-19 since they started reporting as a standalone location on June 11, 2020. Here are the recent 10 of them: 10296, 8964, 8408, 9420, 7292, 8644, 8623, 7393, 6527, 5080. If you look at each of these 262 numbers, you will find that the number 1 is in the first digit 0.3664 (36.64%) of the time, the number 2 — 0.1641(16.41%) of the time, and so on. But Benford’s Law states that the number 1 should be found in the first digit 0.3010 (30.10%) of the time, the number 2 — 0.1761(17.61%) of the time, etc. So England’s report is off by 0.0654 for the number 1, by 0.0120 for the number 2 (the sign of the difference doesn’t matter), and so on. For all 9 numbers, their report is off by 0.1602 and that is their Benford Error.
What are the results?
Using the Benford Error, I ranked the locations from best (smallest error) to worst (largest error), and created plots for each location that show the error visually.
USA:
1. Oregon (OR), e=0.0849
2. Guam (GU), e=0.0939
3. Arizona (AZ), e=0.1037
4. Montana (MT), e=0.1109
5. Wyoming (WY), e=0.1110
6. District of Columbia (DC), e=0.1161
7. Utah (UT), e=0.1179
8. Kentucky (KY), e=0.1316
9. Rhode Island (RI), e=0.1321
10. Washington (WA), e=0.1399
11. North Dakota (ND), e=0.1411
12. Alaska (AK), e=0.1528
13. Tennessee (TN), e=0.1569
14. Connecticut (CT), e=0.1581
15. Delaware (DE), e=0.1689
16. Alabama (AL), e=0.1739
17. Kansas (KS), e=0.1748
18. California (CA), e=0.1754
19. Louisiana (LA), e=0.1786
20. South Dakota (SD), e=0.1812
21. Wisconsin (WI), e=0.1903
22. North Carolina (NC), e=0.1923
23. Vermont (VT), e=0.1926
24. Nevada (NV), e=0.1926
25. Nebraska (NE), e=0.2041
26. Oklahoma (OK), e=0.2078
27. Arkansas (AR), e=0.2098
28. Georgia (GA), e=0.2125
29. Puerto Rico (PR), e=0.2146
30. Texas (TX), e=0.2241
31. Ohio (OH), e=0.2252
32. Mississippi (MS), e=0.2263
33. South Carolina (SC), e=0.2269
34. New Hampshire (NH), e=0.2298
35. Michigan (MI), e=0.2299
36. Hawaii (HI), e=0.2320
37. West Virginia (WV), e=0.2362
38. Idaho (ID), e=0.2460
39. Massachusetts (MA), e=0.2559
40. Virginia (VA), e=0.2605
41. Florida (FL), e=0.2732
42. Iowa (IA), e=0.2764
43. Illinois (IL), e=0.2999
44. U.S. Virgin Islands (VI), e=0.3213
45. Maryland (MD), e=0.3232
46. New Mexico (NM), e=0.3237
47. Minnesota (MN), e=0.3525
48. Colorado (CO), e=0.3564
49. Missouri (MO), e=0.3598
50. Maine (ME), e=0.3698
51. Pennsylvania (PA), e=0.3757
52. Indiana (IN), e=0.3902
53. New York (NY), e=0.4890
54. New Jersey (NJ), e=0.5526
55. Northern Mariana Islands (MP), e=0.6040
World:
1. Jordan, e=0.0625
2. Ukraine, Sumy Oblast, e=0.0661
3. Netherlands, Aruba, e=0.0678
4. Malawi, e=0.0679
5. Australia, New South Wales, e=0.0775
6. Peru, Pasco, e=0.0864
7. Germany, Thuringen, e=0.0893
8. Spain, C Valenciana, e=0.0916
9. Brazil, Maranhao, e=0.0928
10. Namibia, e=0.0967...661. Russia, Ulyanovsk Oblast, e=0.8742
662. Russia, Volgograd Oblast, e=0.8867
663. Russia, Krasnoyarsk Krai, e=0.9153
664. Russia, Krasnodar Krai, e=0.9320
665. Russia, Novosibirsk Oblast, e=0.9341
666. Russia, Karachay Cherkess, e=0.9381
667. Russia, Saratov Oblast, e=0.9396
668. Russia, Orenburg Oblast, e=0.9684
669. Tajikistan, e=0.9896
670. Russia, Mordovia Republic, e=1.0343
For the full results, please see this GitHub repository and the following files:
usa_rank.csv
(full link) — File showing how each U.S. state or territory ranks from best to worst based on how their COVID-19 case numbers fit into Benford's Law. The last column has the file name with the Benford plot for the location.
usa_output/
(full link) — Folder with Benford plots for U.S. states and territories.
world_rank.csv
(full link) — File showing how each country and province ranks from best to worst based on how their COVID-19 case numbers fit into Benford's Law. This file is searchable. The last column has the file name with the Benford plot for the location.
world_output/
(full link) — Folder with Benford plots for world countries and their provinces.
To see the original data:
usa_data/
(full link) — Folder with the original COVID-19 data for the U.S. from The Covid Tracking Project.
world_data/
(full link) — Folder with the original COVID-19 data for the world from Johns Hopkins University.
extra/world.csv
(full link) — A version of the world data in a single file, showing the data in a more concise way than the original data.
How to interpret the results?
Small errors mean the reported cases are likely to be true and accurate, and large errors indicate inaccuracy. Large errors can be a sign of insufficient testing, misreporting, or direct falsification.
For the U.S., the error ranges from 0.08 for Oregon to 0.55 for New Jersey. For the world, the error ranges from 0.06 for Jordan to 1.03 for Mordovia, Russia.
Example of small error (good Benford fit):
Example of large error (bad Benford fit):
What time period is covered? How many numbers used?
The data covers the period from the beginning of COVID-19 reporting in the early 2020 to March 3, 2021, or about 1 year of data or 365 numbers per location, 725 different locations (55 for the U.S. and 670 for the rest of the world). The exact number of numbers (no pun intended) varies by location because they didn’t start reporting at the same time. It also varies because zeros and negative numbers are unusable and were dropped. The actual number of numbers used for Benford-ness is included in the output, so the reader can take this metric into account along with the error. About 100 locations were excluded from the ranking because they had too few numbers (less than 50 usable numbers). These are typically small territories or places like cruise ships.