I've been analyzing the public NPD data leak. Not *all* of it, only the most public ~277GB (uncompressed) corpus: https://www.troyhunt.com/inside-the-3-billion-people-national-public-data-breach/
Corporate media headlines said 2.9, 2.7 billion, "every American", etc, which raised questions in threads about it: https://noauthority.social/@ned/112962645178289749
Troy Hunt estimated ~899M unique SSNs, though I'm curious if his "100M samples" were random or sequential because I've noticed formatting differences between sections of the corpus, indicating different origins.
Cont...
On to some observations about this corpus that I can make with a high degree of confidence:
* The corpus does NOT contain 2.9 or 2.7 billion people.
* "Every American" is NOT present in the corpus.
* The corpus does contain accurate information about many Americans.
* The corpus does contain accurate information about deceased Americans.
Not that anyone here would ask, but like Troy, I'm not a databroker or your personal lookup service.
These posts are purely informational.
Continued...
I'll post an approximate count of total unique SSNs in this corpus when I have it.
I'd like to do some fraud analysis to determine if/how many "hot" SSNs may be being fraudulently abused by what appear to be multiple people. However his will be tricky because of recycled numbers, name changes, etc, so I'll have to experiment a bit and see how feasible this is. Processing "big data" is time consuming no matter how efficient you are.
The BleepingComputer article was one of the better early general-audience writeups: https://www.bleepingcomputer.com/news/security/hackers-leak-27-billion-data-records-with-social-security-numbers/
> previously leaked samples also contained email addresses and phone numbers
> The data breach has led to multiple class action lawsuits against Jerico Pictures, which is believed to be doing business as National Public Data, for not adequately protecting people's data.
There is also a lawsuit:
https://www.documentcloud.org/documents/25038487-hoffman-npd-class-action-lawsuit
> Defendant Jerico Pictures, Inc. d/b/a National Public Data
If this filing is accurate, my question is, why is Jerico Pictures, Inc, a Florida business "Located in both Los Angeles and South Florida, Jerico Pictures maintains a talented group of film and television producers with a passion for storytelling", headed by its president, Salvatore Verini (JR?), doing business as National Public Data, a background check service?
https://web.archive.org/web/20240802185707/https://www.jericopictures.com/about/
And I don't know if this is the same Salvatore Verini (JR?)
https://www.imdb.com/name/nm4701915/
Interesting to note that the twitter account linked on the aforementioned website doesn't exist.
But his Facebook and Instagram accounts are still up. His Facebook bio says he's "EP Jerico Pictures".
I am somewhat perplexed by the use of both "Salvatore Verini", and "Salvatore Verini JR" in Jerico Pictures' annual filing with the State of Florida.
Krebs compiled a fair bit of info about Sal: https://krebsonsecurity.com/2024/08/nationalpublicdata-com-hack-exposes-a-nations-data/
> The Florida Secretary of State says Jerico Pictures is owned by Salvatore (Sal) Verini Jr., a retired deputy with the Broward County Sheriff’s office. The Secretary of State also says Mr. Verini is or was a founder of several other Florida companies, including National Criminal Data LLC, Twisted History LLC, Shadowglade LLC and Trinity Entertainment Inc., among others.
Krebs' is the best general writeup so far.
Small update. Still working on getting the DB built and data imported, (onto round 9). I'm continuing to see "invalid page in block xxxxxxx of relation base /xxx/xxx" and various data checksum errors, which are presumably from transient write failures. What's weird is that this is present even after I put it on zfs.
I'm forced to conclude that it's a "hardware" problem, so I'm going to try disabling the write cache on the NVMes.
Nothing's ever easy.
Also, I got the rest of the "partial" data leak, but I haven't even begun to open and sort through that can of worms.
After finally tracking down and resolving my DB corruption issue, I imported everything again last night.
There may have been (read: probably was) some corruption when I'd unarchived the corpus, plus potentially some issue(s) with my pre-insertion normalizer.
Given those caveats, here are my results:
Total rows of data: 2,695,281,509
Total distinct/unique SSNs: 272,384,882
272M is a bit off from Troy's early estimate of 899M.
I'll start from scratch again later to validate my results.
@eriner I have to leave before you've finished your thoughts, so I'll just leave this here:
>> ~899M unique SSNs
LOLWUT?
@eriner unless this includes fraudulent identities somehow... in which case there might be hope for humanity
What number were you expecting?
Why is that number fake?
If 7-8k ppl die each day in the US and SSNs were first issued in 1936, how many would you expect in circulation?
It's a 9 digit number, right?
Just call it 100 years, so just the numbers issued to dead ppl on an oversimplified estimate:
100 * 365 * ~7000 = (over 255 million)
It's probably not that high, bc the population growth rate, but it's a reasonable estimate. Plus lots of other ppl get SSNs, too.
@eriner I can’t help but think that this “Leak” will be part of a push to institute a new “Smart” ID system since we can no long trust SSN’s.
Just waiting on that shoe to drop.
@Jagahati that, and it's just in time for an election.
Remember, every person on earth that has to report income to US gov has to get some kind of tax identifier, if it involves paying into SS, you get an SSN