Cleaning FEC data


Rule #1 when working with databases:

All data is dirty!

Hot Tip: Here's a book with tips on data cleaning:  Ver 1.0 Proceedings


 

Bill Dedman <Bill.Dedman@msnbc.com>

Discussion Forum <NICAR-L@po.missouri.edu>    

date        Sep 24, 2008 9:30 AM    

subject        Re: [NICAR-L] presidential campaign contributions by county    

    

Listers,

By the way, several wrote off-list to ask, how were ZIP Codes assigned to counties?

The answer, which may be useful in handling lots of databases, and which doesn't require mapping software:

We used a commercial file that matches ZIPs to counties.

As people have discussed on the list, often the FEC data itself is inconsistent. Many donations, for example, show have addresses in suburban Virginia or Maryland but then when it comes to the state, it says DC. Or vice versa. So if you're going to trust the ZIP Code in the FEC data, you can't trust the state name. Candidates are required to report an address for every donor; they're not required to get it right; and the FEC does no error-checking.

We went with the ZIP Code. That won't always be right, but as we've discussed it does seem right more often than the city is.

So we took whatever zip the candidate reported for the donor, and assigned it to a county. In the commercial file, each zip is assigned to only one file. So there's no straddilng of county lines. That may or may not reflect the real world, but the company assigns each ZIP to only one county. A "primary" county.

The file is from zipcodedownload.com, and is updated monthly. (So it won't include old ZIPs. It may surprise you to learn that the Postal Service retires ZIP Codes. You lose some contributions that way.) $40 for one month, more if you want updates.

http://www.zipcodedownload.comroducts/Compare/ZIP5/

The company has other databases that include a latitude and longitude for every ZIP Code, and that has been helpful on other stories, easing calculations of distance from one location to another, etc. You also get an MSA name, FIPS Codes, telephone area codes, daylight savings time.. End of commercial.

I also made sure that the address included a state that's in the US. It didn't have to match the ZIP, as discussed above. But it did have to exist. This is to weed out (many of) the foreign addresses. Someone in Russia may have a postal code that's the same as a US postal code, and it's in the ZIP field. We don't want to assign that contribution to a county in Iowa.

Aron earlier on this list proposed a series of cascading validity checks, which sounds like a brilliant method that I did not employ. :-) I just relied on the ZIP Code, and made sure the state in the FEC data was a valid one. But we pulled the county name and state name from the ZIP Code commercial file -- so they're always consistent with each other in our results.

Hope this helps.

bill


Bill, thanks for posting your data.

 

FYI, listers -- I found some references to free ZIP code databases, including one that requires publicity to get it for free (such as a

blog post); one based on 1999 Census data; and then this thing, which is supposedly free and from 2006, cleaned up, Lat-Long, city, county,

state. I haven't tried it and can't vouch for it, but I think I'll be

 

looking at it: http://www.free-zipcodes.com/

 

Mike

Mike Stucka

 

stucka@whitedoggies.com

(617) 795-0344 home

 

(508) 496-9630 cell

42.313513,-71.218781 icbm

 

http://lilgenghis.blogspot.com


Tim Henderson <tim.hendo@gmail.com>

No need for commercial products to find zip codes -- Census has them. google "census zcta" and you'll find them in any form you might want.

I've never had a problem except with single-address or other non-geographic zip codes which you might have to figure out individually, but it's not many. -

 

-tim


Tim Henderson later posted:

 

i should say "they line up as well with delivery areas as you can expect." The post office is under no obligation to keep respect any boundaries of course, it's just trying to deliver the mail as efficiently as possible. So you never know with zip codes. But sometimes you have to work with them. I have had trouble with unexplained zip codes that the Census doesn't list because they're non-geographic in some way -- they refer to a single address, maybe, or to a post office that doesn't deliver mail in a rural area.  But I don't know how a commercial product could do better -- it's just the way of the world when you're dealing with zip codes. No? prove me wrong, somebody, and I'll buy a zip code list or go to the marketing department with hat in hand.

 


Sarah Cohen <cohensh@washpost.com>          hide details     3:28 pm (4 minutes ago)

    reply-to        Discussion Forum <NICAR-L@po.missouri.edu>     

    to        NICAR-L@po.missouri.edu     

    date        Sep 24, 2008 3:28 PM     

    subject        Re: [NICAR-L] presidential campaign contributions by county     

I wouldn't put too fine a point on this -- you'll have way more errors than that [SHE'S REFERRING TO PLACES OF FAST GROWTH LIKE LAS VEGAS]. One example:  the FEC says it should be the home address, but guess how many people actually use their work address? That puts them in the wrong state around here, no less the wrong county. And hosts still sometimes just use the address of where the fundraiser was held because they don't want to bother the donor too much. That happens more in congressional than presidential races, but it does happen, until the campaign figures it out. So look at this whole exercise as a guide, not the gospel.