Using Python to Anonymise a Dataset

Using Python to Anonymise a Dataset

I am by no stretch a programmer. I'd love to be able to write thousands of lines of code but for now, I'll continue to dabble with learning how to write basic scripts. I've edited some scripts in various roles to tweak them for my specific needs. Mostly to save time with tasks such as access reviews (I modified a PowerShell script that compared AD groups with approved Service Now request numbers). Code is like magic to me, so it's time again to improve my skills as a magician!

I'm solving a problem I see frequently in organisations. This is probably not the best way to solve it but to keep things simple my pretend use case is a non-tecnical team that downloads a daily report (in CSV format) that they want to sanitise immediately before they process it so they do not have PII sitting on their local machine which is against the pretend company's internal data handling policy. Sounds straight forward enough. The key requirement here is the data still needs to be usable to process so they can't just mask everything by replacing any PII with "XXXX XXXXXX" but they can replace a Jason Bourne with a John Smith etc.

Let's dive in. I created a CSV file with some column headers, I needed to fill it with some mock data so for that I went to Mockaroo to grab a fake name and address. The script itself I pieced together from several online sources and full disclosure I didn't write it all myself I done my usual modification to make it work for my use case. Here it is:

This script needs an input path to pull in the CSV to anonymise and an output path to export to. First things first, I wanted to make sure it worked after setting the paths:

Success!

Just kidding, it failed a few times because I lacked a few indents in my code, lesson learned! Also after getting it to run I noticed I needed more mock data so that I can replace the pretend real data, still with me? I would essentially be replacing a full name and address with another to prove it works.

It was time to check the results, here's a before:

(Data we want to anonymise with fake data)

Here's an after:

Notice how the data was replaced with the placeholder values from the script, it needs some cleaning up of the brackets but it works! The input and output paths worked great too!

Share on LinkedIn

If you enjoyed this post, please consider supporting my work through the button below or becoming a free subscriber, it really helps, thank you!

If you're a business and would like to discuss consulting services, you can request a free consultation here: https://www.megabytesandme.com/services/

Thank you!