Anonymizing and masking sensitive data in SQL databases before migrating to Azure SQL
Many companies I interact with have expressed an interest in migrating their legacy SQL Server databases to Azure. Typically this is due to End of Life for SQL Server 2008 (and 2008 R2) and other times it’s a desire to perform reporting and analytics on existing data in the cloud.
(I wrote about migrating to Azure SQL previously here)
Mostly this is all documented nicely on docs.microsoft.com (Dynamic Data Masking, Static Data Masking) but what isn’t clear to me is how to best anonymize and/or mask sensitive data before migrating to Azure. This is typically something that companies need to perform in on-premises infrastructure before moving a database outside the perimeter network.
So I set to work and spent some time figuring out how to best achieve this while also considering cost involved.
The setup
I spun up a virtual machine locally using Hyper-V and installed SQL Server 2017 on Windows Server 2019. I briefly considered using Docker but to keep things simple, I chose this traditional setup.
I needed sample data that I wouldn’t mind messing up. Luckily Microsoft frequently publishes AdventureWorks, the perennial favorite sample database that is intended for just something like this. You can access the .BAK files for AdventureWorks here. I chose the Lightweight version as it’s very small but has enough data for my tests.
I autogenerated a recovery SQL statement to quickly resume a fresh database from the backup:
RESTORE DATABASE [AdventureWorksLT2017] FROM DISK = N'c:\temp\AdventureWorksLT2017.bak' WITH FILE = 1, MOVE N'AdventureWorksLT2012_Data' TO N'C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\MSSQL\DATA\AdventureWorksLT2012.mdf', MOVE N'AdventureWorksLT2012_Log' TO N'C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\MSSQL\DATA\AdventureWorksLT2012_log.ldf', NOUNLOAD, STATS = 5
AdventureWorksLT has a schema called SalesLT and a table I’m using for tests called Customer. What’s of interest in this table is a column named EmailAddress. It’s universally understood and is typically considered personally identifiable information.
Thus, my goal is to prevent sensitive data in AdventureWorksLT2017.SalesLT.Customer.EmailAddress from ever reaching the cloud. It has 847 rows, and for some reason there’s plenty of duplicates. Maybe it’s an easter egg that I found first.
Do It Yourself – PowerShell to the rescue!
Before even looking into what tooling is available I chose to use PowerShell. How hard can it be, I briefly remember thinking to myself. Well, turns out it’s not that hard.
First, I define a few variables in my PowerShell script:
$SERVER = "SQL17"
$DATABASE = "AdventureWorksLT2017"
$SCHEMA = "SalesLT"
$TABLE = "Customer"
$COLUMN = "EmailAddress"
PowerShell has a module for SQL Server, which provides an amazing cmdlet, Invoke-Sqlcmd. This module also now supports cross-platform! It allows direct SQL queries from the command-line, thus I’ll start by fetching all rows from the Customer table:
$Emails = invoke-sqlcmd -Query "SELECT $COLUMN FROM $DATABASE.$SCHEMA.$TABLE" -ServerInstance "$SERVER" -As DataTables
I’m using -As DataTables to transform the return data to ADO.NET-based DataTable. I reasoned this would provide greater performance and also allow for all sorts of tinkering in the future when needed.
Next, I need a few fake domains for my anonymized email addresses. Or, well not fake-fake but good looking domains for good looking email addresses that are fake.
$domains = "outlook.com", "hotmail.com", "gmail.com", "yahoo.com", "geocities.com", "myspace.com"
(I miss Geocities, that’s why it’s there)
And then the actual logic in a ForEach() loop:
ForEach ($Email in $Emails)
{
$rnd1 = Get-Random -Minimum 4 -Maximum 14
$rnd2 = Get-Random -Minimum 4 -Maximum 14
$FirstName = -join ((65..90) + (97..122) | Get-Random -Count $rnd1 | % {[char]$_})
$LastName = -join ((65..90) + (97..122) | Get-Random -Count $rnd2 | % {[char]$_})
$Domain = Get-Random -InputObject $domains
$Original = $Email.EmailAddress
$MaskedEmail = $Firstname + "." + $Lastname + "@" + $Domain
Invoke-Sqlcmd -Query "UPDATE $DATABASE.$SCHEMA.$TABLE SET $COLUMN='$MaskedEmail' WHERE $COLUMN='$Original'"
Write-Host Updated: $Original --> $MaskedEmail
}
$rnd1 and $rnd2 are used for random strings, length between 4 and 14 characters.
Each mangled email address is in the form of firstname.lastname@domain.com. Then it’s a simple string replacement and another query to update each line.
The script takes about 500 milliseconds to run against the 847 rows in the table. It’s far from optimized but for simple testing it’s more than enough to get an idea how things perform. The VM I’m using to run this has 2 cores and 4 GB of RAM.
Looking at the source data reveals the email addresses are now anonymized:
Masking or anonymizing data with a PowerShell-based approach is simple when the source data is simple, such as in the AdventureWorksLT database. Data that is more complex might require a tedious amount of resources to craft scripts and perform testing and verification.
Then again it’s the organization’s responsibility to understand the data, which should make this task more viable.
Using a third-party tool: Redgate Data Masker for SQL Server
I also wanted to try out a third-party tool. I’ve heard good things about Redgate’s Data Masker tool. It’s an enterprise-ready tool for masking and anonymizing data directly within a database.
It’s also not free. Pricing is based on amount of data – 1 TB per year is 7800 € and 3 TB per year is 15660 €. While not cheap, it’s about the equivalent of 10-15 days of PowerShell implementation time from someone, so I find it fairly reasonable.
I installed Data Masker on the SQL Server (it’s a throw-away VM) and it’s a simple next-next-finish install.
First, I needed to connect to my database.
It supports a variety of different database versions, including Azure SQL and recent SQL Server editions.
Once connected you scan relevant tables and figure out what you’re interested in.
The tool works in masking rules that allow for a lot of variety and different levels of complexity.
I created a simple rule to mask the EmailAddress column:
I ran the rule and verified the changes in the original source table:
Results do look nice, perhaps even a little bit nicer than my 5 minute PowerShell script was capable of producing. That is the difference between 5 minutes and 7800 €.
Run time was 4 seconds so about 8X slower than via PowerShell. Perhaps the rule does some sort of verification at the end which my own script isn’t worried about. I’m not in the business of coding bugs.
Closing thoughts
This was a quick exercise I used to learn more before going back to a few companies to recommend a viable approach.
I learned that using a homebrewed approach is a solid approach, if you know your data and have skills available for crafting the tooling. Otherwise it makes much more sense to invest your time and resources in a tool that has support, sample playbooks and a tested approach to anonymize and mask data. For this, I only tested one tool – as there really weren’t that many to choose from. I did find plenty of scripts hacked together but they are all very specific and require quite a bit of tinkering in the end to get desired results.
These two approaches can be combined, especially if you fear the 1 TB limitation you incur when purchasing a yearly license is not enough.
Thanks for reading – happy SELECTing 🙂