Data redaction is the suppression of sensitive data, such as any personally identifiable information (PII) such as credit card number, email address, social security number.
Cloudera has a data redaction feature, which will mask the credit card, email address with random or custom strings(we specify), so that in queries, log files those random strings will be printed in place of PII. It has pre defined “regex” to match and detect the credit card, email addresses.
To enable this feature,
Go to CM – HDFS – Configuration – Search “redaction”
- Check – Enable Log and Query redaction
In the “Log and Query Redaction Policy”, add the default policies available as per the requirement.
For example, to mask the email addresses, choose Email addresses policy and provide the custom/replacement string for email address in the “Replace” field.
Now in the “Test Redaction Rules” field, you can enter an email address and select “test Redaction”, you’ll get an ouput as per the replacement string you’ve given.
If in case you want to mask the email address only when it begins or ends with particular words/format, then you can use the trigger field.
In the above picture, we’ve given the trigger as “my id is” for the email addresses redaction. i.e the email addresses in the logs,queries will be masked only if it begins with the string “my id is”.
As you can see, we tested the redaction rule by providing input as “email id is email@example.com” and the output is unmasked. This is because the string didn’t match the trigger given.
Now, we changed the trigger as “email id is” and as you can see the redaction policy masks the email id successfully as the given input matches the trigger.
To add a custom policy rule, you’ve to provide your own regular expression to match the intended string/values.
Add – Custom rule and provide regex, replace values.