BigQuery Security 101 - Core Concepts

May 15, 2024·
Ben Benhemo
Ben Benhemo
· 8 min read
Generated By DALL·E 3

Introduction

Today, in the cloud computing age, the Cloud Service Providers provide services that essentially cut across every aspect of computing and data management needs. From scalable computing power and robust data storage solutions to advanced data analytics and machine learning platforms, these services are designed to empower businesses and developers. They provide the flexibility, scalability, and efficiency required to drive innovation and optimize operations in a digitally transformed world.

In a brief, these services can broadly be categorized into several types:

Service CategoryDescriptionExamples
Compute Services 💻Provide virtualized computing resources over the Internet.Amazon EC2, Microsoft Azure Virtual Machines, Google Compute Engine
Storage Services 🗄️Dedicated to storing data in the cloud, ensuring security and accessibility.Amazon S3, Azure Blob Storage, Google Cloud Storage
Databases 📚Offer scalable, distributed systems for data storage and management.Amazon RDS, Azure SQL Database, Google Cloud SQL
Data Analytics 🔍Designed to process and analyze large datasets efficiently.AWS Redshift, Azure Synapse Analytics, Google BigQuery
Machine Learning and AI 🤖Enable building, training, and deploying machine learning models.AWS SageMaker, Azure Machine Learning, Google AI Platform
Networking 🌐Provide interconnectivity between cloud services, on-premises data centers, and end-users.AWS VPC, Azure Virtual Network, Google Cloud VPC

Today, we’ll zero in on BigQuery, spotlighting the risks that tag along.

What is BigQuery?

💡 BigQuery, offered by Google Cloud Platform (GCP), stands out as a premier, fully managed, and serverless data warehouse within the Data Analytics spectrum. It enables scalable analysis across petabytes of data, boasting a robust, fast, and interactive SQL-on-terabyte-class database infrastructure. This technology paves the way for the modernization of data-driven applications, ensuring efficient data management and analysis at scale.

For a deeper dive and more context, check out Google’s own explanation in the video below:

Essential BigQuery Concepts Explained

  • BigQuery Jobs: Jobs in BigQuery are specific operations that you can perform on the data stored within. These include executing queries to analyze data, loading new data into tables, or exporting data to different formats or external locations.
  • BigQuery Tables: Tables are the fundamental building blocks of BigQuery where your actual data resides. Organized in rows and columns, tables support the storage and retrieval of large quantities of structured data.
  • BigQuery Datasets: A dataset in BigQuery is a container that holds tables and views. Datasets are used to organize and control access to your tables based on needs and are defined at the project level.
  • BigQuery Queries: Queries in BigQuery are used to interact with the data stored in tables. By writing SQL-like commands, you can perform complex data analysis, manipulate data, and generate insights.

Now Let’s Talk Security

Supporting companies with huge data stores—and some may be containing sensitive information, such as PII, financial records, or intellectual property—makes BigQuery capabilities for advanced data analytics more than mandatory. Such an approach equalizes to an increased level of sensitivity and, accordingly, turns BigQuery into a priority target of cyber threats. Attackers will exploit this service for carrying out such actions as access and exfiltration of sensitive data but not limited to DoS or misuse of computational resources.

Unauthorized Access

The Problem → Gaining unauthorized access to BigQuery can lead to sensitive data exposure.

The Solution → Adopt the principle of Least Privilege Access, granting users only the access they need.

1️⃣ The IAM Solution

You might think everyone knows about IAM (Identity and Access Management) by now, but getting it right is crucial for keeping your BigQuery data safe. BigQuery Access control with IAM provides detailed guidance on establishing strict IAM policies.

I’ve included an overview of the roles, just in case:

High-Risk BigQuery Roles

RolePermissionsSecurity Risk
BigQuery Admin (roles/bigquery.admin)Full control over all BigQuery resources.Can lead to data exposure, loss, or unauthorized modification.
BigQuery Data Owner (roles/bigquery.dataOwner)Full control over datasets and their contents.Risk of unauthorized access or modifications within datasets.
BigQuery User (roles/bigquery.user)Create new datasets and run jobs.Potential for unauthorized dataset creation or costly/malicious job execution.
BigQuery Job User (roles/bigquery.jobUser)Run jobs, including queries, within the project.May be exploited for data exfiltration or incurring high costs.
BigQuery Data Editor (roles/bigquery.dataEditor)Read/write access to datasets, no sharing.Can lead to data loss or unauthorized alterations.

allUsers/allAuthenticatedUsers

Another important aspect is to ensure that no publicly accessible BigQuery datasets are available within your GCP environment. Make sure that role bindings such as allUsers and allAuthenticatedUsers are not configured:

  • “allUsers:” This allows any user on the internet, whether authenticated or unauthenticated, to access your dataset.
  • “allAuthenticatedUsers:” This allows any user who can sign in to GCP to access your dataset.

2️⃣ Row-Level Security (RLS)

Row-Level Security (RLS) in BigQuery enhances access control, extending it down to the granularity of table rows. You can implement access control at the project, dataset, and table levels, as well as column-level security through policy tags. This method allows for more delicated access control, by deciding which entity can access specific columns.

I think Google’s explanation regarding Row Level Security is excellent, so I will use their use case to demonstrate how RLS is implemented:

  • Think of a table, dataset1.table1, where rows are labeled by different regions in the region column.
  • Row-level security allows a Data Owner or Admin to set specific policies, such as allowing only members of the group:apac to see data from the APAC region.
  • As a result, only users in the sales-apac@example.com group can view rows where the Region is “APAC”. Similarly, those in the sales-us@example.com group can access rows marked as “US”. Users who aren’t in either the APAC or US groups will not be able to see any rows.
  • Under the row-level access policy called us_filter, various entities, including the chief US salesperson jon@example.com, are granted access to rows associated with the US region.

From Google Documentation: The resulting behavior is that users in the group <code>sales-apac@example.com</code> can view only rows where <code>Region = &quot;APAC&quot;</code>. Similarly, users in the group <code>sales-us@example.com</code> can view only rows in the <code>US</code> region. Users not in <code>APAC</code> or <code>US</code> groups don&rsquo;t see any rows.

From Google Documentation: The resulting behavior is that users in the group sales-apac@example.com can view only rows where Region = "APAC". Similarly, users in the group sales-us@example.com can view only rows in the US region. Users not in APAC or US groups don’t see any rows.

Data Exfiltration

The Problem → Through unauthorized queries and jobs, attackers can export data to external storage for nefarious purposes

The Solution → Implementing Robust Monitoring and Threat Detection Strategies

A Word On BigQuery Log Versions

If you are using BigQuery in your company, you may have noticed that BigQuery has two log versions. In short, here is the difference:

  • AuditData is the legacy version of BigQuery audit logs, primarily monitoring API calls.
  • BigQueryAuditMetadata is the “v2” version of the logs, similar to AuditData. It monitors the activities in BigQuery, such as executeing jobs and queries, reading and updating tables and datasets.

We’re not going to delve deeply into which detection rules you can create, but just by examining the protoPayload.methodName, I believe you can come up with a few ideas 🙂 :

MethodDescription
google.cloud.bigquery.v2.TableService.InsertTableCreates a new table.
google.cloud.bigquery.v2.TableService.UpdateTableReplaces a table’s metadata.
google.cloud.bigquery.v2.TableService.PatchTableUpdates parts of a table’s metadata.
google.cloud.bigquery.v2.TableService.DeleteTableRemoves a table.
google.cloud.bigquery.v2.DatasetService.InsertDatasetCreates a new dataset.
google.cloud.bigquery.v2.DatasetService.UpdateDatasetReplaces a dataset’s metadata.
google.cloud.bigquery.v2.DatasetService.PatchDatasetUpdates parts of a dataset’s metadata.
google.cloud.bigquery.v2.DatasetService.DeleteDatasetDeletes a dataset.
google.cloud.bigquery.v2.TableDataService.ListLists table data.
google.cloud.bigquery.v2.JobService.InsertJobSubmits a processing job.
google.cloud.bigquery.v2.JobService.QueryExecutes a query and returns results.
google.cloud.bigquery.v2.JobService.GetQueryResultsRetrieves results of a completed query.
InternalTableExpiredIndicates a table was auto-deleted after expiring.

💡 You can also check out this great blog post by Lionel Saposnik and Dan Abramov, which provides actual examples of threat hunting in BigQuery. One of the use cases shows how to hunt for data being exported to an external dataset.

Data Masking

The Problem → Sometimes, sensitive data within BigQuery should be accessible to users who have legitimate system access but don’t need to see all the details, such as for analysis purposes.

The Solution → Data masking in BigQuery provides a solution by concealing specific data elements, allowing users view only to the information essential for their roles.

Benefits of Data Masking

Data masking offers several important benefits that enhance both security and operational efficiency:

  • Streamlines Data Sharing: By masking sensitive columns, you can safely share tables with larger groups without compromising sensitive information.
  • Maintains Query Integrity: Data masking works seamlessly with existing queries. Configuring data masking ensures that sensitive data is automatically obscured, based on the roles assigned to users, without the need to modify each query.
  • Scalable Data Policies: You can establish a data policy, associate it with a policy tag, and apply this tag across numerous columns. This approach allows for the consistent application of access rules on a wide scale.
  • Enables Attribute-Based Access Control: By attaching a policy tag to a column, data access becomes contextual, governed by the specifics of the data policy and the roles associated with that policy tag. This method ensures data is accessible only under appropriate circumstances.

Configuring Data Masking

To set up data masking in BigQuery, I suggest following the straightforward guide provided by Google Cloud. It covers everything you need to know to get started: Google Cloud Guide to Data Masking in BigQuery.

In summary, here’s what you need to do:

  1. Set up a taxonomy with policy tags: A taxonomy is a framework you design to categorize your data. Within this framework, policy tags are the markers you’ll use to indicate which data is sensitive.
  2. Create data policies for policy tags: These are the rules attached to your policy tags. Data policies determine who can see data unmasked and who sees it masked.
  3. Set policy tags on columns: Once you’ve got your taxonomy and policies in place, you’ll assign the policy tags to the specific columns in your BigQuery tables that contain sensitive data.
  4. Grant access through the Masked Reader role: This role is for users who should see the masked data. By assigning users to the Masked Reader role at the data policy level, they’ll only be able to access data according to the policy you’ve set.

https://cloud.google.com/static/bigquery/images/data-masking-workflow.png

Using BigQuery? I hope this post has given you a solid introduction to what BigQuery can do and how to protect it. If you have any insights or stories about BigQuery security, please share them in the comments