BigQuery Security 101 - Core Concepts
Introduction
Today, in the cloud computing age, the Cloud Service Providers provide services that essentially cut across every aspect of computing and data management needs. From scalable computing power and robust data storage solutions to advanced data analytics and machine learning platforms, these services are designed to empower businesses and developers. They provide the flexibility, scalability, and efficiency required to drive innovation and optimize operations in a digitally transformed world.
In a brief, these services can broadly be categorized into several types:
Service Category | Description | Examples |
---|---|---|
Compute Services 💻 | Provide virtualized computing resources over the Internet. | Amazon EC2, Microsoft Azure Virtual Machines, Google Compute Engine |
Storage Services 🗄️ | Dedicated to storing data in the cloud, ensuring security and accessibility. | Amazon S3, Azure Blob Storage, Google Cloud Storage |
Databases 📚 | Offer scalable, distributed systems for data storage and management. | Amazon RDS, Azure SQL Database, Google Cloud SQL |
Data Analytics 🔍 | Designed to process and analyze large datasets efficiently. | AWS Redshift, Azure Synapse Analytics, Google BigQuery |
Machine Learning and AI 🤖 | Enable building, training, and deploying machine learning models. | AWS SageMaker, Azure Machine Learning, Google AI Platform |
Networking 🌐 | Provide interconnectivity between cloud services, on-premises data centers, and end-users. | AWS VPC, Azure Virtual Network, Google Cloud VPC |
Today, we’ll zero in on BigQuery, spotlighting the risks that tag along.
What is BigQuery?
💡 BigQuery, offered by Google Cloud Platform (GCP), stands out as a premier, fully managed, and serverless data warehouse within the Data Analytics spectrum. It enables scalable analysis across petabytes of data, boasting a robust, fast, and interactive SQL-on-terabyte-class database infrastructure. This technology paves the way for the modernization of data-driven applications, ensuring efficient data management and analysis at scale.
For a deeper dive and more context, check out Google’s own explanation in the video below:
Essential BigQuery Concepts Explained
- BigQuery Jobs: Jobs in BigQuery are specific operations that you can perform on the data stored within. These include executing queries to analyze data, loading new data into tables, or exporting data to different formats or external locations.
- BigQuery Tables: Tables are the fundamental building blocks of BigQuery where your actual data resides. Organized in rows and columns, tables support the storage and retrieval of large quantities of structured data.
- BigQuery Datasets: A dataset in BigQuery is a container that holds tables and views. Datasets are used to organize and control access to your tables based on needs and are defined at the project level.
- BigQuery Queries: Queries in BigQuery are used to interact with the data stored in tables. By writing SQL-like commands, you can perform complex data analysis, manipulate data, and generate insights.
Now Let’s Talk Security
Supporting companies with huge data stores—and some may be containing sensitive information, such as PII, financial records, or intellectual property—makes BigQuery capabilities for advanced data analytics more than mandatory. Such an approach equalizes to an increased level of sensitivity and, accordingly, turns BigQuery into a priority target of cyber threats. Attackers will exploit this service for carrying out such actions as access and exfiltration of sensitive data but not limited to DoS or misuse of computational resources.
Unauthorized Access
The Problem → Gaining unauthorized access to BigQuery can lead to sensitive data exposure.
The Solution → Adopt the principle of Least Privilege Access, granting users only the access they need.
1️⃣ The IAM Solution
You might think everyone knows about IAM (Identity and Access Management) by now, but getting it right is crucial for keeping your BigQuery data safe. BigQuery Access control with IAM provides detailed guidance on establishing strict IAM policies.
I’ve included an overview of the roles, just in case:
High-Risk BigQuery Roles
Role | Permissions | Security Risk |
---|---|---|
BigQuery Admin (roles/bigquery.admin ) | Full control over all BigQuery resources. | Can lead to data exposure, loss, or unauthorized modification. |
BigQuery Data Owner (roles/bigquery.dataOwner ) | Full control over datasets and their contents. | Risk of unauthorized access or modifications within datasets. |
BigQuery User (roles/bigquery.user ) | Create new datasets and run jobs. | Potential for unauthorized dataset creation or costly/malicious job execution. |
BigQuery Job User (roles/bigquery.jobUser ) | Run jobs, including queries, within the project. | May be exploited for data exfiltration or incurring high costs. |
BigQuery Data Editor (roles/bigquery.dataEditor ) | Read/write access to datasets, no sharing. | Can lead to data loss or unauthorized alterations. |
- For more details on BigQuery permissions and API calls, visit this page: https://gcp.permissions.cloud/iam/bigquery
allUsers/allAuthenticatedUsers
Another important aspect is to ensure that no publicly accessible BigQuery datasets are available within your GCP environment. Make sure that role bindings such as allUsers and allAuthenticatedUsers are not configured:
- “allUsers:” This allows any user on the internet, whether authenticated or unauthenticated, to access your dataset.
- “allAuthenticatedUsers:” This allows any user who can sign in to GCP to access your dataset.
2️⃣ Row-Level Security (RLS)
Row-Level Security (RLS) in BigQuery enhances access control, extending it down to the granularity of table rows. You can implement access control at the project, dataset, and table levels, as well as column-level security through policy tags. This method allows for more delicated access control, by deciding which entity can access specific columns.
I think Google’s explanation regarding Row Level Security is excellent, so I will use their use case to demonstrate how RLS is implemented:
- Think of a table,
dataset1.table1
, where rows are labeled by different regions in theregion
column. - Row-level security allows a Data Owner or Admin to set specific policies, such as allowing only members of the
group:apac
to see data from the APAC region. - As a result, only users in the
sales-apac@example.com
group can view rows where theRegion
is “APAC”. Similarly, those in thesales-us@example.com
group can access rows marked as “US”. Users who aren’t in either the APAC or US groups will not be able to see any rows. - Under the row-level access policy called
us_filter
, various entities, including the chief US salespersonjon@example.com
, are granted access to rows associated with the US region.
From Google Documentation: The resulting behavior is that users in the group sales-apac@example.com
can view only rows where Region = "APAC"
. Similarly, users in the group sales-us@example.com
can view only rows in the US
region. Users not in APAC
or US
groups don’t see any rows.
Data Exfiltration
The Problem → Through unauthorized queries and jobs, attackers can export data to external storage for nefarious purposes
The Solution → Implementing Robust Monitoring and Threat Detection Strategies
A Word On BigQuery Log Versions
If you are using BigQuery in your company, you may have noticed that BigQuery has two log versions. In short, here is the difference:
AuditData
is the legacy version of BigQuery audit logs, primarily monitoring API calls.BigQueryAuditMetadata
is the “v2” version of the logs, similar to AuditData. It monitors the activities in BigQuery, such as executeing jobs and queries, reading and updating tables and datasets.
We’re not going to delve deeply into which detection rules you can create, but just by examining the protoPayload.methodName
, I believe you can come up with a few ideas 🙂 :
Method | Description |
---|---|
google.cloud.bigquery.v2.TableService.InsertTable | Creates a new table. |
google.cloud.bigquery.v2.TableService.UpdateTable | Replaces a table’s metadata. |
google.cloud.bigquery.v2.TableService.PatchTable | Updates parts of a table’s metadata. |
google.cloud.bigquery.v2.TableService.DeleteTable | Removes a table. |
google.cloud.bigquery.v2.DatasetService.InsertDataset | Creates a new dataset. |
google.cloud.bigquery.v2.DatasetService.UpdateDataset | Replaces a dataset’s metadata. |
google.cloud.bigquery.v2.DatasetService.PatchDataset | Updates parts of a dataset’s metadata. |
google.cloud.bigquery.v2.DatasetService.DeleteDataset | Deletes a dataset. |
google.cloud.bigquery.v2.TableDataService.List | Lists table data. |
google.cloud.bigquery.v2.JobService.InsertJob | Submits a processing job. |
google.cloud.bigquery.v2.JobService.Query | Executes a query and returns results. |
google.cloud.bigquery.v2.JobService.GetQueryResults | Retrieves results of a completed query. |
InternalTableExpired | Indicates a table was auto-deleted after expiring. |
💡 You can also check out this great blog post by Lionel Saposnik and Dan Abramov, which provides actual examples of threat hunting in BigQuery. One of the use cases shows how to hunt for data being exported to an external dataset.
Data Masking
The Problem → Sometimes, sensitive data within BigQuery should be accessible to users who have legitimate system access but don’t need to see all the details, such as for analysis purposes.
The Solution → Data masking in BigQuery provides a solution by concealing specific data elements, allowing users view only to the information essential for their roles.
Benefits of Data Masking
Data masking offers several important benefits that enhance both security and operational efficiency:
- Streamlines Data Sharing: By masking sensitive columns, you can safely share tables with larger groups without compromising sensitive information.
- Maintains Query Integrity: Data masking works seamlessly with existing queries. Configuring data masking ensures that sensitive data is automatically obscured, based on the roles assigned to users, without the need to modify each query.
- Scalable Data Policies: You can establish a data policy, associate it with a policy tag, and apply this tag across numerous columns. This approach allows for the consistent application of access rules on a wide scale.
- Enables Attribute-Based Access Control: By attaching a policy tag to a column, data access becomes contextual, governed by the specifics of the data policy and the roles associated with that policy tag. This method ensures data is accessible only under appropriate circumstances.
Configuring Data Masking
To set up data masking in BigQuery, I suggest following the straightforward guide provided by Google Cloud. It covers everything you need to know to get started: Google Cloud Guide to Data Masking in BigQuery.
In summary, here’s what you need to do:
- Set up a taxonomy with policy tags: A taxonomy is a framework you design to categorize your data. Within this framework, policy tags are the markers you’ll use to indicate which data is sensitive.
- Create data policies for policy tags: These are the rules attached to your policy tags. Data policies determine who can see data unmasked and who sees it masked.
- Set policy tags on columns: Once you’ve got your taxonomy and policies in place, you’ll assign the policy tags to the specific columns in your BigQuery tables that contain sensitive data.
- Grant access through the Masked Reader role: This role is for users who should see the masked data. By assigning users to the Masked Reader role at the data policy level, they’ll only be able to access data according to the policy you’ve set.
Using BigQuery? I hope this post has given you a solid introduction to what BigQuery can do and how to protect it. If you have any insights or stories about BigQuery security, please share them in the comments