Scanner for Splunk (Beta)

Run search queries on your high-volume logs in S3 directly from Splunk via custom commands.

What is Scanner for Splunk?

Scanner provides a Splunk app that allows teams to rapidly search their object storage logs directly from Splunk.

It introduces two custom search commands system-wide in Splunk: scanner and scannertable.

Use `scanner` to return events

The scanner command executes a search query via the Scanner API and returns the results as events. In Splunk parlance, this is an events generating command.

Example. Get ECS FireLens log events that contain the string token ERROR in any field.

| scanner q="%ingest.source_type: 'aws:ecs_firelens' ERROR"

Use `scannertable` to return a table

The scannertable command also executes a search query via the Scanner API, but instead of returning the results as events, it returns the results as a table. In Splunk parlance, this is a report generating command.

This command is helpful in contexts where you want to generate a report, set up a dashboard widget, or manipulate statistical tables.

Example. Compute aggregated counts of CloudTrail log events by eventSource:

| scannertable q="%ingest.source_type: 'aws:cloudtrail' | stats by eventSource"

Why use Scanner for Splunk?

Indexing high-volume log sources in Splunk is often very expensive. Teams can reduce costs dramatically by redirecting these logs to S3 and using Scanner to index them at much lower cost, sometimes 80-90% less than Splunk.

It is still useful to query these high-volume log sources from Splunk, and Scanner allows you to do this at high speed, especially for needle-in-haystack queries. For example, searching a 100TB log data set for a list of IP addresses, emails, or UUIDs takes only 10 seconds in Scanner. Running the same search in a 1PB log data set takes about 100 seconds. This can be 10-100x faster than tools like Athena, especially against raw JSON logs that have not yet been highly optimized with Parquet and partitioning.

How do I get started?

1. Start storing your high-volume logs in S3

Using tools like Vector.dev, Cribl, or other log pipeline tools, you can store your logs in S3 instead of sending them directly to Splunk.

Many tools, like Crowdstrike Falcon Data Replicator and the Github Audit system can write logs directly to your S3 buckets.

Once you have logs in your S3 buckets, you can start to index them with Scanner. We support JSON, CSV, Parquet, and plaintext log files. No need to transform them first. Just point Scanner at your raw log files.

2. Configure Scanner to index these high-volume logs in S3

Following the S3 integration guide, configure Scanner to index these logs in S3. This allows search queries to execute at high speed even as data volumes reach hundreds of terabytes or petabytes.

3. Install the "Scanner for Splunk" app into your Splunk instance.

Contact your Scanner support engineer for access to the scanner_for_splunk Github repository, which contains the custom Splunk app.

Within the Scanner for Splunk configuration page in Splunk, add the Base API URL of your Scanner instance and your API key. These are available within Settings in Scanner.

4. Execute `scanner` and `scannertable` queries from within Splunk

Start executing search queries against your high-volume logs in S3 by using the scanner and scannertable custom search commands. These commands are available system-wide.

The commands take a parameter q, which must be a query written in Scanner's query language.

The query is executed against Scanner's ad hoc queries API. By default, the API returns the most recent 1000 results in descending timestamp order.

Example queries

Since scanner and scannertable are generating commands, they must be used at the beginning of the search string or at the beginning of a subsearch.

For example, here is how to search for all GetObject events in CloudTrail logs indexed by Scanner.

| scanner q="%ingest.source_type: 'aws:cloudtrail' and eventName: GetObject"

Here's how to look for a set of indicators of compromise that are IP addresses:

| scanner q="%ingest.source_type: 'aws:cloudtrail' and sourceIPAddress: (
  23.105.182.19 or 104.251.211.122 or 202.59.10.100 or 162.210.194.35 or
  198.16.66.124 or 198.16.66.156 or 198.16.70.28 or 198.16.74.203 or
  198.16.74.204 or 198.16.74.205 or 198.98.49.203 or 2.56.164.52
)"

Here is how you can use Scanner in a sub-search. Let's say you have a Splunk index called threat_intel_ip_addresses containing threat intelligence about malicious IP addresses. You might query for malicious IP addresses above a certain threat score, and then join them with a sub-search that users the scanner command. This query would allow you to determine if any recent AWS console login events came from a high threat IP address.

index=threat_intel_ip_addresses threat_score > 10
| join left=L right=R where L.ip_addr = R.sourceIPAddress
  [scanner q="%ingest.source_type: 'aws:cloudtrail' and eventName: 'ConsoleLogin'"]

Dashboards

You can also execute Scanner queries to populate dashboards in Splunk. It is almost always best to use the scannertable command with dashboard queries since widgets tend to consume data in tabular format.

For example, this query computes aggregated counts of all S3 CloudTrail log events that are not GetObject. We can use it to generate a bar chart in the dashboard.

| scannertable q="%ingest.source_type: 'aws:cloudtrail' 
  and eventSource: 's3.amazonaws.com' and not eventName: 'GetObject'
  | stats by eventName"

Access more data from Splunk, reduce blind spots

If you want Splunk to be your single pane of glass where you can analyze both Splunk logs and the logs you have in object storage, Scanner can help you make this happen.

Using Scanner, you can run fast queries against your object storage logs, join them against your Splunk logs, create dashboards from object storage logs, and more.

With your object storage logs easily queryable from Splunk, you can avoid blind spots and keep Splunk costs low.

PreviousAd hoc queries

Last updated 1 month ago