Tuning Performance on Splunk Searches
Here at DomainTools we have been working on a new version of our Splunk app. A total rewrite. Our users have given us feedback on the speed of our app, and we have listened. We run a lot of searches to get all of the data that our users would like to see. Most everything you do in Splunk is a Splunk search. The complicated searches we were using caused our speed issue, so we dug in and found out what we could do to improve our performance.
Our test environment has five million events come in each month, so we were prepared to develop a solution. On our older monitoring dashboard, for a seven day time frame it took 843 seconds, that’s 14-15 minutes!
We used four techniques to dramatically improve our performance:
Summary Index
We enrich data via saved search that runs on a cron schedule. That is raw data, and stored in one huge collection in the kvstore. With summary indexing, we do a saved search on a cron schedule that checks for new data in our collection. The search gets only the data it needs for the chart, filtered. This is a much smaller set of data than our huge collection of enriched data, and it’s a lot faster to query when we render the charts. We use the retirement policy to keep these tables small with only the data we need which is >= 30 days. The summary index deletes data that is over 30 days old. That is how long we allow users to search our data with our dashboards.
Let’s take a look at a summary index and how to use it. We have the summary index set up in savedsearches.conf. We say it’s a summary index by setting action.summary to 1 and set a cron_schedule. The seach does a lookup on our main enrich collection and uses the sitimechart command to create a timechart summary index. Here is a simplified version of what this looks like in savedsearches.conf.
[Summary Index - Threat Portfolio] action.summary_index = 1 cron_schedule = 15 * * * * dispatch.earliest_time = -1h@h dispatch.latest_time = @h search = | `some_basesearch` | tld field_in=url, field_out=domain | eval domain=lower(domain) | lookup domaintools_awesome_data en_domain_name as domain | sitimechart span=1h count by en_threat_profile_type
This is the search for our Threat Portfolio chart on our Threat Intelligence dashboard. When you load the page, instead of searching on our main enrich collection, we perform the following search:
index=summary search_name="Summary Index - Threat Portfolio" | timechart count by threat_type | rename _time as "Time" malware as "Malware" phishing as "Phishing" spam as "Spam"
Our search starts by telling it we are going to use a summary index, and we provide the name of it that is stored in savedsearches.conf. We have only the data we need and can create a timechart based off of it.
Accelerated Fields
We use accelerated fields on our main enrich collection because it provides benefits much like an index in a traditional database. Same idea. Adding them costs you a slight performance hit on insertion, but since we do this in the background on a cron, we are not concerned about this. Implementation is very simple, you just add the fields on your collection in collections.conf:
[domaintools_awesome_data] field.domain_name = string field.tag = string accelerated_fields.domain_accel = {"domain_name": 1} accelerated_fields.tags_accel = {"tag": 1} ...
You will want to keep an eye on your disk space when using accelerated fields. These are the fields on our main enrich collection with all of our enriched data. You’ll notice in the Summary Index – Threat Portfolio search above that we key on the accelerated domain_name field.
tstats
We use tstats in our some_basesearch which pulls from a data model of raw log data, and is where we find data to enrich. This works directly with accelerated fields. In this context it is a report-generating command. When using tstats we can have it just pull summarized data by using the summariesonly argument. This search is used in enrichment, summaries, and other places throughout the app. Here is what that looks like:
| tstats summariesonly=true count FROM datamodel=Web BY Web.url Web.src Web.dest _time | rename Web.url AS url | rename Web.src AS src | rename Web.dest AS dest | fields url src dest _time count
This search is joined with collections that have accelerated fields. The documentation on tstats shows that it is very powerful; there is a lot you can do with it.
Search Inheritance
Many of our charts, graphs, and reports need the same base data. Instead of hitting some_basesearch for every chart and graph we want on the page, we search it once and use PostProcessManager to search the data returned from a base searchSearchManager. This is quite simple to implement, here is how we pull the data in for our monitored events sparkline found on our Monitoring Dashboard:
var statsCountBaseSearch = new SearchManager({ id: "statsCountBaseSearch", latest_time: "now", earliest_time: "$form.time_range$", preview: true, search: "index=summary search_name="Summary - Timechart count by domain with latest time" n" + "| lookup m_list _key as domain OUTPUT _key as is_monitored n" + "| search is_cool=* n" }, {tokens: true});
We create this first. It’s using a summary index, so it’s going to be fast. Once this is created we can call as many PostProcessManager searches as we want, and here is what one looks like:
var eventsMonitoredSearch = new PostProcessManager({ id: "eventsMonitoredSearch", managerid: "statsCountBaseSearch", search: "| timechart count" });
Combining all of these strategies improved the Monitoring Dashboard performance by almost 300 times!
We are very excited about our new release. You can get it on SplunkBase.