Is data decentralization the future of data management ?
In the post-COVID WFH world, companies continue to grapple with security challenges and to start with, loading up their Security Information and Event Management platforms (SIEMs) to unprecedented volumes. The transition from on-prem to cloud has seen SIEM data volumes growing from 2 TB a day, to now generating 20 TB per day. Security events, network logs, endpoint, applications, infrastructure — all of these yield a lot of data. Authentication of users, changes to Active Directory, destroying instances on AWS, wiping endpoints, unblocking IPs on firewalls…the list goes on.
And this is just in security — imagine data being pumped in from various lines of business — operations, sales, financials and customer support. Data centralization is a chokepoint, a logjam, a swamp and a grossly ineffective method of dealing with volume of data. The future is decentralized — enabling low storage costs, speed of access, rapid analytics.
Perhaps it’s time to consider the question, “What drives the need to collect and store all this security data?” And more importantly, how are the storage costs impacting security budgets?
Data everywhere, query anything
Dhiraj Sharan, CEO of Query.AI has been in the world of cybersecurity for two decades across four startups including 12 years at ArcSight (acquired by HP for $1.5 bn). The company has built a browser based API driven solution that thrives on the mantra — Data everywhere, Query anything. It provides an innovative data integration layer that connects a customer’s various security and storage solutions for a single pane of glass and unified user experience. One of the reasons I invested in Query.ai is the relevance of their solution in the modern day security stack, not to mention the technical and commercial strength of its team, and the simplicity and elegance of its platform. “Data storage costs are top of mind for every CISO, ingest based pricing from these data storage solutions is a bad fit for the modern world, add to that the difficult task of actually getting the data centralized in the first place and you’ll quickly find we need to think differently.” Dhiraj explained.
Query.AI recently helped one of its customers cut its data storage costs almost 50%. How? By asking some simple questions — “Why are you centralizing all this data?” And “Where might be the best place for storing it?”
When the team was building the Query.AI platform, they made a simple yet elegant choice — the world does not need another data storage solution, instead let’s focus on access to data. Can our customers have the flexibility to store data in cost effective platforms, while retaining the centralized access and insights they desire?
“Why are we still sending everything to the SIEM? Dhiraj asks. “A SIEM is built for analytics, not storage. Leading SIEMs are charging us an arm and a leg for every GB, when most of that data is not used for analytics and may never even be accessed?” That’s exactly the problem Dhiraj, Andrew Maloney, Chief Operating Officer at Query.ai and the team have tackled.
And they have tackled it well. I’m a biased indeed but I take immense pride in the fact that CIO Review recently named as one of the most promising tech startups of 2020 .
Three drivers of data centralization
As an industry have three primary themes that drive attempts to centrally collect and store data in cybersecurity: Compliance, Forensics, or Analytics.
- Compliance: Is an alphabet soup which includes a wide range of data storage requirements based on industry or governmental regulation. We have PCI-DSS, SOX, HIPAA, COPPA, FACTA, FRCP, GDPR, CCPA and more — each of these have separate data storage requirements but most require maintaining audit and event data for a minimum period of one year. To keep track of these regulations and ensure compliance is no easy task, worse yet those under violation can face stiff penalties.
- Forensic investigations: Once any compromise occurs, an incident response (IR) process begins. The heart of the process is the collection and analysis of incident related data. Several simultaneous queries are made collecting logs, alerts, and relevant context on the systems and users in scope then slicing the information from various vantage points to get to the root of what happened.
The big question is this — how can you ever know what data is incident related before the incident occurs?
Take for example, a ransomware attack — this requires data to be gathered from infected endpoints, analysis of emails and web traffic to see how the compromise came in, gathering details of affected users, and handing it off to incident responders for investigation. It’s difficult, sometimes it seems impossible to predict everything that could be relevant in the event of an incident which drives many security teams to lean towards what has been coined as “Just in Case” logging. The belief is that you store everything, as long as you can, just in case you need it to support a forensics investigation in the future.
3. Analytics: In security, analytics largely equates to using analyzing disparate data with algorithms (man made or machine driven) to correlate and detect potential threats. This task has traditionally fallen to the SIEM to complete. A SIEM ingests data from multiple sources, applies these correlations and analysis algorithms and generates alerts. In order for this to be effective the SIEM needs all the data relevant to the analysis which drives the need for centralization.
As tools proliferate, data usage patterns evolve
When we start to look at how data is utilized, some interesting patterns start to emerge:
- The data used in active analytics is often a very small subset of the overall data ingested into a SIEM as compared to the first two centralization themes.
- SIEM is rarely used for correlation and analytics and it has largely been relegated to a triage window serving as a jumping off point for security analysts. In recent years the heavy lifting of analytics for detection has been handed off loaded. Various tools Endpoint Detection and Response (EDR), Network Detection and Response (NDR), User & Entity Behavior Analytics (UEBA), Email analytics, and the like offer analytics specific to niche areas.
Complexity continues to worsen
In this day and age, why do even bother with this universal centralization is an interesting question indeed? Maybe bad habits of the past? Maybe lack of better solutions? While data can be gathered from multiple sources, its storage, handling and analysis are a significant pain in most security organizations.
In short, it’s hard.
The SIEM lords of Exabeam have managed to pull off over 350 integrations of various data sources. Yet challenges remain getting that data in a centralized location.
Security data continues to remain fragmented across products and platforms. A survey of more than 200 security leaders by Panaseer shows over 55% of respondents had 50+ tools. Enterprise security teams spend an average of 70% of their time on manual data collection and another 36% manually producing reports. A whopping 89% of these organizations have concerns about the lack of visibility and insight into data.
Manual data collection, manual reports, lack of visibility — this is 2020 and yet, these problems it feels like we are in 1920. The data management grunt work includes extracting, moving, cleaning, and merging data, as well as making, formatting, and presenting calculations. Security leaders are concerned that their team’s productivity is adversely impacted, and rightly so.
“The analyst needs to get to the bottom of the threat as quickly as possible. If they had a magic wand, it would allow them to have data everywhere, query anything — and our goal was to do exactly that, says Dhiraj. We allow our users to get centralized insights without having to centralize data or having to learn multiple query languages — with simplicity and elegance of a browser based no installation solution.” Query.AI uses an API-driven architecture to integrate multiple data sources without data replication. With natural language processing capabilities, it has developed a number of guided workflows to bring automation and speed. The company was recently recognized recently by Enterprise Security Magazine as one of the top 10 artificial intelligence solutions of the year.
Driven by COVID and complexity, Security spending grows by 40%
Amidst COVID, buyers are shifting away from the nice-to-have and sticking to the must-haves. Security has become far more complex with WFH mandates and budget demands are increasing rapidly. According to a 2019 Gartner survey of CIOs, security spending is growing by as much as 40% in the post-COVID era.
Customers are being prudent about their budgets and spend. SIEM’s ingest based (consumption) data storage costs are driving companies to adopt better strategies. Splunk’s data storage pricing has caused so much heartburn for its customers that it’s now looking to instance-based pricing. While this may help it retain customers in the short run, how it will keep up with changing market dynamics is yet to be seen. Simple comparisons with the “Amazon S3” cost advantage and newer “pay per access” models will remain a challenge especially when considering data stored only for retention or forensic purposes, which may lead to more intentional decentralization of data as a new strategy for the SOC of the future.
Whether you’re on the bandwagon or not, the data decentralization wave has begun and it will only grow. With the rising adoption of cloud both in terms of infrastructure, applications (O365) and delivery of tools as SaaS platforms the reliance on multi vendor environments is just making it harder and more costly for organizations to centralize data. Layer on top data privacy regulation mandating boundary enforcement on where data resides and is accessed, and data centralization may ultimately be a thing of the past.
(About the author — Mahendra Ramsinghani is author of two books — “The Business of Venture Capital” (Wiley Finance) and “Startup Boards” ( Wiley), co-authored with Brad Feld. When he is not writing, he is busy investing in some of the leading technology and security companies, including Query.ai, Attivo Networks, Accurics, CyberGRX to name a few.)