Now I live on an ELK stack and I experience nothing but full-time agony as I switch between Kibana and Kibana Lens constantly. It's clear they are two completely separate "products" built for different use-cases. The experience reminds you constantly that they were not purpose-built for how I use them, unlike Splunk.
Increasingly we are moving towards the reality of a security data lake, and all I can think is that I'm about to lose what little power I had left as I have to move to something like Mode, Sisense, or Tableau which again, were not purpose-built for these use-cases and even further separate the query/data discovery and visualization layers.
I hate how crufty and slow Splunk has gotten as an organization, and they use their accomplishments from 15 years ago to justify the exorbitant price they charge. I really hope the OSS/next-gen SaaS options can fill this need and security data lake becomes a reality. But for that to happen, more focus is needed on the user experience as well.
Regardless, very cool stuff and could definitely fill a need for organizations that are just starting to dip toes into security data lakes. I wish you success!
A few remarks though.
- Doing real time data processing on tera/peta bytes involves a lot of IO, which is a significant part of of the cost in AWS. Things like Athena are simply not cheap to run at that scale.
- With time series data, the emphasis is usually on querying recent data, not all of the data. You retain older data for auditing for some time. But this can essentially be cold storage.
- Especially alerting related querying is effectively against recent data only. There's no good reason for this to be slow.
- People tend to scale Elasticsearch for the whole data set instead of just recent data. However, with suitable data stream and index life cycle management policies, you can contain the cost quite effectively.
- Elastic Common Schema is nice but also adds a lot of verbosity to your data, and queries. Bloating individual log entries to a KB or more. Parquet is a nice option for sparsely populated column oriented data of course. Probably the online disk storage is not massively different from a well tuned elastic index.
- Elastic and Opensearch have both announced stateless as a their next goal. So, architecturally similar to this and easier to scale horizontally.
- SIEM is just one use case. What about APM, log analytics, and other time series data? Security events usually involve looking at all of that.
In case anybody else is wondering how Matano compares to Panther (my first thought reading this launch post) there's a comparison on the Matano website.
Quick note to the Matano team, the "Elastic Common Schema (ECS)" link in the readme seems to be broken.
Out of curiosity, at some point I believe you were working on a predecessor called AppTrail whic tackled (customer-facing) audit logs, it was something I was interested in at the time (and still am! I would've loved to use that).
Would you perhaps be willing to share your learnings from that product, and (I assume) why it evolved into Matano?
Your architecture diagram looks like a carbon copy of theirs.
Also, python detections sounds horrible! I love python but it sounds like you haven't considered the challenges of detection engineering. This one of my main "expertise" if you will. You should think more in the lines of flexible sql than python. People who write detection rules to the most part don't know python and even if they do it would be a nightmare to use for many reasons.
I hope someone from your team reads this comment: DO NOT try to invent your own query language but if you do, DON'T start from scratch. Your product could be the best people who like the fabulous splunk need to also like it. And for a security data lake, you must support Sigma rule conversion into your query/rule format. Python is a general purpose language, there are very good reasons why no one else from Splunk,elastic, graylog, Google,Microsoft use Python. Don't learn this hard lesson with your own money. Querying it needs to be very simple and most importantly you need to support regex with capture groups and the equivalent of "|stats" command from splunk if you want to quickly capture market share. I have used and evaluated many of these tools and have written a lot of detection content.
Your users are not coders, DB admins or exploit developers. They are really smart people whose focus is understand threat actors and responding to incidents -- not coding or anything sophisticated. FAANG background founders/devs have a hard time grasping this reality.
Did you estimate how many times lambda will get invoked and what will be AWS bill for 1 million events ingested? I am curious to learn the price to pay for serverless SIEM
Edit: here's me with Andy, from a millenium ago .