Wow, time flies… we’ve been developing
kamu for 2.5 years now.
But today is special — we’re happy to announce that our prototype is ready to be used by you!
What’s kamu? #︎
kamu is a new-generation data management tool made for rapid exchange and global collaboration on structured data.
It can be described as:
- Global supply chain for structured data
- Decentralized stream-processing data pipeline
- Git for data (think collaboration, not diffs/branches)
- Blockchain for Big Data
It’s an ambitious project that takes a very different perspective at what data is and how it should to be handled. Even if you don’t dabble in data sci/eng but work with databases — many ideas behind it may be of interest and change how you look at data.
Why it exists? #︎
The world is mismanaging data on a terrifying scale:
- It took us months to establish reporting of daily COVID-19 cases and start tracking spread of the virus
- We had research published based on fake medical data sidetrack drug discovery
- UK under-reported the number of COVID cases because the Excel spreadsheet used to share lab test results overflowed
- Ongoing reproducibility crisis in science has 90% of researchers unable to repeat the work of others (and often even their own)
- Now a year into pandemic, data continues to be extremely siloed, hard to find, with every single publisher inventing their own way to share it.
And that’s just a few recent examples…
Something is clearly broken about how we do data!
What problems does it solve? #︎
Long story short — we believe it all comes down to a few basic problems:
The “download-modify-upload” cycle is killing collaboration
Imagine I spent a month combining COVID-19 case data from hundreds of countries into one awesome dataset and shared it with you. Would you use it? Would you be able to trust me not to make any mistakes, or worse — malicious alterations?
The amount of time it would take you to verify that my dataset is valid is comparable with re-doing the entire work from scratch. The reuse and collaboration exist only within the boundaries of a company, where people can trust one-another.
Our goal is to make sure that no matter how many hands and transformation steps data goes through — the result can be verified in seconds and can be fully trusted.
Modern data workflows have high latency and prevent automation
All batch-oriented processing (even if fully automated) is slowing down data. Batch does not account for some most common situations in data processing: data arriving late, out-of-order, and on different cadences (when joining two or more datasets). All batch workflows constantly need to be babied and looked after by humans.
Data needs a pipeline, not a Rube Goldberg machine.
It should flow freely, with every new data point being reflected in results within seconds, so that decision-makers and automation always acted on the most up-to-date data.
Existing solutions are not designed for global scale
If you have the money — your company can afford lots of nice tools like data warehouses, lakes, and analytical platforms. But as soon as data needs to leave the boundaries of the organization — you’re back to the data “stone age” — moving CSVs of questionable origin around.
We can’t and will never have a single “World Data Warehouse”. Data management has to be decentralized.
How does it work? #︎
- Stores entire history of data
- Datasets are immutable append-only streams
- Lets you define new datasets as a transformation, aggregation, or a product of other datasets
- Source and derived data are forever linked together
- All transformations are 100% reproducible and verifiable
- Uses stream processing instead of batch
- You can write query once and run it forever with zero extra effort
- All edge cases (late arrivals, out-of-order, etc.) are handled automatically
- Lets you easily share datasets
- Currently via S3, but can technically support any protocol (torrent/dat/IPFS)
- Data is always accompanied by the metadata
- It confirms its origin and validity
- We’re working on the Blockchain integration to have a single-yet-decentralized place to share and discover datasets
kamu does not process the data itself — we integrate with Apache Spark and Apache Flink and use their streaming SQL dialects. It can support any language and data processing engine.
// Does anyone have a data engine that doesn’t take 10s to start? :)
I still have your attention? #︎
Great! Then I invite you to:
- Read the Open Data Fabric introduction article
- Check out the kamu demo video
- Give it a try yourself by following our examples
If you are new to temporal data and stream processing — check out this short series of blog posts, try our examples, and you will never look at data the same again ;)
Oh, and if changing the world for better with data sounds like a good idea to you — let’s collaborate!
See you in the next update.