This is a celebratory update as
kamu project turns 3 years old this month!🎉
What’s kamu? #︎
Imagine if every small organization and every individual in the world could become a data publisher in just a few minutes:
- At a minimal cost
- Without needing to move or lose ownership of their data
- Able to produce data that flows continuously, in near real-time
- And follow the best data sharing practices without having to know what they are
This data, flowing from from millions of decentralized sources, is then picked up by organizations, data science communities, and enthusiasts:
- Who build multi-stage pipelines that extract insight from raw data
- Who collaborate on data cleaning, enrichment, and harmonization just like on Open Source Software
- Who create data supply chains that work autonomously and with low latency
- And where any doubts about trustworthiness and provenance of data can be resolved in minutes
This refined information and insights is then consumed by government officials, researchers, and journalists:
- To be presented to decision-makers on always up-to-date dashboards
- Used as reproducible input data with verifiable provenance for science projects and AI/ML model training
- And as the source of truth for automation and smart contracts
This is the vision of kamu and we have a solid plan to get there, as we are building the world’s first decentralized real-time data warehouse disguised as a command line tool you can run on a laptop.
What’s new? #︎
This update will show what we’ve been up to in the past 6 months.
If you like what we’re doing - please star our repo and spread the word, it helps a lot!
New Documentation Portal #︎
Our docs have a new home: https://docs.kamu.dev
This is the best place to get started with the project and will gently guide you throu the process of getting to know our tooling.
The docs are rendered with
hugo and of course remain open-source and easy to contribute to.
Tutorials, Talks, and Examples #︎
kamu in action - start with this new YouTube playlist. It covers the basic functionality and then dives into deeper topics, such as trustworthiness of data and (WIP) benefits of stream processing.
We have also presented at PyData Global 2021 recently - check out this talk to understand some theory behind
kamu’s ledger-like data and metadata in the context of many decades of evolution of data modelling and processing.
Links to this and many other talks can be found on our Learning Materials page.
If you’d like to give
kamu a try - check out our Self-serve Demo. It will guide you through many features of the tool without you needing to install anything.
ODF Whitepaper #︎
We have just released in open access the original Open Data Fabric protocol whitepaper - it’s a great introduction to the problems we are solving and the vision we’re working towards.
New Features #︎
We’ve been packing
kamu with new features working towards an MVP:
kamu verify <dataset-id> command is a huge step on our way to trustworthiness of data. It allows you to verify that the data you have downloaded from someone else have not been tampered, and that derivative data was in fact produced by the transformation declared in the metadata (see video tutorial).
kamu inspect lineage -b command shows the dependency graph of your datasets in a browser:
kamu tail <dataset-id> command allows you to quickly preview last events in a dataset:
tail command is made possible through integration of the DataFusion SQL engine based on Apache Arrow project. This engine is really fast compared to the startup time of Apache Spark, so we expect to use it more for exploratory data analysis features of
kamu in the future.
For ad-hoc querying you can use
datafusion as an alternative backend for
kamu sql command (see
kamu sql --help). It has some missing functionality (e.g. does not support computations on
DECIMAL data types) so we cannot make in the default yet, but the community around it is very active and we’re also contributing fixes to make it better. The future of data processing in Rust looks very promising.
Join Us #︎
If you are passionate about data, open-source software, data science, and want to take part in revolutionizing the data exchange worldwide - let’s collaborate!
We have recently set up a Discord server where you can chat with us and other like-minded people about anything data-related.
See you in the next update!