lakeFS
| lakeFS | |
|---|---|
| Original authors | Einat Orr Oz Katz |
| Developer | Treeverse |
| Initial release | August 3, 2020 |
| Stable release | 1.72.0
|
| Written in | Go |
| Type | Data version control |
| License | Apache 2.0 |
| Website | lakefs |
| Repository | https://github.com/treeverse/lakeFS |
lakeFS is an open-source data version control system for managing data stored in object storage.[1] It provides Git-like operations such as branching, committing, merging, and reverting for large-scale data stored in systems including Amazon S3, Azure Blob Storage, and Google Cloud Storage, as well as other S3-compatible object storage platforms.[2] lakeFS is used in data engineering and machine learning workflows to manage changes to data, support reproducibility, and enable data governance across data lakes.[3] The software is available as an open-source project, as well as in enterprise and managed service offerings, including lakeFS Cloud.[3][1]
History
lakeFS was created in 2020 by Einat Orr and Oz Katz at Treeverse.[4] Its first public release, version 0.8.1, appeared in August 2020 and introduced Git-style operations with support for Amazon S3.[5]
In 2021, Treeverse raised $23 million in a Series A funding round led by Dell Technologies Capital, Norwest Venture Partners, and Zeev Ventures.[6] The same year, lakeFS was included in InfoWorld’s Best of Open Source Software (Bossie) awards.[7]
In June 2022, Treeverse introduced lakeFS Cloud, a managed service providing hosted lakeFS deployments for cloud-based data lakes.[3] Version 1.0 was released in October 2023, adding integrations with platforms such as Databricks and Apache Iceberg, as well as support for orchestration tools including Apache Airflow.[1][8] Public case studies and conference materials have described usage of lakeFS by organizations such as Microsoft, Volvo, and NASA.[1]
In July 2025, Treeverse announced an additional $20 million in growth funding to support further development of lakeFS.[9][10]
In November 2025, Treeverse announced the acquisition of the open-source data version control project DVC.[11]
Software
Overview
lakeFS provides Git-like operations such as branching, committing, merging, and reverting for datasets stored in object storage.[1] These operations are used to manage changes to data, test modifications in isolation, reproduce specific data states, and recover from errors or unintended updates.[2]
Architecture
lakeFS operates as a metadata layer on top of object storage systems such as Amazon S3, Azure Blob Storage, and Google Cloud Storage.[2] It stores repository metadata describing commits, branches, and tags, enabling versioned views of data without copying underlying objects.[2]
The system provides access through multiple interfaces, including a web user interface, command-line tools, a REST API, and software development kits.[2] It is designed to integrate with existing data engineering and machine learning workflows, and can be deployed either in self-hosted environments or as a managed service.[3]
Functions
lakeFS provides version control functionality for data stored in object storage–based data lakes. Core features include:
- Atomic commits and version tracking for datasets, supporting reproducibility and auditability.[1]
- Branching and merging mechanisms that allow isolated development and testing without duplicating data.[2]
- Configurable hooks that can validate data or trigger external processes during commit and merge operations.[1]
- The ability to revert repositories to earlier states to recover from data errors or failed changes.[2]
- Recording of commit history and associated metadata for lineage tracking.[3]
- Support for managing data across multiple object storage systems, including Amazon S3, Azure Blob Storage, Google Cloud Storage, and MinIO.[3]
- Use of fixed data versions to reproduce experiments and machine learning model training.[1]
Integrations
Coverage of lakeFS has described integrations with platforms such as Databricks and Apache Iceberg, as well as support for environments including Red Hat OpenShift.[1][2] Additional materials describe its use with Trino, including validation of data changes prior to merging in versioned data workflows, as well as compatibility with orchestration tools such as Apache Airflow.[12]
See also
References
- ^ a b c d e f g h i Kerner, Sean Michael (October 2023). "Open-source lakeFS data version control levels up to 1.0". VentureBeat. Archived from the original on 2023-11-02. Retrieved 2025-12-03.
- ^ a b c d e f g h "LakeFS brings Git-like version control to virtual dataset copies". Blocks and Files. March 27, 2023. Archived from the original on 2023-08-02. Retrieved October 18, 2025.
- ^ a b c d e f Kerner, Sean Michael (June 22, 2022). "Treeverse set to launch lakeFS cloud data lake service". TechTarget. Archived from the original on 2023-06-27. Retrieved 2023-06-27.
- ^ Orbach, Meir (July 28, 2021). "Treeverse raises $15 million Series A to leverage lakeFS". Calcalist. Archived from the original on July 7, 2023. Retrieved 2025-12-03.
- ^ "v0.8.1". GitHub. Archived from the original on 2024-06-28. Retrieved June 27, 2023.
- ^ Sawers, Paul (July 28, 2021). "Treeverse raises $23M to bring Git-like version control to data lakes". VentureBeat. Archived from the original on 2023-09-24. Retrieved June 27, 2023.
- ^ Borck, James R. (2021-10-18). "The best open source software of 2021". InfoWorld. Archived from the original on 2023-03-08. Retrieved 2025-12-03.
- ^ "Real-Time Analytics News for the Week Ending October 28". RTInsights. October 2023. Archived from the original on 2023-11-01. Retrieved 2025-12-03.
- ^ "LakeFS nabs $20M to build "Git for Big Data"". BigDataWire. July 29, 2025. Archived from the original on 2025-08-10. Retrieved 2025-12-03.
- ^ "LakeFS Secures $20M in Growth Capital, Transforms Critical Gap in Enterprise Data and AI Tech Stack". DBTA. July 2025. Archived from the original on 2025-08-10. Retrieved 2025-12-03.
- ^ "DVC Joins lakeFS: Your Questions Answered". DVC.org. November 18, 2025. Retrieved 2025-12-03.
- ^ "Trino Community Broadcast 27: Data versioning with lakeFS". Trino. Retrieved 2025-12-04.