AI coding transforms data engineering: How dltHub's open-source Python library helps developers create data pipelines for AI in minutes

0



A quiet revolution is reshaping enterprise data engineering. Python developers are building production data pipelines in minutes using tools that would have required entire specialized teams just months ago.

The catalyst is dlt, an open-source Python library that automates complex data engineering tasks. The tool has reached 3 million monthly downloads and powers data workflows for over 5,000 companies across regulated industries including finance, healthcare and manufacturing. That technology is getting another solid vote of confidence today as dltHub, the Berlin-based company behind the open-source dlt library, is raising $8 million in seed funding led by Bessemer Venture Partners. 

What makes this significant isn't just adoption numbers. It's how developers are using the tool in combination with AI coding assistants to accomplish tasks that previously required infrastructure engineers, DevOps specialists and on-call personnel.

The company is building a cloud-hosted platform that extends their open-source library into a complete end-to-end solution. The platform will allow developers to deploy pipelines, transformations and notebooks with a single command without worrying about infrastructure. This represents a fundamental shift from data engineering requiring specialized teams to becoming accessible to any Python developer.

"Any Python developer should be able to bring their business users closer to fresh, reliable data," Matthaus Krzykowski, dltHub's co-founder and CEO told VentureBeat in an exclusive interview. "Our mission is to make data engineering as accessible, collaborative and frictionless as writing Python itself."

From SQL to Python-native data engineering

The problem the company set out to solve emerged from real-world frustrations.

One core set of frustrations comes from a fundamental clash between how different generations of developers work with data. Krzykowski noted that there is a generation of developers that are grounded in SQL and relational database technology. On the other hand is a generation of developers building AI agents with Python.

This divide reflects deeper technical challenges. SQL-based data engineering locks teams into specific platforms and requires extensive infrastructure knowledge. Python developers working on AI need lightweight, platform-agnostic tools that work in notebooks and integrate with LLM coding assistants.

The dlt library changes this equation by automating complex data engineering tasks in simple Python code. 

"If you know what a function in Python is, what a list is, a source and resource, then you can write this very declarative, very simple code," Krzykowski explained.

The key technical breakthrough addresses schema evolution automatically. When data sources change their output format, traditional pipelines break.

 "DLT has mechanisms to automatically resolve these issues," Thierry Jean, founding engineer at dltHub told VentureBeat. "So it will push data, and you can say, alert me if things change upstream, or just make it flexible enough and change the data and the destination in a way to accommodate these things."

Real-world developer experience

Hoyt Emerson, Data Consultant and Content Creator at The Full Data Stack, recently adopted the tool for a job where he had a challenge to solve.

He needed to move data from Google Cloud Storage to multiple destinations including Amazon S3 and a data warehouse. Traditional approaches would require platform-specific knowledge for each destination. Emerson told VentureBeat that what he really wanted was a much more lightweight, platform agnostic way to send data from one spot to another. 

"That's when DLT gave me the aha moment," Emerson said.

He completed the entire pipeline in five minutes using the library's documentation which made it easy to get up and running quickly and without issue..

The process gets even more powerful when combined with AI coding assistants. Emerson noted that he's using agentic AI coding principles and realized that the dlt documentation could be sent as context to an LLM to accelerate and automate his data work. With the documentation as context, Emerson was able to create reusable templates for future projects and used AI assistants to generate deployment configurations.

"It's extremely LLM friendly because it's very well documented," he said.

The LLM-Native development pattern

This combination of well-documented tools and AI assistance represents a new development pattern. The company has optimized specifically for what they call "YOLO mode" development where developers copy error messages and paste them into AI coding assistants.

"A lot of these people are literally just copying and pasting error messages and are trying the code editors to figure it out," Krzykowski said. The company takes this behavior seriously enough that they fix issues specifically for AI-assisted workflows.

The results speak to the approach's effectiveness. In September alone, users created over 50,000 custom connectors using the library. That represents a 20x increase since January, driven largely by LLM-assisted development.

Technical architecture for enterprise scale

The dlt design philosophy prioritizes interoperability over platform lock-in. The tool can deploy anywhere from AWS Lambda to existing enterprise data stacks. It integrates with platforms like Snowflake while maintaining the flexibility to work with any destination.

"We always believe that DLT needs to be interoperable and modular," Krzykowski explained. "It can be deployed anywhere. It can be on Lambda. It often becomes part of other people's data infrastructures."

Key technical capabilities include:

  • Automatic Schema Evolution: Handles upstream data changes without breaking pipelines or requiring manual intervention.

  • Incremental Loading: Processes only new or changed records, reducing computational overhead and costs.

  • Platform Agnostic Deployment: Works across cloud providers and on-premises infrastructure without modification.

  • LLM-Optimized Documentation: Structured specifically for AI assistant consumption, enabling rapid problem-solving and template generation.

The platform currently supports over 4,600 REST API data sources with continuous expansion driven by user-generated connectors.

Competing against ETL giants with a code-first approach

The data engineering landscape splits into distinct camps, each serving different enterprise needs and developer preferences. 

Traditional ETL platforms like Informatica and Talend dominate enterprise environments with GUI-based tools that require specialized training but offer comprehensive governance features.

Newer SaaS platforms like Fivetran have gained traction by emphasizing pre-built connectors and managed infrastructure, reducing operational overhead but creating vendor dependency.

The open-source dlt library occupies a fundamentally different position as code-first, LLM-native infrastructure that developers can extend and customize. 

"We always believe that DLT needs to be interoperable and modular," Krzykowski explained. "It can be deployed anywhere. It can be on Lambda. It often becomes part of other people's data infrastructures."

This positioning reflects the broader shift toward what the industry calls the composable data stack where enterprises build infrastructure from interoperable components rather than monolithic platforms.

More importantly, the intersection with AI creates new market dynamics. 

"LLMs aren't replacing data engineers," Krzykowski said. "But they radically expand their reach and productivity."

What this means for enterprise data leaders

For enterprises looking to lead in AI-driven operations, this development represents an opportunity to fundamentally rethink data engineering strategies.

The immediate tactical advantages are clear. Organizations can leverage existing Python developers instead of hiring specialized data engineering teams. Organizations that adapt their tooling and hiking approaches to leverage this trend may find significant cost and agility advantages over competitors still dependent on traditional, team-intensive data engineering.

The question isn't whether this shift toward democratized data engineering will occur. It's how quickly enterprises adapt to capitalize on it.



Source link

You might also like
Leave A Reply

Your email address will not be published.