Introduction
What is dlt
?
dlt
is an open-source library that you can add to your Python scripts to load data
from various and often messy data sources into well-structured, live datasets. To get started, install it with:
pip install dlt
We recommend using a clean virtual environment for your experiments! Here are detailed instructions.
Unlike other solutions, with dlt, there's no need to use any backends or containers. Simply import dlt
in a Python file or a Jupyter Notebook cell, and create a pipeline to load data into any of the supported destinations. You can load data from any source that produces Python data structures, including APIs, files, databases, and more. dlt
also supports building a custom destination, which you can use as reverse ETL.
The library will create or update tables, infer data types, and handle nested data automatically. Here are a few example pipelines:
- Data from an API
- Data from a dlt Source
- Data from CSV/XLS/Pandas
- Data from a Database
Looking to use a REST API as a source? Explore our new REST API generic source for a declarative way to load data.
import dlt
from dlt.sources.helpers import requests
# Create a dlt pipeline that will load
# chess player data to the DuckDB destination
pipeline = dlt.pipeline(
pipeline_name="chess_pipeline", destination="duckdb", dataset_name="player_data"
)
# Grab some player data from Chess.com API
data = []
for player in ["magnuscarlsen", "rpragchess"]:
response = requests.get(f"https://api.chess.com/pub/player/{player}")
response.raise_for_status()
data.append(response.json())
# Extract, normalize, and load the data
load_info = pipeline.run(data, table_name="player")
Copy this example to a file or a Jupyter Notebook and run it. To make it work with the DuckDB destination, you'll need to install the duckdb dependency (the default dlt
installation is really minimal):
pip install "dlt[duckdb]"
Now run your Python file or Notebook cell.
How it works? The library extracts data from a source (here: chess.com REST API), inspects its structure to create a schema, structures, normalizes, and verifies the data, and then loads it into a destination (here: duckdb, into a database schema player_data and table name player).
Initialize the Slack source with dlt init
command:
dlt init slack duckdb
Create and run a pipeline:
import dlt
from slack import slack_source
pipeline = dlt.pipeline(
pipeline_name="slack",
destination="duckdb",
dataset_name="slack_data"
)
source = slack_source(
start_date=datetime(2023, 9, 1),
end_date=datetime(2023, 9, 8),
page_size=100,
)
load_info = pipeline.run(source)
print(load_info)
Pass anything that you can load with Pandas to dlt
import dlt
import pandas as pd
owid_disasters_csv = (
"https://raw.githubusercontent.com/owid/owid-datasets/master/datasets/"
"Natural%20disasters%20from%201900%20to%202019%20-%20EMDAT%20(2020)/"
"Natural%20disasters%20from%201900%20to%202019%20-%20EMDAT%20(2020).csv"
)
df = pd.read_csv(owid_disasters_csv)
data = df.to_dict(orient="records")
pipeline = dlt.pipeline(
pipeline_name="from_csv",
destination="duckdb",
dataset_name="mydata",
)
load_info = pipeline.run(data, table_name="natural_disasters")
print(load_info)
Use our verified SQL database source to sync your databases with warehouses, data lakes, or vector stores.
import dlt
from sqlalchemy import create_engine
# Use any SQL database supported by SQLAlchemy, below we use a public
# MySQL instance to get data.
# NOTE: you'll need to install pymysql with `pip install pymysql`
# NOTE: loading data from public mysql instance may take several seconds
engine = create_engine("mysql+pymysql://rfamro@mysql-rfam-public.ebi.ac.uk:4497/Rfam")
with engine.connect() as conn:
# Select genome table, stream data in batches of 100 elements
query = "SELECT * FROM genome LIMIT 1000"
rows = conn.execution_options(yield_per=100).exec_driver_sql(query)
pipeline = dlt.pipeline(
pipeline_name="from_database",
destination="duckdb",
dataset_name="genome_data",
)
# Convert the rows into dictionaries on the fly with a map function
load_info = pipeline.run(map(lambda row: dict(row._mapping), rows), table_name="genome")
print(load_info)
Install pymysql driver:
pip install sqlalchemy pymysql
Why use dlt
?
- Automated maintenance - with schema inference and evolution and alerts, and with short declarative code, maintenance becomes simple.
- Run it where Python runs - on Airflow, serverless functions, notebooks. No external APIs, backends, or containers, scales on micro and large infra alike.
- User-friendly, declarative interface that removes knowledge obstacles for beginners while empowering senior professionals.
Getting started with dlt
- Dive into our Getting started guide for a quick intro to the essentials of
dlt
. - Play with the
Google Colab demo.
This is the simplest way to see
dlt
in action. - Read the Tutorial to learn how to build a pipeline that loads data from an API.
- Check out the How-to guides for recipes on common use cases for creating, running, and deploying pipelines.
- Ask us on Slack if you have any questions about use cases or the library.