Version: 0.5.4

Introduction

dlt pacman

What is `dlt`?

dlt is an open-source library that you can add to your Python scripts to load data from various and often messy data sources into well-structured, live datasets. To get started, install it with:

pip install dlt

tip

We recommend using a clean virtual environment for your experiments! Here are detailed instructions.

Unlike other solutions, with dlt, there's no need to use any backends or containers. Simply import dlt in a Python file or a Jupyter Notebook cell, and create a pipeline to load data into any of the supported destinations. You can load data from any source that produces Python data structures, including APIs, files, databases, and more. dlt also supports building a custom destination, which you can use as reverse ETL.

The library will create or update tables, infer data types, and handle nested data automatically. Here are a few example pipelines:

Data from an API
Data from a dlt Source
Data from CSV/XLS/Pandas
Data from a Database

tip

Looking to use a REST API as a source? Explore our new REST API generic source for a declarative way to load data.

import dlt
from dlt.sources.helpers import requests

# Create a dlt pipeline that will load
# chess player data to the DuckDB destination
pipeline = dlt.pipeline(
    pipeline_name="chess_pipeline", destination="duckdb", dataset_name="player_data"
)
# Grab some player data from Chess.com API
data = []
for player in ["magnuscarlsen", "rpragchess"]:
    response = requests.get(f"https://api.chess.com/pub/player/{player}")
    response.raise_for_status()
    data.append(response.json())
# Extract, normalize, and load the data
load_info = pipeline.run(data, table_name="player")

Copy this example to a file or a Jupyter Notebook and run it. To make it work with the DuckDB destination, you'll need to install the duckdb dependency (the default dlt installation is really minimal):

pip install "dlt[duckdb]"

Now run your Python file or Notebook cell.

How it works? The library extracts data from a source (here: chess.com REST API), inspects its structure to create a schema, structures, normalizes, and verifies the data, and then loads it into a destination (here: duckdb, into a database schema player_data and table name player).

Initialize the Slack source with dlt init command:

dlt init slack duckdb

Create and run a pipeline:

import dlt

from slack import slack_source

pipeline = dlt.pipeline(
    pipeline_name="slack",
    destination="duckdb",
    dataset_name="slack_data"
)

source = slack_source(
    start_date=datetime(2023, 9, 1),
    end_date=datetime(2023, 9, 8),
    page_size=100,
)

load_info = pipeline.run(source)
print(load_info)

Pass anything that you can load with Pandas to dlt

import dlt
import pandas as pd

owid_disasters_csv = (
    "https://raw.githubusercontent.com/owid/owid-datasets/master/datasets/"
    "Natural%20disasters%20from%201900%20to%202019%20-%20EMDAT%20(2020)/"
    "Natural%20disasters%20from%201900%20to%202019%20-%20EMDAT%20(2020).csv"
)
df = pd.read_csv(owid_disasters_csv)
data = df.to_dict(orient="records")

pipeline = dlt.pipeline(
    pipeline_name="from_csv",
    destination="duckdb",
    dataset_name="mydata",
)
load_info = pipeline.run(data, table_name="natural_disasters")

print(load_info)

tip

Use our verified SQL database source to sync your databases with warehouses, data lakes, or vector stores.

import dlt
from sqlalchemy import create_engine

# Use any SQL database supported by SQLAlchemy, below we use a public
# MySQL instance to get data.
# NOTE: you'll need to install pymysql with `pip install pymysql`
# NOTE: loading data from public mysql instance may take several seconds
engine = create_engine("mysql+pymysql://rfamro@mysql-rfam-public.ebi.ac.uk:4497/Rfam")

with engine.connect() as conn:
    # Select genome table, stream data in batches of 100 elements
    query = "SELECT * FROM genome LIMIT 1000"
    rows = conn.execution_options(yield_per=100).exec_driver_sql(query)

    pipeline = dlt.pipeline(
        pipeline_name="from_database",
        destination="duckdb",
        dataset_name="genome_data",
    )

    # Convert the rows into dictionaries on the fly with a map function
    load_info = pipeline.run(map(lambda row: dict(row._mapping), rows), table_name="genome")

print(load_info)

Install pymysql driver:

pip install sqlalchemy pymysql

Why use `dlt`?

Automated maintenance - with schema inference and evolution and alerts, and with short declarative code, maintenance becomes simple.
Run it where Python runs - on Airflow, serverless functions, notebooks. No external APIs, backends, or containers, scales on micro and large infra alike.
User-friendly, declarative interface that removes knowledge obstacles for beginners while empowering senior professionals.

Getting started with `dlt`

Dive into our Getting started guide for a quick intro to the essentials of dlt.
Play with the Google Colab demo. This is the simplest way to see dlt in action.
Read the Tutorial to learn how to build a pipeline that loads data from an API.
Check out the How-to guides for recipes on common use cases for creating, running, and deploying pipelines.
Ask us on Slack if you have any questions about use cases or the library.

Join the `dlt` community

Give the library a ⭐ and check out the code on GitHub.
Ask questions and share how you use the library on Slack.
Report problems and make feature requests here.

Introduction

What is `dlt`?

Why use `dlt`?

Getting started with `dlt`

Join the `dlt` community

DHelp

Ask a question

Introduction

What is dlt?​

Why use dlt?​

Getting started with dlt​

Join the dlt community​

DHelp

Ask a question

What is `dlt`?

Why use `dlt`?

Getting started with `dlt`

Join the `dlt` community