FederIQ

Federated SQL over any source, in a single Rust binary.

FederIQ is an open-source federated query engine built on top of DuckDB. One command attaches a Postgres database, a Parquet directory on S3, a CSV on disk, and an HTTP JSON API — and lets you join them in plain SQL.

SELECT u.name, SUM(o.total) AS revenue
FROM postgres_users.public.users u
JOIN events.orders o ON o.user_id = u.id
GROUP BY u.name
ORDER BY revenue DESC;

Why not Trino?

Trino is the incumbent. It's also ~4 GB of JVM, a cluster to deploy, and a DBA to operate. FederIQ is a single ~20 MB binary:

TrinoFederIQ
Deploy unitJVM clusterStatic binary
Minimum memory~4 GB~100 MB
First query latency~30s (cluster warm)~200 ms (cold)
Embed in Python/NodeNoYes (pip/npm)
Policy engineAdd-on (Ranger)Built-in
LicenseApache 2.0Apache 2.0

Why not DuckDB alone?

DuckDB is incredible but single-node and single-process. FederIQ adds:

  • Catalog-driven federation — declare sources in YAML, query from any client.
  • Policy engine — column masking, row filters, region pinning enforced at query-rewrite time.
  • HTTP service modefederiq serve with bearer auth + TLS + Prometheus metrics.
  • Cache layer — memory, disk, and Redis backends with hooks for a predictive prefetcher.
  • First-party Python and TypeScript clients.

License

Apache 2.0. The core will always be Apache 2.0 — hosted and enterprise features may be licensed separately in the future.

Install

Pre-built binary

Grab a release archive from the releases page. Extract and put federiq on your PATH.

Cargo

cargo install federiq-cli

Docker

docker run --rm -p 7878:7878 \
  -v "$PWD/federiq.yaml:/var/federiq/federiq.yaml:ro" \
  ghcr.io/dyber-pqc/federiq:latest \
  serve --addr 0.0.0.0:7878 --token "$FEDERIQ_TOKEN"

From source

git clone https://github.com/dyber-pqc/federiq.git
cd federiq
cargo build --release
./target/release/federiq --help

First build takes a few minutes because DuckDB is compiled from source.

Python

pip install federiq
pip install 'federiq[pandas]'   # optional pandas helper

TypeScript / Node

npm install federiq

Your first federated query

Create a catalog file:

# federiq.yaml
sources:
  - name: events
    type: parquet
    path: ./data/events.parquet

  - name: users
    type: csv
    path: ./data/users.csv

Run a query that joins the two:

federiq query "
  SELECT u.name, COUNT(e.id) AS events
  FROM users u LEFT JOIN events e ON e.user_id = u.id
  GROUP BY u.name
  ORDER BY events DESC
  LIMIT 10
"

That's it. No cluster. No ETL. The Parquet file and the CSV stay exactly where they are; FederIQ federates the query across both.

Add a Postgres database in two more lines:

  - name: sales
    type: postgres
    dsn: "host=localhost port=5432 dbname=sales user=postgres password=postgres"

Then:

federiq query "
  SELECT u.name, SUM(o.total) AS revenue
  FROM sales.public.orders o
  JOIN users u ON u.id = o.user_id
  GROUP BY u.name
"

Next steps

The catalog file

federiq.yaml declares your data sources and (optionally) your policies. Initialize one with:

federiq init

Structure

sources:
  - name: <alias>
    type: <source-type>
    # ... type-specific keys ...

policies:
  - name: <policy-name>
    applies_to: <source-alias or "*">
    # ... policy keys ...

Source types

type:Required keysNotes
postgresdsnDuckDB postgres_scanner extension
mysqldsnDuckDB mysql_scanner extension
sqlitepathLocal SQLite database file
parquetpathSingle file, directory, or glob
csvpathread_csv_auto for schema inference
httpurlJSON response, default feature
mongouri, database, collectionPlanned; blocked on upstream

Managing sources from the CLI

federiq sources list
federiq sources add --name events --type parquet --path ./data/events.parquet
federiq sources remove events
federiq sources test events
federiq sources describe events

Inspecting schemas

federiq schema list                   # every source
federiq schema list events            # one source
federiq schema describe events        # columns of a single table or view

Validating

federiq config validate
federiq doctor

SQL connectors

FederIQ attaches Postgres, MySQL, and SQLite natively via DuckDB's postgres_scanner, mysql_scanner, and sqlite_scanner extensions — meaning queries push down to the remote engine wherever possible.

Postgres

- name: sales
  type: postgres
  dsn: "host=db.internal port=5432 dbname=sales user=app password=secret"

Reference tables with three-part names:

SELECT * FROM sales.public.orders LIMIT 10;

MySQL

- name: catalog
  type: mysql
  dsn: "host=db.internal port=3306 user=app password=secret database=catalog"

SQLite

- name: analytics
  type: sqlite
  path: "./data/analytics.db"

Read-only by design

FederIQ attaches every SQL source READ_ONLY. Writes go through the source's own tools.

File connectors

Parquet

- name: events
  type: parquet
  path: "./data/events.parquet"       # single file
# path: "./data/events/*.parquet"     # glob
# path: "s3://bucket/events/*.parquet" # S3 via DuckDB's httpfs extension

CSV

- name: users
  type: csv
  path: "./data/users.csv"

read_csv_auto handles header detection, delimiter inference, and type inference. For anything non-trivial, convert to Parquet first — it's faster, smaller, and carries schema.

S3 / HTTP

Parquet sources with an S3 or HTTPS path: use DuckDB's httpfs extension. Set credentials via the usual AWS env vars (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION) or the DuckDB secrets mechanism.

HTTP / REST connector

Materializes a JSON HTTP response as a DuckDB view.

- name: posts
  type: http
  url: "https://api.example.com/posts"
  # optional
  headers:
    Accept: "application/json"
  auth_bearer: "$TOKEN"
  method: GET

Requirements:

  • Response must be JSON.
  • Shape must be either a top-level array of objects, or an object with a single array property under one of data, results, items.

Caveats

  • The connector fetches the full response at attach time. Huge responses will hurt. Add pagination support in your own wrapper or use a proper database.
  • No streaming yet. Planned.
  • No response-change detection. Re-attach to refresh.

For anything beyond trivial REST feeds, write your own connector — see connector SDK.

Apache Iceberg connector

FederIQ reads Iceberg tables through DuckDB's native iceberg extension. Declare the source once, query it like any other table.

sources:
  - name: bronze
    type: iceberg
    path: "s3://warehouse/bronze/events"
    # optional — tolerate data files that moved relative to the manifest
    allow_moved_paths: false

Then:

SELECT event_type, COUNT(*)
FROM bronze
GROUP BY event_type;

How it works

FederIQ emits:

INSTALL iceberg;
LOAD iceberg;
CREATE OR REPLACE VIEW bronze AS
  SELECT * FROM iceberg_scan('s3://warehouse/bronze/events');

on attach. DuckDB handles manifest parsing and data file access.

Credentials

For S3-backed tables, set AWS credentials via environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION) or via DuckDB's secrets manager.

For local filesystem paths, no credentials are needed.

Limitations

DuckDB's Iceberg extension is read-only and tracks a subset of the spec:

  • v2 tables supported; older format versions best-effort.
  • No writes (add partitions, schema evolution, etc.) — use your engine of choice (Spark, Trino, pyiceberg) to author, FederIQ to query.
  • Time travel (FOR TIMESTAMP AS OF ...) is not yet surfaced through FederIQ's catalog — query DuckDB directly if you need it.
  • Expect slower first-query latency than Parquet — Iceberg reads the metadata tree on every attach.

Roadmap: more connectors

Planned connectors, roughly in order of priority:

ConnectorApproachStatus
MongoDBmongodb crate → NDJSON tempfile → viewUpstream-blocked on Rust 1.93 compat
Apache IcebergDuckDB iceberg extensionShipped (see iceberg)
SnowflakeJDBC over snowflake-jdbc bridge or RESTPlanned
BigQueryBigQuery Storage APIPlanned
ElasticsearchREST search with cursorPlanned
Kafkardkafka + snapshot materializationPlanned
Delta Lakedelta-rsPlanned
DynamoDBAWS SDK scanPlanned

Open an issue on the repo if your source isn't on the list — we pick based on user demand.

Writing your own connector

See the CONNECTORS.md guide in the repo for the full walkthrough, but here's the shape:

#![allow(unused)]
fn main() {
use federiq_core::{AttachCtx, Connector};

pub struct MyConnector {
    /* your config */
}

impl Connector for MyConnector {
    fn kind(&self) -> &'static str { "mysource" }

    fn attach(&self, ctx: &AttachCtx<'_>) -> anyhow::Result<()> {
        // 1. Fetch rows from the remote source.
        // 2. Write them as NDJSON to ctx.tmpdir.
        // 3. CREATE OR REPLACE VIEW <ctx.alias> AS SELECT * FROM read_json_auto('...');
        Ok(())
    }
}
}

A dynamic plugin system (WASM components loaded from ~/.federiq/plugins) is on the roadmap so third parties can ship connectors without recompiling FederIQ.

Policy model

A FederIQ policy is a YAML record that attaches to a source (or every source via "*") and declares how data must be handled. Policies ship with four capabilities:

policies:
  - name: mask_pii
    applies_to: users
    mask_columns:
      email: "'***@***.com'"
      ssn: "NULL"

  - name: analysts_see_active_only
    applies_to: events
    when_role: analyst            # only applies if FEDERIQ_ROLE=analyst
    row_filter: "status = 'active'"

  - name: pin_to_us_east
    applies_to: "*"
    require_region: us-east-1     # refuses to run if FEDERIQ_REGION != us-east-1

Fields

FieldTypePurpose
namestringHuman-readable identifier
applies_tostring or "*"Source name to match
when_rolestring (optional)Only applies when caller role matches
require_regionstring (optional)Refuses to run outside this region
mask_columnsmap of col: exprReplace the column with the SQL expression
row_filterstring (optional)SQL WHERE predicate AND'd into every SELECT

Context

PolicyContext is pulled from environment variables at query time:

  • FEDERIQ_ROLE — caller role (e.g., analyst, admin)
  • FEDERIQ_REGION — data residency region (e.g., us-east-1)

What happens at query time

See enforcement for the rewrite details.

Enforcement

When federiq query or federiq serve sees a SQL statement against a table with active policies, it parses the SQL with sqlparser-rs, applies each policy as an AST transformation, and emits the rewritten SQL for DuckDB to execute.

Row filters

-- User input
SELECT id FROM events WHERE user_id = 42;

-- After policy `row_filter: "status = 'active'"`
SELECT id FROM events WHERE user_id = 42 AND status = 'active';

Column masking

-- User input
SELECT name, email FROM users;

-- After policy `mask_columns: { email: "'***@***.com'" }`
SELECT name, '***@***.com' AS email FROM users;

The masking expression is valid SQL. The column is aliased back to its original name so callers don't see a schema change.

SELECT * limitation

Star projection cannot be masked without schema expansion. For now, FederIQ returns an error:

SELECT * cannot be enforced against policy 'mask_pii'
(masks columns on 'users'); list columns explicitly

List the columns explicitly, or upgrade when the schema-aware rewriter ships (tracked on the roadmap).

Region pinning

Region mismatches fail the rewrite:

policy 'pin_to_us_east' requires region='us-east-1'
but current context has eu-west-1

Static checking

Before running a query, preview violations:

federiq policy check "SELECT email, ssn FROM users"

Disabling enforcement

Policies auto-apply when the catalog declares any. To opt out on a per-call basis:

federiq query --enforce-policy=false "SELECT ..."

Use sparingly — this bypass is primarily for development and debugging.

Running federiq serve

federiq serve exposes the engine as an HTTP service.

Endpoints

MethodPathPurpose
GET/healthzLiveness probe — always 200
GET/metricsPrometheus exposition
GET/sourcesList attached source names
POST/queryExecute SQL — {"sql": "..."} → JSON rows

/healthz and /metrics are always reachable. /sources and /query require the bearer token (see auth).

Running it

federiq serve --addr 0.0.0.0:7878 --token "$FEDERIQ_TOKEN" \
  --tls-cert /etc/federiq/cert.pem --tls-key /etc/federiq/key.pem

Without --tls-cert the server listens on plain HTTP and prints a loud warning — don't expose an unencrypted port to the internet.

Making requests

curl -H "authorization: Bearer $FEDERIQ_TOKEN" \
  -H "content-type: application/json" \
  -d '{"sql":"SELECT 1 AS n"}' \
  https://localhost:7878/query

Configuring via env

VariableEffect
FEDERIQ_SERVER_TOKENBearer token for inbound requests
FEDERIQ_LOG_FORMATjson switches tracing to structured logs
FEDERIQ_ROLECaller role for policies
FEDERIQ_REGIONCaller region for policies

Authentication and TLS

Bearer tokens

federiq serve enforces a shared-secret bearer token on every request to /sources and /query.

federiq serve --token "$(openssl rand -hex 32)"

Or via environment:

export FEDERIQ_SERVER_TOKEN="..."
federiq serve

Clients send Authorization: Bearer <token>:

new FederIQ("https://federiq.example.com", { token: process.env.FEDERIQ_TOKEN })

Token comparison is constant-time to avoid timing oracles.

TLS

Provide PEM-encoded cert and key:

federiq serve \
  --tls-cert /etc/federiq/cert.pem \
  --tls-key  /etc/federiq/key.pem

Generate a self-signed cert for development:

openssl req -x509 -newkey rsa:4096 -sha256 -days 365 -nodes \
  -keyout key.pem -out cert.pem -subj "/CN=localhost"

For production, use certs from Let's Encrypt or your internal CA.

What's missing (tracked)

  • Multiple tokens with distinct scopes / roles
  • mTLS (client certs)
  • OIDC / OAuth integration
  • Rate limiting

Open an issue if any of these are blocking your deployment.

Observability and metrics

Prometheus metrics

federiq serve exposes metrics at /metrics in Prometheus text format.

MetricKindLabels
federiq_queries_totalcounter
federiq_query_errors_totalcounter
federiq_query_duration_secondshistogram
federiq_http_requests_totalcounterroute
federiq_auth_failures_totalcounter

Scrape config:

scrape_configs:
  - job_name: federiq
    static_configs:
      - targets: ["federiq.internal:7878"]

Structured logs

export FEDERIQ_LOG_FORMAT=json
federiq serve

Emits one JSON object per log event (stdout), with event fields flattened so they're indexable by downstream shipper.

Log levels are controlled by RUST_LOG:

RUST_LOG=federiq=debug,federiq_core=debug federiq serve

Per-source attach timing

Every source's ATTACH latency is tracked by the engine:

federiq query "SELECT 1" --timing
per-source attach timing:
  events               parquet  2.14ms
  sales                postgres 84.2ms

And from Python:

import json
print(json.dumps(engine.attach_timings(), indent=2))

EXPLAIN ANALYZE

DuckDB's EXPLAIN ANALYZE gives operator-level timing that naturally includes federated source scan costs:

federiq query "SELECT COUNT(*) FROM events" --explain-analyze

Caching

FederIQ has an optional result cache keyed by a content-addressed hash of the query SQL. Three backends ship in the default build:

  • memory — in-process LRU with TTL (fastest, evicted on restart)
  • disk — file-backed store sharded by hash prefix (durable, shareable across processes)
  • redis — enabled with --features cache-redis (shared across nodes)

From the CLI

federiq query --cache disk --cache-dir ./.cache "SELECT COUNT(*) FROM events"
federiq cache status --backend disk --dir ./.cache
federiq cache clear  --backend disk --dir ./.cache

From Rust

#![allow(unused)]
fn main() {
use federiq_core::{DiskCache, Engine, FreqPredictor};
use std::sync::Arc;

let engine = Engine::new()?
    .with_cache(Arc::new(DiskCache::new("./.cache")?))
    .with_predictor(Arc::new(FreqPredictor::new(10)));

engine.attach_all(&catalog.sources)?;

let rows = engine.query_cached("SELECT COUNT(*) FROM events")?;
}

Predictors

Predictor::observe(sql, hit) gets called on every cache access. The default FreqPredictor counts queries; predict() returns the top-k most-frequent for background warmup.

#![allow(unused)]
fn main() {
engine.warmup()?;  // runs predictor.predict() and caches the results
}

A smarter ML predictor is on the roadmap — implement the Predictor trait to plug in your own.

What's not cached yet

  • Queries whose results depend on the calling role/region (planned for v0.5.1 when PolicyContext is plumbed through query_cached).
  • Negative results (errors are not cached).
  • Schema introspection.

Python client

pip install federiq
pip install 'federiq[pandas]'   # optional pandas helper

Quickstart

import federiq

engine = federiq.connect("./federiq.yaml")

rows = engine.query("SELECT 1 AS n")
print(rows)  # [{'n': 1}]

df = federiq.query_df(engine, "SELECT COUNT(*) AS n FROM events")
print(df)

API

class Engine:
    @staticmethod
    def from_catalog(path: str) -> "Engine": ...
    def attach(self, source: dict) -> None: ...
    def query(self, sql: str) -> list[dict]: ...
    def explain(self, sql: str) -> str: ...
    def explain_analyze(self, sql: str) -> str: ...
    def attach_timings(self) -> list[dict]: ...
    def duckdb_version(self) -> str: ...

Jupyter

See examples/jupyter/federiq_intro.ipynb for an interactive walkthrough.

TypeScript client

npm install federiq

The TypeScript client is a thin HTTP wrapper around federiq serve.

Quickstart

import { FederIQ } from "federiq";

const fq = new FederIQ("https://federiq.example.com", {
  token: process.env.FEDERIQ_TOKEN,
});

const { rows } = await fq.query("SELECT 1 AS n");
console.log(rows);

API

class FederIQ {
  constructor(baseUrl: string, opts?: FederIQOptions);
  query(sql: string): Promise<{ rows: unknown[] }>;
  sources(): Promise<string[]>;
  healthz(): Promise<boolean>;
}

interface FederIQOptions {
  token?: string;
  fetch?: typeof fetch;
  timeoutMs?: number;
  headers?: Record<string, string>;
}

Works in any runtime with global fetch — Node 18+, Bun, Deno, and modern browsers.

CLI reference

federiq [--config PATH] [-v] <subcommand>

Every subcommand accepts --help for detail. A condensed reference:

SubcommandPurpose
initScaffold a new federiq.yaml
sources listList declared sources
sources addAdd a source
sources removeRemove a source
sources testVerify one or all sources are reachable
sources describeShow source detail and generated ATTACH SQL
schema listList tables in a source
schema describeShow column types for a table
queryExecute SQL (see flags below)
shellInteractive REPL
serveHTTP server with auth + TLS + metrics
config showPrint resolved catalog YAML
config validateLint catalog sources
policy listList policies
policy showShow one policy
policy validateCross-check policies against sources
policy checkStatic violation scan for a SQL query
cache statusCache statistics
cache clearEmpty the cache
doctorEnvironment diagnostics
completions <shell>Emit shell completion scripts

federiq query flags

FlagMeaning
-f, --file PATHRead SQL from a file
-o, --output PATHWrite results to a file
--format FMTtable (default), csv, json, ndjson, parquet
--explainPrint DuckDB plan
--explain-analyzePlan plus per-operator timing
--timingPrint total and per-source timing
--cache MODEoff / memory / disk
--cache-dir PATHDisk cache location
--enforce-policy BOOLDefault true when catalog has policies

Configuration reference

Environment variables

VariableWherePurpose
FEDERIQ_SERVER_TOKENfederiq serveBearer token for inbound requests
FEDERIQ_LOG_FORMATall commandsjson switches logs to JSON
FEDERIQ_ROLEall commandsCaller role for policies
FEDERIQ_REGIONall commandsCaller region for policies
RUST_LOGall commandstracing env-filter expression

Catalog reference

See catalog and policy pages for the authoritative schema.

File locations

  • Default catalog: ./federiq.yaml
  • Default disk cache: ./.federiq-cache/
  • Shell history: ./.federiq_history

All configurable via flags — nothing writes to $HOME without your involvement.

FAQ

How is this different from Trino?

Trino is a distributed query engine. FederIQ is a single binary built on DuckDB. For small-to-medium deployments you get most of Trino's federation story without the JVM ops burden. For true petabyte multi-tenant workloads, Trino is still the right answer.

Can I use FederIQ in production?

Today: probably not without caveats. Tier-1 features (auth, TLS, policy enforcement, observability) are shipped but have only been used by early-alpha testers. Wait for v1.0 for load-bearing workloads.

Does it modify my data?

No. Every source is attached READ_ONLY. FederIQ only reads.

What SQL dialect?

DuckDB's, which is broadly Postgres-compatible. Functions like date_trunc, epoch, list_aggregate, and DuckDB's extensions are all available.

Why is the first build slow?

DuckDB's C++ core compiles from source via the bundled feature. First build takes 3–6 minutes. Subsequent incremental builds are fast.

How do I pin data to a region?

Use the require_region field of a policy. See enforcement.

Can I cache results across processes / nodes?

Yes — use the disk cache (shared filesystem) or the Redis cache (cargo install federiq-cli --features cache-redis).

What about joins that span sources?

DuckDB plans them natively. For expensive cross-source joins consider caching the hot side or materializing to Parquet. A cost-aware planner is on the roadmap.

Is there a managed cloud?

Planned. Core will stay Apache 2.0 forever; hosted and enterprise features will be separately licensed.

FederIQ vs. Trino vs. DuckDB

A pragmatic feature comparison. None of these tools is strictly better — they solve different pain points.

DimensionDuckDBFederIQTrino
Deploy unitSingle binarySingle binaryJVM cluster
Minimum memory~50 MB~100 MB~4 GB
Cold start~50 ms~200 ms~30 s
Multi-source federationLimitedYesYes
Postgres / MySQL / SQLiteVia extensionsNative catalogNative
Parquet / CSV / JSONNativeVia DuckDBVia Hive connector
HTTP / RESTNoYes (default feature)No
MongoNoPlannedYes
IcebergVia extensionIn progressYes
Declarative policy engineNoYes (YAML, SQL rewrite)Via Ranger add-on
Column masking / row filtersManual viewsBuilt-in, catalog-declaredVia Ranger
HTTP service modeNoYes (auth + TLS + metrics)Yes
Cache layer (memory/disk/Redis)ManualBuilt-inMaterialized views
Python / TS clientsYes / NoYes / YesYes / community
LicenseMITApache 2.0Apache 2.0

When to pick each

  • DuckDB alone: single-node analytics, embedded in a Python/R/Node process, local Parquet/CSV processing.
  • FederIQ: you need federated queries across 2–10 sources, a policy engine, or an HTTP service, but don't want the operational cost of Trino. Works great embedded or as a sidecar.
  • Trino: petabyte-scale, hundreds of concurrent users, existing Hive metastore, or you already run a JVM fleet.

Benchmarks

Rigorous TPC-H numbers against Trino and Dremio are on the roadmap. When we publish, the benchmark harness and raw numbers will live in the repo under bench/ so you can reproduce.