FederIQ
Federated SQL over any source, in a single Rust binary.
FederIQ is an open-source federated query engine built on top of DuckDB. One command attaches a Postgres database, a Parquet directory on S3, a CSV on disk, and an HTTP JSON API — and lets you join them in plain SQL.
SELECT u.name, SUM(o.total) AS revenue
FROM postgres_users.public.users u
JOIN events.orders o ON o.user_id = u.id
GROUP BY u.name
ORDER BY revenue DESC;
Why not Trino?
Trino is the incumbent. It's also ~4 GB of JVM, a cluster to deploy, and a DBA to operate. FederIQ is a single ~20 MB binary:
| Trino | FederIQ | |
|---|---|---|
| Deploy unit | JVM cluster | Static binary |
| Minimum memory | ~4 GB | ~100 MB |
| First query latency | ~30s (cluster warm) | ~200 ms (cold) |
| Embed in Python/Node | No | Yes (pip/npm) |
| Policy engine | Add-on (Ranger) | Built-in |
| License | Apache 2.0 | Apache 2.0 |
Why not DuckDB alone?
DuckDB is incredible but single-node and single-process. FederIQ adds:
- Catalog-driven federation — declare sources in YAML, query from any client.
- Policy engine — column masking, row filters, region pinning enforced at query-rewrite time.
- HTTP service mode —
federiq servewith bearer auth + TLS + Prometheus metrics. - Cache layer — memory, disk, and Redis backends with hooks for a predictive prefetcher.
- First-party Python and TypeScript clients.
License
Apache 2.0. The core will always be Apache 2.0 — hosted and enterprise features may be licensed separately in the future.
Install
Pre-built binary
Grab a release archive from the
releases page.
Extract and put federiq on your PATH.
Cargo
cargo install federiq-cli
Docker
docker run --rm -p 7878:7878 \
-v "$PWD/federiq.yaml:/var/federiq/federiq.yaml:ro" \
ghcr.io/dyber-pqc/federiq:latest \
serve --addr 0.0.0.0:7878 --token "$FEDERIQ_TOKEN"
From source
git clone https://github.com/dyber-pqc/federiq.git
cd federiq
cargo build --release
./target/release/federiq --help
First build takes a few minutes because DuckDB is compiled from source.
Python
pip install federiq
pip install 'federiq[pandas]' # optional pandas helper
TypeScript / Node
npm install federiq
Your first federated query
Create a catalog file:
# federiq.yaml
sources:
- name: events
type: parquet
path: ./data/events.parquet
- name: users
type: csv
path: ./data/users.csv
Run a query that joins the two:
federiq query "
SELECT u.name, COUNT(e.id) AS events
FROM users u LEFT JOIN events e ON e.user_id = u.id
GROUP BY u.name
ORDER BY events DESC
LIMIT 10
"
That's it. No cluster. No ETL. The Parquet file and the CSV stay exactly where they are; FederIQ federates the query across both.
Add a Postgres database in two more lines:
- name: sales
type: postgres
dsn: "host=localhost port=5432 dbname=sales user=postgres password=postgres"
Then:
federiq query "
SELECT u.name, SUM(o.total) AS revenue
FROM sales.public.orders o
JOIN users u ON u.id = o.user_id
GROUP BY u.name
"
Next steps
- The catalog file — all source types and options.
- Policies — declare column masks and row filters.
- Running
federiq serve— expose the engine over HTTP.
The catalog file
federiq.yaml declares your data sources and (optionally) your
policies. Initialize one with:
federiq init
Structure
sources:
- name: <alias>
type: <source-type>
# ... type-specific keys ...
policies:
- name: <policy-name>
applies_to: <source-alias or "*">
# ... policy keys ...
Source types
type: | Required keys | Notes |
|---|---|---|
postgres | dsn | DuckDB postgres_scanner extension |
mysql | dsn | DuckDB mysql_scanner extension |
sqlite | path | Local SQLite database file |
parquet | path | Single file, directory, or glob |
csv | path | read_csv_auto for schema inference |
http | url | JSON response, default feature |
mongo | uri, database, collection | Planned; blocked on upstream |
Managing sources from the CLI
federiq sources list
federiq sources add --name events --type parquet --path ./data/events.parquet
federiq sources remove events
federiq sources test events
federiq sources describe events
Inspecting schemas
federiq schema list # every source
federiq schema list events # one source
federiq schema describe events # columns of a single table or view
Validating
federiq config validate
federiq doctor
SQL connectors
FederIQ attaches Postgres, MySQL, and SQLite natively via DuckDB's
postgres_scanner, mysql_scanner, and sqlite_scanner extensions —
meaning queries push down to the remote engine wherever possible.
Postgres
- name: sales
type: postgres
dsn: "host=db.internal port=5432 dbname=sales user=app password=secret"
Reference tables with three-part names:
SELECT * FROM sales.public.orders LIMIT 10;
MySQL
- name: catalog
type: mysql
dsn: "host=db.internal port=3306 user=app password=secret database=catalog"
SQLite
- name: analytics
type: sqlite
path: "./data/analytics.db"
Read-only by design
FederIQ attaches every SQL source READ_ONLY. Writes go through the
source's own tools.
File connectors
Parquet
- name: events
type: parquet
path: "./data/events.parquet" # single file
# path: "./data/events/*.parquet" # glob
# path: "s3://bucket/events/*.parquet" # S3 via DuckDB's httpfs extension
CSV
- name: users
type: csv
path: "./data/users.csv"
read_csv_auto handles header detection, delimiter inference, and type
inference. For anything non-trivial, convert to Parquet first — it's
faster, smaller, and carries schema.
S3 / HTTP
Parquet sources with an S3 or HTTPS path: use DuckDB's httpfs
extension. Set credentials via the usual AWS env vars (AWS_ACCESS_KEY_ID,
AWS_SECRET_ACCESS_KEY, AWS_REGION) or the DuckDB secrets mechanism.
HTTP / REST connector
Materializes a JSON HTTP response as a DuckDB view.
- name: posts
type: http
url: "https://api.example.com/posts"
# optional
headers:
Accept: "application/json"
auth_bearer: "$TOKEN"
method: GET
Requirements:
- Response must be JSON.
- Shape must be either a top-level array of objects, or an object with a
single array property under one of
data,results,items.
Caveats
- The connector fetches the full response at attach time. Huge responses will hurt. Add pagination support in your own wrapper or use a proper database.
- No streaming yet. Planned.
- No response-change detection. Re-attach to refresh.
For anything beyond trivial REST feeds, write your own connector — see connector SDK.
Apache Iceberg connector
FederIQ reads Iceberg tables through DuckDB's native iceberg extension.
Declare the source once, query it like any other table.
sources:
- name: bronze
type: iceberg
path: "s3://warehouse/bronze/events"
# optional — tolerate data files that moved relative to the manifest
allow_moved_paths: false
Then:
SELECT event_type, COUNT(*)
FROM bronze
GROUP BY event_type;
How it works
FederIQ emits:
INSTALL iceberg;
LOAD iceberg;
CREATE OR REPLACE VIEW bronze AS
SELECT * FROM iceberg_scan('s3://warehouse/bronze/events');
on attach. DuckDB handles manifest parsing and data file access.
Credentials
For S3-backed tables, set AWS credentials via environment variables
(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION) or via
DuckDB's secrets manager.
For local filesystem paths, no credentials are needed.
Limitations
DuckDB's Iceberg extension is read-only and tracks a subset of the spec:
- v2 tables supported; older format versions best-effort.
- No writes (add partitions, schema evolution, etc.) — use your engine of choice (Spark, Trino, pyiceberg) to author, FederIQ to query.
- Time travel (
FOR TIMESTAMP AS OF ...) is not yet surfaced through FederIQ's catalog — query DuckDB directly if you need it. - Expect slower first-query latency than Parquet — Iceberg reads the metadata tree on every attach.
Roadmap: more connectors
Planned connectors, roughly in order of priority:
| Connector | Approach | Status |
|---|---|---|
| MongoDB | mongodb crate → NDJSON tempfile → view | Upstream-blocked on Rust 1.93 compat |
| Apache Iceberg | DuckDB iceberg extension | Shipped (see iceberg) |
| Snowflake | JDBC over snowflake-jdbc bridge or REST | Planned |
| BigQuery | BigQuery Storage API | Planned |
| Elasticsearch | REST search with cursor | Planned |
| Kafka | rdkafka + snapshot materialization | Planned |
| Delta Lake | delta-rs | Planned |
| DynamoDB | AWS SDK scan | Planned |
Open an issue on the repo if your source isn't on the list — we pick based on user demand.
Writing your own connector
See the
CONNECTORS.md
guide in the repo for the full walkthrough, but here's the shape:
#![allow(unused)] fn main() { use federiq_core::{AttachCtx, Connector}; pub struct MyConnector { /* your config */ } impl Connector for MyConnector { fn kind(&self) -> &'static str { "mysource" } fn attach(&self, ctx: &AttachCtx<'_>) -> anyhow::Result<()> { // 1. Fetch rows from the remote source. // 2. Write them as NDJSON to ctx.tmpdir. // 3. CREATE OR REPLACE VIEW <ctx.alias> AS SELECT * FROM read_json_auto('...'); Ok(()) } } }
A dynamic plugin system (WASM components loaded from ~/.federiq/plugins)
is on the roadmap so third parties can ship connectors without recompiling
FederIQ.
Policy model
A FederIQ policy is a YAML record that attaches to a source (or every
source via "*") and declares how data must be handled. Policies ship
with four capabilities:
policies:
- name: mask_pii
applies_to: users
mask_columns:
email: "'***@***.com'"
ssn: "NULL"
- name: analysts_see_active_only
applies_to: events
when_role: analyst # only applies if FEDERIQ_ROLE=analyst
row_filter: "status = 'active'"
- name: pin_to_us_east
applies_to: "*"
require_region: us-east-1 # refuses to run if FEDERIQ_REGION != us-east-1
Fields
| Field | Type | Purpose |
|---|---|---|
name | string | Human-readable identifier |
applies_to | string or "*" | Source name to match |
when_role | string (optional) | Only applies when caller role matches |
require_region | string (optional) | Refuses to run outside this region |
mask_columns | map of col: expr | Replace the column with the SQL expression |
row_filter | string (optional) | SQL WHERE predicate AND'd into every SELECT |
Context
PolicyContext is pulled from environment variables at query time:
FEDERIQ_ROLE— caller role (e.g.,analyst,admin)FEDERIQ_REGION— data residency region (e.g.,us-east-1)
What happens at query time
See enforcement for the rewrite details.
Enforcement
When federiq query or federiq serve sees a SQL statement against a
table with active policies, it parses the SQL with sqlparser-rs,
applies each policy as an AST transformation, and emits the rewritten
SQL for DuckDB to execute.
Row filters
-- User input
SELECT id FROM events WHERE user_id = 42;
-- After policy `row_filter: "status = 'active'"`
SELECT id FROM events WHERE user_id = 42 AND status = 'active';
Column masking
-- User input
SELECT name, email FROM users;
-- After policy `mask_columns: { email: "'***@***.com'" }`
SELECT name, '***@***.com' AS email FROM users;
The masking expression is valid SQL. The column is aliased back to its original name so callers don't see a schema change.
SELECT * limitation
Star projection cannot be masked without schema expansion. For now, FederIQ returns an error:
SELECT * cannot be enforced against policy 'mask_pii'
(masks columns on 'users'); list columns explicitly
List the columns explicitly, or upgrade when the schema-aware rewriter ships (tracked on the roadmap).
Region pinning
Region mismatches fail the rewrite:
policy 'pin_to_us_east' requires region='us-east-1'
but current context has eu-west-1
Static checking
Before running a query, preview violations:
federiq policy check "SELECT email, ssn FROM users"
Disabling enforcement
Policies auto-apply when the catalog declares any. To opt out on a per-call basis:
federiq query --enforce-policy=false "SELECT ..."
Use sparingly — this bypass is primarily for development and debugging.
Running federiq serve
federiq serve exposes the engine as an HTTP service.
Endpoints
| Method | Path | Purpose |
|---|---|---|
| GET | /healthz | Liveness probe — always 200 |
| GET | /metrics | Prometheus exposition |
| GET | /sources | List attached source names |
| POST | /query | Execute SQL — {"sql": "..."} → JSON rows |
/healthz and /metrics are always reachable. /sources and /query
require the bearer token (see auth).
Running it
federiq serve --addr 0.0.0.0:7878 --token "$FEDERIQ_TOKEN" \
--tls-cert /etc/federiq/cert.pem --tls-key /etc/federiq/key.pem
Without --tls-cert the server listens on plain HTTP and prints a loud
warning — don't expose an unencrypted port to the internet.
Making requests
curl -H "authorization: Bearer $FEDERIQ_TOKEN" \
-H "content-type: application/json" \
-d '{"sql":"SELECT 1 AS n"}' \
https://localhost:7878/query
Configuring via env
| Variable | Effect |
|---|---|
FEDERIQ_SERVER_TOKEN | Bearer token for inbound requests |
FEDERIQ_LOG_FORMAT | json switches tracing to structured logs |
FEDERIQ_ROLE | Caller role for policies |
FEDERIQ_REGION | Caller region for policies |
Authentication and TLS
Bearer tokens
federiq serve enforces a shared-secret bearer token on every request
to /sources and /query.
federiq serve --token "$(openssl rand -hex 32)"
Or via environment:
export FEDERIQ_SERVER_TOKEN="..."
federiq serve
Clients send Authorization: Bearer <token>:
new FederIQ("https://federiq.example.com", { token: process.env.FEDERIQ_TOKEN })
Token comparison is constant-time to avoid timing oracles.
TLS
Provide PEM-encoded cert and key:
federiq serve \
--tls-cert /etc/federiq/cert.pem \
--tls-key /etc/federiq/key.pem
Generate a self-signed cert for development:
openssl req -x509 -newkey rsa:4096 -sha256 -days 365 -nodes \
-keyout key.pem -out cert.pem -subj "/CN=localhost"
For production, use certs from Let's Encrypt or your internal CA.
What's missing (tracked)
- Multiple tokens with distinct scopes / roles
- mTLS (client certs)
- OIDC / OAuth integration
- Rate limiting
Open an issue if any of these are blocking your deployment.
Observability and metrics
Prometheus metrics
federiq serve exposes metrics at /metrics in Prometheus text format.
| Metric | Kind | Labels |
|---|---|---|
federiq_queries_total | counter | |
federiq_query_errors_total | counter | |
federiq_query_duration_seconds | histogram | |
federiq_http_requests_total | counter | route |
federiq_auth_failures_total | counter |
Scrape config:
scrape_configs:
- job_name: federiq
static_configs:
- targets: ["federiq.internal:7878"]
Structured logs
export FEDERIQ_LOG_FORMAT=json
federiq serve
Emits one JSON object per log event (stdout), with event fields flattened so they're indexable by downstream shipper.
Log levels are controlled by RUST_LOG:
RUST_LOG=federiq=debug,federiq_core=debug federiq serve
Per-source attach timing
Every source's ATTACH latency is tracked by the engine:
federiq query "SELECT 1" --timing
per-source attach timing:
events parquet 2.14ms
sales postgres 84.2ms
And from Python:
import json
print(json.dumps(engine.attach_timings(), indent=2))
EXPLAIN ANALYZE
DuckDB's EXPLAIN ANALYZE gives operator-level timing that naturally
includes federated source scan costs:
federiq query "SELECT COUNT(*) FROM events" --explain-analyze
Caching
FederIQ has an optional result cache keyed by a content-addressed hash of the query SQL. Three backends ship in the default build:
- memory — in-process LRU with TTL (fastest, evicted on restart)
- disk — file-backed store sharded by hash prefix (durable, shareable across processes)
- redis — enabled with
--features cache-redis(shared across nodes)
From the CLI
federiq query --cache disk --cache-dir ./.cache "SELECT COUNT(*) FROM events"
federiq cache status --backend disk --dir ./.cache
federiq cache clear --backend disk --dir ./.cache
From Rust
#![allow(unused)] fn main() { use federiq_core::{DiskCache, Engine, FreqPredictor}; use std::sync::Arc; let engine = Engine::new()? .with_cache(Arc::new(DiskCache::new("./.cache")?)) .with_predictor(Arc::new(FreqPredictor::new(10))); engine.attach_all(&catalog.sources)?; let rows = engine.query_cached("SELECT COUNT(*) FROM events")?; }
Predictors
Predictor::observe(sql, hit) gets called on every cache access. The
default FreqPredictor counts queries; predict() returns the top-k
most-frequent for background warmup.
#![allow(unused)] fn main() { engine.warmup()?; // runs predictor.predict() and caches the results }
A smarter ML predictor is on the roadmap — implement the Predictor
trait to plug in your own.
What's not cached yet
- Queries whose results depend on the calling role/region
(planned for v0.5.1 when
PolicyContextis plumbed throughquery_cached). - Negative results (errors are not cached).
- Schema introspection.
Python client
pip install federiq
pip install 'federiq[pandas]' # optional pandas helper
Quickstart
import federiq
engine = federiq.connect("./federiq.yaml")
rows = engine.query("SELECT 1 AS n")
print(rows) # [{'n': 1}]
df = federiq.query_df(engine, "SELECT COUNT(*) AS n FROM events")
print(df)
API
class Engine:
@staticmethod
def from_catalog(path: str) -> "Engine": ...
def attach(self, source: dict) -> None: ...
def query(self, sql: str) -> list[dict]: ...
def explain(self, sql: str) -> str: ...
def explain_analyze(self, sql: str) -> str: ...
def attach_timings(self) -> list[dict]: ...
def duckdb_version(self) -> str: ...
Jupyter
See
examples/jupyter/federiq_intro.ipynb
for an interactive walkthrough.
TypeScript client
npm install federiq
The TypeScript client is a thin HTTP wrapper around federiq serve.
Quickstart
import { FederIQ } from "federiq";
const fq = new FederIQ("https://federiq.example.com", {
token: process.env.FEDERIQ_TOKEN,
});
const { rows } = await fq.query("SELECT 1 AS n");
console.log(rows);
API
class FederIQ {
constructor(baseUrl: string, opts?: FederIQOptions);
query(sql: string): Promise<{ rows: unknown[] }>;
sources(): Promise<string[]>;
healthz(): Promise<boolean>;
}
interface FederIQOptions {
token?: string;
fetch?: typeof fetch;
timeoutMs?: number;
headers?: Record<string, string>;
}
Works in any runtime with global fetch — Node 18+, Bun, Deno, and
modern browsers.
CLI reference
federiq [--config PATH] [-v] <subcommand>
Every subcommand accepts --help for detail. A condensed reference:
| Subcommand | Purpose |
|---|---|
init | Scaffold a new federiq.yaml |
sources list | List declared sources |
sources add | Add a source |
sources remove | Remove a source |
sources test | Verify one or all sources are reachable |
sources describe | Show source detail and generated ATTACH SQL |
schema list | List tables in a source |
schema describe | Show column types for a table |
query | Execute SQL (see flags below) |
shell | Interactive REPL |
serve | HTTP server with auth + TLS + metrics |
config show | Print resolved catalog YAML |
config validate | Lint catalog sources |
policy list | List policies |
policy show | Show one policy |
policy validate | Cross-check policies against sources |
policy check | Static violation scan for a SQL query |
cache status | Cache statistics |
cache clear | Empty the cache |
doctor | Environment diagnostics |
completions <shell> | Emit shell completion scripts |
federiq query flags
| Flag | Meaning |
|---|---|
-f, --file PATH | Read SQL from a file |
-o, --output PATH | Write results to a file |
--format FMT | table (default), csv, json, ndjson, parquet |
--explain | Print DuckDB plan |
--explain-analyze | Plan plus per-operator timing |
--timing | Print total and per-source timing |
--cache MODE | off / memory / disk |
--cache-dir PATH | Disk cache location |
--enforce-policy BOOL | Default true when catalog has policies |
Configuration reference
Environment variables
| Variable | Where | Purpose |
|---|---|---|
FEDERIQ_SERVER_TOKEN | federiq serve | Bearer token for inbound requests |
FEDERIQ_LOG_FORMAT | all commands | json switches logs to JSON |
FEDERIQ_ROLE | all commands | Caller role for policies |
FEDERIQ_REGION | all commands | Caller region for policies |
RUST_LOG | all commands | tracing env-filter expression |
Catalog reference
See catalog and policy pages for the authoritative schema.
File locations
- Default catalog:
./federiq.yaml - Default disk cache:
./.federiq-cache/ - Shell history:
./.federiq_history
All configurable via flags — nothing writes to $HOME without your
involvement.
FAQ
How is this different from Trino?
Trino is a distributed query engine. FederIQ is a single binary built on DuckDB. For small-to-medium deployments you get most of Trino's federation story without the JVM ops burden. For true petabyte multi-tenant workloads, Trino is still the right answer.
Can I use FederIQ in production?
Today: probably not without caveats. Tier-1 features (auth, TLS, policy enforcement, observability) are shipped but have only been used by early-alpha testers. Wait for v1.0 for load-bearing workloads.
Does it modify my data?
No. Every source is attached READ_ONLY. FederIQ only reads.
What SQL dialect?
DuckDB's, which is broadly Postgres-compatible. Functions like
date_trunc, epoch, list_aggregate, and DuckDB's extensions are all
available.
Why is the first build slow?
DuckDB's C++ core compiles from source via the bundled feature. First
build takes 3–6 minutes. Subsequent incremental builds are fast.
How do I pin data to a region?
Use the require_region field of a policy. See enforcement.
Can I cache results across processes / nodes?
Yes — use the disk cache (shared filesystem) or the Redis cache
(cargo install federiq-cli --features cache-redis).
What about joins that span sources?
DuckDB plans them natively. For expensive cross-source joins consider caching the hot side or materializing to Parquet. A cost-aware planner is on the roadmap.
Is there a managed cloud?
Planned. Core will stay Apache 2.0 forever; hosted and enterprise features will be separately licensed.
FederIQ vs. Trino vs. DuckDB
A pragmatic feature comparison. None of these tools is strictly better — they solve different pain points.
| Dimension | DuckDB | FederIQ | Trino |
|---|---|---|---|
| Deploy unit | Single binary | Single binary | JVM cluster |
| Minimum memory | ~50 MB | ~100 MB | ~4 GB |
| Cold start | ~50 ms | ~200 ms | ~30 s |
| Multi-source federation | Limited | Yes | Yes |
| Postgres / MySQL / SQLite | Via extensions | Native catalog | Native |
| Parquet / CSV / JSON | Native | Via DuckDB | Via Hive connector |
| HTTP / REST | No | Yes (default feature) | No |
| Mongo | No | Planned | Yes |
| Iceberg | Via extension | In progress | Yes |
| Declarative policy engine | No | Yes (YAML, SQL rewrite) | Via Ranger add-on |
| Column masking / row filters | Manual views | Built-in, catalog-declared | Via Ranger |
| HTTP service mode | No | Yes (auth + TLS + metrics) | Yes |
| Cache layer (memory/disk/Redis) | Manual | Built-in | Materialized views |
| Python / TS clients | Yes / No | Yes / Yes | Yes / community |
| License | MIT | Apache 2.0 | Apache 2.0 |
When to pick each
- DuckDB alone: single-node analytics, embedded in a Python/R/Node process, local Parquet/CSV processing.
- FederIQ: you need federated queries across 2–10 sources, a policy engine, or an HTTP service, but don't want the operational cost of Trino. Works great embedded or as a sidecar.
- Trino: petabyte-scale, hundreds of concurrent users, existing Hive metastore, or you already run a JVM fleet.
Benchmarks
Rigorous TPC-H numbers against Trino and Dremio are on the roadmap. When
we publish, the benchmark harness and raw numbers will live in the repo
under bench/ so you can reproduce.