Overview
rulesgen is a secure rule-processing service for synthetic data workflows.
It accepts rule input as either natural_language or a restricted DSL,
translates natural_language requests into an untrusted semantic_frame plus
DSL candidate, validates the DSL into a compiled_rule, supports local
execution_preview, and can execute full dataset generation as a tracked
job.
The service is designed around a clear trust boundary: LLM output is never
trusted directly. Natural-language output and DSL candidates remain untrusted
until parser and validator checks succeed, and diagnostics are part of the
contract at every stage.
Who this documentation is for
- Teams authoring rules for synthetic data workflows.
- Operators evaluating local preview and full dataset-generation paths.
- Application teams embedding the Python library or HTTP API.
- Platform teams configuring LLM, guardrail, storage, and execution backends.
What rulesgen does
The core workflow is staged so each step can be inspected independently:
- Parse rule input into a
semantic_frame, DSL candidate, anddiagnostics. - Treat
natural_languageoutput and every DSL candidate as untrusted until validation succeeds. - Compile validated DSL into a
compiled_ruleartifact. - Run an
execution_previewagainst a sample row and seed. - Generate a target dataset as a tracked
job. - Inspect
diagnostics, generated artifacts, and job metadata.
rulesgen can be used through the HTTP API or directly as a Python library.
The HTTP service provides endpoints for parsing, compilation, preview,
dataset uploads, dataset generation, job polling, and artifact download. The
library API exposes the same core capabilities for in-process use.
Safety model
rulesgen is not a direct natural-language-to-Python executor. A
natural_language rule is translated into an LLM-produced semantic_frame
and DSL candidate, and both remain untrusted until validation succeeds. The
compiler accepts only a restricted Python-expression subset and a runtime
helper whitelist. The preview executor is intended for fast local feedback;
full dataset generation runs through either the subprocess dataset executor or
the optional Alibaba OpenSandbox adapter.
Pre-LLM guardrails scan natural-language input for prompt injection and jailbreak attempts. Blocked requests return a standard Problem Details response without exposing scanner internals to the caller.