Output Formats
Monad supports multiple output formats to meet diverse data integration needs. This guide helps you choose the right format for your use case and provides links to detailed configuration documentation for each format.
Supported Formats Overview
Monad currently supports three primary output format families:
| Format | Description | File Extensions | Best For |
|---|---|---|---|
| JSON | Flexible text-based format with three variants | .json, .jsonl | APIs, web services, streaming, human-readable data |
| Delimited | Tabular formats like CSV and TSV | .csv | Spreadsheets, traditional analytics tools, data exchange |
| Parquet | Columnar binary format optimized for analytics | .parquet | Data warehouses, big data analytics, long-term storage |
Format Comparison
Performance Characteristics
| Format | Write Speed | Read Speed | Compression | File Size |
|---|---|---|---|---|
| JSON Array | Medium | Medium | Good with gzip | Medium |
| JSON Line | Fast | Fast | Good with gzip | Medium |
| JSON Nested | Medium | Medium | Good with gzip | Medium |
| Delimited (CSV) | Fast | Fast | Excellent with gzip | Small |
| Parquet | Slow | Very Fast | Built-in (excellent) | Very Small |
Feature Support
| Feature | JSON | Delimited | Parquet |
|---|---|---|---|
| Human Readable | ✅ Yes | ✅ Yes | ❌ No (binary) |
| Schema Evolution | ✅ Flexible | ⚠️ Limited | ✅ Yes |
| Nested Data | ✅ Full support | ❌ No | ✅ Full support |
| Data Types | ⚠️ Basic | ⚠️ Strings only | ✅ Rich type system |
| Query Performance | ⚠️ Scan entire file | ⚠️ Scan entire file | ✅ Columnar access |
| Streaming Support | ✅ Yes (line format) | ✅ Yes | ❌ No |
Choosing the Right Format
Use JSON When:
- Integrating with REST APIs or web services
- Human readability is important
- Your data has nested structures or varying schemas
- You need streaming capabilities (use line format)
- Working with document-oriented databases
Use Delimited (CSV/TSV) When:
- Importing data into spreadsheet applications
- Working with legacy systems or traditional BI tools
- Your data is purely tabular with consistent columns
- You need the smallest file sizes when compressed
- Simplicity is more important than features
Use Parquet When:
- Building a data warehouse or data lake
- Performing analytical queries on large datasets
- You need optimal query performance
- Long-term storage with efficient compression is required
- Working with Apache Spark, Athena, or similar big data tools
Quick Configuration Examples
JSON Array Format
{
"format_config": {
"format": "json",
"json": {
"type": "array"
}
}
}
CSV Format
{
"format_config": {
"format": "delimited",
"delimited": {
"delimiter": ",",
"headers": ["id", "name", "value"]
}
}
}
Parquet Format
{
"format_config": {
"format": "parquet",
"parquet": {
"schema": "{\"Tag\": \"name=data\", \"Fields\": [{\"Tag\": \"name=id, type=INT64, repetitiontype=REQUIRED\"}]}"
}
}
}
Format-Specific Documentation
For detailed configuration options and advanced usage:
- JSON Format - Array, nested, and line-delimited JSON configurations
- Delimited Format - CSV, TSV, and custom delimiter configurations
- Parquet Format - Schema definition and columnar storage optimization
Best Practices
-
Consider Your Downstream Systems: Choose formats that your consuming applications can efficiently process.
-
Balance Readability and Performance: JSON offers readability, Parquet offers performance, and CSV offers compatibility.
-
Think About Schema Evolution: If your data structure changes frequently, JSON provides the most flexibility.
-
Compression Matters: All formats support compression, but Parquet has it built-in and typically achieves the best ratios.
-
Test with Real Data: Performance characteristics can vary based on your specific data patterns and volumes.
Common Patterns
ETL Pipelines
- Extract: Use JSON or CSV for initial data extraction
- Transform: Process in memory or with streaming tools
- Load: Use Parquet for final storage in data warehouses
Real-time Streaming
- Use JSON Line format for append-friendly streaming
- Enable minimal batching for near real-time delivery
- Consider compression trade-offs for latency
Analytics Workloads
- Use Parquet with Hive-compatible partitioning
- Define schemas that optimize for your query patterns
- Leverage columnar benefits for aggregation queries
Need Help?
If you're unsure which format to choose, consider these questions:
- What system will consume this data?
- How important is human readability?
- What's the expected data volume?
- Do you need to support complex nested structures?
- Will the schema change frequently?
The answers to these questions will guide you toward the optimal format for your use case.