Skip to main content

Output Formats

Monad supports multiple output formats to meet diverse data integration needs. This guide helps you choose the right format for your use case and provides links to detailed configuration documentation for each format.

Supported Formats Overview

Monad currently supports three primary output format families:

FormatDescriptionFile ExtensionsBest For
JSONFlexible text-based format with three variants.json, .jsonlAPIs, web services, streaming, human-readable data
DelimitedTabular formats like CSV and TSV.csvSpreadsheets, traditional analytics tools, data exchange
ParquetColumnar binary format optimized for analytics.parquetData warehouses, big data analytics, long-term storage

Format Comparison

Performance Characteristics

FormatWrite SpeedRead SpeedCompressionFile Size
JSON ArrayMediumMediumGood with gzipMedium
JSON LineFastFastGood with gzipMedium
JSON NestedMediumMediumGood with gzipMedium
Delimited (CSV)FastFastExcellent with gzipSmall
ParquetSlowVery FastBuilt-in (excellent)Very Small

Feature Support

FeatureJSONDelimitedParquet
Human Readable✅ Yes✅ Yes❌ No (binary)
Schema Evolution✅ Flexible⚠️ Limited✅ Yes
Nested Data✅ Full support❌ No✅ Full support
Data Types⚠️ Basic⚠️ Strings only✅ Rich type system
Query Performance⚠️ Scan entire file⚠️ Scan entire file✅ Columnar access
Streaming Support✅ Yes (line format)✅ Yes❌ No

Choosing the Right Format

Use JSON When:

  • Integrating with REST APIs or web services
  • Human readability is important
  • Your data has nested structures or varying schemas
  • You need streaming capabilities (use line format)
  • Working with document-oriented databases

Use Delimited (CSV/TSV) When:

  • Importing data into spreadsheet applications
  • Working with legacy systems or traditional BI tools
  • Your data is purely tabular with consistent columns
  • You need the smallest file sizes when compressed
  • Simplicity is more important than features

Use Parquet When:

  • Building a data warehouse or data lake
  • Performing analytical queries on large datasets
  • You need optimal query performance
  • Long-term storage with efficient compression is required
  • Working with Apache Spark, Athena, or similar big data tools

Quick Configuration Examples

JSON Array Format

{
"format_config": {
"format": "json",
"json": {
"type": "array"
}
}
}

CSV Format

{
"format_config": {
"format": "delimited",
"delimited": {
"delimiter": ",",
"headers": ["id", "name", "value"]
}
}
}

Parquet Format

{
"format_config": {
"format": "parquet",
"parquet": {
"schema": "{\"Tag\": \"name=data\", \"Fields\": [{\"Tag\": \"name=id, type=INT64, repetitiontype=REQUIRED\"}]}"
}
}
}

Format-Specific Documentation

For detailed configuration options and advanced usage:

Best Practices

  1. Consider Your Downstream Systems: Choose formats that your consuming applications can efficiently process.

  2. Balance Readability and Performance: JSON offers readability, Parquet offers performance, and CSV offers compatibility.

  3. Think About Schema Evolution: If your data structure changes frequently, JSON provides the most flexibility.

  4. Compression Matters: All formats support compression, but Parquet has it built-in and typically achieves the best ratios.

  5. Test with Real Data: Performance characteristics can vary based on your specific data patterns and volumes.

Common Patterns

ETL Pipelines

  • Extract: Use JSON or CSV for initial data extraction
  • Transform: Process in memory or with streaming tools
  • Load: Use Parquet for final storage in data warehouses

Real-time Streaming

  • Use JSON Line format for append-friendly streaming
  • Enable minimal batching for near real-time delivery
  • Consider compression trade-offs for latency

Analytics Workloads

  • Use Parquet with Hive-compatible partitioning
  • Define schemas that optimize for your query patterns
  • Leverage columnar benefits for aggregation queries

Need Help?

If you're unsure which format to choose, consider these questions:

  1. What system will consume this data?
  2. How important is human readability?
  3. What's the expected data volume?
  4. Do you need to support complex nested structures?
  5. Will the schema change frequently?

The answers to these questions will guide you toward the optimal format for your use case.