JSON Schema Format for Parquet Output
This document explains how to define a Parquet schema using a JSON format for any Monad output component that supports parquet format.
Overview
The JSON schema format allows you to define complex Parquet schemas including nested structures, lists, maps, and various data types with their respective encodings and compression options.
Basic Structure
A Parquet schema in JSON format consists of a root element and a collection of fields:
{
"Tag": "name=root_name, repetitiontype=REQUIRED",
"Fields": [
// field definitions go here
]
}
Field Definition
Each field is defined as a JSON object with at least a Tag property. For nested structures, a field may also contain a Fields array:
{
"Tag": "name=field_name, type=DATA_TYPE, repetitiontype=REPETITION_TYPE, [additional attributes]"
}
Common Tag Attributes
| Attribute | Description | Example |
|---|---|---|
name | Field name in the Parquet file (required) | name=customer_id |
inname | Input field name in your source data (optional) | inname=CustomerID |
type | Parquet data type (required) | type=INT64 |
repetitiontype | Field repetition type (required) | repetitiontype=REQUIRED |
convertedtype | Logical type for the field (optional) | convertedtype=UTF8 |
encoding | Encoding method (optional) | encoding=PLAIN_DICTIONARY |
omitstats | Skip stats for this field (optional) | omitstats=true |
Primitive Data Types
| Type | Description | Go Type Mapping |
|---|---|---|
BOOLEAN | Boolean value | bool |
INT32 | 32-bit integer | int32 |
INT64 | 64-bit integer | int64 |
INT96 | 96-bit integer (deprecated) | string |
FLOAT | Single precision (32-bit) floating point | float32 |
DOUBLE | Double precision (64-bit) floating point | float64 |
BYTE_ARRAY | Array of bytes | string |
FIXED_LEN_BYTE_ARRAY | Fixed-length byte array | string (requires length attribute) |
Logical Types
Logical types provide additional semantic information for primitive types:
| Logical Type | Primitive Type | Attributes | Example |
|---|---|---|---|
UTF8 | BYTE_ARRAY | convertedtype=UTF8 | "Tag": "name=username, type=BYTE_ARRAY, convertedtype=UTF8" |
INT_8/16/32/64 | INT32/INT64 | convertedtype=INT_8 | "Tag": "name=small_int, type=INT32, convertedtype=INT_8" |
UINT_8/16/32/64 | INT32/INT64 | convertedtype=UINT_16 | "Tag": "name=unsigned, type=INT32, convertedtype=UINT_16" |
DATE | INT32 | convertedtype=DATE | "Tag": "name=birth_date, type=INT32, convertedtype=DATE" |
TIME_MILLIS | INT32 | convertedtype=TIME_MILLIS | "Tag": "name=event_time, type=INT32, convertedtype=TIME_MILLIS" |
TIME_MICROS | INT64 | convertedtype=TIME_MICROS | "Tag": "name=precise_time, type=INT64, convertedtype=TIME_MICROS" |
TIMESTAMP_MILLIS | INT64 | convertedtype=TIMESTAMP_MILLIS | "Tag": "name=created_at, type=INT64, convertedtype=TIMESTAMP_MILLIS" |
TIMESTAMP_MICROS | INT64 | convertedtype=TIMESTAMP_MICROS | "Tag": "name=modified_at, type=INT64, convertedtype=TIMESTAMP_MICROS" |
DECIMAL | Various | convertedtype=DECIMAL, scale=N, precision=M | "Tag": "name=price, type=INT32, convertedtype=DECIMAL, scale=2, precision=9" |
Complex Types
Lists
{
"Tag": "name=items, type=LIST, repetitiontype=REQUIRED",
"Fields": [
{"Tag": "name=element, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REQUIRED"}
]
}
Maps
{
"Tag": "name=properties, type=MAP, repetitiontype=REQUIRED",
"Fields": [
{"Tag": "name=key, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REQUIRED"},
{"Tag": "name=value, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REQUIRED"}
]
}
Nested Structures
{
"Tag": "name=address, repetitiontype=REQUIRED",
"Fields": [
{"Tag": "name=street, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REQUIRED"},
{"Tag": "name=city, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REQUIRED"},
{"Tag": "name=zip, type=INT32, repetitiontype=REQUIRED"}
]
}
Repeated Fields
{"Tag": "name=phone_numbers, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REPEATED"}
Repetition Types
| Type | Description |
|---|---|
REQUIRED | The field must be present (non-null) |
OPTIONAL | The field can be null |
REPEATED | The field can appear multiple times (similar to an array) |
Encoding Options
| Type | Support | Description |
|---|---|---|
PLAIN | All types | Default encoding |
PLAIN_DICTIONARY | All types | Dictionary-based encoding for repeated values |
DELTA_BINARY_PACKED | Integer types | Efficient for sequences with small deltas |
DELTA_BYTE_ARRAY | BYTE_ARRAY, UTF8 | Efficient for strings with common prefixes |
DELTA_LENGTH_BYTE_ARRAY | BYTE_ARRAY, UTF8 | Efficient for strings with similar lengths |
Complete Example
{
"Tag": "name=customer_record, repetitiontype=REQUIRED",
"Fields": [
{"Tag": "name=id, type=INT64, repetitiontype=REQUIRED"},
{"Tag": "name=name, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN_DICTIONARY, repetitiontype=REQUIRED"},
{"Tag": "name=email, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=OPTIONAL"},
{"Tag": "name=signup_date, type=INT32, convertedtype=DATE, repetitiontype=REQUIRED"},
{"Tag": "name=active, type=BOOLEAN, repetitiontype=REQUIRED"},
{
"Tag": "name=address, repetitiontype=OPTIONAL",
"Fields": [
{"Tag": "name=street, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REQUIRED"},
{"Tag": "name=city, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REQUIRED"},
{"Tag": "name=state, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REQUIRED"},
{"Tag": "name=zip, type=INT32, repetitiontype=REQUIRED"}
]
},
{
"Tag": "name=purchases, type=LIST, repetitiontype=OPTIONAL",
"Fields": [
{
"Tag": "name=element, repetitiontype=REQUIRED",
"Fields": [
{"Tag": "name=product_id, type=INT64, repetitiontype=REQUIRED"},
{"Tag": "name=price, type=INT32, convertedtype=DECIMAL, scale=2, precision=9, repetitiontype=REQUIRED"},
{"Tag": "name=quantity, type=INT32, repetitiontype=REQUIRED"}
]
}
]
},
{
"Tag": "name=properties, type=MAP, repetitiontype=OPTIONAL",
"Fields": [
{"Tag": "name=key, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REQUIRED"},
{"Tag": "name=value, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REQUIRED"}
]
},
{"Tag": "name=tags, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REPEATED"}
]
}
Tips and Best Practices
- Use
innamewhen your input field names differ from how you want them stored in Parquet - For large fields with many different values, avoid
PLAIN_DICTIONARYencoding as it can consume excessive memory - For large array values where stats aren't useful, use
omitstats=trueto reduce file size - Use the appropriate repetition type:
REQUIREDfor non-nullable fieldsOPTIONALfor nullable fieldsREPEATEDfor array-like fields
- Field names beginning with uppercase and lowercase letters are treated as different fields
- Avoid using
PARGO_PREFIX_as a name prefix as it's reserved