Skip to main content

JSON Schema Format for Parquet Output

This document explains how to define a Parquet schema using a JSON format for any Monad output component that supports parquet format.

Overview

The JSON schema format allows you to define complex Parquet schemas including nested structures, lists, maps, and various data types with their respective encodings and compression options.

Basic Structure

A Parquet schema in JSON format consists of a root element and a collection of fields:

{
"Tag": "name=root_name, repetitiontype=REQUIRED",
"Fields": [
// field definitions go here
]
}

Field Definition

Each field is defined as a JSON object with at least a Tag property. For nested structures, a field may also contain a Fields array:

{
"Tag": "name=field_name, type=DATA_TYPE, repetitiontype=REPETITION_TYPE, [additional attributes]"
}

Common Tag Attributes

AttributeDescriptionExample
nameField name in the Parquet file (required)name=customer_id
innameInput field name in your source data (optional)inname=CustomerID
typeParquet data type (required)type=INT64
repetitiontypeField repetition type (required)repetitiontype=REQUIRED
convertedtypeLogical type for the field (optional)convertedtype=UTF8
encodingEncoding method (optional)encoding=PLAIN_DICTIONARY
omitstatsSkip stats for this field (optional)omitstats=true

Primitive Data Types

TypeDescriptionGo Type Mapping
BOOLEANBoolean valuebool
INT3232-bit integerint32
INT6464-bit integerint64
INT9696-bit integer (deprecated)string
FLOATSingle precision (32-bit) floating pointfloat32
DOUBLEDouble precision (64-bit) floating pointfloat64
BYTE_ARRAYArray of bytesstring
FIXED_LEN_BYTE_ARRAYFixed-length byte arraystring (requires length attribute)

Logical Types

Logical types provide additional semantic information for primitive types:

Logical TypePrimitive TypeAttributesExample
UTF8BYTE_ARRAYconvertedtype=UTF8"Tag": "name=username, type=BYTE_ARRAY, convertedtype=UTF8"
INT_8/16/32/64INT32/INT64convertedtype=INT_8"Tag": "name=small_int, type=INT32, convertedtype=INT_8"
UINT_8/16/32/64INT32/INT64convertedtype=UINT_16"Tag": "name=unsigned, type=INT32, convertedtype=UINT_16"
DATEINT32convertedtype=DATE"Tag": "name=birth_date, type=INT32, convertedtype=DATE"
TIME_MILLISINT32convertedtype=TIME_MILLIS"Tag": "name=event_time, type=INT32, convertedtype=TIME_MILLIS"
TIME_MICROSINT64convertedtype=TIME_MICROS"Tag": "name=precise_time, type=INT64, convertedtype=TIME_MICROS"
TIMESTAMP_MILLISINT64convertedtype=TIMESTAMP_MILLIS"Tag": "name=created_at, type=INT64, convertedtype=TIMESTAMP_MILLIS"
TIMESTAMP_MICROSINT64convertedtype=TIMESTAMP_MICROS"Tag": "name=modified_at, type=INT64, convertedtype=TIMESTAMP_MICROS"
DECIMALVariousconvertedtype=DECIMAL, scale=N, precision=M"Tag": "name=price, type=INT32, convertedtype=DECIMAL, scale=2, precision=9"

Complex Types

Lists

{
"Tag": "name=items, type=LIST, repetitiontype=REQUIRED",
"Fields": [
{"Tag": "name=element, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REQUIRED"}
]
}

Maps

{
"Tag": "name=properties, type=MAP, repetitiontype=REQUIRED",
"Fields": [
{"Tag": "name=key, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REQUIRED"},
{"Tag": "name=value, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REQUIRED"}
]
}

Nested Structures

{
"Tag": "name=address, repetitiontype=REQUIRED",
"Fields": [
{"Tag": "name=street, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REQUIRED"},
{"Tag": "name=city, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REQUIRED"},
{"Tag": "name=zip, type=INT32, repetitiontype=REQUIRED"}
]
}

Repeated Fields

{"Tag": "name=phone_numbers, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REPEATED"}

Repetition Types

TypeDescription
REQUIREDThe field must be present (non-null)
OPTIONALThe field can be null
REPEATEDThe field can appear multiple times (similar to an array)

Encoding Options

TypeSupportDescription
PLAINAll typesDefault encoding
PLAIN_DICTIONARYAll typesDictionary-based encoding for repeated values
DELTA_BINARY_PACKEDInteger typesEfficient for sequences with small deltas
DELTA_BYTE_ARRAYBYTE_ARRAY, UTF8Efficient for strings with common prefixes
DELTA_LENGTH_BYTE_ARRAYBYTE_ARRAY, UTF8Efficient for strings with similar lengths

Complete Example

{
"Tag": "name=customer_record, repetitiontype=REQUIRED",
"Fields": [
{"Tag": "name=id, type=INT64, repetitiontype=REQUIRED"},
{"Tag": "name=name, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN_DICTIONARY, repetitiontype=REQUIRED"},
{"Tag": "name=email, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=OPTIONAL"},
{"Tag": "name=signup_date, type=INT32, convertedtype=DATE, repetitiontype=REQUIRED"},
{"Tag": "name=active, type=BOOLEAN, repetitiontype=REQUIRED"},

{
"Tag": "name=address, repetitiontype=OPTIONAL",
"Fields": [
{"Tag": "name=street, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REQUIRED"},
{"Tag": "name=city, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REQUIRED"},
{"Tag": "name=state, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REQUIRED"},
{"Tag": "name=zip, type=INT32, repetitiontype=REQUIRED"}
]
},

{
"Tag": "name=purchases, type=LIST, repetitiontype=OPTIONAL",
"Fields": [
{
"Tag": "name=element, repetitiontype=REQUIRED",
"Fields": [
{"Tag": "name=product_id, type=INT64, repetitiontype=REQUIRED"},
{"Tag": "name=price, type=INT32, convertedtype=DECIMAL, scale=2, precision=9, repetitiontype=REQUIRED"},
{"Tag": "name=quantity, type=INT32, repetitiontype=REQUIRED"}
]
}
]
},

{
"Tag": "name=properties, type=MAP, repetitiontype=OPTIONAL",
"Fields": [
{"Tag": "name=key, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REQUIRED"},
{"Tag": "name=value, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REQUIRED"}
]
},

{"Tag": "name=tags, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REPEATED"}
]
}

Tips and Best Practices

  1. Use inname when your input field names differ from how you want them stored in Parquet
  2. For large fields with many different values, avoid PLAIN_DICTIONARY encoding as it can consume excessive memory
  3. For large array values where stats aren't useful, use omitstats=true to reduce file size
  4. Use the appropriate repetition type:
    • REQUIRED for non-nullable fields
    • OPTIONAL for nullable fields
    • REPEATED for array-like fields
  5. Field names beginning with uppercase and lowercase letters are treated as different fields
  6. Avoid using PARGO_PREFIX_ as a name prefix as it's reserved