Batch Processing Glossary
Processing guarantee that records may be processed more than once but are not silently lost, requiring idempotent handling where necessary.
Processing guarantee that records are never processed more than once, but some records may be lost on failure.
Running batch jobs to process historical data that was previously missed.
Group of servers used to process batch workloads in parallel.
Techniques used to reduce infrastructure costs for batch workloads.
Measure of how up to date batch output is relative to the newest source data expected to be included.
Frequency at which batch jobs are triggered, such as hourly or daily.
Automated task that processes a dataset without requiring manual interaction.
Delay between when data becomes available and when batch processing occurs.
Automated system that processes data in batches through multiple processing stages.
Data processing approach where large volumes of data are collected and processed together at scheduled intervals rather than in real time.
Infrastructure design supporting batch pipelines and compute clusters.
Software platform designed to execute large-scale batch workloads.
Software platform designed to manage large-scale batch workloads.
Ordered list of batch jobs waiting to be executed by a processing system.
Assigning compute, memory, and storage resources to batch workloads.
Amount of data processed in a single batch operation.
Service-level agreement defining performance expectations for batch jobs.
Amount of data processed by batch systems over a given time period.
Scheduled time period during which batch jobs are executed.
Sequence of batch jobs executed in a predefined order to complete a larger data processing task.
Saving intermediate job states to allow recovery after failures.
Component responsible for managing resources across a distributed compute cluster.
Storage format optimized for analytical queries and batch processing workloads.
Individual server within a distributed batch processing cluster.
Longest dependency path in a workflow that determines the minimum total completion time of the batch pipeline.
Combining multiple records to produce summary insights.
Accumulation of data waiting to be processed by batch jobs.
Group of records processed together as part of a batch job.
Identifying and correcting inaccurate or inconsistent data.
Merging smaller files into larger ones to improve storage efficiency.
Ensuring processed data remains reliable and synchronized across systems.
Enhancing datasets with additional information from other sources.
Policies ensuring security, compliance, and quality of data within processing pipelines.
Process of importing data into systems for processing.
Guarantee that data remains complete and accurate during processing.
Centralized repository storing raw data used for batch analytics.
Tracking how data moves and transforms across batch pipelines.
Placement of computation close to the data it needs, reducing network transfer and improving batch performance.
Dividing large datasets into smaller segments for parallel processing.
Automated system that moves and processes data across multiple stages.
End-to-end system used to process, transform, and deliver data.
Splitting data across multiple storage nodes for scalability.
Uneven distribution of data causing processing imbalance across nodes.
Captured state of a dataset at a specific point in time.
Converting raw data into a structured format suitable for analysis.
Analytical database storing structured data processed via batch pipelines.
Storage location for records that failed processing.
Running batch jobs across multiple machines to scale processing capacity.
Coordinator process that schedules tasks and manages distributed job execution.
Automatically scaling infrastructure resources to match batch workload demand.
Data pipeline approach where raw data is loaded first and transformed afterward.
Process of extracting data from sources, transforming it, and loading it into a target system.
Processing guarantee that each input record affects the final result only once, even in the presence of retries or failures.
Worker-side runtime process that executes tasks in distributed frameworks such as Spark.
Resource allocation strategy ensuring equitable distribution of compute resources among jobs.
Ability of batch systems to recover from failures during execution.
Reprocessing the entire dataset during each batch run.
Batch pattern in which an output table or dataset is rebuilt from scratch rather than incrementally updated.
Processing large volumes of data efficiently across distributed infrastructure.
Architecture combining batch and streaming data processing techniques.
Batch job that produces the same result even when executed multiple times.
Processing only new or changed data rather than reprocessing entire datasets.
Relationship where one batch job must complete before another job begins.
Detailed plan describing how tasks in a batch job are distributed across compute resources.
Recording execution details and results of batch tasks.
Performance indicators used to measure batch job efficiency.
Observing execution status and performance of batch jobs.
Coordinating execution of multiple batch jobs and workflows across systems.
Executing multiple batch tasks simultaneously to improve processing speed.
Automatic re-execution of failed batch jobs.
System responsible for triggering and managing batch job execution at scheduled times.
Slow-running task that delays completion of a distributed batch job.
Simplified data architecture that processes data streams continuously without separate batch pipelines.
Data architecture combining batch processing and real-time stream processing.
Source data that arrives after the expected processing window and may require backfill, correction, or watermark logic.
Batch workload that must complete within strict time limits.
Processing small batches of data at short intervals to approximate real-time processing.
Processing model that delivers results shortly after data arrival using frequent small batches.
Columnar storage format optimized for distributed data processing systems.
Mechanism that ensures distributed task outputs are written atomically and consistently, preventing partial or duplicate final results.
Columnar file format commonly used in batch analytics pipelines.
Controlled change to partitioning strategy over time as data volume, access patterns, or business logic change.
Query optimization technique that reads only relevant data partitions.
Situation where downstream systems cannot process incoming data fast enough.
System responsible for managing complex batch pipelines and dependencies.
Scheduling policy where high-priority jobs receive resources first.
Time a batch job spends waiting for resources or scheduler admission before execution begins.
Last safe checkpoint, committed output boundary, or processed offset from which a batch workflow can resume after failure.
Running batch pipelines again to correct errors or incorporate updated logic.
System responsible for allocating compute resources to batch jobs.
Stage in distributed processing where data is redistributed between nodes.
Condition where intermediate shuffle data exceeds memory and is written to disk, often increasing runtime substantially.
Techniques used to reduce imbalance caused by uneven data distribution, such as repartitioning, salting, or adaptive execution.
Performance and metadata overhead caused when batch pipelines create a very large number of small output files instead of fewer larger files.
Technique where duplicate tasks are launched to mitigate slow tasks.
Running batch workloads on temporary low-cost compute resources.
Logical phase of a distributed batch job consisting of tasks that can run in parallel before a shuffle or dependency boundary.
Continuous processing of incoming data streams rather than scheduled batches.
Condition where a batch workload fails to meet its defined completion deadline, freshness target, or success criteria.
A single execution attempt of a task, including retries after failure or speculative re-execution.
Individual unit of work performed within a batch job.
Software component that automates the execution timing of tasks or jobs.
Machine responsible for executing tasks assigned by the processing framework.
Graph structure used to represent dependencies between tasks in a batch workflow.
Dividing workloads into balanced tasks for parallel execution.
Process of determining when and where batch jobs run in a cluster.
Property that ensures repeated write attempts produce the same final output state without duplication or corruption.
No matching data found.