Configuring Dataflows

Introduction
Definitions
Worker Breakdown
Dataflow Configuration
Configuring Correctly
Examples

#Introduction

Scribengin works by splitting data up into concurrent streams and dividing the work up between workers and executors.

Refer to terminology for help on understanding terminology

#Worker Breakdown

#Dataflow Configuration

We'll declare our dataflow like so, and then show how to configure it

Dataflow<Message, Message> dfl = new Dataflow<>(dataflowId);

##Workers

The number of workers will be the number of containers we request from the VM Master

dfl.getWorkerDescriptor().setNumOfInstances(numOfWorker);

##Executors

The number of "threads" per worker

dfl.getWorkerDescriptor().setNumOfExecutor(numOfExecutorPerWorker);

##Default Parallelism

The default number of data streams to deploy per operator.

dfl.setDefaultParallelism(defaultParallelism);

##Default Replication

The default replication to use for Kafka, HDFS, or any other dataStore that supports replication. Has no impact on dataSources that do not support configurable replication.

dfl.setDefaultReplication(defaultReplication).

#Configuring Correctly

When configuring your dataflow, you must have some prior knowledge about what you're setting up. You'll need to know the number of streams you'll be expecting, and then set up the number of workers and executors to reflect that.

The following calculation is a good starting point:

(number of task slots) = 2

(number of workers) * (number of executors) * (number of task slots) >= (num of streams)

(num of streams)  = (parallelism) * (number of operators)

#Examples

##The Code

Let's consider the following code:

Dataflow<Message,Message> dfl = new Dataflow<Message,Message>(dataflowID);
    
    //Example of how to set the KafkaWireDataSetFactory
    dfl.
      setDefaultParallelism(2).
      setDefaultReplication(1);
    
    dfl.getWorkerDescriptor().setNumOfInstances(2));
    dfl.getWorkerDescriptor().setNumOfExecutor(3);

    //Define our input source - set name, ZK host:port, and input topic name
    KafkaDataSet<Message> inputDs = 
        dfl.createInput(new KafkaStorageConfig("input", kafkaZkConnect, inputTopic));
    
    //Define our output sink - set name, ZK host:port, and output topic name
    DataSet<Message> outputDs = 
        dfl.createOutput(new KafkaStorageConfig("output", kafkaZkConnect, outputTopic));
    
    //Define which operators to use.  
    //This will be the logic that ties the datasets and operators together
    Operator<Message, Message> splitter = dfl.createOperator("splitteroperator", SplitterDataStreamOperator.class);
    Operator<Message, Message> odd      = dfl.createOperator("oddoperator", PersisterDataStreamOperator.class);
    Operator<Message, Message> even     = dfl.createOperator("evenoperator", PersisterDataStreamOperator.class);

    //Send all input to the splitter operator
    inputDs.useRawReader().connect(splitter);
    
    //The splitter operator then connects to the odd and even operators
    splitter.connect(odd)
            .connect(even);
    
    //Both the odd and even operator connect to the output dataset
    // This is arbitrary, we could connect them to any dataset or operator we wanted
    odd.connect(outputDs);
    even.connect(outputDs);

##Figuring Out The Configuration

Let's assume Kafka's default number of partition (num.partitions) is set to the default, 1. The input topic is written to with this default number of partitions.

Since we're setting defaultParallelism to "2", we know we'll have two partitions by default for every operator we control.

The number of streams will be:

1 (partition) x 1 (input operator) + 
2 (partition) x 1 (output operator) + 
2 (partition) x 3 (splitter, odd, and even operator) = 8 streams

The number of available operator slots will be:

2 (workers) x 3 (executors per worker) x 2 (number of task slots) = 12

Therefore, this will be a good configuration with room to spare

12 (number of task slots) >= 8 (streams)

##Kafka Complications

What happens if Kafka's default number of partitions is set to something other than 1? Let's assume for this example that the number of default partitions is set to 5, and that data written into Kafka was written with this default.

In this case, the number of task slots will be off!

The number of streams will be:

5 (partition) x 1 (input operator) + 
2 (partition) x 1 (output operator) + 
2 (partition) x 3 (splitter, odd, and even operator) = 13 streams

The number of available operator slots will be insufficient! :

2 (workers) x 3 (executors per worker) x 2 (number of task slots) = 12

Therefore, this will be a bad configuration.

12 (number of task slots) < 13 (streams)

We would fix this by either increasing the number of Workers or the number of Executors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configuring Dataflows

FilesExpand file tree

configuringDataflows.md

Latest commit

History

configuringDataflows.md

File metadata and controls

Configuring Dataflows