返回介绍

物理分区

发布于 2025-05-02 18:19:14 字数 4236 浏览 0 评论 0 收藏

Flink 还通过以下函数对转换后的精确流分区进行低级控制(如果需要)。


转换: 自定义分区 DataStream→DataStream

描述:使用用户定义的分区程序为每个数据元选择目标任务。

dataStream.partitionCustom(partitioner, "someKey");
dataStream.partitionCustom(partitioner, 0);

转换: 随机分区 DataStream→DataStream

描述:根据均匀分布随机分配数据元。

dataStream.shuffle();

转换: Rebalance (循环分区) DataStream→DataStream

描述:分区数据元循环,每个分区创建相等的负载。在存在数据偏斜时用于性能优化。

dataStream.rebalance();

转换: 重新调整 DataStream→DataStream

描述:分区数据元,循环,到下游 算子操作的子集。如果您希望拥有管道,例如,从源的每个并行实例扇出到多个映射器的子集以分配负载但又不希望发生 rebalance()会产生完全 Rebalance ,那么这非常有用。这将仅需要本地数据传输而不是通过网络传输数据,具体取决于其他配置值,例如 TaskManagers 的插槽数。上游 算子操作发送数据元的下游 算子操作的子集取决于上游和下游 算子操作的并行度。例如,如果上游 算子操作具有并行性 2 并且下游 算子操作具有并行性 6,则一个上游 算子操作将分配元件到三个下游 算子操作,而另一个上游 算子操作将分配到其他三个下游 算子操作。另一方面,如果下游 算子操作具有并行性 2 而上游 算子操作具有并行性 6,则三个上游 算子操作将分配到一个下游 算子操作,而其他三个上游 算子操作将分配到另一个下游 算子操作。在不同并行度不是彼此的倍数的情况下,一个或多个下游 算子操作将具有来自上游 算子操作的不同数量的输入。请参阅此图以获取上例中连接模式的可视化:数据流中的检查点障碍

dataStream.rescale();

转换: 广播 DataStream→DataStream

描述:向每个分区广播数据元。

dataStream.broadcast();

转换: Custom partitioning DataStream → DataStream

描述:Uses a user-defined Partitioner to select the target task for each element.

dataStream.partitionCustom(partitioner, "someKey")
dataStream.partitionCustom(partitioner, 0)

转换: Random partitioning DataStream → DataStream

描述:Partitions elements randomly according to a uniform distribution.

dataStream.shuffle()

转换: Rebalancing (Round-robin partitioning) DataStream → DataStream

描述:Partitions elements round-robin, creating equal load per partition. Useful for performance optimization in the presence of data skew.

dataStream.rebalance()

转换: Rescaling DataStream → DataStream

描述:Partitions elements, round-robin, to a subset of downstream operations. This is useful if you want to have pipelines where you, for example, fan out from each parallel instance of a source to a subset of several mappers to distribute load but don't want the full rebalance that rebalance() would incur. This would require only local data transfers instead of transferring data over network, depending on other configuration values such as the number of slots of TaskManagers.The subset of downstream operations to which the upstream operation sends elements depends on the degree of parallelism of both the upstream and downstream operation. For example, if the upstream operation has parallelism 2 and the downstream operation has parallelism 4, then one upstream operation would distribute elements to two downstream operations while the other upstream operation would distribute to the other two downstream operations. If, on the other hand, the downstream operation has parallelism 2 while the upstream operation has parallelism 4 then two upstream operations would distribute to one downstream operation while the other two upstream operations would distribute to the other downstream operations.In cases where the different parallelisms are not multiples of each other one or several downstream operations will have a differing number of inputs from upstream operations.</p> Please see this figure for a visualization of the connection pattern in the above example: </p>Checkpoint barriers in data streams

dataStream.rescale()

转换: Broadcasting DataStream → DataStream

描述:Broadcasts elements to every partition.

dataStream.broadcast()

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。