[FEA]: Optimize segmented TopK with clusters for sm100f

### Is this a duplicate?

- [x] I confirmed there appear to be no [duplicate issues](https://github.com/NVIDIA/cccl/issues) for this request and that I agree to the [Code of Conduct](CODE_OF_CONDUCT.md)

### Area

CUB

### Is your feature request related to a problem? Please describe.

In the worst case, our device-level TopK implementations can read each input element from global memory `~sizeof(T)` times. For device-level TopK, it's caused by the fact that we use separate kernel to ensure global histogram is updated by each CTA. For segmented TopK, similar behavior will take place once we start supporting large segment sizes.  

From the first principles, clusters let us synchronize CTAs without launching extra kernels while preserving candidate items in shared memory to avoiding global memory traffic on SM90+. Naive [implementation](https://github.com/gevtushenko/cccl/blob/cluster-topk-poc/cub/cub/agent/agent_batched_topk_cluster.cuh) that does N cluster-level histogram passes shows ~30% speedup compared to device-level TopK. There are other [attempts](https://github.com/flashinfer-ai/flashinfer/pull/2814) that show ~1.4x (up to 4x) speedup on larger segment sizes. 

### Describe the solution you'd like

We should attempt optimizing segmented TopK with SM90+ clusters. 

### Describe alternatives you've considered

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]: Optimize segmented TopK with clusters for sm100f #9075

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[FEA]: Optimize segmented TopK with clusters for sm100f #9075

Description

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions