Advancements in Large Language Models (LLMs) have enabled developers to create a variety of applications such as code generation, translation, and text summarization. The effectiveness of all these models depends on the quality of the data used for training LLMs. Data from public sources often contains duplicates, personally identifiable information(PII), or low quality information. This data needs to be processed and filtered for downstream tasks, including training and/or fine-tuning LLM models. Furthermore, synthetic data can be created for training LLMs or evaluating Retrieval-Augmented Generation (RAG) applications. In this talk, we'll show how to use NeMo Curator and functionalities from the NVIDIA RAPIDS library to accelerate data processing tasks. We'll walk through a few examples using Python Jupyter notebooks to showcase some examples and discuss best practices relevant to large-scale data curation pipelines.