Don’t Let Your Messy Documents Run You RAG-Ged. Announcing Document Curation in Cleanlab Studio

June 7, 2024
  • Emily BarryEmily Barry

What’s one of the biggest obstacles to standing up a RAG system? A document collection that’s uncurated and riddled with problems. Duplicate information, personally identifiable information, label problems, toxic language, biased/informal language – the list goes on.

That’s why we’re delighted to announce Document Support - instantly unify large amounts of disparate and unstructured documents into usable auto-curated datasets, with just a few clicks. From the same screen you can deploy a robust and trustworthy model with confidence and without needing to build bespoke pipelines to manage pre-processing.

As a team of Data Scientists ourselves, we know that RAGs are all the rage in 2024 - but going from sandbox to production on a reasonable timeline is the hard part. Our team has focused our efforts on developing cutting edge AI that not only auto-detects and resolves issues across all major datatypes, it can help you label heterogenous data now too. You can leverage our no-code data interface to quickly curate a useful document collection out of your existing materials without spending countless hours and creating manual headaches. Cleanlab Studio now directly supports document collections composed of files of the following types: doc, docx, pdf, ppt, pptx, csv, xls, xlsx - all in the same dataset.

RAG applications aren’t the only document use-case Cleanlab Studio has added enormous value to - check out our solutions pages to learn more about how industry and application agnostic Cleanlab Studio can be. Cleanlab Studio is the one platform a team needs to instantly curate data for a broad range of tasks and deploy better models faster.

Here are some examples to get you started:

You don’t need a Doc-tor in the house. You just need Cleanlab. Sign up for a free trial.

Related Blogs
Automatically boost the accuracy of any LLM, without changing your prompts or the model
Demonstrating how the Trustworthy Language Model system can produce better responses from a wide variety of LLMs
Read more
How to detect bad data in your instruction tuning dataset (for better LLM fine-tuning)
Overview of automated tools for catching: low-quality responses, incomplete/vague prompts, and other problematic text (toxic language, PII, informal writing, bad grammar/spelling) lurking in a instruction-response dataset. Here we reveal findings for the Dolly dataset.
Read more
cleanlab now supports all major ML tasks — including Regression, Object Detection, and Image Segmentation
Introducing cleanlab v2.5, the long-awaited release that adds support for practicing Data-Centric AI in ML tasks requested by the most users.
Read more
Get started today
TLM is free to try and adds a reliabilty layer to RAG and GenAI systems in a few lines of code.
More resources
Explore applications of Cleanlab via blogs, tutorials, videos, and read the research that powers this next-generation platform.
Join us on Slack
Join the Cleanlab Community to ask questions and see how scientists and engineers are practicing Data-Centric AI.