Show HN: I Made an Open Source Platform for Structuring Any Unstructured Data

88 points by adithya-s-k a year ago

Hey HN,

I'm Adithya, a 20-year-old dev from India. I have been working with GenAI for the past year, and I've found it really painful to deal with the many different forms of data out there and get the best representation of it for my AI applications.

That's why I built OmniParse—an open-source platform designed to handle any unstructured data and transform it into optimized, structured representations.

Key Features: - Completely local processing—no external APIs - Supports ~20 file types - Converts documents, multimedia, and web pages to high-quality structured markdown - Table extraction, image extraction/captioning, audio/video transcription, web page crawling - Fits in a T4 GPU - Easily deployable with Docker and Skypilot - Colab friendly with an interactive UI powered by Gradio

Why OmniParse? I wanted a platform that could take any kind of data—documents, images, videos, audio files, web pages, and more—and make it clean and structured, ready for AI applications.

Check it out on GitHub: https://git.new/omniparse

bpev a year ago

I'm not sure that I understand what we're parsing to. Like on the website, I see supported types, but that looks like the parsable types, no? What kind of structured representation is outputted? And can we guide what that structure looks like?

adithya-s-k a year ago

Yes, the current implementation of the repository converts any data primarily into strctured markdown text.
The next stage will involve prompt guides or schema-guided structure extraction.
Let's say you are processing a lot of research PDFs and want to convert them into clean markdown that best represents the content. Now, let's say you want to extract the authors, abstracts, captions, and store images.
The extraction engine we are currently working on will help you with that.
- xigoi a year ago
  
  “structured Markdown” sounds like an oxymoron.

itishappy a year ago

I haven't run it myself, but the example provided looks kinda broken. It looks WAY better than the PyPDF results, but good enough?

The table name was parsed as part of a column name, and half of the column names were not parsed at all.

Original: https://github.com/adithya-s-k/marker-api/blob/master/data/i...

Parsed: https://github.com/adithya-s-k/marker-api/blob/master/data/i...

adithya-s-k a year ago

Yep, the accuracy it currently offers is 80% to 90%. We are actively working on improving the underlying models and there are some major improvments coming soon

brianjking a year ago

1. How does this differ from LlamaParse which can be used with and without LlamaParse?

2. Is there an option for a more permissive license that isn't GNU for commercial enterprise use?

Thanks!

adithya-s-k a year ago

Llamaparse currently only parses PDF documents, as far as I know. OmniParse aims to process any data type, from documents and images to videos and websites, and provide the best representation for AI applications.
We have a few dependencies that are licensed under GNU, which is why we have that license. However, I am currently training models to be under the MIT license and plan to replace the current GNU-licensed dependency to eliminate this limitation.
- brianjking a year ago
  
  LlamaParse supports 80+ file types, just FYI.
  https://docs.cloud.llamaindex.ai/llamaparse/features/support...
  - riku_iki a year ago
    
    But this is not open source? It is some cloud stuff.
- brianjking a year ago
  
  That's fantastic, the MIT license will allow commercial usage as well, right?
  Will you be launching a commercial SaaS offering of it as well?
  Any ETA?
  - adithya-s-k a year ago
    
    Oh, I will do some more research on LlamaParse.
    Yep, planning to release it under a commercially permissible license.
    We have an active API which we are using for our internal clients, and we are planning to release it soon.
    Regarding the ETA of the new model, I don't have a fixed deadline as we are training and testing for a lot of edge cases. Currently, we are doing research and trying to build/train in public on X/Twitter.

sirjaz a year ago

What are the limitations of running the server on Windows?

adithya-s-k a year ago

Some of the softwareslike LibreOffice, are used to convert files from one format to another.For Windows,it will require a different approach which hasnt been implemented yet