Structured vs Unstructured Data
from wiki
Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well (…)
Usually, we use this term for text or PDF documents. We call it “unstructured” because it does not fit into a predefined Schema.
But in AI age this is simply not true any longer.
Text is Not Random
Calling text “unstructured” is like calling a song “random noise.”
A song has a tempo, melody, and harmony the text has grammar, syntax, and punctuation.
These are strict rules. They are the structure. If I break these rules, the text has no meaning. So, the structure exists, but it is linguistic, not tabular, not json like.
Why It Is “Structured” Today
Ten years ago, text was “unstructured” because our computers were not smart enough to understand. If data was not in a row or column, the machine could not query it.
But today, the definition has changed.
- NLP (Natural Language Processing) allows computers to “read” the linguistic rules.
- Vector Embeddings turn words into numbers that represent meaning.
- LLMs can extract entities (like names or dates) automatically.
Great opportunity for startups in that area.