Curating the Data

Author: Stu Feeser

image image

Data Curation Deep Dive: The Bridge Photo Challenge

In our “Data to Dollars” journey, today’s stop is a deep dive into data curation, viewed through a lens that’s both specific and relatable—the challenge of curating one million bridge photographs. Imagine a vast digital archive, where each photo holds the potential to reveal critical insights about bridge health, from glaring cracks to subtle signs of wear. Yet, amidst this potential, a significant hurdle stands: not all photos are created equal.

Data Curation Rules of Thumb for the CEO

Your data must go through six steps to become usable. In the next blog, I will cover what this will cost, but for now, understand the journey your data must take. Not all of your data will make it to the finish line, as copious amounts might be unusable. Also, note the variation in effort at each level. Please understand that the data here is ONLY A RULE OF THUMB that will serve to get a handle on what the overall effort might look like and where you need to watch most closely:

1. Raw Data

  • Description: Data in its most unrefined form, directly collected from sources without any processing or cleaning. This includes all the noise, irrelevant information, inconsistencies, and possible errors present at the time of collection.
  • Data Remaining: 100% (Starting point)
  • Effort: 5% of total effort
    • Initial assessment and planning stage. The effort here involves setting up the infrastructure for data storage, initial data assessment, and planning for the cleaning process.

2. Cleansed Data

  • Description: Data that has undergone initial processing to remove obvious errors, duplicates, and irrelevant entries. Cleansing aims to correct inaccuracies and inconsistencies to make the data more uniform.
  • Data Remaining: 80-90%
    • A portion of the data is often removed due to being corrupt, duplicate, or obviously irrelevant.
  • Effort: 15% of total effort
    • This includes the tasks of identifying and removing erroneous data, correcting inconsistencies, and standardizing data formats.

3. Curated Data

  • Description: At this level, data has not only been cleansed but also organized, structured, and annotated with relevant context, making it significantly more valuable. Curation may involve classifying data, tagging it with metadata, and aligning it with specific analytical or operational goals.
  • Data Remaining: 70-80%
    • Further reduction as data is organized and structured, with some data being set aside due to lack of relevance or quality for the specific aims of the project.
  • Effort: 25% of total effort
    • Involves detailed organization, structuring, and initial tagging or classification. This step is labor-intensive as it requires a deeper understanding of the data’s context and how it fits into the project’s objectives.

4. Augmented Data

  • Description: This data has been enhanced beyond curation through techniques like data augmentation, which artificially increases the volume of data by creating modified versions of existing data points. Augmentation techniques include generating synthetic data, cropping, rotating, or otherwise altering data to simulate a broader range of scenarios.
  • Data Remaining: N/A (Data volume increases due to augmentation)
    • The concept of data remaining changes here, as augmentation artificially expands the dataset.
  • Effort: 20% of total effort
    • Effort spent generating synthetic data or applying modifications to existing data to enhance dataset diversity and volume.

5. Labeled Data

  • Description: A subset of curated data that has been specifically labeled for supervised learning tasks. Labels are annotations that directly inform the AI model about the patterns or features it should learn to recognize.
  • Data Remaining: 60-70% of the original (prior to augmentation)
    • Some data may not be suitable for labeling due to ambiguity or irrelevance to the training goals, leading to its exclusion.
  • Effort: 20% of total effort
    • Significant effort is required to accurately label data, often involving domain experts. This step is crucial for supervised learning models.

6. Enriched Data

  • Description: Data that has been enhanced with additional external information or insights to increase its value and context. Enrichment might involve integrating datasets from different sources, adding derived attributes, or embedding advanced metadata.
  • Data Remaining: 50-60%
    • This estimate accounts for the exclusion of data through previous steps and the selective integration of additional external data to enrich the remaining dataset.
  • Effort: 15% of total effort
    • The final enrichment process might involve integrating external datasets, embedding advanced metadata, or adding derived attributes, which requires sophisticated understanding and manipulation of data.

Conclusion

The requirements for human manpower will be significant, and even more so if inefficient tools are utilized. In the data curation effort, User Experience (UX) is paramount. Every step should be taken to simplify the process for individuals reviewing and tagging data. Artificial Intelligence (AI) should play a crucial role in automating manual steps as much as possible.