Data and artistic content are essential inputs in the development of Artificial Intelligence (AI) and Machine Learning (ML) technologies. In the rapidly evolving landscape of AI, demand for high-quality data and artistic content is surging. Current methods of AI data collection, however, particularly data scraping, are risky and controversial due to the lack of provenance and the absence of compensation for owners and creators. Further, traditional methods of content licensing are inefficient and ill-suited to the dynamic needs of the AI era. There is a critical need for an efficient, market-based transactional platform that can streamline the licensing process for data and artistic content. An efficient, market-based transactional platform will not only facilitate seamless exchanges and ensure fair compensation for creators but also promote a sustainable ecosystem for both AI innovation and data and content development.
AI and ML technologies are built on complex algorithms and models that use vast amounts of data, and based on these data, AI and ML models use pattern recognition to make predictions and generate content. The foundation of AI and ML lies in the data used for model training, fine-tuning and augmentation. Without sufficient and high-quality data, even the most sophisticated algorithms can fail to deliver usable or reliable results. This makes data an essential component in the development and deployment of AI and ML solutions.
AI and ML models require massive datasets to train effectively, and the quality and quantity of this data directly impact the performance and reliability of the models. Large quantities of data are needed for AI and ML models to identify and capture underlying patterns, enabling them to compress data from a wide array of examples and improve their predictive capabilities. Large data collection helps to minimize over-fitting, where models cannot generalize, performing well on training data, but poorly on new data. The diversity within a given dataset ensures that models can handle different situations robustly, making them more reliable in real-world applications.
Large, diverse datasets are integral for developing reliable and effective AI and ML models. However, the quality of data is even more crucial to the success of AI and ML initiatives than quantity. High-quality data ensures that models learn from authentic, relevant, and diverse information, reducing hallucinations and enhancing their ability to provide relevant answers or generalize across different scenarios. Low-quality data, on the other hand, often results in erroneous output and unreliable models, regardless of the dataset size. Garbage in, garbage out. Models trained on high-quality data also require less time and computational resources to achieve optimal performance.
Artistic content plays a significant role in training models for tasks such as image and video generation, music composition, and multimodal outputs. Without diverse and high-quality artistic content, generative models like GANs (Generative Adversarial Networks) and VAEs (Variational Autoencoders) are unable to learn and generate “new” creative works. Ultimately, high-quality data sets improve the adaptability of AI and ML models, enabling them to make more accurate predictions when the training data is representative of real-world scenarios.
AI and ML models acquire data from a variety of sources without clear lineage or license for its use. Public datasets from platforms like Kaggle, UCI Machine Learning Repository, and government databases are widely used. Web scraping, which involves extracting data from websites using automated tools and scripts, is another common method. APIs provided by various platforms and services offer programmatic access to data, and licensing agreements with organizations and institutions can provide proprietary datasets that are not publicly available.
In addition to these “real” data sources, synthetic data generated by algorithms has been proposed as an alternative data source when real data is scarce, sensitive, or inaccessible. Training AI models on synthetic data, however, will likely lead to model degradation. Synthetic data may not sufficiently capture the full diversity and feature distribution of real-world data, resulting in models that are less robust, accurate, and unable to generalize well to new data. Synthetic data may also exaggerate imperfections present in the original data, which can lead to lower-quality models. Another significant concern when using synthetic data is model collapse. Model collapse occurs when AI models trained on data generated by other AI models lose data from the original data distribution, resulting in increasingly similar, less diverse and/or low-quality outputs. Ultimately, if the synthetic data are not carefully generated, they may introduce biases that were not present in the original data, leading to biased models that make inaccurate predictions.
Data acquisition for AI and ML training is currently a complex and increasingly contentious process as media companies, content producers and enterprise customers recognize the significant value that AI and ML platforms derive through the commercialization of their IP and data assets. Recently, several noteworthy legal cases have emerged around AI and ML data acquisition and scraping practices. In 2023, more than 13 new content-related lawsuits were filed against AI companies. Notably, the New York Times filed a multi-billion-dollar lawsuit against Microsoft and OpenAI, the creator of ChatGPT, accusing them of copyright infringement and abusing the newspaper’s intellectual property to train large language models (LLMs).
Adding to the contention is the growing consensus that data are becoming one of the most valuable forms of intellectual property (IP). As AI and ML technologies advance, the importance of high-quality, diverse data-sets has surged, often surpassing the traditional value placed on other forms of IP. This value shift underscores the critical role data assets play in driving innovation and competitive advantage in the AI era.
In recognition of the value of data, AI and ML platforms are scrambling to acquire content use rights. However, blanket content licensing can be risky for both the AI platform and the content owner. AI and ML platforms may overpay, agreeing to high license fees based on the anticipated value of the data, only to find that the licensed data are not as useful or relevant as initially surmised.
For data owners, blanket licensing is a double-edged sword. For a struggling online magazine or newspaper, a blanket content license may be a welcome lump sum payment or short-term revenue stream. But when content owners do not fully understand the rights or value of the rights that are being granted, and the long-term benefits of data to AI and ML platforms, underpayment and/or loss of control is a real and significant risk as AI becomes a larger part of their distribution channel. Additionally, content owners may find it challenging to negotiate fair terms when they lack access to the AI and AI customer usage data needed, bargaining power, or the expertise needed to assess the potential long-term benefits and value of their data.
An independent, auditable transactional platform would significantly improve market efficiency and pricing. Moreover, a transparent marketplace for data and artistic content would streamline the process of buying and selling data and content, reducing transaction costs and eliminating the need for lengthy individual negotiations, paper contracts and royalty reports. By offering clear market pricing and licensing mechanisms, it would help establish fair market values for different types of data and content, ensuring that both buyers and sellers are adequately compensated and use rights are enforced. Additionally, the platform could incorporate tools for tracking and measuring the usage, attribution, and contribution of data and content, providing insights into its actual value and impact. This transparency would reduce information asymmetry and economic imbalances allowing all value chain participants to make more informed decisions and be compensated fairly for their contributions.
For a sustainable and efficient information economy, there must be both transparency and accountability. Further, in addition to accurate and timely information about prices, there must also be reliable mechanisms to track and measure the usage by, and contribution of, data and artistic content to AI and ML platforms. Accurate and real-time pricing, as well as robust mechanisms to track and measure the usage and contribution of data and artistic content to AI and ML platforms, would significantly improve market efficiency, and thus enable market-based pricing. Price transparency allows market participants to make informed decisions, reducing information asymmetry and promoting fair competition. When data and content rights and usage are accurately tracked, it ensures that content creators and data owners are fairly compensated based on the value their contributions bring to AI and ML models. These conditions would not only incentivize the creation and sharing of high-quality data but also help to inspire trust between data providers and AI and ML developers (Developers). Additionally, dynamic pricing models, driven by real-time data, can adjust prices based on demand, usage patterns, and market conditions, ensuring that prices reflect the true value of data and content.
In addition to transparency, an efficient transactional platform must include easy, verifiable access to data provenance for diverse datasets and artistic content. Clear data provenance requires that the origin, quality, and legal status of the data is known to all users, reducing the risks associated with copyright infringement and unauthorized use. This clarity helps establish trust between data providers and developers, facilitating smoother negotiations and fairer compensation agreements. Additionally, having a wide range of high-quality, well-documented datasets that are readily available allows developers to distinguish and select the most relevant data for their needs, optimizing the performance of their models. This would reduce the significant time and resources spent on data acquisition and preparation, leading to cost savings and more competitive pricing, which benefits both data/content owners and developers.
The benefits of an efficient data and content transaction platform are many. For developers, access to more high-quality data will lead to improved model performance, lower computing costs and more rapid innovation. For developers and data owners, access to such a transactional platform would significantly reduce the cost of finding counter-parties, negotiating terms, and finalizing deals, reducing the time and resources spent on individual agreements. Standardized licensing deals can simplify negotiations and ensure that all parties understand the terms, which reduces legal fees and the complexity of individual negotiations. With transparent market pricing, all parties can be assured that they are receiving fair compensation based on market demand and the actual value of their contributions. The platform connects data/content owners with a wider range of potential buyers, increasing the likelihood of finding suitable and competitive offers. Additionally, the platform can provide tools to track and measure the usage and value of data and content, ensuring that owners are compensated accurately and fairly based on actual usage.
Negotiating and valuing an upfront license for data and artistic content in AI and ML platforms presents significant challenges. The intrinsic value of data and content can be highly variable, depending on factors such as uniqueness, quality, relevance, and perceived impact on model performance. Additionally, the rapid evolution of AI and ML businesses makes it difficult to predict long-term value accurately. In contrast, a usage-based model enabled by an efficient transactional platform offers a more flexible approach. By compensating data/content owners based on their contributions, this model ensures that remuneration is aligned with the actual usage and benefits derived from their data and content. It also ensures that developers do not overpay for the use of data/content, as payments are directly correlated to the actual value and usage of the data and content. This approach can integrate with various pricing models, including subscription, pay-per-use, and advertising-based monetization models, providing a scalable and dynamic framework that can accommodate diverse business needs and market conditions. This not only incentivizes high-quality contributions, but also fosters a more sustainable and collaborative ecosystem for AI and ML development.
For data/content owners, an efficient transactional platform offers increased revenue streams, broader market reach, enhanced collaboration, efficient use of data and content assets, and the opportunity to establish industry standards and best practices. For developers, an efficient transactional platform provides access to the verifiable, quality data needed for enhanced model accuracy, cost efficiency and accelerated time-to-market.
Although a usage-based transactional model enabled by an efficient, transparent transactional platform would address many of the use rights concerns currently faced by data/content owners and AI Developers, the adoption of such platforms is just beginning. Only a handful of companies have attempted or are currently pioneering solutions, most of which have only announced fundraising and potential betas for their products.
In 2012, the intellectual property advisory firm Ocean Tomo launched the first intellectual property trading platform, Intellectual Property Exchange International (IPXI). IPXI aimed to create a marketplace for IP rights, allowing for the trading of unit license rights (ULRs). This innovative approach was designed to make IP transactions more efficient and transparent. Unfortunately, IPXI ceased operations in 2015, but its efforts were recognized as positively contributing to the global IP market.
Today, Personal Digital Spaces (PDS) is a noteworthy leader in the space. Offering an end-to-end data and IP licensing and market platform, PDS has a commercialized enterprise product, customers, and established leadership and development teams. The PDS platform allows data attribution/contribution to be recognized and tracked, providing guarantees of integrity and accountability. Moreover, the platform integrates blockchain technology to enable real-time management and monetization of data/IP assets. PDS’s platform supports multiple licensing strategies and pricing models such as subscription, pay-per-use, and advertising-based models. By facilitating a complete accounting and value exchange mechanism, PDS’s platform ensures fair compensation for data owners and content creators while providing AI Developers with a scalable framework for their initiatives.
In addition to PDS, Story Protocol, a development-stage company, recently raised an impressive $80 million, at a valuation of $2.25 billion. Story Protocol, like PDS, intends to deploy a blockchain-based protocol for intellectual property management. Story Protocol’s offering, however, is not yet commercially available, and its product roadmap currently lacks comprehensive functionality.
Human Native AI, another early stage company, is developing a platform designed to manage and monetize digital content. The company’s goal is to create a decentralized marketplace where content creators can license their works to developers for training purposes. Human Native AI was founded in April 2024, and its product is currently in beta. The company is working to build out its operating team and infrastructure to bring its solution to market.
While the concept of a usage-based transactional model for data/IP rights in AI and ML platforms holds great promise, its implementation remains in its early stages. As adoption and deployment of these platforms continues to develop, they promise robust solutions for secure, transparent, and fair management of data and content that enhances their value, ultimately benefiting both creators and Developers across AI and ML ecosystems.
Ultimately, the development of an efficient and transparent, market-based transactional platform for licensing data and artistic content is essential for the continued growth and sustainability of AI and ML technologies. The emergence and significant investment in companies like Personal Digital Spaces and Story Protocol is indicative of the value-add these platforms will bring to the evolution of AI and ML.
For Developers, access to high quality, diverse datasets will significantly enhance model performance and accelerate innovation. Transparent, market-based pricing and explicit data provenance will ensure that Developers can make informed decisions about the data they use; and a streamlined process for acquiring data will reduce the time and resources spent on data collection and preparation and legal fees, allowing developers to focus on refining their models and algorithms.
For data/content owners, these platforms will offer an efficient way to monetize their assets. By providing tools to track and measure the usage of their data, these platforms will ensure that creators are fairly compensated based on the actual value of their contributions to AI and ML models, incentivizing the creation and sharing of high-quality data and fostering trust between data providers/content owners and developers. The ability to reach a broader market will increase monetization opportunities and reduce the complexity of negotiating individual licensing agreements and the likelihood of costly legal proceedings.
As these platforms evolve, they will play a crucial role in accelerating innovation and collaboration and paving the way for a future where data and content rights are managed efficiently, and all can thrive.