Customers often ask us how much data they need to run successful artificial intelligence (AI) and machine learning (ML) projects. This question is hard to answer in simple terms. A functioning ML model requires clean and large data sets, but their optimal size is affected by a range of factors including the complexity of the model, training method, and tolerance for errors. Fortunately, there are several ways of calculating your data needs and overcoming the lack of data.
It’s difficult to know your data needs
ML models consider various parameters to reveal patterns behind and between data points. The more parameters they have to consider, the more data they need. An ML model tasked with identifying the make of a specific bike, for instance, has a small set of parameters to analyze, but it will take much more data for the algorithm to work out how much the bike costs. The complexity of the model thus directly affects data requirements.
Your training methodology is another factor to account for. If your algorithm uses structured learning, it might quickly reach a point where additional data has little ROI. But relying on deep, unstructured learning models with a longer learning curve means that more data will lead to a positive impact.
The role that an ML model is to play is another important consideration. If ML is supposed to predict the weather or suggest clothing products to site visitors, then a 15% error rate might be tolerable. But this rate is unacceptable if the model is vital for the survival of a business or even a person, in which case you’ll need more data to ensure the flawless performance of your algorithm.
And depending on the environment in which the model is deployed, it may require a greater variety of inputs to function properly. Take, for instance, a chatbot handling thousands of customers each day. It receives inquiries in various languages, written in formal and informal styles, and this unpredictability means that the algorithm needs to be trained with large and diverse data sets to provide correct answers.
Calculating how much data your model requires
These challenges notwithstanding, there are some common methods data scientists would use when deciding on how much data they may need to get the ball rolling.
One of those is ‘Rule of 10’. This means that a model requires ten times more data than it has degrees of freedom. And a degree of freedom can be defined in various ways, whether as a parameter that affects the model’s output, a column in your dataset, or a data point attribute.
Another method is to create a study that analyzes a relationship between your dataset and the model’s accuracy and calculates the point when more data provides diminishing returns. And although this is a labor-intensive approach, it’s more accurate than guessing.
Though this is more appropriate for data scientists, you can also review studies on ML problems published by other companies and experts. Analyzing how much data they used in situations similar to yours can inform your decisions and lead to better outcomes.
There are many details across the internet on how much data was used in successful ML projects. Here are some notable examples:
FaceNet (facial detection and recognition) – 450,000 samples
MIT CSAIL (image annotation) – 185,000 images, 62,000 annotated images, 650,000 labeled objects
Sprout (sentiment analysis for Twitter) – tens of thousands of tweets
Analysis and Classification of Arabic Newspapers (sentiment analysis and classification of Facebook pages in Arabic) – 62,000 posts, 9,000 comments
TransPerfect (machine translation) – four million words
Building Chatbots from Forum Data: Model Selection Using Question Answering Metrics (chatbot training) – two million answers paired with 200,000 questions
Online Learning Library (natural language processing experiments) – 15,000 training points, over one million features
Collecting more data and overcoming the lack of it
Now that you know how much data your model roughly requires, it’s time to collect more of it. Ideally, you’d already have a data-gathering mechanism in place based on the scope and nature of your ML project. For example, you could provide your target audience or customers with access to an app and use the collected data to build algorithms.
Also, you can go through open-source resources as some companies and experts give their data away for free. And if you need specific external data sets, then partnering with organizations that might have it is a way to go.
If you can’t collect more data, there are various options to consider. You can turn to data augmentation, a technique to create new data points out of the existing ones by altering their properties. Feeding your algorithm with a more diverse group of data sets improves its robustness and accuracy.
Another option is data synthesis that involves creating new data points using complex sampling techniques. And if your data is limited, you can turn to discriminative methods such as regularization to give more weight to relevant data points and prevent overfitting.
Start working on your model and adapt as you go
Knowing how much data you need for ML models requires looking at the problem and data. But the (over-simplified) answer would be that if you have at least several hundred examples of whatever you want to predict, it’s worth giving it a try.
And try not to look for perfect data. If the resources you have can make a business impact, move forward with your project. ML models are iterative and you’ll learn and improve as you work on them, coming ever closer to your ultimate goal.