8 ML Projects for Beginners
In this blog, we’ll be walking through 8 fun machine learning projects for beginners. Projects are some of the best investments of your time. You’ll enjoy learning, stay motivated, and make faster progress.
You see, no amount of theory can replace hands-on practice. Textbooks and lessons can lull you into a false belief of mastery because the material is there in front of you. But once you try to apply it, you might find that it’s harder than it looks.
Projects help you improve your applied ML skills quickly while giving you the chance to explore an interesting topic.
Plus, you can add projects into your portfolio, making it easier to land a job, find cool career opportunities, and even negotiate a higher salary.
1. Machine Learning Gladiator
We’re affectionately calling this “machine learning gladiator,” but it’s not new. This is one of the fastest ways to build practical intuition around machine learning.
The goal is to take out-of-the-box models and apply them to different datasets. This project is awesome for 3 main reasons:
First, you’ll build intuition for model-to-problem fit. Which models are robust to missing data? Which models handle categorical features well? Yes, you can dig through textbooks to find the answers, but you’ll learn better by seeing it in action.
Second, this project will teach you the invaluable skill of prototyping models quickly. In the real world, it’s often difficult to know which model will perform best without simply trying them.
Finally, this exercise helps you master the workflow of model building. For example, you’ll get to practice…
- Importing data
- Cleaning data
- Splitting it into train/test or cross-validation sets
- Feature engineering
Because you’ll use out-of-the-box models, you’ll have the chance to focus on honing these critical steps.
Check out the sklearn (Python) or caret (R) documentation pages for instructions. You should practice regression, classification, and clustering algorithms.
- Python: sklearn – Official tutorial for the sklearn package
- Predicting wine quality with Scikit-Learn – Step-by-step tutorial for training a machine learning model
- R: caret – Webinar given by the author of the caret package
- UCI Machine Learning Repository – 350+ searchable datasets spanning almost every subject matter. You’ll definitely find datasets that interest you.
- Kaggle Datasets – 100+ datasets uploaded by the Kaggle community. There are some really fun datasets here, including PokemonGo spawn locations and Burritos in San Diego.
- data.gov – Open datasets released by the U.S. government. Great place to look if you’re interested in social sciences.
2. Play Money Ball
In the book Moneyball, the Oakland A’s revolutionized baseball through analytical player scouting. They built a competitive squad while spending only 1/3 of what large market teams like the Yankees were paying for salaries.
First, if you haven’t read the book yet, you should check it out. It’s one of our favorites!
Fortunately, the sports world has a ton of data to play with. Data for teams, games, scores, and players are all tracked and freely available online.
There are plenty of fun machine learning projects for beginners. For example, you could try…
- Sports betting… Predict box scores given the data available at the time right before each new game.
- Talent scouting… Use college statistics to predict which players would have the best professional careers.
- General managing… Create clusters of players based on their strengths in order to build a well-rounded team.
Sports is also an excellent domain for practicing data visualization and exploratory analysis. You can use these skills to help you decide which types of data to include in your analyses.
- Sports Statistics Database – Sports statistics and historical data covering many professional sports and several college ones. Clean interface makes it easier for web scraping.
- Sports Reference – Another database of sports statistics. More cluttered interface, but individual tables can be exported as CSV files.
- cricsheet.org – Ball-by-ball data for international and IPL cricket matches. CSV files for IPL and T20 internationals matches are available.
3. Predict Stock Prices
The stock market is like candy-land for any data scientists who are even remotely interested in finance.
First, you have many types of data that you can choose from. You can find prices, fundamentals, global macroeconomic indicators, volatility indices, etc… the list goes on and on.
Second, the data can be very granular. You can easily get time series data by day (or even minute) for each company, which allows you think creatively about trading strategies.
Finally, the financial markets generally have short feedback cycles. Therefore, you can quickly validate your predictions on new data.
Some examples of beginner-friendly machine learning projects you could try include…
- Quantitative value investing… Predict 6-month price movements based fundamental indicators from companies’ quarterly reports.
- Forecasting… Build time series models, or even recurrent neural networks, on the delta between implied and actual volatility.
- Statistical arbitrage… Find similar stocks based on their price movements and other factors and look for periods when their prices diverge.
Obvious disclaimer: Building trading models to practice machine learning is simple. Making them profitable is extremely difficult. Nothing here is financial advice, and we do not recommend trading real money.
- Python: sklearn for Investing – YouTube video series on applying machine learning to investing.
- R: Quantitative Trading with R – Detailed class notes for quantitative finance with R.
- Quandl – Data market that provides free (and premium) financial and economic data. For example, you can bulk download end-of-day stock prices for over 3000 US companies or economic data from the Federal Reserve.
- Quantopian – Quantitative finance community that offers a free platform for developing trading algorithm. Includes datasets.
- US Fundamentals Archive – 5 years of fundamentals data for 5000+ U.S. companies.
4. Teach a Neural Network to Read Handwriting
Neural networks and deep learning are two success stories in modern artificial intelligence. They’ve led to major advances in image recognition, automatic text generation, and even in self-driving cars.
To get involved with this exciting field, you should start with a manageable dataset.
The MNIST Handwritten Digit Classification Challenge is the classic entry point. Image data is generally harder to work with than “flat” relational data. The MNIST data is beginner-friendly and is small enough to fit on one computer.
Handwriting recognition will challenge you, but it doesn’t need high computational power.
To start, we recommend with the first chapter in the tutorial below. It will teach you how to build a neural network from scratch that solves the MNIST challenge with high accuracy.
- Neural Networks and Deep Learning (Online Book) – Chapter 1 walks through how to write a neural network from scratch in Python to classify digits from MNIST. The author also gives a very good explanation of the intuition behind neural networks.
- MNIST – MNIST is a modified subset of two datasets collected by the U.S. National Institute of Standards and Technology. It contains 70,000 labeled images of handwritten digits.
5. Investigate Enron
The Enron scandal and collapse was one of the largest corporate meltdowns in history.
In the year 2000, Enron was one of the largest energy companies in America. Then, after being outed for fraud, it spiraled downward into bankruptcy within a year.
Luckily for us, we have the Enron email database. It contains 500 thousand emails between 150 former Enron employees, mostly senior executives. It’s also the only large public database of real emails, which makes it more valuable.
In fact, data scientists have been using this dataset for education and research for years.
Examples of machine learning projects for beginners you could try include…
- Anomaly detection… Map the distribution of emails sent and received by hour and try to detect abnormal behavior leading up to the public scandal.
- Social network analysis… Build network graph models between employees to find key influencers.
- Natural language processing… Analyze the body messages in conjunction with email metadata to classify emails based on their purposes.
- Enron Email Dataset – This is the Enron email archive hosted by CMU.
- Description of Enron Data (PDF) – Exploratory analysis of Enron email data that could help you get your grounding.
6. Write ML Algorithms from Scratch
Writing machine learning algorithms from scratch is an excellent learning tool for two main reasons.
First, there’s no better way to build true understanding of their mechanics. You’ll be forced to think about every step, and this leads to true mastery.
Second, you’ll learn how to translate mathematical instructions into working code. You’ll need this skill when adapting algorithms from academic research.
To start, we recommend picking an algorithm that isn’t too complex. There are dozens of subtle decisions you’ll need to make for even the simplest algorithms.
After you’re comfortable building simple algorithms, try extending them for more functionality. For example, try extending a vanilla logistic regression algorithm into a lasso/ridge regression by adding regularization parameters.
Finally, here’s a tip every beginner should know: Don’t be discouraged is your algorithm is not as fast or fancy as those in existing packages. Those packages are the fruits of years of development!
- Python: Logistic Regression from Scratch
- Python: k-Nearest Neighbors from Scratch
- R: Logistic Regression from Scratch
7. Mine Social Media Sentiment
Social media has almost become synonymous with “big data” due to the sheer amount of user-generated content.
Mining this rich data can prove unprecedented ways to keep a pulse on opinions, trends, and public sentiment. Facebook, Twitter, YouTube, WeChat, WhatsApp, Reddit… the list goes on and on.
Furthermore, every generation is spending even more time on social media than their predecessors. This means that social media data is will become even more relevant for marketing, branding, and business as a whole.
While there are many popular social media platforms out there, Twitter is the classic entry point for practicing machine learning.
With Twitter data, you get an interesting blend of data (tweet contents) and meta-data (location, hashtags, users, re-tweets, etc.) that open up nearly endless paths for analysis.
- Python: Mining Twitter Data – How to perform sentiment analysis on Twitter data
- R: Sentiment analysis with machine learning – Short and sweet sentiment analysis tutorial
- Twitter API – The twitter API is a classic source for streaming data. You can track tweets, hashtags, and more.
- StockTwits API – StockTwits is like a twitter for traders and investors. You can expand this dataset in many interesting ways by joining it to time series datasets using the timestamp and ticker symbol.
8. Improve Health Care
Another industry that’s undergoing rapid changes thanks to machine learning is global health and health care.
In most countries, becoming a doctor requires many years of education. It’s a demanding field with long hours, high stakes, and an even higher barrier to entry.
As a result, there has recently been significant effort to alleviate doctors’ workload and improve the overall efficiency of the health care system with the help of machine learning.
Uses cases include:
- Preventative care… Predicting disease outbreaks on both the individual and the community level.
- Diagnostic care… Automatically classifying image data, such as scans, x-rays, etc.
- Insurance… Adjusting insurance premiums based on publicly available risk factors.
As hospitals continue to modernize patient records and as we collect more granular health data, there will be an influx of low-hanging fruit opportunities for data scientists to make a difference.
- R: Building meaningful machine learning models for disease prediction
- Machine Learning in Health Care – Excellent presentation by Microsoft Research
- Large Health Data Sets – Collection of large health-related datasets
- data.gov/health – Datasets related to health and health care provided by the U.S. government.
- Health Nutrition and Population Statistics – Global health, nutrition, and population statistics provided by the World Bank.