Image Source: Youtube.com
On April 23, 2005, the first ever YouTube video was uploaded. Since then, YouTube has grown tremendously in popularity and new videos on all kinds of topics are being uploaded every day. Youtube has become the largest search engine for videos and each minute, more than 500 hours of video content is uploaded to YouTube. With over two billion people(about ⅓ of all people on the internet) searching for videos, it is extremely important to be able to categorize videos effectively for many different reasons.
As of today, YouTube currently has 31 different categories including categories such as gaming, comedy, people & blogs, and how to & style. YouTube viewers can use these categories to find videos relating to their interests and discover new videos. Content creators can also see what other kinds of videos similar YouTubers are creating in order to find opportunities to collaborate or research the competition. One of the most important reasons for YouTube to correctly identify categories for videos in to assist in selling ad space for effectively. For example, a credit card business would want its YouTube ads to target the right demographic of people who might want to apply for a new credit card. Putting their ads on YouTube videos about personal finance would be much more effective then having their ads on a YouTube prank channel’s videos. Thus, finding the right category of videos to purchase ad space on could make a huge difference in a company’s sales or conversion rates.
One thing to keep in mind is that video uploaders are actually the ones to choose which category their video belongs in on YouTube. Our goal was to create a better categorization of the videos on YouTube instead of relying on the uploader to pick a category in order to provide better ways to sell ad space on YouTube videos.
As a result of the poor performance of many metadata models, we can instead choose to incorporate the textual features into a predictive model as well. Using natural language processing for this makes it possible to convert each of these features into word embedding. This process aims to represent each segment of text as a high dimensional vector, such that two words with similar meanings will have a short euclidian distance between them then two words with opposite, or unrelated meanings. A demonstration of this process can be shown on a two dimensional projection of the word embeddings assigned to each of the category titles. Then using principal component analysis it is possible to reduce the 512 dimensions present in these embeddings into a two dimensional plot. It only depicts an approximation of the actual embedding vector does demonstrate how similar concepts, such as “Education” and “Science & Technology”, are placed in similar euclidean positions.
“YouTube is something that looks like reality, but it is distorted to make you spend more time online. The recommendation algorithm is not optimising for what is truthful, or balanced, or healthy for democracy.” -The Guardian
The way YouTube makes money is by keeping viewers on its site and watching videos. Their most important performance metric is total watch time. More time watching videos means more time watching advertisements and more money YouTube makes. This is where YouTube’s recommendation algorithm comes into play. Automatically playing the next video in the recommendation list, viewers are drawn into more content similar to their interests. Google Brain, the group of engineers at Google responsible for the recommendation algorithm, started making recommendations three years ago. In that time, watch time on YouTube has grown twenty times over and seven out of ten videos people watch are videos suggested by the recommendation algorithm. Although recommending similar videos to watch is seemingly innocent, YouTube promotes controversial videos that fuel the spread of fake news and disinformation feeding directly to the large majority of the population. For example, YouTube’s recommendation algorithm may have played a strong role in the spread of disinformation during the 2016 presidential election.
In regards to this, many predictive models based off the Youtube recommendation algorithm do not yield Weapons of Math Destruction(WMD). The classification of videos could potentially be a much safer way to group similar videos together and give viewers a better alternative to the recommendation algorithm to find similar videos to watch. The predictions of our model are easily measurable by watching the videos and do not have the negative consequences of the recommendation algorithm since they are based off the actual content of the videos and not the previous history in watching habits of a user. The training and testing sets for the model can also be rigorously separated with results that agree with experimental measurements. A classification model also does not discriminate against any group or individual based on race, sex, color, etc. It will fairly predict a video’s category by its description and content data. This is completely independent of a user’s personal information. In this way, testing for fairness is not very important for our project.
With these results, it is possible to not see YouTube implement them in production to influence their recommendation algorithm since their current structure is very profitable. On the other hand, a classification algorithm could be put into production in marketing effective ad space to advertisement agencies. Companies would see more returns on investing into advertisements in YouTube videos with more accurate video classifications.
In order to test this hypothesis it can be possible to use the Kaggle “Trending Youtube Video Statistics” data set for a report, which contains several months worth of data scraped from the trending page on Youtube. The original data set compiled the results of this scraper across several geographic regions: USA, Great Britain, Germany, Canada, and France. For a possible project it is possible to chose to focus specifically on the US data set which contains just over forty thousand unique videos, divided into 31 possible categories.
The original data set contains features which can be divided into three classes. First there is the category id, which serves as the output variable for our first set of models. Next there is a set of features comprising the video’s “metadata,” these are all the general summary statistics detailing a videos performance, such as amount of likes or comments.