Minggu, 20 Oktober 2019

How to Know What the Most Video That I Like on Youtube [Text Extraction and Analysis] using Beautiful Soup, Selenium, and Youtube API in Python



Okay, this project was started on Saturday, 19 October 2019. At first, the reason why I want to build this project is I want to know what the topics that I like on Youtube and what is my favorite channel. 

Introduction 

1. Selenium

So, what is Selenium? "Selenium automates browsers. "

In modern web application, it's common that they use infinite scroll to load data instead of pagination. We know that is hard to scrapping the javascript web page than the PHP web page (server-side rendering). Most of the javascript web uses client-side rendering to render a web page.  

Youtube uses client-side rendering and infinite scroll, to avoid the limitation I use Selenium which commonly used for automation testing. With this library, you can control a web browser and interact with the website. I used it to load all Youtube data on the page using the Chrome browser and then scrapping it.

2. Beautiful Soup

Beautiful Soup is a Python library that use for data extraction of HTML and XML files. That is my favorite library to scrape and parse the HTML web page because it's easy to use and save a lot of time compare to manual parsing. 

3. Youtube API

Youtube is a video streaming services platform owned by Google where you can watch a lot of videos. Nowadays, Google opens API services for Youtube. So, it helps developers and everyone who wants to use Youtube as a 3rd party library services and to gather information.


The Goal 
The goal of this project is to give me information about: 
  • The topics that mostly I like based on the category.
  • My favorite channels.
  • What the words that commonly appear on the video that I like based on the title and tagging. 


1. Let's get started
First, install Anaconda and use Jyupter Notebook to run your python code. Here is the download link: https://www.anaconda.com/distribution/

After you finished install Anaconda, then install Selenium by running command below :
pip install selenium

Create your first project in Jupiter Notebook.  

After that, we need to import the web driver to control Chrome with our code. Here is the download link and chose the one that compatible with your browser: https://sites.google.com/a/chromium.org/chromedriver/downloads

more information about chrome web driver: https://github.com/SeleniumHQ/selenium/wiki/ChromeDriver

Note: If you using macOS, at first you must enable permission at security settings. 

2. Getting Youtube API Key 
At this session, I need Youtube API to get information about video detail. The fields that I use are the title, description, category, channel id, channel name, and tags.

Before you can use Youtube API, you must register your application to get permission and access key.

3. Save the data to CSV File

Youtube has limitation quota access per day. So, I decided to save the data into local storage. Instead of save it into the MySQL database, I prefer to save it into CSV file because it easy to save and load data without querying first.

4. Stop words

I use stop words to filter and erase common words like the, a, that, and etc.

5. Present data into Chart and Wordcloud


The result 

My Favorite Categories




My Favorite Channels






Words by Title







Words by Tag




The Code using Juypter Notebook


from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.keys import Keys
import time
import json
import requests as reqs
from altair import Row, Column, Chart, Text, Scale, Color
import pandas as pd
import csv
from itertools import islice
import ast
import numpy as np
import json
import matplotlib.pyplot as plt
from stop_words import get_stop_words
import re

from os import path
from PIL import Image
import os

from wordcloud import WordCloud, STOPWORDS



class Video:
    def __init__(self, idx, title, description, date, channelID, channel, categoryID, tags):
        self.idx = idx
        self.title = title
        self.description = description
        self.date = date
        self.channelID = channelID
        self.channel = channel
        self.categoryID = categoryID
        self.tags = tags
    def asList(self):
        return [
            self.idx, 
            self.title,
            self.description,
            self.date,
            self.channelID,
            self.channel,
            self.categoryID,
            self.tags
        ]
    
    
def get_data(totalVideo, perPage, channelID, youtubeAPIKEY):

    driver = webdriver.Chrome('/Users/aprilian/Downloads/chromedriver')
    driver.get("https://www.youtube.com/channel/"+channelID+"/videos?view=15&flow=grid")

    maxIterate = int(totalVideo/perPage)

    elem = driver.find_element_by_tag_name('html')
    for i in range(0,maxIterate):
        elem.send_keys(Keys.END)
        time.sleep(5)
        
    html = driver.page_source
    soup = BeautifulSoup(html)
    
    videos = []
    videoList = []
    for tag in soup.find_all("a", {'id':['video-title']}, href=True):
        url = tag['href']
        videoID = remove_prefix(url, '/watch?v=')
        video = get_video_detail(youtubeAPIKEY, videoID)
        videos.append(video)
        videoList.append(video.asList())
        
    with open('My-Youtube-Data.csv', 'a') as csvFile:
        writer = csv.writer(csvFile)
        writer.writerows(videoList)
    csvFile.close()
    
    return videos
        
def get_video_detail(youtubeAPIKEY, videoID):
    response = reqs.get('https://www.googleapis.com/youtube/v3/videos?id='+videoID+'&key='+youtubeAPIKEY+'&part=snippet,contentDetails,statistics,status&hl=id')
    response_dict = json.loads(response.text)

    item = response_dict['items'][0]
    snippet = item['snippet']
    vid_id = item['id']
    vid_title = snippet['title']
    vid_date = snippet['publishedAt']
    vid_desc = snippet['description']
    vid_channelID = snippet['channelId']
    vid_channel = snippet['channelTitle']
    vid_categoryId = snippet['categoryId']
    vid_tags = snippet.get('tags')
    
    return Video(vid_id, vid_title, vid_desc, vid_date, vid_channelID, vid_channel, vid_categoryId, vid_tags)
    
    
def remove_prefix(text, prefix):
    if text.startswith(prefix): 
         text = text.replace(prefix, "", 1)
    return text

def result_by_title(videos, youtubeAPIKEY):
    
    text = ''
    counts = dict()
    for vid in videos:
        title = re.sub('\W+',' ', vid.title.lower())
        title = ' '.join( [w for w in title.split() if len(w)>1] )
        title = title.split()
        title = [word for word in title if word not in get_stop_words('en')]
        title = [word for word in title if word not in get_stop_words('id')]
        
        text += ' '.join(title)
        
        for t in title:
            counts[t] = counts.get(t, 0) + 1
            
    getWordCloud(text)
    
    counts = sorted(counts.items(), key=lambda x: x[1], reverse=True)

    words = dict()
    for c in counts:
        words[c[0]] = c[1]
        
    
    MAX_RESULTS = 18
    words = {k: words[k] for k in list(words.keys())[:MAX_RESULTS]} 
        
    #change dictionary into dataframe 
    print("======== Title Most Likely ========")
    plt.figure(figsize=(10,10)) 
    dfY = pd.Series(words, name='Title')
    
    #show data as chart
    ax = dfY.plot.barh(x='lab', y='val', rot=0)
    plt.show()
    

def result_by_category(videos, youtubeAPIKEY):
    counts = dict()
    for vid in videos:
        counts[vid.categoryID] = counts.get(vid.categoryID, 0) + 1
        
    counts = sorted(counts.items(), key=lambda x: x[1], reverse=True)
    
    categories = dict()
    for r in counts:
        response = reqs.get('https://www.googleapis.com/youtube/v3/videoCategories?id='+r[0]+'&key='+youtubeAPIKEY+'&part=snippet&hl=id')
        response_dict = json.loads(response.text)
        category = response_dict['items'][0]['snippet']['title']
        categories[category] = r[1]

    #change dictionary into dataframe 
    print("======== Category Most Likely ========")
    plt.figure(figsize=(10,10))
    dfX = pd.Series(categories, name='Category')
    
    #show data as chart
    ax = dfX.plot.barh(x='lab', y='val', rot=0) 
    plt.show()
        
    #return categories



def result_by_channel(videos):
    counts = dict()
    for vid in videos:
        counts[vid.channel] = counts.get(vid.channel, 0) + 1
        
    counts = sorted(counts.items(), key=lambda x: x[1], reverse=True)
    
    channels = dict()
    for c in counts:
        channels[c[0]] = c[1]
        
    
    MAX_RESULTS = 18
    channels = {k: channels[k] for k in list(channels.keys())[:MAX_RESULTS]} 
        
    #change dictionary into dataframe 
    print("======== Channel Most Likely ========")
    plt.figure(figsize=(10,10))
    dfY = pd.Series(channels, name='Channel')
    
    #show data as chart
    ax = dfY.plot.barh(x='lab', y='val', rot=0) 
    plt.show()

    
    #return channels


def result_by_tag(videos):
    text = ''
    counts = dict()
    for vid in videos:

        tag = vid.tags
        if len(tag) > 0:
            tag = ast.literal_eval(tag)
            for t in tag:
                #print(t)
                t = t.lower()
                counts[t] = counts.get(t, 0) + 1
                text += ''.join(t)
                
                #tList = t.split()
                #for t1 in tList: 
                    #print(t1)
                    #counts[t1] = counts.get(t1, 0) + 1
                    #text += ''.join(t1)
                
     
    getWordCloud(text)
                
    
    counts = sorted(counts.items(), key=lambda x: x[1], reverse=True)
    
        
    tags = dict()
    for t in counts:
        tags[t[0]] = t[1]
        
    
    MAX_RESULTS = 20
    tags = {k: tags[k] for k in list(tags.keys())[:MAX_RESULTS]} 
        
    #change dictionary into dataframe 
    print("======== Tags Most Likely ========")
    plt.figure(figsize=(10,10)) 
    dfY = pd.Series(tags, name='Tags')
    
    #show data as chart
    ax = dfY.plot.barh(x='lab', y='val', rot=2)
    plt.show()

    #return tags
    

def loadCSVData():
    with open('My-Youtube-Data.csv', 'r') as readFile:
        reader = csv.reader(readFile)
        data = list(reader)
    videos = []
    for v in data:
        videos.append(Video(v[0], v[1], v[2], v[3], v[4], v[5], v[6], v[7]))
    return videos


def getWordCloud(text):
    d = path.dirname(__file__) if "__file__" in locals() else os.getcwd()
    
    
    filtered_words = list(filter(lambda word: word not in getStopwords(), text))
    
    # read the mask image
    #alice_mask = np.array(Image.open(path.join(d, "owl.png")))

    stopwords = getStopwords()

    wc = WordCloud(background_color="white", max_words=2000,
                   stopwords=stopwords, contour_width=3, contour_color='steelblue')

    # generate word cloud
    wc.generate(text)

    # store to file
    wc.to_file(path.join(d, "result.png"))

    # show
    plt.figure(figsize=(20,20)) 
    plt.imshow(wc, interpolation='bilinear')
    plt.axis("off")
    plt.show()
    
    
def getStopwords():
    # Create stopword list:
    stopwords = set(STOPWORDS)
    stopwords.update(["official", "di"])
    return stopwords




youtubeAPIKEY = '[YOUR_YOUTUBE_API_KEY]'
channelID = '[YOUR_CHANNEL_ID]'
totalVideo = 1000
perPage = 30

print("starting and getting data...")

#getting data from Internet, but if you have your local data (CSV file command code on below)
videos = get_data(totalVideo, perPage, channelID, youtubeAPIKEY)
#load data from local file (CSV)
videos = loadCSVData()

print("processing data...")

result_by_title(videos, youtubeAPIKEY)

result_by_category(videos, youtubeAPIKEY)

result_by_channel(videos)

result_by_tag(videos)

print("finished...")


Reference :

Install Selenium webdriver
https://sites.google.com/a/chromium.org/chromedriver/getting-started
https://stackoverflow.com/questions/55304226/selenium-webdriver-driver-issue-mac

Trying Selenium infinite scroll
https://stackoverflow.com/questions/32391303/how-to-scroll-to-the-end-of-the-page-using-selenium-in-python/32629481
https://stackoverflow.com/questions/55400703/how-to-scroll-down-in-youtube-using-selenium


Save load CSV File
https://datatofish.com/export-dataframe-to-csv/
https://stackoverflow.com/questions/14037540/writing-a-python-list-of-lists-to-a-csv-file
https://www.programiz.com/python-programming/working-csv-files
https://www.dev2qa.com/python-read-write-csv-file-example/



Membuat Function di Python
https://www.w3schools.com/python/python_functions.asp

Membuat CLASS di Python
https://www.w3schools.com/python/python_classes.asp


Sorting Ditctionary
https://stackoverflow.com/questions/613183/how-do-i-sort-a-dictionary-by-value


Chart
https://altair-viz.github.io/user_guide/data.html

WordCloud
https://github.com/amueller/word_cloud
https://www.datacamp.com/community/tutorials/wordcloud-python


Remove Prefix
https://stackoverflow.com/questions/16891340/remove-a-prefix-from-a-string




Get ARray From JSON
https://stackoverflow.com/questions/2687225/json-keyerror-with-json-loads


Stop words
https://pypi.org/project/stop-words/
https://stackoverflow.com/questions/5486337/how-to-remove-stop-words-using-nltk-or-python


Remove Special Char 
https://stackoverflow.com/questions/43358857/how-to-remove-special-characters-except-space-from-a-file-in-python/43358965

Remove Single Letter
https://stackoverflow.com/questions/32705962/removing-any-single-letter-on-a-string-in-python




  

Related Articles

0 komentar:

Posting Komentar