Sabudh Data Science Internship(2021)

Posts

WEEK-17 ( 31/06/2021 - 4/06/2021 )

- June 04, 2021

In this week our learning about recommender continues, also in this week, we started more on our project and also learnt about Text Analysis. Text Vectorization Text Vectorization is the process of converting text into numerical representation. Here are some popular methods to accomplish text vectorization: TF-IDF(Term Frequency Inverse Document Frequency) Word2Vec TF-IDF TF-IDF stands for Term Frequency-Inverse Document Frequency which basically tells importance of the word in the corpus or dataset. TF-IDF contain two concept Term Frequency(TF) and Inverse Document Frequency(IDF) Term Frequency Term Frequency is defined as how frequently the word appear in the document or corpus. As each sentence is not the same length so it may be possible a word appears in long sentence occur more time as compared to word appear in sorter sentence. Term frequency can be defined as: Inverse Document Frequency Inverse Document frequency is another concept which is used for finding ou...

WEEK-16 ( 24/05/2021 - 28/05/2021 )

- May 29, 2021

Week 16 covered all the aspects of a recommender system, so everything related to the recommender systems was taught to us and a News Recommender Assignment was also given to us to implement everything we learned during the week. The assignment problem is discussed at the end of the blog. Recommender Systems Introduction During the last few decades, with the rise of Youtube, Amazon, Netflix, and many other such web services, recommender systems have taken more and more place in our lives. From e-commerce (suggest to buyers articles that could interest them) to online advertisement (suggest to users the right contents, matching their preferences), recommender systems are today unavoidable in our daily online journeys. In a very general way, recommender systems are algorithms aimed at suggesting relevant items to users (items being movies to watch, text to read, products to buy, or anything else depending on industries). Outline In the first section, we are going to overview the two...

WEEK-15 ( 17/05/2021 - 21/05/2021 )

- May 22, 2021

In this week, we were taught about kd trees and similarity and distance metrics like euclidean distance and Pearson’s correlation. Let’s firstly start with kd trees KD Tree Algorithm The KD Tree Algorithm is one of the most commonly used Nearest Neighbor Algorithms. The data points are split at each node into two sets. Like the previous algorithm, the KD Tree is also a binary tree algorithm always ending in a maximum of two nodes. The split criteria chosen are often the median. On the right side of the image below, you can see the exact position of the data points, on the left side the spatial position of them. Data points and their position in a coordinate system. The KD-Tree Algorithm uses first the median of the first axis and then, in the second layer, the median of the second axis. We’ll start with axis X. The in ascending order sorted x-values are: 1,2,3,4,4,6,7,8,9,9. Followingly, the median is 6. The data points are then divided into smaller and bigger equal to 6. T...

WEEK-14 ( 26/04/2021 - 30/04/2021 )

- April 30, 2021

In this week we were taught about the genetic algorithms, and everything discussed during the week is mentioned below- Introduction to Genetic Algorithms A genetic algorithm is a search heuristic that is inspired by Charles Darwin’s theory of natural evolution. This algorithm reflects the process of natural selection where the fittest individuals are selected for reproduction in order to produce offspring of the next generation. Notion of Natural Selection The process of natural selection starts with the selection of fittest individuals from a population. They produce offspring which inherit the characteristics of the parents and will be added to the next generation. If parents have better fitness, their offspring will be better than parents and have a better chance at surviving. This process keeps on iterating and at the end, a generation with the fittest individuals will be found. This notion can be applied for a search problem. We consider a set of solut...

WEEK-13 ( 19/04/2021 - 23/04/2021 )

- April 24, 2021

Week 13 started with continuing the Unsupervised Learning topic. What Is the Problem with the K-Means Method? The k-means algorithm is sensitive to outliers! Since an object with an extremely large value may substantially distort the distribution of the data. K-Medoids: Instead of taking the mean value of the objects in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster Drawbacks of Partitional Clustering Natural clusters may be split if clusters are not well separated (large inter-cluster distances) have very different cluster sizes When cluster shapes are not convex Hierarchical Clustering Create a nested series of partitions represented in the form of a dendogram Shows how objects are grouped together step by step Typical stopping criteria Number of clusters Minimum distance between clusters being greater than a user defined threshold Types: Agglomerative Starts by assuming that each object is a separate cluster Su...

Search This Blog