Below you will find pages that utilize the taxonomy term “Data Cleaning”
Ds
Project 6: Indeed Job Scraping
Utilized BeautifulSoup library to scrape Indeed job offerings data based on job title and job location specified. Stored the output in a CSV file for easier comparing between different offerings in a single file without the need of opening many tabs or going back of forth in a tab. This helped us to filter out irrelevant offerings to us before proceeding to view for further information in the Indeed website and applying for the jobs.
Ds
Project 3: IMDB Movie Reviews Sentiment Analysis
Performed word preprocessing such as special characters text and stopwords removal as well as stemming on the review texts. Conducted feature transformation to convert text data into numerical features using TF-IDF. Built a Multinomial Naive Bayes and Logistic Regression machine learning model to predict positive and negative sentiments and achieve F1 score of around 0.85. Plotted two word clouds to see the common words used in positive and negative reviews respectively.
Ds
Project 2: President Trump's Lies Dataset
Scraped all lies of President Trump in 2017 from this website. Used requests library to get HTML source of the website. Used BeautifulSoup library to parse the HTML match the specific HTML tag containing the details of lies. Stored the details of lies (date of lies, the lies, explanation of lies and the url linked to that explanation) in a CSV file. All President Trump's Lies in 2017 (Source: The New York Times) Link to Notebook
Ds
Project 1: Life Expectancy Predictor
Created a web app that predicts life expectancy of people based on lifestyle and demographic factors using multiple linear regression. Performed feature selection and discarded the factors that showed low significant impact towards life expectancy prediction (p<0.05). Found that the number of years of schooling was the most correlated feature with life expectancy. Build the web app using R and its Shiny package. Try the app here