Analyzing news sources with Amazon AI/ML

I will be analyzing news sources with Amazon AI/ML and how positive or negative they are.
Why did I decided to explore this?
The are a few reasons.
One, I wanted more positivity in my life.
Two, I believe there are still good things happening around the world today.
Three, AI/ML is interesting and I wanted to learn more about it and use it to get positive results!
I started looking at the top news stories from CNBC, Buzzfeed, and NYTimes.
I read CNBC because I am interested in economics and how our financial markets are doing.
I included Buzzfeed and NYTimes because according to Wibbitz blog, Buzzfeed and NYTimes are the two most visited sites by millennials. [1]
I architected a system that automatically fetches the top news from each of these sites, scrape the content, and ran it through Amazon’s sentiment analysis.
Amazon sentiment analysis is a web service that uses ML/AI to determine the sentiment of some texts.
You send it some text and they tell you the positive, negative, neutral, and mixed score.
It also returns an overall emotion state of either positive, negative, or neutral.
An example they give is to use sentiment analysis is to use it on comments of a blog to determine if your readers liked the post. [2]
So far I have analyzed 1,030 articles and here are the results:
CNBC have a positive score of .025, a negative score of .039, a neutral score of .92, and a mixed score of .014.
NYTimes have a positive score of .055, a negative score of .044, a neutral score of .87, and a mixed score of .022.
I have the results for buzzfeed but after  reviewing the values, my code was not parsing all the pages in buzzfeed correctly so the results may be incorrect. I will be working on it and get an update soon.
So, what does this mean?
The Amazon AI sentiment analysis tells you the probability that something is either positive, negative, neutral or mixed.
Supposedly, you can take these numbers and convert it into a probability that something is positive.
I have asked on the Amazon forums and got an answer from a rep at Sep 15,2018 that the example is outdated.
However, my instinct is that each number represents the percentage of each sentiment.
For example, when you add all the positive, negative, neutral, and mixed score, you get 1.
Therefore, each of the number of a probability percentage of it being that sentiment.
In this case, CNBC has a 92% probability of being neutral and NYTimes has 87%.
NYTimes positive score is 5.5% vs CNBC 2.5%.
This may sound good but is it better for a news article to be neutral or positive?
These are questions that I love to explore.
Since I combine all the articles into a single number, I do not know which articles are deemed positive or negative.
Next time, I will break it down per article and let’s see how accurate the sentiment analysis is.

Demystifying Data Science Post

I read Demystifying Data Science For All from business over broadway. For someone new and interested in the space, this article is great.

Generally speaking, data science is a way of extracting value and insights from data using the powers of computer science and statistics applied to a specific field of study – Business over Broadway

I studied Biomedical Engineering and I work with data for my research. What makes  data scientist different is the increased size of the data set and the variable structure of that data. Today, we have data from servers, logs, mobile, IoT, 3rd party integrations such as salesforce, chat, Zendesk, etc. To complicate matters, some of the data are not structured.

I’m still not sure how you can analyze data that are not structured. The data I have worked with have all been structured. Sometimes, I need to transform that data so it can be easily analyzed.

The sheer volume of recorded data is called Big Data since there is a lot of it; hence, Big. Better technology enables us to store and process Big Data. Businesses want to understand and analyze this massive data to gain competitive advantage.

At my work, security is also another factor. Massive data is collected and security measures have to be taken so analysis does not reveal sensitive information. The article did not mention security but it should be taken seriously in any organization working with sensitive data.

I am surprised to see IBM listed as the leader in Data Science Platforms. Living in SF, I thought it would be Google or Amazon. They have the expertise in house so they should be able to monetize that expertise. Perhaps it is more valuable to use that expertise on their core businesses than to provide services to other companies.

My interests in Big Data is automation. How can someone take all this data and generate something useful with the least amount of resources. That’s where technology and computer science comes in.