Article Sentiment Through AI

This is part 1 on evaluating sentiments using ML/AI of news articles.

This post builds on work from last week as I explore news articles with ML/AI. To recap, I aggregated the top news from CNBC and NYTimes and calculated their overall sentiment score. However, since all the news articles are combined together, there is no way to evaluate them individually.

In this post, I will examine the individual article’s sentiment.

Methodology

Last week I use AWS Comprehend; however, this week I will using Google Cloud Natural Language.

Why the change?

Because of AWS’s limitations. According to their guidelines and limits, the maximum size for sentiment detection is 5KB. That is a mere 2,500 words!If an article goes over 2500 words, I have to split them, and I have to analysis separately. Then, I need to weigh them appropriately and did a final calculation. I am lazy so I seek a better solution. I found it with Google Cloud Natural Language

Note to businesses: This is a reason why customers switch to a different service.

Google Cloud Natural Language

Google Cloud Natural Language derive insights from unstructured text using Google maching learning

Google’s sentiment analysis is less specific than AWS. Google provides two values: magnitude and score. A score of 0 is neutral. A score of less than 0 is considered to have negative emotion and a score that is greater than 0 is considered to have positive emotion. The magnitude indicates the level of emotional content. Pretty vague but let’s take a look at some samples. You can read more about Google’s sentiment analysis here.

Results

Enough theory! Let’s analyze some examples of the output and see if they make sense.

Here is the results for the average magnitude and score for CNBC and NYTimes:

source avg(gcloud_magnitude) avg(gcloud_score)
cnbc 4.656716405678151260447761194030 -0.082835822463480393880597014925
nytimes 14.884188043510812147863247863248 -0.060256411440861528376068376068

NYTimes is more emotional based on its average gcloud_magnitude score, 4.65 vs 14.88. The sentiment score for both is very close to 0 so they are both neither positive or negative.

From last article’s analysis, CNBC has a 92% and NYTimes has a 87% probability of being neutral respectively. Both AWS and Google seems to agree that the sentiments are most likely neutral.

Individual Article Analysis

Here is the focus point of this article, let’s evaluate one of the articles.

CNBC With Higest Magnitude
url gcloud_magnitude gcloud_score
https://www.cnbc.com/2018/09/18/iphone-xs-and-iphone-xs-max-review.html 30.3999996185302700 0.2000000029802322

This CNBC article reviews Apple’s new iPhones and it generates a magnitude score of 30.39, much higher than the average score of 4.65 for CNBC.

I read the article and there are phrases that indicates how emotional dramatic the article is written in. Here are some quotes:

They’re the best phones Apple has ever made.

The iPhone X, even a year later, is still arguably the best phone on the market.

It’s one of the best screens on the market…

The speakers sound awesome.

I love how shiny it is on the new gold and white models.

I love that iOS 12 gives you so much more control over notifications.

…these are the best phones Apple has made…

Judging from some of these statements, I can understand why Google’s algorithm gives it a high magnitude score. It’s full of dramatic adjectives like best and love.

Conclusion

The CNBC article reviewing Apple’s new iPhoneXS do seem dramatic and emotional. It am pretty impressed that Google Cloud Natural Language can understand that. In part 2, I will dive deeper into articles that have low and high scores from both sources.

Analyzing news sources with Amazon AI/ML

I will be analyzing news sources with Amazon AI/ML and how positive or negative they are.
Why did I decided to explore this?
The are a few reasons.
One, I wanted more positivity in my life.
Two, I believe there are still good things happening around the world today.
Three, AI/ML is interesting and I wanted to learn more about it and use it to get positive results!
I started looking at the top news stories from CNBC, Buzzfeed, and NYTimes.
I read CNBC because I am interested in economics and how our financial markets are doing.
I included Buzzfeed and NYTimes because according to Wibbitz blog, Buzzfeed and NYTimes are the two most visited sites by millennials. [1]
I architected a system that automatically fetches the top news from each of these sites, scrape the content, and ran it through Amazon’s sentiment analysis.
Amazon sentiment analysis is a web service that uses ML/AI to determine the sentiment of some texts.
You send it some text and they tell you the positive, negative, neutral, and mixed score.
It also returns an overall emotion state of either positive, negative, or neutral.
An example they give is to use sentiment analysis is to use it on comments of a blog to determine if your readers liked the post. [2]
So far I have analyzed 1,030 articles and here are the results:
CNBC have a positive score of .025, a negative score of .039, a neutral score of .92, and a mixed score of .014.
NYTimes have a positive score of .055, a negative score of .044, a neutral score of .87, and a mixed score of .022.
I have the results for buzzfeed but after  reviewing the values, my code was not parsing all the pages in buzzfeed correctly so the results may be incorrect. I will be working on it and get an update soon.
So, what does this mean?
The Amazon AI sentiment analysis tells you the probability that something is either positive, negative, neutral or mixed.
Supposedly, you can take these numbers and convert it into a probability that something is positive.
I have asked on the Amazon forums and got an answer from a rep at Sep 15,2018 that the example is outdated.
However, my instinct is that each number represents the percentage of each sentiment.
For example, when you add all the positive, negative, neutral, and mixed score, you get 1.
Therefore, each of the number of a probability percentage of it being that sentiment.
In this case, CNBC has a 92% probability of being neutral and NYTimes has 87%.
NYTimes positive score is 5.5% vs CNBC 2.5%.
This may sound good but is it better for a news article to be neutral or positive?
These are questions that I love to explore.
Since I combine all the articles into a single number, I do not know which articles are deemed positive or negative.
Next time, I will break it down per article and let’s see how accurate the sentiment analysis is.

How did you prepare for AWS Certified Solutions Architect – Associate Level certification?

I receive my AWS Certified Solutions Architect – Associate Level in Feb 2018. First, I think a course on Online Courses – Learn Anything, On Your Schedule | Udemy. Here is the link: AWS Certified Solutions Architect – Associate 2018

Second, I took the AWS CSAA Practice Test from whizlabs.com. It is a paid course but I think it is worth it.

Those two are enough for me and I passed it on the first try albeit barely. I also made sure to redo some of the labs on the udemy course so I can remember it more.

Quora link

Scope ruby gem to a single class

I came across a useful trick of explicitly requiring ruby gems in each classes. In the Gemfile, set the gem require to false. Then you will have to explicitly require that gem in each class.

I am using it to encapsulate interaction with a 3rd party gems through an interface class. All calls made to the 3rd party service through this gem has to be through the interface class.

Example: https://stackoverflow.com/questions/18267849/require-a-gem-only-in-the-context-of-a-single-class-in-ruby

Interface Class with 3rd party gem

I finished creating an interface class to interact with a 3rd party gem. It is only used by a single class for now; however, establishing a good pattern early on is critical. Having a single class that interacts with the 3rd party gem allows me to create an integration tests that actually make network calls. Since it makes network calls to an external service, the integration spec will only run if you manually specify it on the command line. I do not include this as part of the continuous integration tests as I do not want to overload the external service. It is useful when developing as I can invoke an automated integration test after completing a feature. Right now, I have to run the commands and test it manually anyways; this integration tests automates it.
One issue I noticed is proper encapsulation is hard. That means I want ALL interaction with a 3rd party gem to go through this interface class but it is hard to encapsulate that fully. For example, the interface makes a request to an external service and return that as an object. Now, the class using the interface has a handle on that object and may invoke method calls or modify it. I need to freeze the returned object so you cannot invoke any methods or modify it.
A 3rd party integration spec is useful as it allows you to run a suite of tests when you finish with a feature.
Proper encapsulation of the interface class is difficult.

Big Data Tools Cheat Sheet

Datamania has a good technology landscape cheat sheet. Attached below is the cheat sheet.

Big-Data-Technology-Landscape-CheatSheet
Big-Data-Technology-Landscape-CheatSheet

What surprised me is that Looker is not on here. “Looker is a business intelligence software and big data analytics platform that helps you explore, analyze and share real-time business analytics easily.”

Also, ELT is also not represented in the graph. ELT stands for Extract, Load, Transform. Instead of having a data engineering team extract the data, transform it and load it into a data warehouse, the team extracts and loads the raw data into a data warehouse. Then, analytics tool such as Looker can transform the data on the fly and analyze it straight from the data warehouse. I deal with complex data transformations so I am most curious to hear how Looker or other tools can handle complex transformations.

I hear a lot of ELT and how ETL is dead. However, I did a google trend comparison between ETL and ELT and they both seem to have been stable for the past five years.

Finally, with so many tools and terminologies available, it is crucial to know when to use what. Factors include the data size, the skills, security, infrastructure, etc. There is so many exciting developments in this field!

Demystifying Data Science Post

I read Demystifying Data Science For All from business over broadway. For someone new and interested in the space, this article is great.

Generally speaking, data science is a way of extracting value and insights from data using the powers of computer science and statistics applied to a specific field of study – Business over Broadway

I studied Biomedical Engineering and I work with data for my research. What makes  data scientist different is the increased size of the data set and the variable structure of that data. Today, we have data from servers, logs, mobile, IoT, 3rd party integrations such as salesforce, chat, Zendesk, etc. To complicate matters, some of the data are not structured.

I’m still not sure how you can analyze data that are not structured. The data I have worked with have all been structured. Sometimes, I need to transform that data so it can be easily analyzed.

The sheer volume of recorded data is called Big Data since there is a lot of it; hence, Big. Better technology enables us to store and process Big Data. Businesses want to understand and analyze this massive data to gain competitive advantage.

At my work, security is also another factor. Massive data is collected and security measures have to be taken so analysis does not reveal sensitive information. The article did not mention security but it should be taken seriously in any organization working with sensitive data.

I am surprised to see IBM listed as the leader in Data Science Platforms. Living in SF, I thought it would be Google or Amazon. They have the expertise in house so they should be able to monetize that expertise. Perhaps it is more valuable to use that expertise on their core businesses than to provide services to other companies.

My interests in Big Data is automation. How can someone take all this data and generate something useful with the least amount of resources. That’s where technology and computer science comes in.

Data Driven Practical Sites to Follow

I started looking for practical data driven sites to follow. It’s tough. There are ‘professional bloggers’ who uses buzz words such as big data, machine learning, and AI to generate content. However, when I read the content, I’m nodding my head in agreement but it lacks practical advice. It’s good for water cooler talk; nothing else.

There are some potential good ones and I am listing them below. I have not fully vet them, but they show promise. Let me know your thoughts on these sites and if you have any good ones to share.

  1. Data Mania Blog
  2. Business Over Broadway
  3. Blog on Big Data and Analytics