In this article, I explore their entity detection service.
Google describes its entity analysis as such:
Entity Analysis inspects the given text for known entities (proper nouns such as public figures, landmarks, etc.), and returns information about those entities.
Gcloud entity returns a list of entities, types (i.e. PERSON) and its salience score. Salience score is a number between 0 to 1 describing how important or central the entity is. Read more from the official documentation.
Entity analyses returns many entities per article so I filter it to those with salience score greater than
0.1. I am only interested in entities that have significant context in the artcles.
Entity Analysis Article #1
I pick a news article that caught my eye, titled How Chicago Is Changing Theater, One Storefront at a Time. I have recently visited the beautiful city of Chicago but didn’t get a chance to watch any shows.
One entity that Google returns is
Red Tape Theater with a category of
ORGANIZATION and a salience score of
0.1361. This entity barely passes the
0.1 threshold. Google gave me some metadata and suggested a wikipedia link to Red Tape which is an article for excessive bureaucratic regulation. Google is wrong on that one. It cannot differentiate between the
Red Tape Theater and
Red Tape as an idiom.
Other entities are mentioned in this article such as
Broken Nose Theater,
First Floor Theater, and many more. Since most of those entities are briefly mentioned, the salience score is probably <
0.1. Considering how
Red Tape Theater has a salience score of only
0.1361, it seems that not a single theater is central to this article. Instead, there are many theaters presented but
Red Tape Theater is the most prominent one. That is evident when I read the article.
Towards the end,
Red Tape Theater is mentioned and the next several paragraphs are dedicated to it. If I were to pick a theater to visit, I would go with the
Red Tape Theater and see The Shipment. The second most prominent might be the
Steppenwolf Theater Company but only one paragraph is dedicated to it.
This is another impressive feat by Google’s machine learning. It is able to detect the most prominent entity in the article. It knows this by relating several paragraphs to the
Red Tape Theater.
Previously on part 2 of this news analysis series, Google Cloud Natural Language was able to detect the emotional level and its sentiment. We looked at a CNBC article reviewing the iPhone and it is an emotional piece.
Let’s look at more articles and its analyses.
CNBC With Higest Positive Score
The CNBC news article with the highest positive score is titled: Victoria Beckham on juggling a fashion brand with family life: ‘I just do the best I can’.
It has a score of
0.4 with a magnitude of
6.4. The average score for all CNBC articles are
-0.051443570457 with a magnitude of
The article’s tone is positive as it presents hope for working women with a family. Here are some quotes:
When you’re a working mum, you feel torn, you feel guilty, but I just do the best that I can do. My kids and (soccer star husband) David will always come first.
that’s why we need to support each other, first and foremost.”
“I’m doing the best I can creatively, as a wife, as a mum
CNBC With Lowest Positive Score
The CNBC news article with the lowest score is titled: Mattis relationship with Trump reportedly frays as a decision on his fate looms, but White House dismisses.
It has a score of
-0.699999988079071 with a magnitude of
2.0999999046325684. The average score for all CNBC articles are
-0.051443570457 with a magnitude of
The score is negative which means that is correlated to negative emotion. The article describes President Trump’s soured relationship with Secretary of Defense James Mattis. Here are are some quotes:
The relationship between President Donald Trump and Secretary of Defense James Mattis may have “soured” to the point of no return
Trump is … resentful of unflattering comparisons between the two men, the publication reported.
…the president is reportedly looking to replace the four star general…
NYTimes With Highest Positive Score
This NYTimes article has the highest score: 20 Wines Under $20: When Any Night Can Be a Weeknight.
It has a score of
0.4 with a whopping magnitude of
63.70000076293945. The average score for all NYTimes articles are
-0.03707533304 with a magnitude of
The article is slightly positive but does have a high magnitude of
63.7. After reading through parts of the article, it is written with expressive and descriptive words. Here are some quotes:
Greatness in a wine is not solely a measure of complexity or profundity.
…represents a people and a culture and a love of wine, then a few extra dollars is a worthwhile investment.
But good ones, like this wine from La Staffa, grown in the Castelli di Jesi region in the northern Marche near the Adriatic, reawaken curiosity.
NYTimes With Lowest Score
This NYTimes article has the lowest score: Myanmar’s ‘Gravest Crimes’ Against Rohingya Demand Action, U.N. Says.
It has a score of
-0.5 with a magnitude of
10.69999980926514. The average score for all NYTimes articles are
-0.03707533304 with a magnitude of
The article is slightly negative as it details Myanmar army’s crime against a muslim minority group in Rakhine. It is written in a somber tone about a grave injustice. Here are some quotes:
…“the gravest crimes under international law”…
…troops shot some of the children and snatched infants from their mothers, throwing some into the river to drown while tossing others onto a fire, …
“The killing of civilians of all ages, including babies, cannot be argued to be a counterterrorism measure…“
After looking at both articles from CNBC and NYTimes, I am further impressed by Google’s ability to determine human expression. CNBC has an average magnitude of
5.68 compared to NYTimes
18.08. CNBC uses more common words whereas NYTimes’ articles are written with expressive and descriptive words. It is fascinating that an algorithm can make that distinction.
Articles from both sites are written fairly neutral. Depending on the content, some are slightly positive while others are slightly negative. I do not see exaggeration from both sides.
This is part 1 on evaluating sentiments using ML/AI of news articles.
This post builds on work from last week as I explore news articles with ML/AI. To recap, I aggregated the top news from CNBC and NYTimes and calculated their overall sentiment score. However, since all the news articles are combined together, there is no way to evaluate them individually.
In this post, I will examine the individual article’s sentiment.
Why the change?
Because of AWS’s limitations. According to their guidelines and limits, the maximum size for sentiment detection is 5KB. That is a mere 2,500 words!If an article goes over 2500 words, I have to split them, and I have to analysis separately. Then, I need to weigh them appropriately and did a final calculation. I am lazy so I seek a better solution. I found it with Google Cloud Natural Language
Note to businesses: This is a reason why customers switch to a different service.
Google Cloud Natural Language
Google Cloud Natural Language derive insights from unstructured text using Google maching learning
Google’s sentiment analysis is less specific than AWS. Google provides two values: magnitude and score. A score of 0 is neutral. A score of less than 0 is considered to have negative emotion and a score that is greater than 0 is considered to have positive emotion. The magnitude indicates the level of emotional content. Pretty vague but let’s take a look at some samples. You can read more about Google’s sentiment analysis here.
Enough theory! Let’s analyze some examples of the output and see if they make sense.
Here is the results for the average magnitude and score for CNBC and NYTimes:
NYTimes is more emotional based on its average gcloud_magnitude score, 4.65 vs 14.88. The sentiment score for both is very close to 0 so they are both neither positive or negative.
From last article’s analysis, CNBC has a 92% and NYTimes has a 87% probability of being neutral respectively. Both AWS and Google seems to agree that the sentiments are most likely neutral.
Individual Article Analysis
Here is the focus point of this article, let’s evaluate one of the articles.
CNBC With Higest Magnitude
This CNBC article reviews Apple’s new iPhones and it generates a magnitude score of 30.39, much higher than the average score of 4.65 for CNBC.
I read the article and there are phrases that indicates how emotional dramatic the article is written in. Here are some quotes:
They’re the best phones Apple has ever made.
The iPhone X, even a year later, is still arguably the best phone on the market.
It’s one of the best screens on the market…
The speakers sound awesome.
I love how shiny it is on the new gold and white models.
I love that iOS 12 gives you so much more control over notifications.
…these are the best phones Apple has made…
Judging from some of these statements, I can understand why Google’s algorithm gives it a high magnitude score. It’s full of dramatic adjectives like
The CNBC article reviewing Apple’s new iPhoneXS do seem dramatic and emotional. It am pretty impressed that Google Cloud Natural Language can understand that. In part 2, I will dive deeper into articles that have low and high scores from both sources.
I receive my AWS Certified Solutions Architect – Associate Level in Feb 2018. First, I think a course on. Here is the link:
Second, I took the AWS CSAA Practice Test from. It is a paid course but I think it is worth it.
Those two are enough for me and I passed it on the first try albeit barely. I also made sure to redo some of the labs on the udemy course so I can remember it more.
I came across a useful trick of explicitly requiring ruby gems in each classes. In the Gemfile, set the gem require to false. Then you will have to explicitly require that gem in each class.
I am using it to encapsulate interaction with a 3rd party gems through an interface class. All calls made to the 3rd party service through this gem has to be through the interface class.
Datamania has a good technology landscape cheat sheet. Attached below is the cheat sheet.
What surprised me is that Looker is not on here. “Looker is a business intelligence software and big data analytics platform that helps you explore, analyze and share real-time business analytics easily.”
Also, ELT is also not represented in the graph. ELT stands for Extract, Load, Transform. Instead of having a data engineering team extract the data, transform it and load it into a data warehouse, the team extracts and loads the raw data into a data warehouse. Then, analytics tool such as Looker can transform the data on the fly and analyze it straight from the data warehouse. I deal with complex data transformations so I am most curious to hear how Looker or other tools can handle complex transformations.
I hear a lot of ELT and how ETL is dead. However, I did a google trend comparison between ETL and ELT and they both seem to have been stable for the past five years.
Finally, with so many tools and terminologies available, it is crucial to know when to use what. Factors include the data size, the skills, security, infrastructure, etc. There is so many exciting developments in this field!
I read Demystifying Data Science For All from business over broadway. For someone new and interested in the space, this article is great.
Generally speaking, data science is a way of extracting value and insights from data using the powers of computer science and statistics applied to a specific field of study – Business over Broadway
I studied Biomedical Engineering and I work with data for my research. What makes data scientist different is the increased size of the data set and the variable structure of that data. Today, we have data from servers, logs, mobile, IoT, 3rd party integrations such as salesforce, chat, Zendesk, etc. To complicate matters, some of the data are not structured.
I’m still not sure how you can analyze data that are not structured. The data I have worked with have all been structured. Sometimes, I need to transform that data so it can be easily analyzed.
The sheer volume of recorded data is called Big Data since there is a lot of it; hence, Big. Better technology enables us to store and process Big Data. Businesses want to understand and analyze this massive data to gain competitive advantage.
At my work, security is also another factor. Massive data is collected and security measures have to be taken so analysis does not reveal sensitive information. The article did not mention security but it should be taken seriously in any organization working with sensitive data.
I am surprised to see IBM listed as the leader in Data Science Platforms. Living in SF, I thought it would be Google or Amazon. They have the expertise in house so they should be able to monetize that expertise. Perhaps it is more valuable to use that expertise on their core businesses than to provide services to other companies.
My interests in Big Data is automation. How can someone take all this data and generate something useful with the least amount of resources. That’s where technology and computer science comes in.