Impossible to be unbiased

As I encounter articles on the web, I noticed that almost all of them are biased. Recently, I read an opinion piece in the New York Times about treatment of depression between red and blue states written by Seth Stephens-Davidowitz. Although it is an opinion piece, the author’s use of research data makes it seem like a research paper. It certainly lacks the rigor of a peer reviewed paper; however, I am more interested in how his opinion is presented with loose facts. I follow Seth because, like him, I am interested in using big data to learn about human behavior.

His first opinion is “Blue states are far more likely to use therapy to treat mental illness”. He makes two claims to backup his opinion. The first is “In blue states, there are 54 percent more Google searches for psychotherapists” and the second is “there are also 76 percent more psychologists or psychiatrists per capita in blue states”.

His first claim was based on google searches. My understanding is that if there are 100 searches for psychotherapists in red states, there will be 154 searches in the blue states. One reasoning might be that there are more depressed people in blue states. Seth immediately squashes that thought by claiming that “Whatever stereotypes you might have, it is not true that people in blue states are more neurotic or depressed”. This statement was not backed by any evidence. And, he starts the sentence with “Whatever stereotypes you might have”, leading readers to replace their preconceived notions with his. Interestingly, only one commenter mentions  that “liberals have a higher chance of being neurotic”. Indeed, simple google searches led to research that supports that claim.

Seth tries to rationalize that it cannot be due to cost because both red and blue states searches for expensive procedures. Next, he thinks that it is due to negative stigma. He claims that “men in red states make up a smaller portion of visits to therapy sites”. But, what does that mean? I think he is saying that men visit therapy sites less than women. However, that has no bearing on seeking therapy between red vs blue states. He then looks at twitter data and estimates that “Americans in blue states are about 100 percent more likely to tweet about their therapist”. OK, that is interesting. So, the theory is that negative stigma leads to less tweets. I can believe that.

I skim the rest of the article because it goes into anecdotal experiences of celebrities. His main point is that there is negative stigma to receive therapy in red states which leads to less treatment. My first thought is, won’t people just do it in secret? Are there other negative stigmas in red states that people secretly do? A simple google search of “premarital sex in red vs blue states” reveal, ironically, another NY times opinion piece titled: “Blue States Practice the Family Values Red States Preach”. In the article, ” those with the highest percentage of high school students who say they have had sex are Mississippi, Delaware, West Virginia, Alabama and Arkansas”. You know, red states. The article even mentions that red states preaches against sex which probably means that there are negative stigmas associated with it, right? I do realize that the survey is for high school students so it is not representative.

I do not believe that red states reject therapy due to negative stigmas. They may not openly talk about it but if it beneficial, people will still do it. In secret if they have to. It may be that red states do not believe in therapy. It may be that red states are less neurotic than blue states. It may be a million other reasons. My point is that most people are biased and they find data to fit their narrative. In this article, it is “More people with mental illness need better insurance coverage”. Noble and definitely true. However, I hate being manipulated with biased research and anecdotal feel bad stories.

I do not blame them. It is impossible to be unbiased. I just consume anything, and I mean anything with a crate of salt. This is bad because it leads me to justify my own biases by dismissing their viewpoints as biased, therefore, untrue. I am sure there are people who are unbiased and I would love to meet them. For now, I need to do my own research and analysis and make my own conclusions. This is the cost of distrusts of the system. If anyone knows good tools and techniques to study human behavior using big data, please let me know.



Scope ruby gem to a single class

I came across a useful trick of explicitly requiring ruby gems in each classes. In the Gemfile, set the gem require to false. Then you will have to explicitly require that gem in each class.

I am using it to encapsulate interaction with a 3rd party gems through an interface class. All calls made to the 3rd party service through this gem has to be through the interface class.


Interface Class with 3rd party gem

I finished creating an interface class to interact with a 3rd party gem. It is only used by a single class for now; however, establishing a good pattern early on is critical. Having a single class that interacts with the 3rd party gem allows me to create an integration tests that actually make network calls. Since it makes network calls to an external service, the integration spec will only run if you manually specify it on the command line. I do not include this as part of the continuous integration tests as I do not want to overload the external service. It is useful when developing as I can invoke an automated integration test after completing a feature. Right now, I have to run the commands and test it manually anyways; this integration tests automates it.
One issue I noticed is proper encapsulation is hard. That means I want ALL interaction with a 3rd party gem to go through this interface class but it is hard to encapsulate that fully. For example, the interface makes a request to an external service and return that as an object. Now, the class using the interface has a handle on that object and may invoke method calls or modify it. I need to freeze the returned object so you cannot invoke any methods or modify it.
A 3rd party integration spec is useful as it allows you to run a suite of tests when you finish with a feature.
Proper encapsulation of the interface class is difficult.

Big Data Tools Cheat Sheet

Datamania has a good technology landscape cheat sheet. Attached below is the cheat sheet.


What surprised me is that Looker is not on here. “Looker is a business intelligence software and big data analytics platform that helps you explore, analyze and share real-time business analytics easily.”

Also, ELT is also not represented in the graph. ELT stands for Extract, Load, Transform. Instead of having a data engineering team extract the data, transform it and load it into a data warehouse, the team extracts and loads the raw data into a data warehouse. Then, analytics tool such as Looker can transform the data on the fly and analyze it straight from the data warehouse. I deal with complex data transformations so I am most curious to hear how Looker or other tools can handle complex transformations.

I hear a lot of ELT and how ETL is dead. However, I did a google trend comparison between ETL and ELT and they both seem to have been stable for the past five years.

Finally, with so many tools and terminologies available, it is crucial to know when to use what. Factors include the data size, the skills, security, infrastructure, etc. There is so many exciting developments in this field!

Demystifying Data Science Post

I read Demystifying Data Science For All from business over broadway. For someone new and interested in the space, this article is great.

Generally speaking, data science is a way of extracting value and insights from data using the powers of computer science and statistics applied to a specific field of study – Business over Broadway

I studied Biomedical Engineering and I work with data for my research. What makes  data scientist different is the increased size of the data set and the variable structure of that data. Today, we have data from servers, logs, mobile, IoT, 3rd party integrations such as salesforce, chat, Zendesk, etc. To complicate matters, some of the data are not structured.

I’m still not sure how you can analyze data that are not structured. The data I have worked with have all been structured. Sometimes, I need to transform that data so it can be easily analyzed.

The sheer volume of recorded data is called Big Data since there is a lot of it; hence, Big. Better technology enables us to store and process Big Data. Businesses want to understand and analyze this massive data to gain competitive advantage.

At my work, security is also another factor. Massive data is collected and security measures have to be taken so analysis does not reveal sensitive information. The article did not mention security but it should be taken seriously in any organization working with sensitive data.

I am surprised to see IBM listed as the leader in Data Science Platforms. Living in SF, I thought it would be Google or Amazon. They have the expertise in house so they should be able to monetize that expertise. Perhaps it is more valuable to use that expertise on their core businesses than to provide services to other companies.

My interests in Big Data is automation. How can someone take all this data and generate something useful with the least amount of resources. That’s where technology and computer science comes in.

Data Driven Practical Sites to Follow

I started looking for practical data driven sites to follow. It’s tough. There are ‘professional bloggers’ who uses buzz words such as big data, machine learning, and AI to generate content. However, when I read the content, I’m nodding my head in agreement but it lacks practical advice. It’s good for water cooler talk; nothing else.

There are some potential good ones and I am listing them below. I have not fully vet them, but they show promise. Let me know your thoughts on these sites and if you have any good ones to share.

  1. Data Mania Blog
  2. Business Over Broadway
  3. Blog on Big Data and Analytics