Sentiment analysis¶
In this tutorial, you will learn:
- The basics of sentiment analysis
- How to collect tweets
- How to collect financial news headlines
- What are the common ways of analysing sentiment
- How to measure the accuracy of the sentiment prediction
Intro to sentiment analysis¶
As we have discussed in the Introduction part, sentiment analysis is a natural language processing technique that is used to
determine whether a statement contains positive, negative or neutral sentiment.
In this tutorial, we aim to analyse the daily sentiment of a stock with the use of relevant news headlines and tweets,
and thus to find out the market sentiment.
Collection of tweets¶
Apply for developer account from Twitter use Tweepy
1. Click and apply for a developer account through this link: https://developer.twitter.com/en/apply-for-access
2. Create a new project and connect it with the developer App in the developer portal
3. Enable App permissions (Read and Write)
4. Navigate to the ‘Keys and token’ page, save your API key, API secret, Access token and Access secret
Code example
import tweepy
# do not share the API key in any public platform (e.g github, public website)
consumer_key = API secret
consumer_secret = API secret
access_token = Access token
access_token_secret = Access secret
# authorisation of consumer key and consumer secret
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
Access the relevant tweets using the Twitter API
There are different types of API provided by Twitter with various limitations. Please visit this link for further
information: https://developer.twitter.com/en/docs/twitter-api .In the following section, you will learn how to
retrieve tweets from the Twitter timeline, hashtag/cashtag and also stream data that contains real time tweets.
Timeline tweets¶
Returns the 20 most recent tweets posted from the authenticated user. It is also possible to request another
user’s timeline via the the
id
parameter.Pass the
user_id
or screen_name
parameters to access the user-specified tweets. For more information regarding the parameters,
please visit the official documentation: https://docs.tweepy.org/en/v3.5.0/api.htmlCode example
# create an empty list
alltweets = []
# extract data from the API
timeline = api.user_timeline(user_id=userid, count=number_of_tweets)
alltweets.extend(timeline)
with open('%s_tweets.csv' % screen_name, 'a') as f:
writer = csv.writer(f)
for tweet in alltweets:
tweet_text = tweet.text.encode("utf-8")
dates=tweet.created_at
writer.writerow([dates,tweet_text])
Hashtag/Cashtag tweets¶
Cashtag is a feature on Twitter that allows users retrieve tweets relevant to a particular ticker, say $GOOG, $AAPL or $FB.
Use
tweepy.Cursor()
to access data from hashtag and cashtags.Code example
# extract data from the API
hashtags = tweepy.Cursor(api.search, q=name, lang='en', tweet_mode='extended').items(200)
with open('%s_tweets.csv' % screen_name, 'a') as f:
writer = csv.writer(f)
for status in hashtags:
tweet_text = status.full_text
dates = str(status.created_at)[:10]
writer.writerow([dates,tweet_text])
If you want to collect tweets for a period of time, we could further amend the code snippet in the following way:
with open('%s_tweets.csv' % screen_name, 'a') as f:
writer = csv.writer(f)
for status in hashtags:
# Add this line
** if (datetime.datetime.now() - status.created_at).days <= day_required: **
tweet_text = status.full_text
dates = str(status.created_at)[:10]
writer.writerow([dates,tweet_text])
Stream tweets¶
The Twitter streaming API is used to download the tweets in real time. It is useful for obtaining a high volume of
tweets, or for creating a live feed using a site stream. For more information with the API, please visit this link:
https://docs.tweepy.org/en/v3.5.0/streaming_how_to.html.
- Create a class inheriting from StreamListener
# override tweepy.StreamListener
class MyStreamListener(tweepy.StreamListener):
# add logic to the on_staus method
def on_status(self, status):
if (self.tweet_count == self.max_tweets):
return False
# collect tweets
else:
tweet_text = status.text
writer = csv.writer(self.output_file)
writer.writerow([status.created_at,status.extended_tweet['full_text'].encode("utf-8")])
self.tweet_count += 1
# add logic to the initialisation function
def __init__(self, output_file=sys.stdout,input_name=sys.stdout):
super(MyStreamListener,self).__init__()
self.max_tweets = 200
self.tweet_count = 100
self.input_name = input_name
- Create a stream
# add an output_file parameter to store the output tweets
myStreamListener = MyStreamListener(output_file=f, input_name=firm)
myStream = tweepy.Stream(auth=api.auth, tweet_mode='extended', listener=myStreamListener, languages=["en"])
- Start a stream
myStream.filter(track=target_firm)
Collect financial headlines¶
US news headlines¶
Finviz.com is a browser-based stock market research platform that allows visitors
to read the latest financial news collected from different major newsagents such as Yahoo! Finance, Accesswire, and Newsfile.
Before the tutorial, it is important to take a look at the front-end code of the website.
- Access the website of each ticker through the
urllib.request
module
allnews = []
finviz_url = 'https://finviz.com/quote.ashx?t='
url = finviz_url + ticker
req = Request(url=url, headers={'user-agent': 'my-app/0.0.1'})
- Get the HTML document using Beautiful Soup
html = BeautifulSoup(resp, features="lxml")
- Get the information of <div> id=’news-table’ in the website
news_table = html.find(id='news-table')
news_tables[ticker] = news_table
- Find all the news under the <tr> tag in the news-table
for info in df.findAll('tr'):
text = info.a.get_text()
date_scrape = info.td.text.split()
if (len(date_scrape) == 1):
time = date_scrape[0]
else:
date = date_scrape[0]
time = date_scrape[1]
news_time_st r= date + " " + time
- Convert the date format to ‘YYYY-MM-dd’
date_time_obj = datetime.datetime.strptime(news_time_str, '%b-%d-%y %I:%M%p')
date_time=date_time_obj.strftime('%Y-%m-%d')
- Append all the news together
allnews.append([date_time,text])
HK news headlines¶
We will also learn how to collect news headlines from aastock.com. The website
one of the most popular financial information platforms in Hong Kong. It offers real-time international information
relevant to Hong Kong shares, which are useful for analysing sentiment and trends in the local market.
Again, before writing code to scrape the news, we need to have a look of the front-end code of the website. Take tencent (00700.HK) as an example.
Click ‘inspect’ and you can view the front-end code of the website. (Or visit this link: http://www.aastocks.com/en/stocks/analysis/stock-aafn/00700/0/all/1)
From the above snippet, we could know that the
date'
attribute is stored within the <div class ='inline_block>
under the
<div class='newstime 4'>
, while the news headlines are stored within the <div class='newscontent4 mar8T'>
.The following steps are similar to that for collecting US news headlines.
- Access the website of each ticker through
urllib.request
module
prefix_url = 'http://www.aastocks.com/en/stocks/analysis/stock-aafn/'
postfix_url = '/0/all/1'
url = prefix_url + fill_ticker + postfix_url
req = Request(url=url, headers={'user-agent': 'my-app/0.0.1'})
resp = urlopen(req)
- Get the HTML document using Beautiful Soup
html = BeautifulSoup(resp, features="lxml")
# get the html code containing the dates and news
dates = html.findAll("div", {"class": "inline_block"})
news = html.findAll("div", {"class": "newshead4"})
- Find all the news and corresponding dates from the html code from step 2
# track the index in the news list
idx = 0
with open('%s_tweets.csv' % screen_name, 'a') as f:
writer = csv.writer(f)
for i in dates:
# as the dates are in yyyy/mm/dd format
if "/" in str(i.get_text()):
date = str(i.get_text())
# the front-end code is not standardised and sometimes contains 'Release Time' string
if "Release Time" in date:
date = date[13:23]
else:
date = str(date[:10])
text = news[idx].get_text()
date_time_obj = datetime.datetime.strptime(date, '%Y/%m/%d')
# standardise the date format as 'YYYY-mm-dd'
date_time = date_time_obj.strftime('%Y-%m-%d')
# set the number of days you want to collect
if (datetime.datetime.now()-date_time_obj).days <= day_required:
writer.writerow([date_time,text])
idx += 1
VADER sentiment prediction¶
After the collection of data, it is time for you to now carry out the analysis with the database.
VADER (Valence Aware Dictionary for Sentiment Reasoning) is a model used for text sentiment analysis that is
sensitive to both polarity (positive/negative) and intensity (strength) of emotion. It is available in the NLTK
package and can be applied directly to unlabelled text data.
The sentiment labels are generated from the VADER Compound score according to the following rules:
- Positive sentiment (= 2): compound score > 0.01
- Neutral sentiment (= 1): −0.01 ≥ compound score ≤ 0.01
- Negative sentiment (= 0): compound score < −0.01
Note that 1% was set as the threshold value accounting for the average stock movement in the US market, feel free to set
any value for your own analysis
- Import these libraries
import pandas as pd
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import twitter_samples
- VADER’s
SentimentIntensityAnalyzer()
takes in a string and returns a dictionary of scores in each of four categories:
- negative
- neutral
- positive
- compound (computed by normalising the scores above, ranging from -1 to 1)
Let us analyse the data that we have collected through the sentimental analyser.
# pass in the path where you stored the csv file containing the data
def read_tweets_us_path(path):
# could change to your own path
path = os.path.join(dir_name,'train-data/'+path)
# read in data as pandas dataframe
df = pd.read_csv(path)
cs = []
for row in range(len(df)):
cs.append(analyzer.polarity_scores(df['tweets'].iloc[row])['compound'])
# create a new column for the calculated results
df['compound_vader_score'] = cs
print(df)
return df
- Label the sentiment for each tweet
Parameters:
grouped_data
: consolidated data with features including (dates, tweets, compound_vader_score)file_name
: the output name after the label functionperc_change
: the threshold value for labelling the sentiment
Code example
def find_tweets_pred_label(grouped_data,file_name,perc_change):
print('find_pred_label')
tweets = grouped_data['tweets']
# group the tweets within the csv using ['dates','ticker'] index,
grouped_data = grouped_data.groupby(['dates','ticker'])['compound_vader_score'].mean().reset_index()
final_label = []
for i in range(len(grouped_data)):
if grouped_data['compound_vader_score'].iloc[i] > perc_change:
final_label.append(2)
elif grouped_data['compound_vader_score'].iloc[i] < -perc_change:
final_label.append(0)
elif ((grouped_data['compound_vader_score'].iloc[i] >= -perc_change) and (grouped_data['compound_vader_score'].iloc[i] <= perc_change)):
final_label.append(1)
# add the column of vader_label
grouped_data['vader_label'] = final_label
grouped_data['tweets'] = tweets
grouped_data.to_csv(file_name)
4. Merge all the data together
- actual label (= 2): price movement ≥ 0.01
- actual label (= 1): −0.01 ≥ price movement ≤ 0.01
- actual label (= 0): price movement ≤ −0.01
Parameters:
file_name
: consolidated data with features including (dates,tweets,compound_vader_score)label_data
: the label data contains the actual label from yahoo finance
Code example
def merge_actual_label (file_name,label_data):
vader_data = pd.read_csv(file_name)
vader_data.set_index(keys=["dates","ticker"], inplace=True)
label_data = pd.read_csv(label_data)
label_data.set_index(keys=["dates","ticker"], inplace=True)
# merge the actual label and the predicted label into a single pandas data frame
merge = pd.merge(vader_data,label_data, how='inner', left_index=True, right_index=True)
merge.drop(columns=['Unnamed: 0_y'], axis=1)
return merge
5. Validation using confusion matrix
Parameters:
df
: the final merged pandas dataframename
: the output csv file containing all the merged information with dates, tweets, vader label and actual label
Code illustration
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
def validation(df,name):
pred_label = list(df['vader_label'])
actual_label = list(df['label'])
labels = [0,1,2]
cm = confusion_matrix(actual_label, pred_label,labels)
labels = ['True Neg','False Pos','False Neg','True Pos']
categories = ['Negative','Neutral', 'Positive']
make_confusion_matrix(cm, group_names=labels, categories=categories)
df.to_csv(name)
Attention
All investments entail inherent risk. This repository seeks to solely educate
people on methodologies to build and evaluate algorithmic trading strategies.
All final investment decisions are yours and as a result you could make or lose money.
All final investment decisions are yours and as a result you could make or lose money.