Sentiment analysis =================== .. highlight:: Python In this tutorial, you will learn: * The basics of sentiment analysis * How to collect tweets * How to collect financial news headlines * What are the common ways of analysing sentiment * How to measure the accuracy of the sentiment prediction Intro to sentiment analysis --------------------------- | As we have discussed in the Introduction part, sentiment analysis is a natural language processing technique that is used to determine whether a statement contains positive, negative or neutral sentiment. In this tutorial, we aim to analyse the daily sentiment of a stock with the use of relevant news headlines and tweets, and thus to find out the market sentiment. Collection of tweets --------------------- **Apply for developer account from Twitter use Tweepy** | 1. Click and apply for a developer account through this link: https://developer.twitter.com/en/apply-for-access | 2. Create a new project and connect it with the developer App in the developer portal | 3. Enable App permissions (*Read* and *Write*) | 4. Navigate to the **'Keys and token'** page, save your API key, API secret, Access token and Access secret **Code example** :: import tweepy # do not share the API key in any public platform (e.g github, public website) consumer_key = API secret consumer_secret = API secret access_token = Access token access_token_secret = Access secret # authorisation of consumer key and consumer secret auth = tweepy.OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_token_secret) api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True) **Access the relevant tweets using the Twitter API** | There are different types of API provided by Twitter with various limitations. Please visit this link for further information: https://developer.twitter.com/en/docs/twitter-api .In the following section, you will learn how to retrieve tweets from the Twitter timeline, hashtag/cashtag and also stream data that contains real time tweets. Timeline tweets ^^^^^^^^^^^^^^^^ | Returns the 20 most recent tweets posted from the authenticated user. It is also possible to request another user's timeline via the the :code:`id` parameter. | Pass the :code:`user_id` or :code:`screen_name` parameters to access the user-specified tweets. For more information regarding the parameters, please visit the official documentation: https://docs.tweepy.org/en/v3.5.0/api.html **Code example** :: # create an empty list alltweets = [] # extract data from the API timeline = api.user_timeline(user_id=userid, count=number_of_tweets) alltweets.extend(timeline) with open('%s_tweets.csv' % screen_name, 'a') as f: writer = csv.writer(f) for tweet in alltweets: tweet_text = tweet.text.encode("utf-8") dates=tweet.created_at writer.writerow([dates,tweet_text]) Hashtag/Cashtag tweets ^^^^^^^^^^^^^^^^^^^^^^^ | **Cashtag** is a feature on Twitter that allows users retrieve tweets relevant to a particular ticker, say $GOOG, $AAPL or $FB. Use :code:`tweepy.Cursor()` to access data from hashtag and cashtags. **Code example** :: # extract data from the API hashtags = tweepy.Cursor(api.search, q=name, lang='en', tweet_mode='extended').items(200) with open('%s_tweets.csv' % screen_name, 'a') as f: writer = csv.writer(f) for status in hashtags: tweet_text = status.full_text dates = str(status.created_at)[:10] writer.writerow([dates,tweet_text]) | If you want to collect tweets for a period of time, we could further amend the code snippet in the following way: :: with open('%s_tweets.csv' % screen_name, 'a') as f: writer = csv.writer(f) for status in hashtags: # Add this line ** if (datetime.datetime.now() - status.created_at).days <= day_required: ** tweet_text = status.full_text dates = str(status.created_at)[:10] writer.writerow([dates,tweet_text]) Stream tweets ^^^^^^^^^^^^^^^ | The Twitter streaming API is used to download the tweets in real time. It is useful for obtaining a high volume of tweets, or for creating a live feed using a site stream. For more information with the API, please visit this link: https://docs.tweepy.org/en/v3.5.0/streaming_how_to.html. 1. Create a class inheriting from StreamListener :: # override tweepy.StreamListener class MyStreamListener(tweepy.StreamListener): # add logic to the on_staus method def on_status(self, status): if (self.tweet_count == self.max_tweets): return False # collect tweets else: tweet_text = status.text writer = csv.writer(self.output_file) writer.writerow([status.created_at,status.extended_tweet['full_text'].encode("utf-8")]) self.tweet_count += 1 # add logic to the initialisation function def __init__(self, output_file=sys.stdout,input_name=sys.stdout): super(MyStreamListener,self).__init__() self.max_tweets = 200 self.tweet_count = 100 self.input_name = input_name 2. Create a stream :: # add an output_file parameter to store the output tweets myStreamListener = MyStreamListener(output_file=f, input_name=firm) myStream = tweepy.Stream(auth=api.auth, tweet_mode='extended', listener=myStreamListener, languages=["en"]) 3. Start a stream :: myStream.filter(track=target_firm) Collect financial headlines ------------------------------------------ US news headlines ^^^^^^^^^^^^^^^^^^ | `Finviz.com `_ is a browser-based stock market research platform that allows visitors to read the latest financial news collected from different major newsagents such as Yahoo! Finance, Accesswire, and Newsfile. | Before the tutorial, it is important to take a look at the front-end code of the website. .. .. figure:: images/apple_finviz_example.png .. image:: images/apple_finviz_example.png :width: 400px :align: center :height: 252px :alt: "AAPL finviz.com example" 1. Access the website of each ticker through the :code:`urllib.request` module :: allnews = [] finviz_url = 'https://finviz.com/quote.ashx?t=' url = finviz_url + ticker req = Request(url=url, headers={'user-agent': 'my-app/0.0.1'}) 2. Get the HTML document using Beautiful Soup :: html = BeautifulSoup(resp, features="lxml") 3. Get the information of
id='news-table' in the website :: news_table = html.find(id='news-table') news_tables[ticker] = news_table 4. Find all the news under the tag in the news-table :: for info in df.findAll('tr'): text = info.a.get_text() date_scrape = info.td.text.split() if (len(date_scrape) == 1): time = date_scrape[0] else: date = date_scrape[0] time = date_scrape[1] news_time_st r= date + " " + time 5. Convert the date format to 'YYYY-MM-dd' :: date_time_obj = datetime.datetime.strptime(news_time_str, '%b-%d-%y %I:%M%p') date_time=date_time_obj.strftime('%Y-%m-%d') 6. Append all the news together :: allnews.append([date_time,text]) HK news headlines ^^^^^^^^^^^^^^^^^^^ | We will also learn how to collect news headlines from `aastock.com `_. The website one of the most popular financial information platforms in Hong Kong. It offers real-time international information relevant to Hong Kong shares, which are useful for analysing sentiment and trends in the local market. | Again, before writing code to scrape the news, we need to have a look of the front-end code of the website. Take tencent (00700.HK) as an example. Click 'inspect' and you can view the front-end code of the website. (Or visit this link: http://www.aastocks.com/en/stocks/analysis/stock-aafn/00700/0/all/1) .. .. figure:: images/tencent_aastock_example.png .. image:: images/tencent_aastock_example.png :width: 540px :align: center :height: 182px :alt: "Tencent aastock.com example" | From the above snippet, we could know that the :code:`date'` attribute is stored within the :code:`
`, while the news headlines are stored within the :code:`
`. | The following steps are similar to that for collecting US news headlines. 1. Access the website of each ticker through :code:`urllib.request` module :: prefix_url = 'http://www.aastocks.com/en/stocks/analysis/stock-aafn/' postfix_url = '/0/all/1' url = prefix_url + fill_ticker + postfix_url req = Request(url=url, headers={'user-agent': 'my-app/0.0.1'}) resp = urlopen(req) 2. Get the HTML document using Beautiful Soup :: html = BeautifulSoup(resp, features="lxml") # get the html code containing the dates and news dates = html.findAll("div", {"class": "inline_block"}) news = html.findAll("div", {"class": "newshead4"}) 3. Find all the news and corresponding dates from the html code from step 2 :: # track the index in the news list idx = 0 with open('%s_tweets.csv' % screen_name, 'a') as f: writer = csv.writer(f) for i in dates: # as the dates are in yyyy/mm/dd format if "/" in str(i.get_text()): date = str(i.get_text()) # the front-end code is not standardised and sometimes contains 'Release Time' string if "Release Time" in date: date = date[13:23] else: date = str(date[:10]) text = news[idx].get_text() date_time_obj = datetime.datetime.strptime(date, '%Y/%m/%d') # standardise the date format as 'YYYY-mm-dd' date_time = date_time_obj.strftime('%Y-%m-%d') # set the number of days you want to collect if (datetime.datetime.now()-date_time_obj).days <= day_required: writer.writerow([date_time,text]) idx += 1 VADER sentiment prediction -------------------------- | After the collection of data, it is time for you to now carry out the analysis with the database. | VADER (Valence Aware Dictionary for Sentiment Reasoning) is a model used for text sentiment analysis that is sensitive to both polarity (positive/negative) and intensity (strength) of emotion. It is available in the NLTK package and can be applied directly to unlabelled text data. | The sentiment labels are generated from the VADER Compound score according to the following rules: * Positive sentiment (= 2): compound score > 0.01 * Neutral sentiment (= 1): −0.01 ≥ compound score ≤ 0.01 * Negative sentiment (= 0): compound score < −0.01 | Note that 1% was set as the threshold value accounting for the average stock movement in the US market, feel free to set any value for your own analysis 1. Import these libraries :: import pandas as pd import nltk from nltk.sentiment.vader import SentimentIntensityAnalyzer from nltk.corpus import twitter_samples 2. VADER’s :code:`SentimentIntensityAnalyzer()` takes in a string and returns a dictionary of scores in each of four categories: * negative * neutral * positive * compound (computed by normalising the scores above, ranging from -1 to 1) Let us analyse the data that we have collected through the sentimental analyser. :: # pass in the path where you stored the csv file containing the data def read_tweets_us_path(path): # could change to your own path path = os.path.join(dir_name,'train-data/'+path) # read in data as pandas dataframe df = pd.read_csv(path) cs = [] for row in range(len(df)): cs.append(analyzer.polarity_scores(df['tweets'].iloc[row])['compound']) # create a new column for the calculated results df['compound_vader_score'] = cs print(df) return df 3. Label the sentiment for each tweet Parameters: * :code:`grouped_data`: consolidated data with features including (dates, tweets, compound_vader_score) * :code:`file_name`: the output name after the label function * :code:`perc_change`: the threshold value for labelling the sentiment Code example :: def find_tweets_pred_label(grouped_data,file_name,perc_change): print('find_pred_label') tweets = grouped_data['tweets'] # group the tweets within the csv using ['dates','ticker'] index, grouped_data = grouped_data.groupby(['dates','ticker'])['compound_vader_score'].mean().reset_index() final_label = [] for i in range(len(grouped_data)): if grouped_data['compound_vader_score'].iloc[i] > perc_change: final_label.append(2) elif grouped_data['compound_vader_score'].iloc[i] < -perc_change: final_label.append(0) elif ((grouped_data['compound_vader_score'].iloc[i] >= -perc_change) and (grouped_data['compound_vader_score'].iloc[i] <= perc_change)): final_label.append(1) # add the column of vader_label grouped_data['vader_label'] = final_label grouped_data['tweets'] = tweets grouped_data.to_csv(file_name) | 4. Merge all the data together * actual label (= 2): price movement ≥ 0.01 * actual label (= 1): −0.01 ≥ price movement ≤ 0.01 * actual label (= 0): price movement ≤ −0.01 | Parameters: * :code:`file_name`: consolidated data with features including (dates,tweets,compound_vader_score) * :code:`label_data`: the label data contains the actual label from yahoo finance **Code example** :: def merge_actual_label (file_name,label_data): vader_data = pd.read_csv(file_name) vader_data.set_index(keys=["dates","ticker"], inplace=True) label_data = pd.read_csv(label_data) label_data.set_index(keys=["dates","ticker"], inplace=True) # merge the actual label and the predicted label into a single pandas data frame merge = pd.merge(vader_data,label_data, how='inner', left_index=True, right_index=True) merge.drop(columns=['Unnamed: 0_y'], axis=1) return merge | 5. Validation using confusion matrix Parameters: * :code:`df`: the final merged pandas dataframe * :code:`name`: the output csv file containing all the merged information with dates, tweets, vader label and actual label **Code illustration** :: from sklearn.metrics import confusion_matrix import matplotlib.pyplot as plt def validation(df,name): pred_label = list(df['vader_label']) actual_label = list(df['label']) labels = [0,1,2] cm = confusion_matrix(actual_label, pred_label,labels) labels = ['True Neg','False Pos','False Neg','True Pos'] categories = ['Negative','Neutral', 'Positive'] make_confusion_matrix(cm, group_names=labels, categories=categories) df.to_csv(name) .. attention:: | All investments entail inherent risk. This repository seeks to solely educate people on methodologies to build and evaluate algorithmic trading strategies. All final investment decisions are yours and as a result you could make or lose money. All final investment decisions are yours and as a result you could make or lose money.