From 2d59b608a64d9fcc3e661bd66e285a4337844197 Mon Sep 17 00:00:00 2001 From: Angel Umana <60790401+ibidyouadu@users.noreply.github.com> Date: Thu, 23 Jul 2020 04:34:43 -0400 Subject: [PATCH] Complete revision, txt to md --- readme.md | 21 +++++++++++++++++++++ readme.txt | 9 --------- 2 files changed, 21 insertions(+), 9 deletions(-) create mode 100644 readme.md delete mode 100644 readme.txt diff --git a/readme.md b/readme.md new file mode 100644 index 0000000..0aa7ee3 --- /dev/null +++ b/readme.md @@ -0,0 +1,21 @@ +# BERT Tweet Classifier with Keyword Neighbors + +## In a nutshell +This contains all data, code, results, etc. for a tweet classification project as a part of my work with [Dr. Wenying Ji's civil engineering research group.](http://mason.gmu.edu/~wji2/team.xhtml) + +This project tackles the broad question *Can we use social media data (tweets) to assess the severity of natural disasters?* and more specifically addresses the question *Will using context surrounding keywords in tweets provide better results for text classifiers to label tweets as related to power infrastructure or not?* + + +## Example of what the code does +More concretely, let's say we have the following tweet: + +`Throwback to when my apartment had electricity?????? #ThanksIrma @ Orlando, Florida https://t.co/R9YnR8sR8y` + +We would like to automatically label it as related to power infrastructure or not. In this case, it is related since it talks about electric utility service. Before we feed it to a text classifier (which in this project, is either a BERT keras model or a random forest) we want to clean the text and slice the text so that our string slice is centered around a certain keyword and within a specified "distance" i.e. number of words. We call these string slices **neighborhoods** about our keyword, which in this case is `electricity`. After cleaning the text and using a neighborhood radius of 2, we get the following: + +`throwback to when my apartment had electricity?????? #ThanksIrma @ Orlando,` + +Even though the radius is 2, there are more than 2 words to the left of the keyword because a certain class of words (stop words) are not counted. It is this string that is then input into a text classifier. Our results indicate that performance for the BERT classifer improved, whereas the random forest classifier did not. + +## Code and Documentation +That is the project in a nutshell. The random forest model code is [here](https://github.com/ibidyouadu/tweet_classification/blob/master/modeling/RF_tweet_classification.ipynb) and the BERT model code is [here](https://github.com/ibidyouadu/tweet_classification/blob/master/modeling/BERT_tweet_classification.ipynb), in the form of annotated jupyter notebooks. For more details on the research methods and results, you can find the documentation [here](https://github.com/ibidyouadu/tweet_classification/blob/master/documentation/tweet_classification_documentation.pdf). diff --git a/readme.txt b/readme.txt deleted file mode 100644 index 9942e47..0000000 --- a/readme.txt +++ /dev/null @@ -1,9 +0,0 @@ -This contains all data, code, results, etc. for the tweet classification project. - -For raw tweets, keyword scoring, and labeling results, see the data folder. Be careful before using many of the results for analysis however. They may need to be redone in some cases. Consult the readme's within - -The documentation is written in LaTeX. The pdf and source code can be found in the documentation folder along with the used media. - -For random forest and BERT model notebooks and their results, see the modeling folder. Unlike the previous folder, all the stuff there is up to date and has been carefully maintained. - -power_outages is not related to classification exactly, but is the next step in the project: using the refined, labeled, power-related twitter data to analyze outage data. It contains only data from Florida during Irma \ No newline at end of file -- 2.43.0