Project repo for CL Team Laboratory at the University of Stuttgart. Mirror from GitHub repo => https://github.com/pavan245/citation-analysis
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Pavan Mandava 9b80e3d272
WIP : README Documentation - Added Report to MD
5 years ago
classifier refactored ff_nn, added test file 5 years ago
configs Commented unnecessary code 5 years ago
data Added Data from IMS Server, 6 years ago
eval WIP : README Documentation - Added References 5 years ago
feature_extraction WIP : Code Documentation & README Documentation 5 years ago
plots WIP : README Documentation - AllenNLP Model & plot 5 years ago
testing Merge remote-tracking branch 'origin/master' 5 years ago
utils added confusion matrix and plot 5 years ago
.allennlp_plugins Config file changes for IMS Machines 6 years ago
.gitignore Added Data from IMS Server, 6 years ago
README.md WIP : README Documentation - Added Report to MD 5 years ago
final_project_report.pdf WIP : README Documentation - Added Report to MD 5 years ago
presentation.pdf WIP : README Documentation - Added Environment & Setup, Finished AllenNLP doc 5 years ago
requirements.txt added requirements.txt file 5 years ago
scicite_paper.pdf WIP : README Documentation 5 years ago

README.md

Citation Intent Classification

Project repo for Computational Linguistics Team Lab at the University of Stuttgart.

Introduction

This repository contains code and datasets for classifying citation intents in research papers.

We implemented 3 classifiers and evaluated on test dataset:

  • Perceptron Classifier - Baseline model (Implemented from scratch)
  • Feedforward Neural Network Classifier (using PyTorch)
  • BiLSTM + Attention with ELMo Embeddings (using AllenNLP library)

This README documentation focuses on running the code base, training the models and predictions. For more information about our project work, model results and detailed error analysis, check this report. Slides from our mid-term presentation are available here.
For more information on the Citation Intent Classification in Scientific Publications, follow this link to the original published paper and their GitHub repo

Environment & Setup

This project needs Python 3.5 or greater. We need to install and create a Virtual Environment to run this project.

Installing virtualenv

python3 -m pip install --user virtualenv

Creating a virtual environment

venv (for Python 3) allows us to manage separate package installations for different projects.

python3 -m venv citation-env

Activating the virtual environment

Before we start installing or using packages in the virtual environment we need to activate it.

source citation-env/bin/activate

Leaving the virtual environment

To leave the virtual environment, simply run:

deactivate

After activating the Virtual Environment, the console should look like this:

(citation-env) [user@server ~]$ 

Cloning the Repository

git clone https://github.com/yelircaasi/citation-analysis.git

Now change the current working directory to the project root folder (> cd citation-analysis).
Note: Stay in the Project root folder while running all the experiments.

Installing Pacakages

Now we can install all the packages required to run this project, available in requirements.txt file.

(citation-env) [user@server citation-analysis]$ pip install -r requirements.txt

Environment Variable for Saved Models Path

Run the below line in the console, we'll use this variable later on.

export SAVED_MODELS_PATH=/mount/arbeitsdaten/studenten1/team-lab-nlp/mandavsi_rileyic/saved_models

Data

This project uses a large dataset of citation intents provided by this SciCite GitHub repo. Can be downloaded from this link.
We have 3 different intents/classes in this dataset:

  • background (background information)
  • method (use of methods)
  • result (comparing results)

Dataset Class distribution:

background method result
train 4.8 K 2.3 K 1.1 K
dev 0.5 K 0.3 K 0.1 K
test 1 K 0.6 K 0.2 K

Methods (Classification)

1) Perceptron Classifier (Baseline Classifier)

We implemented Perceptron as a baseline classifier, from scratch (including evaluation). Perceptron is an algorithm for supervised learning of classification. It's a linear and binary classifier, which means it can only decide whether or not an input feature belongs to some specific class and it's only capable of learning linearly separable patterns.

class Perceptron:
  def __init__(self, label: str, weights: dict, theta_bias: float):
  def score(self, features: list):
  def update_weights(self, features: list, learning_rate: float, penalize: bool, reward: bool):

class MultiClassPerceptron:
  def __init__(self, epochs: int = 5000, learning_rate: float = 1, random_state: int = 42)
  def fit(self, X_train: list, labels: list)
  def predict(self, X_test: list)

Since we have 3 different classes for Classification, we create a Perceptron object for each class. Each Perceptron has score and update functions. During training, for a set of input features it takes the score from the Perceptron for each label and assigns the label with max score(for all the data instances). It compares the assigned label with the true label and decides whether or not to update the weights (with some learning rate).

Check the source code for more details on the implementation of Perceptron Classifier.

Running the Model

(citation-env) [user@server citation-analysis]$ python3 -m testing.model_testing

Link to the test source code. All the Hyperparameters can be modified to experiment with.

Evaluation

we used f1_score metric for evaluation of our baseline classifier.

F1 score is a weighted average of Precision and Recall(or Harmonic Mean between Precision and Recall). The formula for F1 Score is:
F1 = 2 * (precision * recall) / (precision + recall)

eval.metrics.f1_score(y_true, y_pred, labels, average)  

Parameters: y_true : 1-d array or list of gold class values
y_pred : 1-d array or list of estimated values returned by a classifier
labels : list of labels/classes
average: string - [None, 'micro', 'macro'] If None, the scores for each class are returned.

Link to the metrics source code.

Results

Confusion Matrix Plot

2) Feedforward Neural Network (using PyTorch)

A feed-forward neural network classifier with a single hidden layer containing 9 units. While a feed-forward neural network is clearly not the ideal architecture for sequential text data, it was of interest to add a sort of second baseline and examine the added gains (if any) relative to a single perceptron. The input to the feedforward network remained the same; only the final model was suitable for more complex inputs such as word embeddings.

Check this feed-forward model source code for more details.

3) BiLSTM + Attention with ELMo (AllenNLP Model)

The Bi-directional Long Short Term Memory (BiLSTM) model built using the AllenNLP library. For word representations, we used 100-dimensional GloVe vectors trained on a corpus of 6B tokens from Wikipedia. For contextual representations, we used ELMo Embeddings which have been trained on a dataset of 5.5B tokens. This model uses the entire input text, as opposed to selected features in the text, as in the first two models. It has a single-layer BiLSTM with a hidden dimension size of 50 for each direction.

We used AllenNLP's Config Files to build our model, just need to implement a model and a dataset reader (with a JSON Config file).

Our BiLSTM AllenNLP model contains 4 major components:

  1. Dataset Reader - CitationDatasetReader
    • It reads the data from the file, tokenizes the input text and creates AllenNLP Instances
    • Each Instance contains a dictionary of tokens and label
  2. Model - BiLstmClassifier
    • The model's forward() method is called for every data instance by passing tokens and label
    • The signature of forward() needs to match with field names of the Instance created by the DatasetReader
    • This Model uses ELMo deep contextualised embeddings.
    • The forward() method finally returns an output dictionary with the predicted label, loss, softmax probabilities and so on...
  3. Config File - basic_model.json
    • The AllenNLP Configuration file takes the constructor parameters for various objects (Model, DatasetReader, Predictor, ...)
    • We can provide a number of Hyperparameters in this Config file.
      • Depth and Width of the Network
      • Number of Epochs
      • Optimizer & Learning Rate
      • Batch Size
      • Dropout
      • Embeddings
    • All the classes that the Config file uses must register using Python decorators (for example, @Model.register('bilstm_classifier').
  4. Predictor - IntentClassificationPredictor
    • AllenNLP uses Predictor, a wrapper around the trained model, for making predictions.
    • The Predictor uses a pre-trained/saved model and dataset reader to predict new Instances

Running the Model

AllenNLP provides train, evaluate and predict commands to interact with the models from command line.

Training

$ allennlp train \
    configs/basic_model.json \
    -s $SAVED_MODELS_PATH/experiment_10 \
    --include-package classifier

We ran a few experiments on this model, the run configurations, results and archived models are available in the SAVED_MODELS_PATH directory.
Note: If the GPU cores are not available, set the "cuda_device": to -1 in the config file, otherwise the available GPU Core.

Evaluation

To evaluate the model, simply run:

$ allennlp evaluate \
    $SAVED_MODELS_PATH/experiment_4/model.tar.gz \
    data/jsonl/test.jsonl \
    --cuda-device 3 \
    --include-package classifier

Predictions

To make predictions, simply run:

$ allennlp predict \
    $SAVED_MODELS_PATH/experiment_4/model.tar.gz \
    data/jsonl/test.jsonl \
    --cuda-device 3 \
    --include-package classifier
    --predictor citation_intent_predictor

We also have an another way to make predictions without using allennlp predict command. This returns prediction list, softmax probabilities and more details useful for error analysis. Simply run the following command:

(citation-env) [user@server citation-analysis]$ python3 -m testing.bilstm_predict

Modify this source to run predictions on different experiments. It also saves the Confusion Matrix Plot (as shown below) after prediction.

Results

Confusion Matrix Plot

References

[1] SciCite GitHub Repository
This repository contains datasets and code for classifying citation intents, our poroject is based on this repository.

[2] SciCite Dataset
Large Datset of Citation Intents

[3] AllenNLP Library.
An open-source NLP research library, built on PyTorch.

[4] ELMo Embeddings
Deep Contextualized word representations.

[5] AllenNLP Guide
A Guide to Natural Language Processing With AllenNLP.