🧬 Introduction to NLP for Life Sciences

A practical introduction to Natural Language Processing techniques applied to biomedical literature using the CORD-19 dataset.

📋 Description

This repository contains a Jupyter notebook that demonstrates various NLP techniques for analyzing biomedical literature, specifically focused on COVID-19 research papers from the CORD-19 dataset. The notebook provides a step-by-step guide to implementing and understanding different NLP approaches, from basic text preprocessing to advanced semantic search using BERT embeddings.

✨ Features

Text Preprocessing: Clean and prepare biomedical text data for analysis
Language Detection: Identify and filter articles by language
Regular Expression Heuristics: Extract specific patterns like clinical trial identifiers
Word Clouds: Visualize dominant terms in the corpus
N-gram Analysis: Explore common multi-word phrases in the literature
Topic Modeling: Discover latent topics using Latent Dirichlet Allocation (LDA)
Semantic Search: Implement BERT-based search functionality to find relevant articles

🔧 Prerequisites

Python 3.6+
Jupyter Notebook or Google Colab
Basic understanding of NLP concepts and Python programming

🚀 Getting Started

Clone this repository:

git clone https://github.com/yourusername/intro_nlp_life_sciences.git
cd intro_nlp_life_sciences

Install the required dependencies:
```
pip install -r requirements.txt
```

Open the Jupyter notebook:

jupyter notebook "Introduction to NLP for life sciences.ipynb"

If using Google Colab, upload the notebook and CSV files to your Google Drive.

Adjust the file paths in the notebook if necessary:

# Change this line to point to your CSV file
my_articles = pd.read_csv('cord19-subset-500.csv')

📊 Dataset

The repository includes two subsets of the CORD-19 (COVID-19 Open Research Dataset):

cord19-subset-100.csv: A smaller subset with 100 research papers
cord19-subset-500.csv: A larger subset with 500 research papers

Each CSV file contains the following columns:

paper_id: Unique identifier for each paper
title: Title of the research paper
body_text: Full text content of the paper
journal: Journal where the paper was published

📚 Resources

📄 License

This project is available for educational and research purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Introduction to NLP for life sciences.ipynb		Introduction to NLP for life sciences.ipynb
LICENSE		LICENSE
README.md		README.md
cord19-subset-100.csv		cord19-subset-100.csv
cord19-subset-500.csv		cord19-subset-500.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧬 Introduction to NLP for Life Sciences

📋 Description

✨ Features

🔧 Prerequisites

🚀 Getting Started

📊 Dataset

📚 Resources

📄 License

About

Uh oh!

Releases

Packages

Languages

License

corticalstack/intro_nlp_life_sciences

Folders and files

Latest commit

History

Repository files navigation

🧬 Introduction to NLP for Life Sciences

📋 Description

✨ Features

🔧 Prerequisites

🚀 Getting Started

📊 Dataset

📚 Resources

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages