John Snow Labs Webinars

Past Webinars

State-of-the-art named entity recognition with BERT

Deep neural network models have recently achieved state-of-the-art performance gains in a variety of natural language processing (NLP) tasks. However, these gains rely on the availability of large amounts of annotated examples, without which state-of-the-art performance is rarely achievable. This is especially inconvenient for the many NLP fields where annotated examples are scarce, such as medical text.

Named entity recognition (NER) is one of the most important tasks for development of more sophisticated NLP systems. In this webinar, we will walk you through how to train a custom NER model using BERT embeddings in Spark NLP – taking advantage of transfer learning to greatly reduce the amount of annotated text to achieve accurate results. After the webinar, you will be able to train your own NER models with your own data in Spark NLP.

Watch recording
Accurate De-Identification of Structured & Unstructured Medical Data at Scale

Recent advances in deep learning enable automated de-identification of medical data to approach the accuracy achievable via manual effort. This includes accurate detection & obfuscation of patient names, doctor names, locations, organizations, and dates from unstructured documents – or accurate detection of column names & values in structured tables. This webinar explains:

What’s required to de-identify medical records under the US HIPAA privacy rule Typical de-identification use cases, for structured and unstructured data How to implement de-identification of these use cases using Spark NLP for Healthcare. After the webinar, you will understand how to de-identify data automatically, accurately, and at scale, for the most common scenarios.

Watch recording
AI Model Governance in a High-Compliance Industry

Model governance defines a collection of best practices for data science – versioning, reproducibility, experiment tracking, automated CI/CD, and others. Within a high-compliance setting where the data used for training or inference contains private health information (PHI) or similarly sensitive data, additional requirements such as strong identity management, role-based access control, approval workflows, and full audit trail are added.

This webinar summarizes requirements and best practices for establishing a high-productivity data science team within a high-compliance environment. It then demonstrates how these requirements can be met using John Snow Labs’ Healthcare AI Platform.

Watch recording
Automated Mapping of Clinical Entities from Natural Language Text to Medical Terminologies

The immense variety of terms, jargon, and acronyms used in medical documents means that named entity recognition of diseases, drugs, procedures, and other clinical entities isn't enough for most real-world healthcare AI applications. For example, knowing that "renal insufficiency", "decreased renal function" and "renal failure" should be mapped to the same code, before using that code as a feature in a patient risk prediction or clinical guidelines recommendation model, is critical to that's model's accuracy. Without it, the training algorithm will see these three terms as three separate features and will severely under-estimate the relevance of this condition.

This need for entity resolution, also known as entity normalization, is therefore a key requirement from a healthcare NLP library. This webinar explains how Spark NLP for Healthcare addresses this issue by providing trainable, deep-learning-based, clinical entity resolution, as well as pre-trained models for the most commonly used medical terminologies: SNOMED-CT, RxNorm, ICD-10-CM, ICD-10-PCS, and CPT.

Watch recording
Best Practices & Tools for Accurate Document Annotation and Data Abstraction

Are you working on machine learning tasks such as sentiment analysis, named entity recognition, text classification, image classification or audio segmentation? If so, you need training data adapted for your particular domain and task.
This webinar will explain the best practices and strategies for getting the training data you need. We will go over the setup of the annotation team, the workflows that need to be in place for guaranteeing high accuracy and labeler agreement, and the tools that will help you increase productivity and eliminate errors.

Watch recording
Maximizing Text Recognition Accuracy with Image Transformers in Spark OCR

Spark OCR is an object character recognition library that can scale natively on any Spark cluster; enables processing documents privately without uploading them to a cloud service; and most importantly, provides state-of-the-art accuracy for a variety of common use cases. A primary method of maximizing accuracy is using a set of pre-built image pre-processing transformers - for noise reduction, skew correction, object removal, automated scaling, erosion, binarization, and dilation. These transformers can be combined into OCR pipelines that effectively resolve common 'document noise' issues that reduce OCR accuracy.

This webinar describes real-world OCR use cases, common accuracy issues they bring, and how to use image transformers in Spark OCR in order to resolve them at scale. Example Python code will be shared using executable notebooks that will be made publicly available.

Watch recording
Hardening a Cleanroom AI Platform to allow model training & inference on Protected Health Information

Artificial intelligence projects in high-compliance industries, like healthcare and life science, often require processing Protected Health Information (PHI). This may happen because the nature of the projects does not allow full de-identification in advance – for example, when dealing with rare diseases, genetic sequencing data, identify theft, or training de-identification models – or when training is anonymized data but inference must happen on data with PHI.

In such scenarios, the alternative is to create an “AI cleanroom” – an isolated, hardened, air-gap environment where the work happens. Such a software platform should enable data scientists to log into the cleanroom, and do all the development work inside it – from initial data exploration & experimentation to model deployment & operations – while no data, computation, or generated assets ever leave the cleanroom.
This webinar presents the architecture of such a Cleanroom AI Platform, which has been actively used by Fortune 500 companies for the past three years. Second, it will survey the hundreds of DevOps & SecOps features requires to realize such a platform – from multi-factor authentication and point-to-point encryption to vulnerability scanning and network isolation. Third, it will explain how a Kubernetes-based architecture enables “Cleanroom AI” without giving up on the main benefits of cloud computing: elasticity, scalability, turnkey deployment, and a fully managed environment.

Watch recording
Accurate De-identification, Obfuscation, and Editing of Scanned Medical Documents and Images
One kind of noisy data that healthcare data scientists deal with is scanned documents and images: from PDF attachments of lab results, referrals, or genetic testing to DICOM files with medical imaging. These files are challenging to de-identify, because personal health information (PHI) can appear anywhere in free text – so cannot be removed with rules or regular expressions – or “burned” into images so that it’s not even available as digital text to begin with.

This webinar presents a software system that tackles these challenges, with lessons learned from applying it in real-world production systems. The workflow uses:
- Spark OCR to extract both digital and scanned text from PDF and DICOM files
- Spark NLP for Healthcare to recognize sensitive data in the extracted free text
- The de-identification module to delete, replace, or obfuscate PHI
- Spark OCR to generate new PDF or DICOM file with the de-identified data
- Run the whole workflow within a local secure environment, with no need to share data with any third party or a public cloud API
Watch recording
Answering Natural Language Questions

The ability to directly answer medical questions asked in natural language either about a single entity (“what drugs has this patient been prescribed?”) or a set of entities (“list stage 4 lung cancer patients with no history of smoking”) has been a longstanding industry goal, given its broad applicability across many use cases.

This webinar presents a software solution, based on state-of-the-art deep learning and transfer learning research, for translating natural language questions to SQL statements. An actual case study will be a system which answers clinical questions by training domain-specific models and learning from reference data. This is a production-grade, trainable and scalable capability of Spark NLP Enterprise. Live notebooks will be shared to explain how you can use it in your own projects.

Watch recording
John Snow Labs NLU: Become a Data Science Superhero with One Line of Python Code

Learn how to unleash the power of 350+ pre-trained NLP models, 100+ Word Embeddings, 50+ Sentence Embeddings, and 50+ Classifiers in 46 languages with 1 line of Python code. John Snow Labs' new NLU library marries the power of Spark NLP with the simplicity of Python. Tackle NLP tasks like NER, POS, Emotion Analysis, Keyword extraction, Question answering, Sarcasm Detection, Document classification using state-of-the-art techniques. The end-to-end library includes word & sentence embeddings like BERT, ELMO, ALBERT, XLNET, ELECTRA, USE, Small-BERT, and others; text wrangling and cleaning like tokenization, chunking, lemmatizing, stemming, normalizing, spell-checking, and matchers; and easy visualization capabilities using your embedded data with T-SNE.

Christian Kasim Loan, the creator of NLU, will walk through NLU and show you how easy it is to generate T-SNE visualizations of 6 Deep Learning Embeddings, achieve top classification results on text problems from Kaggle competition with 1 line of NLU code, and leverage the latest & greatest advances in deep learning & transfer learning.

Watch recording
Automated Drug Adverse Event Detection from Unstructured Text

Adverse Drug Events (ADEs) are potentially very dangerous to patients and are amongst the top causes of morbidity and mortality. Monitoring & reporting of ADEs is required by pharma companies and healthcare providers. This session introduces new state-of-the-art deep learning models for automatically detecting if a free-text paragraph includes an ADE (document classification), as well as extracting the key terms of the event in structured form (named entity recognition). Using live Python notebooks and real examples from clinical and conversational text, we’ll show how to apply these models using the Spark NLP for Healthcare library.
Watch recording
State-of-the-art Natural Language Processing for 200+ Languages with 1 Line of Code

Learn to harness the power of 1,000+ production-grade & scalable NLP models for 200+ languages - all available with just 1 line of Python code by leveraging the open-source NLU library, which is powered by the widely popular Spark NLP.

This webinar will show you how to leverage the multi-lingual capabilities of Spark NLP & NLU - including automated language detection for up to 375 languages, and the ability to perform translation, named entity recognition, stopword removal, lemmatization, and more in a variety of language families. We will create Python code in real-time and solve these problems in just 30 minutes. The notebooks will then be made freely available online.

Watch recording
Using & Expanding the NLP Models Hub

The NLP Models Hub which powers the Spark NLP and NLU libraries takes a different approach than the hubs of other libraries like TensorFlow, PyTorch, and Hugging Face. While it also provides an easy-to-use interface to find, understand, and reuse pre-trained models, it focuses on providing production-grade state-of-the-art models for each NLP task instead of a comprehensive archive.

This implies a higher quality bar for accepting community contributions to the NLP Models Hub - in terms of automated testing, level of documentation, and transparency of accuracy metrics and training datasets. This webinar shows how you can make the most of it, whether you're looking to easily reuse models or contribute new ones.

Watch recording
Visual Document Understanding with Multi-Modal Image & Text Mining in Spark OCR 3

The Transformer architecture in NLP has truly changed the way we analyze text. NLP models are great at processing digital text, but many real-word applications use documents with more complex formats. For example, healthcare systems often include visual lab results, sequencing reports, clinical trial forms, and other scanned documents. When we only use an NLP approach for document understanding, we lose layout and style information - which can be vital for document image understanding. New advances in multi-modal learning allow models to learn from both the text in documents (via NLP) and visual layout (via computer vision).
We provide multi-modal visual document understanding, built on Spark OCR based on the LayoutLM architecture. It achieves new state-of-the-art accuracy in several downstream tasks, including form understanding (from 70.7 to 79.3), receipt understanding (from 94.0 to 95.2) and document image classification (from 93.1 to 94.4).

Watch recording
Speed Optimization & Benchmarks in Spark NLP 3: Making the Most of Modern Hardware

Spark NLP is the most widely used NLP library in the enterprise, thanks to implementing production-grade, trainable, and scalable versions of state-of-the-art deep learning & transfer learning NLP research. It is also Open Source with a permissive Apache 2.0 license that officially supports Python, Java, and Scala languages backed by a highly active community and JSL members.

Spark NLP library implements core NLP algorithms including lemmatization, part of speech tagging, dependency parsing, named entity recognition, spell checking, multi-class and multi-label text classification, sentiment analysis, emotion detection, unsupervised keyword extraction, and state-of-the-art Transformers such as BERT, ELECTRA, ELMO, ALBERT, XLNet, and Universal Sentence Encoder.
The latest release of Spark NLP 3.0 comes with over 1100+ pretrained models, pipelines, and Transformers in 190+ different languages. It also delivers massive speeds up on both CPU & GPU devices while extending support for the latest computing platforms such as new Databricks runtimes and EMR versions.
The talk will focus on how to scale Apache Spark / PySpark applications in YARN clusters, use GPU in Databricks new Apache Spark 3.x runtimes, and manage large-scale datasets in resource-demanding NLP applications efficiently. We will share benchmarks, tips & tricks, and lessons learned when scaling Spark NLP.

Watch recording
Accurate Table Extraction from Documents & Images with Spark OCR

Extracting data formatted as a table (tabular data) is a common task — whether you’re analyzing financial statements, academic research papers, or clinical trial documentation. Table-based information varies heavily in appearance, fonts, borders, and layouts. This makes the data extraction task challenging even when the text is searchable – but more so when the table is only available as an image.
This webinar presents how Spark OCR automatically extracts tabular data from images. This end-to-end solution includes computer vision models for table detection and table structure recognition, as well as OCR models for extracting text & numbers from each cell. The implemented approach provides state-of-the-art accuracy for the ICDAR 2013 and TableBank benchmark datasets.
Watch recording
1 Line of Code to Use 200+ State-of-the-Art Clinical & Biomedical NLP Models

In this Webinar, Christian Kasim Loan will teach you how to leverage the hundreds of medical State-of-the-Art models for various Medical and Healthcare domains in 1 line of code like Named Entity Recognition (NER) for Adverse Drug Events, Anatomy, Diseases, Chemicals, Clinical Events, Human Phenotypes, Posology, Radiology, Measurements, and many other fields plus the best in class resolution algorithms to map the extracted entities into medical code terminologies like ICD10, ICD0, RXNORM, SNOMED, LOINC, and many more.
Additionally, we will showcase how to extract the relationship between predicted entities for the Posology, Drug Adverse Effects, Temporal Features, Body Party problems, Procedures domains, and how to De-Identify your text documents.
Finally, we will take a look at the latest NLU Streamlit features and how you can leverage them to visualize all model predictions and test them out with 0 lines of code in your web browser!
Watch recording
Creating a Clinical Knowledge Graph with Spark NLP and neo4j

The knowledge graph represents a collection of connected entities and their relations. A knowledge graph that is fueled by machine learning utilizes natural language processing to construct a comprehensive and semantic view of the entities. A complete knowledge graph allows answering and search systems to retrieve answers to given queries. In this study, we built a knowledge graph using Spark NLP models and Neo4j. The marriage of Spark NLP and Neo4j is very promising for creating clinical knowledge graphs to do a deeper analysis, Q&A tasks, and get insights.
Watch recording
Enterprise-Scale Data Labeling & Automated Model Training with the Free Annotation Lab

Extracting data from unstructured documents is a common requirement - from finance and insurance to pharma and healthcare. Recent advances in deep learning offer impressive results on this task when models are trained on large enough datasets.
However, getting high-quality data involves a lot of manual effort. An annotation project is defined, annotation guidelines are specified, documents are imported, tasks are distributed among domain experts, a manager tracks the team's performance, inter-annotator agreement is reached, and the resulting annotations are exported into a standard format. At enterprise-scale, complexity grows due to the volume of projects, tasks, and users.
John Snow Labs' Annotation Lab is a free annotation tool that has already been deployed and used by large-scale enterprises for three years. This webinar presents how you can exploit the tool's capabilities to easily manage any annotation project - from small team to enterprise-wide. It also shows how models can be trained automatically, without writing a single line of code, and how any pre-trained model can be used to pre-annotate documents to speed up projects by 5x - since domain experts don't start annotating from scratch but correct and improve the models, as part of a no-code human-in-the-loop AI workflow.
Watch recording
Automating Clinical Trial Master File Migration & Information Extraction
Pharmaceutical Companies who conduct clinical trials, looking to get new treatments to market as quickly as possible, possess a high volume of documents. Millions of documents can be created as part of one trial and are stored in a document management system. In case migrating these documents to a new system is needed - for example, when a pharma company acquires the rights to a drug or trial - all these documents must often be read manually in order to classify them and extract metadata that is legally required and must be accurate. Traditionally, this migration is a long, complex, and labor-intensive process.
We present a solution based on the natural language processing (NLP) system which provides:
- Speed – 80% reduction of manual labor and migration timeline, proven in major real-world projects
- State of the art accuracy – based on Spark NLP for Healthcare, integrated in a human-in-the-loop solution
- End-to-end, secure and compliant solution – Air-gap deployment, GxP and GAMP 5 validated
We will share lessons learned from an end-to-end migration process of the trial master file in Novartis.
Watch recording
Rule-Based and Pattern Matching for Entity Recognition in Spark NLP

Finding patterns and matching strategies are well-known NLP procedures to extract information from text.
Spark NLP library has two annotators that can use these techniques to extract relevant information or recognize entities of interest in large-scale environments when dealing with lots of documents from medical records, web pages, or data gathered from social media.
In this talk, we will see how to retrieve the information we are looking for by using the following annotators:
Entity Ruler, an annotator available in open-source Spark NLP.

Contextual Parser, an annotator available only in Spark NLP for Healthcare.

In addition, we will enumerate use cases where we can apply these annotators.

After this webinar, you will know when to use a rule approach to extract information from your data and the best way to set the available parameters in these annotators.
Watch recording
Deeper Clinical Document Understanding Using Relation Extraction

Recognizing entities is a fundamental step towards understanding a piece of text – but entities alone only tell half the story. The other half comes from explaining the relationships between entities. Spark NLP for Healthcare includes state-of-the-art (SOTA) deep learning models that address this issue by semantically relating entities in unstructured data.
John Snow Labs has developed multiple models utilizing BERT architectures with custom feature generation to achieve peer-reviewed SOTA accuracy on multiple benchmark datasets. This session will shed light on the background and motivation behind relation extraction, techniques, real-world use cases, and practical code implementation.
Watch recording
Building Real-World Healthcare AI Projects from Concept to Production
In this Webinar, Juan Martinez from John Snow Labs and Ken Puffer from ePlus will share lessons learned from recent AI, ML, and NLP projects that have been successfully built & deployed in US hospital systems:
- Improving patient flow forecasting at Kaiser Permanente
- A real-time clinical decision support platform for Psychiatry and Oncology at Mount Sinai
- Automated de-identification of 700 million patient notes at Providence Health
Then they will showcase a live demo of the recently launched AI Workflow Accelerator Bundle for Healthcare, which provides a complete data science platform including supporting the full AI lifecycle:
- Data analysis: Enable data analysts to query, visualize & build dashboards without coding
- Data science: Enable data scientists to train models, share & scale experiments
- Model deployment options
- Operations: Enable DevOps & DataOps engineers to monitor, secure, and scale
The bundle is a turnkey solution composed of GPU-accelerated hardware from NVIDIA, proprietary software from John Snow Labs, and implementation services from ePlus. It is unique in providing all of the following healthcare-specific capabilities out of the box:
- 2,300+ current, clean, and enriched healthcare datasets – from ontologies to benchmarks
- Spark NLP for Healthcare - the most widely used NLP library in the healthcare industry – along with 250+ pre-trained clinical & biomedical NLP models for analyzing unstructured data
- Spark OCR - including the ability to read, de-identify, and extract information from DICOM images
- Security controls implemented within the platform, to enable a team of data scientists to effectively work & collaborate in air-gap, high-compliance environments
We will share speed & accuracy benchmarks measuring the optimization of John Snow Labs’ software and models on the GPU-accelerated Nvidia hardware – and how this translates to enabling your AI team to deliver bigger projects faster.
Watch recording

Solutions for Public Sector

Events & Resources

Contracts & Ordering

Join Our Partner Ecosystem

John Snow Labs Webinars

Past Webinars