This is an embedding model trained on pairs of queries and abstracts from a dataset of over 53.1k abstracts, each with 5-10 queries for specific topics for each abstract using keywords and phrases which are different from those contained in the abstract. The purpose for that dataset is to create models that can retrieve relavant papers for a particular concept, even if the wording/phrasing of the user's query is very different from that which is used in the abstract.

I'm already working in an improved version of this model that can better suite very hard queries, based on some of the shortcomings I've noticed. If you are an ecephalitis exepert and have query ideas or any other type of feedback, please rearch out at sangupta.ml@gmail.com

https://huggingface.co/datasets/Santosh-Gupta/EncephalitisQueryDocuments

This model used 'ncbi/MedCPT-Article-Encoder' as the base model, and fine-tuned over the dataset using contrastive learning.

For queries, prepend the string with the word 'QUERY'. Do not prepend anything for abstracts.

To use

from transformers import AutoTokenizer, AutoModel
import torch

model_name = 'Santosh-Gupta/EncephalitisRetrieval'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model.eval()

query_text = "QUERY your query text here"
abstract_text "Your abstract text here"

# Tokenize the text
inputs = tokenizer(query_text, return_tensors="pt")

# Pass the input tokens to the model
with torch.no_grad():
    embedding = model(**inputs).last_hidden_state[:, 0, :]

For already processed abstract and paragraph embeddings from Jan 2000 to September 2023, please see

https://huggingface.co/datasets/Santosh-Gupta/EncephalitisParagraphEmbeddings

and

https://huggingface.co/datasets/Santosh-Gupta/EncephalitisAbstractEmbeddings


To set up full embeddings and papers for the PMC and Pubmed searches, use this Google Colab notebook

https://colab.research.google.com/drive/1wN1a32DWCKmP3mgPw7GEJq9I54PSMh7b?usp=sharing

Select Runtime -> Run All from the menu to run all the code to download and load all the models.

Google Colab downloads everything on their servers, not yours, for free.

Google Colab comes with any google/gmail account, but in case you can't access the notebook, here is the code.


# @markdown 

import textwrap
!pip install datasets
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
import pandas as pd
import json
import pandas as pd
from datasets import Dataset, load_dataset
from sklearn.metrics.pairwise import cosine_similarity

model_name = 'Santosh-Gupta/EncephalitisRetrieval'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model.eval()

def embed_query(query):
    query = 'QUERY ' + query
    inputs = tokenizer(query, return_tensors="pt")
    with torch.no_grad():
        query_embedding = model(**inputs).last_hidden_state[:, 0, :]
    return query_embedding.cpu().numpy()

import textwrap

def pretty_print_dict(d, indent=1, width=75):
    for key, value in d.items():
        key_indent = len(key) + 1 # 4 for ': ' and extra space

        if isinstance(value, str):
            paragraphs = value.split('\n')
            print('\n')
            for i, paragraph in enumerate(paragraphs):
                wrapped_lines = textwrap.wrap(paragraph, width=width)
                for j, wrapped_line in enumerate(wrapped_lines):
                    if i == 0 and j == 0:
                        print(' ' * indent + f"'{key}':   "  + wrapped_line)
                    elif j == 0:
                        print('\n' + ' ' * (indent + key_indent+3) + wrapped_line)
                    else:
                        print(' ' * (indent + key_indent) + wrapped_line)
        else:
            print(' ' * indent + f"'{key}': {value}")

paragraph_embeddings = load_dataset("Santosh-Gupta/EncephalitisParagraphEmbeddings").with_format("np")['train']['paragraph_embeddings']
abstract_embeddings = load_dataset("Santosh-Gupta/EncephalitisAbstractEmbeddings").with_format("np")['train']['abstract_embeddings']
paragraphs_texts = load_dataset("Santosh-Gupta/EncephalitisFullTextPapers")['train'].to_list()

all_paragraphs = []
paper_indices = []  # To track which paper each paragraph belongs to

for idx, paper in enumerate(paragraphs_texts):
    paragraphs = paper['full_paragraphs'].split('\n')
    filtered_paragraphs = [p for p in paragraphs if len(p) >= 1000]
    all_paragraphs.extend(filtered_paragraphs)
    paper_indices.extend([idx] * len(filtered_paragraphs))

abstract_texts = load_dataset("Santosh-Gupta/EncephalitisAbstracts")['train'].to_list()
df = pd.read_parquet('/content/drive/MyDrive/EnchepAbstracts/search_phrases/raw_training_df.parquet')
pd.set_option('display.max_colwidth', 1500)

def create_pmc_url(pmc_id):
    base_url = "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC"
    return f"{base_url}{pmc_id}/"

def create_pubmed_url(pmid):
    base_url = "https://pubmed.ncbi.nlm.nih.gov/"
    return f"{base_url}{pmid}/"

# @title Put Search Term Here
Search_Term = "Neuroinflammatory process following transient middle cerebral artery occlusion" # @param {type:"string"}

Number_of_Results_To_Return = 5 # @param {type:"integer"}

query_embedding = embed_query(Search_Term)
similarities_paragraphs = cosine_similarity(query_embedding, paragraph_embeddings)
similarities_abstracts = cosine_similarity(query_embedding, abstract_embeddings)

# @title Get PMC Paragraph Results
# @markdown 

N = Number_of_Results_To_Return  # Number of top results you want
top_paragraph_indices = np.argsort(similarities_paragraphs[0])[::-1][:N]
for idx, para_idx in enumerate(top_paragraph_indices):
    print('Result number : ', idx+1)
    paper_idx = paper_indices[para_idx]

    pretty_json = json.dumps(paragraphs_texts[paper_idx], indent=4)

    print('Retrieved paragraph: \n')
    print(textwrap.fill(all_paragraphs[para_idx], width=100))
    print('\n')
    print('Link ', create_pmc_url(paragraphs_texts[paper_idx]['pmc_id']))
    print('Paper details: \n')
    if 'full_paragraphs' in paragraphs_texts[paper_idx]:
        paragraphs_texts[paper_idx].pop('full_paragraphs')
    pretty_print_dict(paragraphs_texts[paper_idx])
    print('---------------------------------')
    print('\n')

# @title Get Pumbed Abstract Results
# @markdown 

N = Number_of_Results_To_Return  # Number of top results you want
top_indices = np.argsort(similarities_abstracts[0])[::-1][:N]
for iii, idx in enumerate(top_indices):
    print('Result number : ', iii+1)
    pretty_json = json.dumps(abstract_texts[idx], indent=4)
    print("Link ", create_pubmed_url(abstract_texts[idx]['pmid']))
    print(pretty_json)
    print('---------------------------------\n')

Case Studies ------------------------------------------------------------------------------------------

From my tinkering with this model, I think they're pretty good (a highly biased review). Below are some of the case studies I tried which I think show improvements over the baseline pubmed search function. Note: I'm not an encephalitis expert, so I'm probaby not the best at evaluating these results. If you are an encephalitis expert and have feedback or a wishlist of what you would like to see in this type of search engine, please reach out at sangupta.ml@gmail.com

Case Study 1:

Query: "Chronic loss of endothelial nitric oxide (NO) as a contributor to amyloid precursor protein (APP) related pathology"

Default Pubmed Search Results (What you get if you go to the pubmed website and use their search engine):

0 Results

https://pubmed.ncbi.nlm.nih.gov/?term=Chronic+loss+of+endothelial+nitric+oxide+%28NO%29+as+a+contributor+to+amyloid+precursor+protein+%28APP%29+related+pathology&filter=simsearch1.fha&filter=simsearch3.fft&sort=date

However, a slightly altered version of the query yeilded results: "Endothelial nitric oxide deficiency promotes Alzheimer's disease pathology "

1 Result, which was also picked up by the model

https://pubmed.ncbi.nlm.nih.gov/23745722/

1st results, PMC embeddings:

-Retrieved Paragraph

"Most animal models of AD involve transgenic expression of mutated amyloid precursor protein (APP), which leads to parenchymal Aβ deposition but does not usually invoke tau pathology and associated neuronal cell death. Consequently, these models have been criticized as incomplete models of AD pathology (Irizarry et al., 1997; Radde et al., 2008). In a recently developed CVN-AD mouse model, immune-mediated nitric oxide (iNOS) was lowered to mimic human levels, resulting in a model that demonstrates the complete pathological course of AD, including parenchymal amyloidosis, gradual spread of hyperphosphorlyated tau, episodic memory impairment, and significant hippocampal neuronal degeneration (Colton et al., 2008, 2014; Wilcock et al., 2014). These mice are mNos2 -deficient and transgenic for the Swedish K670N/M671L vasculotropic Dutch/Iowa E693Q/D694N mutant (APPSwDI) APP (Kan et al., 2015), which prevents the expression of inducible nitric oxide synthase protein, thus lowering the level of nitric oxide production (for a detailed description of this model, see Colton et al., 2008, 2014; Wilcock et al., 2014).
"

From Paper

"Use of Eflornithine (DFMO) in the Treatment of Early Alzheimer's Disease: A Compassionate Use, Single-Case Study
"

https://pubmed.ncbi.nlm.nih.gov/29559907/

Relevancy Analysis: From the retrieved paragraph, it seems highly relevant to the query. The paper abstract has no mention of NO, so maybe that's why the pubmed search failed to retrieve this one.

2nd results, PMC embeddings:

-Retrieved Paragraph

"Alzheimer’s disease is a chronic neurodegenerative disease that causes severe dementia that is characterized by memory loss, impaired reasoning, and personality alterations. The pathological hallmarks of this disease include the formation of amyloid beta (Αβ) plaques in the brain parenchyma and around blood vessels and neurofibrillary tangles in the neurons, which are composed of hyperphosphorylated tau proteins.109,110 Histological analysis of postmortem human AD brains has shown BBB breakdown as defined by plasma albumin and immunoglobulins around amyloid plaques.111 In 2019, Nation and colleagues presented evidence of BBB impairment by measuring soluble platelet-derived growth factor receptor-β (sPDGFRβ) as a biomarker of damaged capillaries in patients with AD. However, they showed the circulation of sPDGFRβ by cerebrospinal fluid independent of amyloid plaques and tau status by positron emission tomography, which suggested vascular dysfunction as a component of AD pathology.112 Αβ toxicity and neuroinflammation have been associated with down-regulation of tight junction proteins in patients with AD who suffer from cerebral amyloid angiopathy at the capillary level. Two studies by Carrano and colleagues showed that in postmortem brains of patients with cerebral amyloid angiopathy, at the capillary level there was a remarkable reduction, or even complete loss, of claudin-5, among the other tight junction proteins. The capillaries were positive for Αβ plaques or were associated with activated microglia that were positive for nicotinamide adenine dinucleotide phosphate oxidase 2, an enzyme that is responsible for reactive oxygen species production.113,114 Although Viggars et al. correlated increased levels of albumin and fibrinogen to more progressed AD pathology in human postmortem brains, the expression levels of claudin-5, occludin and ZO-1 were not affected.115 Nevertheless, BBB impairment is well accepted as a pathological characteristic of human AD pathology.112
"

From Paper

"The blood–brain and gut–vascular barriers: from the perspective of claudins"

https://pubmed.ncbi.nlm.nih.gov/34152937/

Relevancy analysis: Describes various aspects of Alzheimer's disease (AD) pathology, including the formation of amyloid beta (Αβ) plaques, which are directly related to amyloid precursor protein (APP) pathology. Although it does not directly mention endothelial nitric oxide (NO), it discusses blood-brain barrier (BBB) impairment and vascular dysfunction in AD, which could be implicitly related to NO levels since NO is known to play a significant role in vascular health and function. The passage suggests that vascular issues are a component of AD pathology, indirectly linking to the potential role of chronic loss of endothelial NO in APP-related pathology in AD.

1st result, Abstract embeddings

Retrieved paper

"title": "Endothelial nitric oxide deficiency promotes Alzheimer's disease pathology.",

"abstract": "Aging and the presence of cerebrovascular disease are associated with increased incidence of Alzheimer's disease. A common feature of aging and cerebrovascular disease is decreased endothelial nitric oxide (NO). We studied the effect of a loss of endothelium derived NO on amyloid precursor protein (APP) related phenotype in late middle aged (LMA) (14-15\u00a0month) endothelial nitric oxide synthase deficient (eNOS(-/-) ) mice. APP, \u03b2-site APP cleaving enzyme (BACE) 1, and amyloid beta (A\u03b2) levels were significantly higher in the brains of LMA eNOS(-/-) mice as compared with LMA wild-type controls. APP and A\u03b21-40 were increased in hippocampal tissue of eNOS(-/-) mice as compared with wild-type mice. LMA eNOS(-/-) mice displayed an increased inflammatory phenotype as compared with LMA wild-type mice. Importantly, LMA eNOS(-/-) mice performed worse in a radial arm maze test of spatial learning and memory as compared with LMA wild-type mice. These data suggest that chronic loss of endothelial NO may be an important contributor to both A\u03b2 related pathology and cognitive decline. Cardiovascular risk factors are associated with increased incidence of Alzheimer's disease (AD). A common feature of these risk factors is decreased endothelial nitric oxide (NO). We observed, in mice deficient in endothelial nitric oxide synthase, increased amyloid precursor protein (APP), \u03b2-site APP cleaving enzyme 1, amyloid beta levels, microglial activation, and impaired spatial memory. This suggests chronic loss of endothelial NO may be an important contributor to the pathogenesis of sporadic AD.",

https://pubmed.ncbi.nlm.nih.gov/23745722/

Relevancy Analysis: Very relevant to query. Same as only pubmed result for "Endothelial nitric oxide deficiency promotes Alzheimer's disease pathology "

2nd result, Abstract embeddings

"title": "Pharmacological strategies for the regulation of inducible nitric oxide synthase: neurodegenerative versus neuroprotective mechanisms.",
"abstract": "Inducible nitric oxide synthase (iNOS) is one of three NOS isoforms generating nitric oxide (NO) by the conversion of l-arginine to l-citrulline. iNOS has been found to be a major contributor to initiation/exacerbation of the central nervous system (CNS) inflammatory/degenerative conditions through the production of excessive NO which generates reactive nitrogen species (RNSs). Activation of iNOS and NO generation has come to be accepted as a marker and therapeutic target in neuroinflammatory conditions such as those observed in ischemia, multiple sclerosis (MS), spinal cord injury (SCI), Alzheimer's disease (AD), and inherited peroxisomal (e.g. X-linked adrenoleukodystrophy; X-ALD) and lysosomal disorders (e.g. Krabbe's disease). However, with the emergence of reports on the neuroprotective facets of NO, the prior dogma about NO being solely detrimental has had to be modified. While RNSs such as peroxynitrite (ONOO(-)) have been linked to lipid peroxidation, neuronal/oligodendrocyte loss, and demyelination in neurodegenerative diseases, limited NO generation by GSNO has been found to promote vasodilation and attenuate vascular injury under the same ischemic conditions. NO generated from GSNO acts as second messenger molecular which through S-nitrosylation has been shown to control important cellular processes by regulation of expression/activity of certain proteins such as NF-kappaB. It is now believed that the environment and the context in which NO is produced largely determines the actions (good or bad) of this molecule. These multi-faceted aspects of NO make therapeutic interference with iNOS activity even more complicated since complete ablation of iNOS activity has been found to be rather more detrimental than protective in most neurodegenerative conditions. Investigators in search of iNOS modulating pharmacological agents have realized the need of a delicate balance so as to allow the production of physiologically relevant amounts of NO (such as those required for host defence/neutotransmission/vasodilation, etc.) but at the same time block the generation of RNSs through repressing excessive NO levels (such as those causing neuronal/tissue damage and demyelination, etc.). The past years have seen a noteworthy increase in novel agents that might prove useful in achieving the aim of harnessing the good and blocking the undesirable actions of NO. It is the aim of this review to provide basic insights into the NOS family of enzymes with special emphasis of the role of iNOS in the CNS, in the first part. In the second part of the review, we will strive to provide an exhaustive compilation of the prevalent strategies being tested for the therapeutic modulation of iNOS and NO production.",

https://pubmed.ncbi.nlm.nih.gov/16765486/

Relevancy Analysis: Discusses the role of inducible nitric oxide synthase (iNOS) in generating NO and its implications in various central nervous system (CNS) conditions, including neurodegenerative diseases like Alzheimer's disease (AD).

Case Study 2:

Query: "intracranial antigens interaction with immune system though glymphatic system"

Default Pubmed Search Results (What you get if you go to the pubmed website and use their search engine):

0 Results

https://pubmed.ncbi.nlm.nih.gov/?term=+intracranial+antigens+interaction+with+immune+system+though+glymphatic+system&filter=simsearch1.fha&filter=simsearch3.fft

2nd results, PMC embeddings:

(jumpted to 2nd result since 1st result wasn't super relevant)

-Retrieved Paragraph

"In other organs, lymphatic vessels serve as conduits for the transport of tissue-derived antigen and dendritic cells to lymph nodes, where naive and memory T cells are optimally positioned for detection of their cognate antigen (Thomas et al., 2016; Gasteiger et al., 2016). The recent discovery of functional lymphatic vessels in the dura mater layer of meninges has prompted a significant reconsideration of how the CNS engages the peripheral immune system (Louveau et al., 2015b; Aspelund et al., 2015). Meningeal lymphatic vessels are observed in rodents, primates, and humans (Absinta et al., 2017; Albayram et al., 2022), and in experimental models of brain cancer and autoimmunity these vessels have been shown to play an integral role in regulating T cell responses in the CNS (Song et al., 2020; Louveau et al., 2018b). Mouse studies have demonstrated that meningeal lymphatic vessels convey macromolecules and immune cells from the meninges and cerebrospinal fluid (CSF) to the deep cervical lymph nodes (Louveau et al., 2018b). Indeed, when model antigens like ovalbumin (OVA) are injected into the brain, these molecules travel from the brain interstitium into the CSF via glymphatic flow (Iliff et al., 2012) and have the potential to be presented to T cells in the deep cervical lymph nodes (Ling et al., 2003; Harris et al., 2014)."

From paper

Title: Meningeal lymphatic drainage promotes T cell responses against Toxoplasma gondii but is dispensable for parasite control in the brain

https://pubmed.ncbi.nlm.nih.gov/36541708/

Relevancy Analysis: Very relevant, quote “ Indeed, when model antigens like ovalbumin (OVA) are injected into the brain, these molecules travel from the brain interstitium into the CSF via glymphatic flow (Iliff et al., 2012) and have the potential to be presented to T cells in the deep cervical lymph nodes (Ling et al., 2003; Harris et al., 2014).”

Case Study 3:

Query: “visual field-cut as a potential symptom of immune-mediated encephalitis”

Default Pubmed Search Results (What you get if you go to the pubmed website and use their search engine):

1 result:

https://pubmed.ncbi.nlm.nih.gov/32360731/

1st results, PMC embeddings:

-Retrieved Paragraph

"There are also less numerous, but nonetheless significant, projections to other visual cortical areas (such as V5 involved in motion processing) and the tectum (involved in pupillary reflexes). There is, therefore, an ample substrate for differential effects of immune processes on parallel pathways of processing within a single anatomically defined structure such as the retina, optic nerve, or visual cortex. Selective deficits of color vision, motion perception, and other modalities are all potential manifestations of autoimmune-mediated dysfunction.Fig. 20.1The major cell types of a typical mammalian retina. From the top row to the bottom, photoreceptors, horizontal cells, bipolar cells, amacrine cells, and ganglion cells. For steric reasons, only a subset of the wide-field amacrine cells is shown.(Reprinted from Masland (2001) with permission from Macmillan Publishers (Nature Neuroscience)).Table 20.1Definition of terms used in visual sciencesTermDefinitionHard-wiredIn human vision the first-, second-, and third-order neurons and their axons are hard-wired into the human brain and transmit analogue and digital signals. This hard-wired single pathway enables the retinotopic map of the human visual cortex. There is no (or very little) potential for plasticity in the strict definition of this single pathway model (Balk et al., 2015)Analogue signalThe analogue signal produced by the photoreceptors is continuous. The signal intensity varies over time depending on the light-induced metabolism of opsins. Therefore the variation of the signal carries information on light entering the eye (Fig. 20.2). The analogue signal of the photoreceptors is converted by retinal bipolar cells into a digital signal as required for higher-level visual network processingDigital signalThe digital signal of the hard-wired visual pathway is sampled from the analogue signal fed into retinal bipolar cells by photoreceptors. The digital signal consists of a series of action potentials. The information of the digital signal is encoded in the time frequency of these action potentialsRetinotopicA topographic map where adjacent locations on the retina are represented by adjacent neurons in the dorsal lateral geniculate nucleus and V1.Rayleigh criterionThe minimum resolvable detail according to the generally accepted physical definition, diffraction limitation. Simplified, the limitation of image resolution relates to the order of wavelength of the wave used to image it. For example, the Rayleigh criterion for a wavelength of 500 nm and a circular pupil opening of 5 mm is: θR=1.22×λd=1.22×5×10−5cm0.5cm=1.22×10−4rad. Put into relation, a Snellen acuity of 6/6 (UK notation, 20/20 US notation) corresponds to a resolution limitation of θ = 5 × 10–4rad in most subjects. Under optimal circumstances a visual acuity of θ = 2 × 10–4rad might be achieved. Essentially, visual acuity depends on the anatomic spacing of sensory neurons in the retina and the wavelengths of the light entering the eyeVernier acuityThe human visual cortex can make spatial distinctions with a precision which is about 10 times better than visual acuity. This so-called hyperacuity depends on sophisticated information processing in the visual human brain. Vernier acuity represents the quintessential example of hyperacuity where the alignment of two edges or lines can be judged with a better precision than predicted by visual acuity. Clinically, the assessment of, for example, normal stereopsis relies on hyperacuity"

From Paper

"Autoimmunity in visual loss"

https://pubmed.ncbi.nlm.nih.gov/27112687/

Relevancy Analysis: Discusses how immune processes can differentially affect visual processing pathways in structures like the retina, optic nerve, or visual cortex

4th results, PMC embeddings:

-Retrieved Paragraph

Optic neuritis (ON) is an acute inflammatory optic neuropathy that may be associated with dramatic visual loss and an important decrease in quality of life in absence of an adequate treatment. Multiple Sclerosis (MS) ON, the most common form of presentation, is characterized by unilateral acute retroocular pain and visual loss, more commonly observed in Caucasian women between 18 and 50 years [1]. Visual acuity (VA) in patients with MS-ON usually improves within a few months even without treatment [2,3,4]. Non-MS ON is less frequent and can be an isolated disorder or related to infections and immune-mediated diseases such as Neuromyelitis Optica (NMO) or other systemic diseases [5]. Non-MS ON may have atypical features such as male gender, age less than 18 or greater than 50 years, absence of pain and bilateral presentation [5]. In non-MS ON, a chronic progressive disease is more common. Flare-ups are frequent, leading often to visual loss [3,6]. If not promptly treated, the visual outcome can be devastating, causing a severe visual loss, and even with adequate treatment, many patients may worsen over months [7,8,9,10].

From Paper

Title: Biologic Therapy in Refractory Non-Multiple Sclerosis Optic Neuritis Isolated or Associated to Immune-Mediated Inflammatory Diseases. A Multicenter Study

https://pubmed.ncbi.nlm.nih.gov/32796717/

Relevancy Analysis: Discusses immune-mediated encephalitis in the context of optic neuritis (ON), a related inflammatory condition affecting the optic nerve

Future Work

In this work I felt I saw the limits of what a single vector can encode for a 512 token passage of text. The next version will likely use a variant of the Colbert model/method for embeddings, which creates embeddings for each token in the query and paragraph.

If you have any feedback, please reach out to sangupta.ml@gmail.com

Downloads last month
8
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support