Extract Mendeley Notes Using R
Code for this analysis is available on Github.
Thought I would offer up a quick post on something that had sidetracked me earlier this week, namely how to extract notes and annotations in Mendeley using R. Basically I was having a similar problem as Daniel Hynk here, which he solved using Python. I too use Mendeley as a reference manager, which also has the handy feature of allowing users to add their own annotations and notes to their saved documents. While I have been generally satisfied with Mendeley, I was in a situation where I wanted to put together a table containing some information on a set of papers I had been reading, including, in particular, the notes I had been adding. Unfortunately, Mendeley does not offer a simple option for doing this (some information is available for export to .bib or .xml).
Anyway, Mendeley stores all the information about a user’s account locally in an SQLite database. We can connect to this database and extract the information, including the notes. I’m not sure how much this tutorial will generalize to other reference management software suites, but I imagine the underlying mechanics are similar. This tutorial can also be treated as an introduction to database interaction using the RSQLite
package, available from CRAN.
As always, let’s start by loading some handy libraries.
library(RSQLite) # Database
library(dplyr) # This really should be loaded by default always...
library(tidyr) # 'spread' function to change table from 'long' to 'wide' format
We first need to connect to the database. Mendeley Support lists how to find the local database path for each system. In Windows, the mendeley.path
, as we will refer to it in R, will look something like C:/Users/{username}/AppData/Local/Mendeley Ltd./Mendeley Desktop/{youremail}@{emailclient.com}@www.mendeley.com.sqlite
.
# Connect to the database
mendeley.connection = dbConnect(RSQLite::SQLite(),mendeley.path)
The Mendeley database contains a variety of tables. These can be listed by using the dbListTables
funtion
# Some of the tables available in the Mendeley database
head(dbListTables(mendeley.connection),n=10)
## [1] "CanonicalDocuments" "DataCleaner" "DocumentCanonicalIds"
## [4] "DocumentContributors" "DocumentDetailsBase" "DocumentFields"
## [7] "DocumentFiles" "DocumentFolders" "DocumentFoldersBase"
## [10] "DocumentKeywords"
For each table, to get an idea of the contents, the list of variable names can be printed using the dbListFields
function, illustrated here for the “Documents” table.
# The variables available in the 'Documents' table
dbListFields(mendeley.connection,"Documents")
## [1] "id" "uuid"
## [3] "confirmed" "deduplicated"
## [5] "deletionPending" "favourite"
## [7] "read" "type"
## [9] "abstract" "added"
## [11] "modified" "importer"
## [13] "note" "privacy"
## [15] "title" "advisor"
## [17] "articleColumn" "applicationNumber"
## [19] "arxivId" "chapter"
## [21] "citationKey" "city"
## [23] "code" "codeNumber"
## [25] "codeSection" "codeVolume"
## [27] "committee" "counsel"
## [29] "country" "dateAccessed"
## [31] "day" "department"
## [33] "doi" "edition"
## [35] "genre" "hideFromMendeleyWebIndex"
## [37] "institution" "internationalAuthor"
## [39] "internationalNumber" "internationalTitle"
## [41] "internationalUserType" "isbn"
## [43] "issn" "issue"
## [45] "language" "lastUpdate"
## [47] "legalStatus" "length"
## [49] "medium" "month"
## [51] "originalPublication" "owner"
## [53] "pages" "pmid"
## [55] "publication" "publicLawNumber"
## [57] "publisher" "reprintEdition"
## [59] "reviewedArticle" "revisionNumber"
## [61] "sections" "seriesEditor"
## [63] "series" "seriesNumber"
## [65] "session" "shortTitle"
## [67] "sourceType" "userType"
## [69] "volume" "year"
I use the extract.table
function defined below to effectively combine the actions of sending a database query (dbSendQuery
), fetching the results (dbFetch
), and freeing up the resources (dbClearResult
).
extract.table <- function(con,query){
res <- dbSendQuery(con,query) # Send query
table <- dbFetch(res) # Fetch table
dbClearResult(res) # Free resources
return(table)
}
In Mendeley, saved documents can be sorted into one or more user-defined folders. In my particular case, I was focusing on a set of recent papers I have been reading on the topic of multilayer and dynamic network analysis, which I had sorted into a folder appropriately named “Networks”. Based on the collection of tables shown earlier, it seems the “Folders” and “DocumentFolders” tables might come in handy. Note that I will use a mix of SQL commands and dplyr
functions to get my desired results going forward.
dbListFields(mendeley.connection,"Folders")
## [1] "id" "uuid" "name"
## [4] "parentId" "access" "syncPolicy"
## [7] "downloadFilesPolicy" "uploadFilesPolicy" "publicUrl"
## [10] "description" "creatorName" "creatorProfileUrl"
folders <- extract.table(mendeley.connection, "SELECT id, name FROM Folders")
dbListFields(mendeley.connection,"DocumentFolders")
## [1] "documentId" "folderId" "status"
document.folders <- extract.table(mendeley.connection, "SELECT folderId, documentId FROM DocumentFolders")
We first need to retrieve the internal ID assigned to the “Networks” folder, then extract the IDs associated with all documents in that folder.
relevant.folder.name <- "Networks"
# Extract interal ID for folder of interest
relevant.folder.id <- (folders %>%
filter(name == relevant.folder.name))$id[1]
# Extract internal IDs for all papers belonging to the folder of interest, using the folder ID
relevant.papers <- (document.folders %>%
filter(folderId == relevant.folder.id))$documentId
head(relevant.papers)
## [1] 196 215 224 232 249 264
The “Documents”" table contains the document title, among others (for illustration, I also collect the citation key assigned to each document). Here we simply need to extract the table, filtering to include only those documents of interest.
# Collect title and citation key for all relevant documents
relevant.documents <- extract.table(mendeley.connection,"SELECT id, citationKey, title FROM Documents") %>%
filter(id %in% relevant.papers) %>%
rename(documentId = id)
## documentId citationKey title
## 1 196 Oentaryo2016 Collective semi-supervised learning for ...
## 2 215 Cortes2003 Computational methods for dynamic graphs...
## 3 224 Paul2016 Null models and modularity based communi...
## 4 232 Durante2016 Bayesian Learning of Dynamic Multilayer ...
## 5 249 Bazzi Generative Benchmark Models for Mesoscal...
## 6 264 Oselio2014 Multi-Layer Graph Analysis for Dynamic S...
Each document can have multiple authors, which are stored in the “DocumentContributors” table. Let’s take a quick peak at the raw table.
dbListFields(mendeley.connection,"DocumentContributors")
## [1] "id" "documentId" "contribution" "firstNames"
## [5] "lastName"
# Collect and concatenate authors for all relevant documents
authors <- extract.table(mendeley.connection,"SELECT * FROM DocumentContributors")
head(authors)
## id documentId contribution firstNames lastName
## 1 8 3 DocumentAuthor Genevera I. Allen
## 2 9 4 DocumentAuthor Francis L. Delmonico
## 3 10 4 DocumentAuthor Paul E. Morrissey
## 4 11 4 DocumentAuthor George S. Lipkowitz
## 5 12 4 DocumentAuthor Jeffrey S. Stoff
## 6 13 4 DocumentAuthor Jonathan Himmelfarb
unique(authors$contribution)
## [1] "DocumentAuthor" "DocumentEditor"
The contribution field specifies whether the entry refers to an author or an editor. We reduce the table to our documents of interest and filter out the editors. To concatenate all of the authors into one string, we first concatenate the lastName
and firstName
using the paste
function, then group by document and collapse each of the authors into one string, separated by a semi-colon (;), again using the paste
funtion with the collapse
option set.
# Collect and concatenate authors for all relevant documents
relevant.authors <- authors %>%
filter(contribution == "DocumentAuthor",
documentId %in% relevant.papers) %>%
mutate(fullName = paste(lastName,firstNames,sep=", ")) %>% # Concatenate first and last name
select(documentId,fullName)
## documentId fullName
## 1 196 Oentaryo, Richard J.
## 2 196 Lim, Ee-Peng
## 3 196 Chua, Freddy Chong Tat
## 4 196 Low, Jia-Wei
## 5 196 Lo, David
## 6 215 Cortes, Corinna
relevant.authors <- relevant.authors %>%
group_by(documentId) %>%
summarize(authorsNames = paste(fullName,collapse="; ")) # Concatenate all authors of a document together
## # A tibble: 6 x 2
## documentId authorsNames
## <int> <chr>
## 1 196 Oentaryo, Richard J.; Lim, Ee-Peng; Chua...
## 2 215 Cortes, Corinna; Pregibon, Daryl; Volins...
## 3 224 Paul, Subhadeep; Chen, Yuguo...
## 4 232 Durante, Daniele; Mukherjee, Nabanita; S...
## 5 249 Bazzi, Marya; Jeub, Lucas G. S.; Arenas,...
## 6 264 Oselio, Brandon; Kulesza, Alex; Hero, Al...
The tags associated with each document can be extracted and concatenated in a similar manner to the authors. Tags are located in the “DocumentTags” table.
dbListFields(mendeley.connection,"DocumentTags")
## [1] "documentId" "tag"
# Collect and concatenate tags for all relevant documents
relevant.tags <- extract.table(mendeley.connection, "SELECT * FROM DocumentTags") %>%
filter(documentId %in% relevant.papers) %>%
group_by(documentId) %>%
summarize(tagList = paste(tag,collapse="; "))
## # A tibble: 6 x 2
## documentId tagList
## <int> <chr>
## 1 196 social network...
## 2 215 dynamic network; edge addition; node add...
## 3 224 Newman-Girvan null model; SBM; community...
## 4 232 dynamic network; multilayer network...
## 5 249 Multilayer networks; community detection...
## 6 264 dynamic SBM; dynamic network; multilayer...
Finally, we collect the document notes, located in the FileNotes
table. In general, these would be stored similarly to tags, with one note per row and document ID specified, and can therefore be treated in the same manner as the tags to obtain a single string for each document. In my case, I had been annotating each paper in this folder with two notes, one for the main goal of each paper, introduced by “Goal:”, and one for the key insights or techniques used in the article, introduced by “Key:”. I classified each note into a type
based on whether it represented a “Goal” or “Key” by splitting the note string by colon (:) and inspecting the first token.
dbListFields(mendeley.connection,"FileNotes")
## [1] "id" "author" "uuid" "documentId"
## [5] "fileHash" "page" "x" "y"
## [9] "note" "modifiedTime" "createdTime" "unlinked"
## [13] "baseNote" "color" "profileUuid"
# Collect notes
relevant.notes <- extract.table(mendeley.connection,"SELECT documentId, note FROM FileNotes") %>%
filter(documentId %in% relevant.papers) %>%
rowwise() %>%
mutate(type = tolower(trimws(unlist(strsplit(note,split=":")))[1]), # Extract for each document whether it is a 'Goal' or a 'Key'
note = paste(unlist(strsplit(note,split=":"))[-1],collapse=":")) %>% # Extract the actual note
ungroup()
## # A tibble: 6 x 3
## documentId note type
## <int> <chr> <chr>
## 1 249 Community detection in multilayer netwo... goal
## 2 249 Generate random multilayer partitions w... key
## 3 215 Survey of computational methods for dyn... goal
## 4 329 Network fusion for composite community ... key
## 5 329 Community detection in multilayer netwo... goal
## 6 328 Define a statistical model for dynamic ... goal
Finally, to tidy the data such that each document is on its own row, I use the spread
function in tidyr
, which creates an individual variable for each type
, whose entry for each document contains the relevant note.
relevant.notes <- relevant.notes %>%
spread(type,note)
## documentId goal
## 1 196 Exploit multiple types of soc...
## 2 215 Survey of computational metho...
## 3 224 Community detection in multil...
## 4 232 Flexible model fo dynamic mul...
## 5 249 Community detection in multil...
## 6 264 Estimate the true adjacency m...
## key
## 1 Collective semi-supervised le...
## 2 NA...
## 3 Newman-Girvan modularity or e...
## 4 Characterize edges as conditi...
## 5 Generate random multilayer pa...
## 6 Latent variable model which d...
At this point, We have collected all of our desired information, and it’s a simple matter of joining all the tables together, using documentId
as the primary key.
# Join the datasets together
relevant.files <- relevant.documents %>%
left_join(relevant.authors, by="documentId") %>%
left_join(relevant.tags, by="documentId") %>%
left_join(relevant.notes, by="documentId")
## documentId citationKey title
## 1 196 Oentaryo2016 Collective semi-supervised lea...
## 2 215 Cortes2003 Computational methods for dyna...
## 3 224 Paul2016 Null models and modularity bas...
## 4 232 Durante2016 Bayesian Learning of Dynamic M...
## 5 249 Bazzi Generative Benchmark Models fo...
## 6 264 Oselio2014 Multi-Layer Graph Analysis for...
## authorsNames tagList
## 1 Oentaryo, Richard J.; Lim, Ee-... social network...
## 2 Cortes, Corinna; Pregibon, Dar... dynamic network; edge addition...
## 3 Paul, Subhadeep; Chen, Yuguo... Newman-Girvan null model; SBM;...
## 4 Durante, Daniele; Mukherjee, N... dynamic network; multilayer ne...
## 5 Bazzi, Marya; Jeub, Lucas G. S... Multilayer networks; community...
## 6 Oselio, Brandon; Kulesza, Alex... dynamic SBM; dynamic network; ...
## goal key
## 1 Exploit multiple types of soc... Collective semi-supervised le...
## 2 Survey of computational metho... NA...
## 3 Community detection in multil... Newman-Girvan modularity or e...
## 4 Flexible model fo dynamic mul... Characterize edges as conditi...
## 5 Community detection in multil... Generate random multilayer pa...
## 6 Estimate the true adjacency m... Latent variable model which d...
The final data frame is ready for any kind of fancy analysis, or can be exported to CSV format (or whatever format you prefer).
That’s basically it for now, but check back soon! I do plan on having some more interesting posts coming up shortly…