Getting Started with glyparse

Your Universal Glycan Text Translator 🔄

Welcome to the world of glycan text parsing! If you’ve ever worked with glycan data from different sources, you know the frustration: every database, software tool, and research group seems to have their own way of representing glycan structures in text format.

That’s where glyparse comes to the rescue! 🚀

Think of glyparse as your universal glycan translator — it can read glycan structures written in many different “languages” and convert them all into a unified format that your computer can understand and work with.

Note: All functions in glyparse return glyrepr::glycan_structure objects. If you are unfamiliar with glyrepr, you can read the documentation here.

library(glyparse)

The Babel Tower of Glycan Text Formats 🗼

Before we dive in, let’s see what we’re dealing with. Here’s the same N-glycan core structure written in different formats:

Format Example Where You’ll See It
IUPAC-condensed Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc Literature, UniCarbKB
IUPAC-short Mana3(Mana6)Manb4GlcNAcb4GlcNAc Literature, UniCarbKB
IUPAC-extended alpha-D-Man-(1->3)-[alpha-D-Man-(1->6)]-beta-D-Man-(1->4)-beta-D-GlcNAc-(1->4)-D-GlcNAc Literature, UniCarbKB
GlycoCT Complex multi-line format Literature, GlycomeDB
WURCS WURCS=2.0/3,5,4/[...]/1-1-2-3-3/a4-b1_b4-c1... Literature, GlyTouCan
Linear Code Ma3(Ma6)Mb4GNb4GNb Literature
pGlyco (N(N(H(H(H))))) pGlyco software results
StrucGP A2B2C1D1E2fedcba StrucGP software results

Confusing, right? 😵‍💫 glyparse understands them all!

Your Parsing Toolkit 🛠️

glyparse provides seven specialized parsers, each optimized for a specific format:

All parsers follow the same pattern:

Part 0: auto_parse()

Don’t know what you’re dealing with? Give it to auto_parse()! This function tries to identify the format automatically and use the appropriate parser. Even input with mixed formats is supported.

x <- c(
  "Gal(b1-3)GalNAc(b1-",
  "(N(F)(N(H(H(N))(H(N(H))))))",
  "WURCS=2.0/3,3,2/[a2122h-1b_1-5][a1122h-1b_1-5][a1122h-1a_1-5]/1-2-3/a4-b1_b3-c1"
)
auto_parse(x)
#> <glycan_structure[3]>
#> [1] Gal(b1-3)GalNAc(b1-
#> [2] Hex(??-?)HexNAc(??-?)Hex(??-?)[HexNAc(??-?)Hex(??-?)]Hex(??-?)HexNAc(??-?)[dHex(??-?)]HexNAc(??-
#> [3] Man(a1-3)Man(b1-4)Glc(b1-
#> # Unique structures: 3

Part 2: Database Formats — The Heavy Hitters 💪

GlycoCT: The Precision Format

GlycoCT is used in literature for precise representation and in databases like GlycomeDB. It’s more complex but extremely precise:

glycoct <- paste0(
  "RES\n",
  "1b:b-dglc-HEX-1:5\n",
  "2b:b-dgal-HEX-1:5\n", 
  "3b:a-dgal-HEX-1:5\n",
  "LIN\n",
  "1:1o(4+1)2d\n",
  "2:2o(3+1)3d"
)
parse_glycoct(glycoct)
#> <glycan_structure[1]>
#> [1] Gal(a1-3)Gal(b1-4)Glc(b1-
#> # Unique structures: 1

WURCS: The Complex Structure Format

WURCS (Web3 Unique Representation of Carbohydrate Structures) is used in literature for complex structures and in databases like GlyTouCan:

wurcs <- paste0(
  "WURCS=2.0/3,3,2/",
  "[a2122h-1b_1-5][a1122h-1b_1-5][a1122h-1a_1-5]/",
  "1-2-3/a4-b1_b3-c1"
)
parse_wurcs(wurcs)
#> <glycan_structure[1]>
#> [1] Man(a1-3)Man(b1-4)Glc(b1-
#> # Unique structures: 1

Linear Code: The Simplified Format

Linear Code is a simplified format used in literature for complex structures:

linear_code <- "Ma3(Ma6)Mb4GNb4GNb"
parse_linear_code(linear_code)
#> <glycan_structure[1]>
#> [1] Man(a1-3)[Man(a1-6)]Man(b1-4)GlcNAc(b1-4)GlcNAc(b1-
#> # Unique structures: 1

Part 3: Software-Specific Formats — The Specialists 🔬

pGlyco Format: Proteomics Tool

If you work with glycoproteomics, you might encounter pGlyco’s parenthetical notation:

pglyco <- "(N(F)(N(H(H(N))(H(N(H))))))"
parse_pglyco_struc(pglyco)
#> <glycan_structure[1]>
#> [1] Hex(??-?)HexNAc(??-?)Hex(??-?)[HexNAc(??-?)Hex(??-?)]Hex(??-?)HexNAc(??-?)[dHex(??-?)]HexNAc(??-
#> # Unique structures: 1

This cryptic notation actually represents a complex N-glycan:

StrucGP Format: Alphabetical System

StrucGP uses a letter-based encoding system:

strucgp <- "A2B2C1D1E2F1fedD1E2edcbB5ba"
parse_strucgp_struc(strucgp)
#> <glycan_structure[1]>
#> [1] Hex(??-?)HexNAc(??-?)Hex(??-?)[HexNAc(??-?)Hex(??-?)]Hex(??-?)HexNAc(??-?)[dHex(??-?)]HexNAc(??-
#> # Unique structures: 1

The Bottom Line 🎯

glyparse transforms the chaos of glycan text formats into order. No matter where your glycan data comes from, databases, literature, or software tools, you can now parse it into glyrepr::glycan_structure() for further analysis. In fact, glyread package uses these parsing functions internally when reading output from common glycopeptide identification softwares.

Next steps:

Happy parsing! 🧬✨