Rendering Graphs for Graph Reasoning
in Multimodal Large Language Models

1Southern University of Science and Technology, 2Hong Kong University of Science and Technology
3Peng Cheng Laboratory
*Equal Contribution

Abstract

Large Language Models (LLMs) are increasingly used for various tasks with graph structures, such as robotic planning, knowledge graph completion, and common-sense reasoning. Though LLMs can comprehend graph information in a textual format, they overlook the rich visual modality, which is an intuitive way for humans to comprehend structural information and conduct graph reasoning. The potential benefits and capabilities of representing graph structures as visual images (i.e., visual graph) is still unexplored. In this paper, we take the first step in incorporating visual information into graph reasoning tasks and propose a new benchmark GITQA, where each sample is a tuple (graph, image, textual description). We conduct extensive experiments on the GITQA benchmark using state-of-the-art multimodal LLMs. Results on graph reasoning tasks show that combining textual and visual information together performs better than using one modality alone. Moreover, the LLaVA-7B/13B models, once finetuned on the GITQA training set (which we refer to as GITA), achieve higher accuracy than the closed-source model GPT-4(V). We also study the effects of augmentations in graph reasoning.

Visual Graph helps LLM on Graph Reasoning

MY ALT TEXT

GITQA (Graph-Image-Text Question Answering) Dataset

Please check out our multimodal graph reasoning dataset, GITQA at GITQA Hugging Face Collection.

GITQA Benchmark

GITA (Graph-Image-Text Assistant) Models

Please check out our Model Zoo for all public GITA checkpoints at Model Checkpoints Hugging Face Collection.

GITA Performance
Figure: Testing accuracy (averaged over eight graph reasoning tasks) of closed-source models GPT-4(V) and open-source models Vicuna/LLaVA.

Comparisons for Models Using Textual/Visual/Both Modalities

Accuracy for different Modalities
Modalities Models Illustration

BibTeX

@article{wei2024rendering,
  title={Rendering Graphs for Graph Reasoning in Multimodal Large Language Models},
  author={Wei, Yanbin and Fu, Shuai and Jiang, Weisen and Kwok, James T and Zhang, Yu},
  journal={arXiv preprint arXiv:2402.02130},
  year={2024}
}