Financial Document Intelligence

Parsing richly formatted financial documents and extracting structured information to support financial regulation (FinReg) and FinTech applications.

 Extracting Zero-shot Structured Information from Form-like Documents: Pretraining with Keys and Triggers (AAAI 2021)

Rongyu Cao, Ping Luo


In this paper, we revisit the problem of extracting the values of a given set of key fields from form-like documents. It is the vital step to support many downstream applications, such as knowledge base construction, question answering, document comprehension and so on. Previous studies ignore the semantics of the given keys by considering them only as the class labels, and thus might be incapable to handle zero-shot keys. To address these issues, we propose a Key-Aware and Trigger-Aware (KATA) extraction model. With the input key, it explicitly learns two mappings, namely from key representations to trigger representations and then from trigger representations to values. These two mappings might be intrinsic and invariant across different keys and documents. Experiments with the fine-tuning step to two applications show that the proposed model achieves more than 70% accuracy for the extraction of zero-shot keys while previous methods all fail. [Paper]

 Cracking Tabular Presentation Diversity for Automatic Cross-Checking over Numerical Facts (KDD 2020)

Hongwei Li, Qingping Yang, Yixuan Cao, Jiaquan Yao and Ping Luo


Tabular forms of numerical facts widely exist in the disclosure documents of vertical domains, especially the fnancial felds. It is also quite common that the same fact might be mentioned multiple times in different tables with diverse tabular presentation. However, due to large volumes of tables, frequent updates during editing, and limited time for manual cross-checking, these facts might be inconsistent with each other even after official publishing. Hence, it creates an opportunity for Automatic Numerical Cross-Checking over Tables. This paper introduces the key module of such a system, which aims to identify whether a pair of table cells are semantically equivalent, namely referring to the same fact. We present an end-to-end solution of binary classification over each pair of table cells, which does not involve with explicit semantic parsing over tables. This system has received wide recognition in the Chinese financial community. Nine of the top ten Chinese security brokers have adopted this system to support their business of investment banking. [Paper]

Nested Relation Extraction with Iterative Neural Network (CIKM 2019)

Yixuan Cao, Dian Chen, Hongwei Li, and Ping Luo


Natural language is used to describe objective facts, including simple relations like "Jobs was the CEO of Apple", and complex relations like "the GDP of the United States in 2018 grew 2.9% compared with 2017". For the latter example, the growth rate relation is between two other relations. Due to the complex nature of language, this kind of nested relations is expressed frequently, especially in professional documents in fields like economics, finance, and biomedicine. But extracting nested relations is challenging, and research on this problem is almost vacant. In this paper, we formally formulate the nested relation extraction problem, and come up with a solution using Iterative Neural Network. Specifically, we observe that the nested relation structures can be expressed as a Directed Acyclic Graph (DAG), and propose the model to simultaneously consider the word sequence of natural language in the horizontal direction and the DAG structure in the vertical direction. [Paper]

owards Automatic Numerical Cross-Checking: Extracting Formulas from Text (WWW 2018)

Yixuan Cao, Hongwei Li, Ping Luo, and Jiaquan Yao


Verbal descriptions over the numerical relationships among some objective measures widely exist in the published documents on Web, especially in the financial fields. However, due to large volumes of documents and limited time for manual cross-check, these claims might be inconsistent with the original structured data of the related indicators even after official publishing. Such errors can seriously affect investors’ assessment of the company. It creates an opportunity for automated Numerical Cross-Checking (NCC) systems. This paper introduces the key component of such a system, formula extractor, which extracts formulas from verbal descriptions of numerical claims. Specifically, we formulate this task as a DAG-structure prediction problem. We propose a bi-directional LSTM followed by a DAG-structured LSTM to extract formulas layer by layer iteratively. The project for NCC has received wide recognition in the Chinese financial community. [Paper]

© Copyright 2021 MLDM, ICT, CAS - All Rights Reserved
Last Modified in Aug 19th, 2021

Created with Mobirise - Find out