![]() ![]() When handling larger documents, we need to define how to split the document into smaller pieces. For text generation, we use Cohere’s Medium model, and we use GPT-J for embeddings, both via JumpStart. Although this model handles documents of up to 10,000 words (approximately 40 pages), we use langchain’s text splitter to make sure that each summarization call to the LLM is no more than 10,000 words long. When the summarization is done, the front-end application can pick up the results from an Amazon DynamoDB table.įor summarization, we use AI21’s Summarize model, one of the foundation models available through Amazon SageMaker JumpStart. We use a Fargate task here because summarizing a very long PDF may take more time and memory than a Lambda function has available. The Fargate task calls the Amazon SageMaker inference endpoint. Another Lambda function picks up that message and starts an Amazon Elastic Container Service (Amazon ECS) AWS Fargate task. For example, the call to summarize a document invokes a Lambda function that posts a message to an Amazon Simple Queue Service (Amazon SQS) queue. When that job is done, you can invoke an API that summarizes the text or answers questions about it.īecause some of these steps may take some time, the architecture uses a decoupled asynchronous approach. As part of the post-processing, an AWS Lambda function inserts special markers into the text indicating page boundaries. After the upload is complete, you can trigger a text extraction job powered by Amazon Textract. ![]() The front-end application lets users upload PDF documents to Amazon S3. It uses the retrieval augmented generation technique to let users ask questions about new data that the LLM hasn’t seen beforeĪs shown in the following diagram, we use a front end implemented with React JavaScript hosted in an Amazon Simple Storage Service (Amazon S3) bucket fronted by Amazon CloudFront.It uses the langchain library to split a large PDF into more manageable chunks.It has an interactive web application for business users to upload and process PDFs.Our solution handles documents that exceed an LLM’s maximum token sequence length, and make that document available to the LLM for question answering. That normally precludes the ability to summarize longer documents. LLMs used for summarization have a limit on the number of tokens (characters) passed into the model, and with some exceptions, these are typically no more than a few thousand tokens. And of course, you can’t ask an LLM questions about a document it has never seen. If you want to extract the key data points from one of these documents, you need both time and some familiarity with the boilerplate language so you can identify the interesting facts. These documents contain a lot of boilerplate language like disclaimers and legal language. Working with financial documentsįinancial statements like quarterly earnings reports and annual reports to shareholders are often tens or hundreds of pages long. The sample solution described in this post is available on GitHub. Once the file is processed, you can summarize the document or ask questions about the content. In this post, we demonstrate how to construct a real-time user interface to let business users process a PDF document of arbitrary length. Once you have a solid LLM, you’ll want to expose that LLM to business users to process new documents, which could be hundreds of pages long. The post Domain-adaptation Fine-tuning of Foundation Models in Amazon SageMaker JumpStart on Financial data describes how to fine-tune an LLM using your own dataset. ![]() ![]() Large language models (LLMs) can be used to analyze complex documents and provide summaries and answers to questions. ![]()
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |