OpenAI Summary (ChatGPT) Nuxeo Integration

Generates a text summary from a file using OpenAI ChatGPT

Architecture

Input: Nuxeo documents with text files Process:

a. Blob Extraction: Get the Blob from the file:content property of the document.

b. Blob Conversion: Supported text files are converted to PDF using any2pdf converter.

c. Text Processing: Process each page one by one.

d. Nuxeo Stream: Use Nuxeo Stream to process the text chunks asynchronously:

  • Producer/Consumer Pattern
  • SummaryServiceImpl.summaryProducer() -> Split text pages and send to Stream Log.
  • PageSummaryComputation -> For each text chunk, call the OpenAI ChatGPT summary endpoint and save the result to KVS
  • SummaryDoneComputation -> Save the merged summary from KVS of all pages in the correct order.

e. Event Listeners:

  • PostCommitEventListener -> process multiple documents
  • EventListener -> blobIsDirty check

Output: A Nuxeo document with the merged summary of all pages.

The architectural design provides a high-level overview of the solution, illustrating the key components and their interactions. It ensures that the solution processes pages one at a time, minimizing memory usage, and leverages Nuxeo’s Stream Service for asynchronous processing.

Configuration

  1. Add openai api token in nuxeo.conf
openai.token:
  1. Default configurations:
summary.extraction.enable.mime-types=text/plain,application/pdf,application/msword,application/vnd.openxmlformats-officedocument.wordprocessingml.document
summary.extraction.openai.url=https://api.openai.com/v1/completions
summary.extraction.openai.model=text-davinci-003
summary.extraction.openai.temperature=0.3
summary.extraction.openai.max-tokens=200
summary.extraction.openai.top-p=1
summary.extraction.openai.frequency-penalty=0
summary.extraction.openai.presence-penalty=0
summary.extraction.http.max-retries=3
summary.extraction.http.sec-delay=10
summary.extraction.http.sec-max-delay=6