first commit

1 year ago · aa535db511
876 changed files with 114726 additions and 0 deletions
--- a/.devcontainer/README.md
+++ b/.devcontainer/README.md
@ -0,0 +1,73 @@
 # AnythingLLM Development Container Setup
 Welcome to the AnythingLLM development container configuration, designed to create a seamless and feature-rich development environment for this project.
 <center><h1><b>PLEASE READ THIS</b></h1></center>
 ## Prerequisites
 - [Docker](https://www.docker.com/get-started)
 - [Visual Studio Code](https://code.visualstudio.com/)
 - [Remote - Containers](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers) VS Code extension
 ## Features
 - **Base Image**: Built on `mcr.microsoft.com/devcontainers/javascript-node:1-18-bookworm`, thus Node.JS LTS v18.
 - **Additional Tools**: Includes `hadolint`, and essential apt-packages such as `curl`, `gnupg`, and more.
 - **Ports**: Configured to auto-forward ports `3000` (Frontend) and `3001` (Backend).
 - **Environment Variables**: Sets `NODE_ENV` to `development` and `ESLINT_USE_FLAT_CONFIG` to `true`.
 - **VS Code Extensions**: A suite of extensions such as `Prettier`, `Docker`, `ESLint`, and more are automatically installed. Please revise if you do not agree with any of these extensions. AI-powered extensions and time trackers are (for now) not included to avoid any privacy concerns, but you can install them later in your own environment.
 ## Getting Started
 1. Using GitHub Codespaces. Just select to create a new workspace, and the devcontainer will be created for you.
 2. Using your Local VSCode (Release or Insiders). We suggest you first make a fork of the repo and then clone it to your local machine using VSCode tools. Then open the project folder in VSCode, which will prompt you to open the project in a devcontainer. Select yes, and the devcontainer will be created for you. If this does not happen, you can open the command palette and select "Remote-Containers: Reopen in Container".
 ## On Creation:
 When the container is built for the first time, it will automatically run `yarn setup` to ensure everything is in place for the Collector, Server and Frontend. This command is expected to be automatically re-run if there is a content change on next reboot.
 ## Work in the Container:
 Once the container is up, be patient. Some extensions may complain because dependencies are still being installed, and in the Extensions tab, some may ask you to "Reload" the project. Don't do that yet. First, wait until all settle down for the first time. We suggest you create a new VSCode profile for this devcontainer, so any configuration and extensions you change, won't affect your default profile.
 Checklist:
 - [ ] The usual message asking you to start the Server and Frontend in different windows are now "hidden" in the building process of the devcontainer. Don't forget to do as suggested.
 - [ ] Open a JavaScript file, for example "server/index.js" and check if `eslint` is working. It will complain that `'err' is defined but never used.`. This means it is working.
 - [ ] Open a React File, for example, "frontend/src/main.jsx," and check if `eslint` complains about `Fast refresh only works when a file has exports. Move your component(s) to a separate file.`. Again, it means `eslint` is working. Now check at the status bar if the `Prettier` has a double checkmark :heavy_check_mark: (double). It means Prettier is working. You will see a nice extension `Formatting:`:heavy_check_mark: that can be used to disable the `Format on Save` feature temporarily.
 - [ ] Check if, on the left pane, you have the NPM Scripts (this may be disabled; look at the "Explorer" tree-dots up-right). There will be scripts inside the `package.json` files. You will basically need to run the `dev:collector`, `dev:server` and the `dev:frontend` in this order. When the frontend finishes starting, a window browser will open **inside** the VSCode. Still, you can open it outside.
 :warning: **Important for all developers** :warning:
 - [ ] When you are using the `NODE_ENV=development` the server will not store the configurations you set for security reasons. Please set the proper config on file `.env.development`. The side-effect if you don't, everytime you restart the server, you will be sent to the "Onboarding" page again.
 **Note when using GitHub Codespaces**
 - [ ] When running the "Server" for the first time, it will automatically configure its port to be publicly accessible by default, as this is required for the front end to reach the server backend. To know more, read the content of the `.env` file on the frontend folder about this, and if any issues occur, make sure to manually set the port "Visibility" of the "Server" is set to "Public" if needed. Again, this is only needed for developing on GitHub Codespaces.
 **For the Collector:**
 - [x] In the past, the Collector dwelled within the Python domain, but now it has journeyed to the splendid realm of Node.JS. Consequently, the configuration complexities of bygone versions are no longer a concern.
 ### Now it is ready to start
 In the status bar you will see three shortcuts names `Collector`, `Server` and `Frontend`. Just click-and-wait on that order (don't forget to set the Server port 3001 to Public if you are using GH Codespaces **_before_** starting the Frontend).
 Now you can enjoy your time developing instead of reconfiguring everything.
 ## Debugging with the devcontainers
 ### For debugging the collector, server and frontend
 First, make sure the built-in extension (ms-vscode.js-debug) is active (I don't know why it would not be, but just in case). If you want, you can install the nightly version (ms-vscode.js-debug-nightly)
 Then, in the "Run and Debug" tab (Ctrl+shift+D), you can select on the menu:
 - Collector debug. This will start the collector in debug mode and attach the debugger. Works very well.
 - Server debug. This will start the server in debug mode and attach the debugger. Works very well.
 - Frontend debug. This will start the frontend in debug mode and attach the debugger. I am still struggling with this one. I don't know if VSCode can handle the .jsx files seamlessly as the pure .js on the server. Maybe there is a need for a particular configuration for Vite or React. Anyway, it starts. Another two configurations launch Chrome and Edge, and I think we could add breakpoints on .jsx files somehow. The best scenario would be always to use the embedded browser. WIP.
 Please leave comments on the Issues tab or the [![](https://img.shields.io/discord/1114740394715004990?logo=Discord&logoColor=white&label=Discord&labelColor=%235568ee&color=%2355A2DD&link=https%3A%2F%2Fdiscord.gg%2F6UyHPeGZAC)]("https://discord.gg/6UyHPeGZAC")
--- a/.devcontainer/devcontainer.json
+++ b/.devcontainer/devcontainer.json
@ -0,0 +1,211 @@
 // For format details, see https://aka.ms/devcontainer.json. For config options, see the
 // README at: https://github.com/devcontainers/templates/tree/main/src/javascript-node
 {
  "name": "Node.js",
  // Or use a Dockerfile or Docker Compose file. More info: https://containers.dev/guide/dockerfile
  // "build": {
  //   "args": {
  //     "ARG_UID": "1000",
  //     "ARG_GID": "1000"
  //   },
  //   "dockerfile": "Dockerfile"
  // },
  // "containerUser": "anythingllm",
  // Or use a Dockerfile or Docker Compose file. More info: https://containers.dev/guide/dockerfile
  "image": "mcr.microsoft.com/devcontainers/javascript-node:1-18-bookworm",
  // Features to add to the dev container. More info: https://containers.dev/features.
  "features": {
    // Docker very useful linter
    "ghcr.io/dhoeric/features/hadolint:1": {
      "version": "latest"
    },
    // Terraform support
    "ghcr.io/devcontainers/features/terraform:1": {},
    // Just a wrap to install needed packages
    "ghcr.io/devcontainers-contrib/features/apt-packages:1": {
      // Dependencies copied from ../docker/Dockerfile plus some dev stuff
      "packages": [
        "build-essential",
        "ca-certificates",
        "curl",
        "ffmpeg",
        "fonts-liberation",
        "git",
        "gnupg",
        "htop",
        "less",
        "libappindicator1",
        "libasound2",
        "libatk-bridge2.0-0",
        "libatk1.0-0",
        "libc6",
        "libcairo2",
        "libcups2",
        "libdbus-1-3",
        "libexpat1",
        "libfontconfig1",
        "libgbm1",
        "libgcc1",
        "libgfortran5",
        "libglib2.0-0",
        "libgtk-3-0",
        "libnspr4",
        "libnss3",
        "libpango-1.0-0",
        "libpangocairo-1.0-0",
        "libstdc++6",
        "libx11-6",
        "libx11-xcb1",
        "libxcb1",
        "libxcomposite1",
        "libxcursor1",
        "libxdamage1",
        "libxext6",
        "libxfixes3",
        "libxi6",
        "libxrandr2",
        "libxrender1",
        "libxss1",
        "libxtst6",
        "locales",
        "lsb-release",
        "procps",
        "tzdata",
        "wget",
        "xdg-utils"
      ]
    }
  },
  "updateContentCommand": "cd server && yarn && cd ../collector && PUPPETEER_DOWNLOAD_BASE_URL=https://storage.googleapis.com/chrome-for-testing-public yarn && cd ../frontend && yarn && cd .. && yarn setup:envs && yarn prisma:setup && echo \"Please run yarn dev:server, yarn dev:collector, and yarn dev:frontend in separate terminal tabs.\"",
  // Use 'postCreateCommand' to run commands after the container is created.
  // This configures VITE for github codespaces and installs gh cli
  "postCreateCommand": "if [ \"${CODESPACES}\" = \"true\" ]; then echo 'VITE_API_BASE=\"https://$CODESPACE_NAME-3001.$GITHUB_CODESPACES_PORT_FORWARDING_DOMAIN/api\"' > ./frontend/.env && (type -p wget >/dev/null || (sudo apt update && sudo apt-get install wget -y)) && sudo mkdir -p -m 755 /etc/apt/keyrings && wget -qO- https://cli.github.com/packages/githubcli-archive-keyring.gpg | sudo tee /etc/apt/keyrings/githubcli-archive-keyring.gpg > /dev/null && sudo chmod go+r /etc/apt/keyrings/githubcli-archive-keyring.gpg && echo \"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/githubcli-archive-keyring.gpg] https://cli.github.com/packages stable main\" | sudo tee /etc/apt/sources.list.d/github-cli.list > /dev/null && sudo apt update && sudo apt install gh -y; fi",
  "portsAttributes": {
    "3001": {
      "label": "Backend",
      "onAutoForward": "notify"
    },
    "3000": {
      "label": "Frontend",
      "onAutoForward": "openPreview"
    }
  },
  "capAdd": [
    "SYS_ADMIN" // needed for puppeteer using headless chrome in sandbox
  ],
  "remoteEnv": {
    "NODE_ENV": "development",
    "ESLINT_USE_FLAT_CONFIG": "true",
    "ANYTHING_LLM_RUNTIME": "docker"
  },
  // "initializeCommand": "echo Initialize....",
  "shutdownAction": "stopContainer",
  // Configure tool-specific properties.
  "customizations": {
    "codespaces": {
      "openFiles": [
        "README.md",
        ".devcontainer/README.md"
      ]
    },
    "vscode": {
      "openFiles": [
        "README.md",
        ".devcontainer/README.md"
      ],
      "extensions": [
        "bierner.github-markdown-preview",
        "bradlc.vscode-tailwindcss",
        "dbaeumer.vscode-eslint",
        "editorconfig.editorconfig",
        "esbenp.prettier-vscode",
        "exiasr.hadolint",
        "flowtype.flow-for-vscode",
        "gamunu.vscode-yarn",
        "hashicorp.terraform",
        "mariusschulz.yarn-lock-syntax",
        "ms-azuretools.vscode-docker",
        "streetsidesoftware.code-spell-checker",
        "actboy168.tasks",
        "tombonnike.vscode-status-bar-format-toggle",
        "ms-vscode.js-debug"
      ],
      "settings": {
        "[css]": {
          "editor.defaultFormatter": "esbenp.prettier-vscode"
        },
        "[dockercompose]": {
          "editor.defaultFormatter": "esbenp.prettier-vscode"
        },
        "[dockerfile]": {
          "editor.defaultFormatter": "ms-azuretools.vscode-docker"
        },
        "[html]": {
          "editor.defaultFormatter": "esbenp.prettier-vscode"
        },
        "[javascript]": {
          "editor.defaultFormatter": "esbenp.prettier-vscode"
        },
        "[javascriptreact]": {
          "editor.defaultFormatter": "esbenp.prettier-vscode"
        },
        "[json]": {
          "editor.defaultFormatter": "esbenp.prettier-vscode"
        },
        "[jsonc]": {
          "editor.defaultFormatter": "esbenp.prettier-vscode"
        },
        "[markdown]": {
          "editor.defaultFormatter": "esbenp.prettier-vscode"
        },
        "[postcss]": {
          "editor.defaultFormatter": "esbenp.prettier-vscode"
        },
        "[toml]": {
          "editor.defaultFormatter": "tamasfe.even-better-toml"
        },
        "eslint.debug": true,
        "eslint.enable": true,
        "eslint.experimental.useFlatConfig": true,
        "eslint.run": "onSave",
        "files.associations": {
          ".*ignore": "ignore",
          ".editorconfig": "editorconfig",
          ".env*": "properties",
          ".flowconfig": "ini",
          ".prettierrc": "json",
          "*.css": "tailwindcss",
          "*.md": "markdown",
          "*.sh": "shellscript",
          "docker-compose.*": "dockercompose",
          "Dockerfile*": "dockerfile",
          "yarn.lock": "yarnlock"
        },
        "javascript.format.enable": false,
        "javascript.inlayHints.enumMemberValues.enabled": true,
        "javascript.inlayHints.functionLikeReturnTypes.enabled": true,
        "javascript.inlayHints.parameterTypes.enabled": true,
        "javascript.inlayHints.variableTypes.enabled": true,
        "js/ts.implicitProjectConfig.module": "CommonJS",
        "json.format.enable": false,
        "json.schemaDownload.enable": true,
        "npm.autoDetect": "on",
        "npm.packageManager": "yarn",
        "prettier.useEditorConfig": false,
        "tailwindCSS.files.exclude": [
          "**/.git/**",
          "**/node_modules/**",
          "**/.hg/**",
          "**/.svn/**",
          "**/dist/**"
        ],
        "typescript.validate.enable": false,
        "workbench.editorAssociations": {
          "*.md": "vscode.markdown.preview.editor"
        }
      }
    }
  }
  // Uncomment to connect as root instead. More info: https://aka.ms/dev-containers-non-root.
  // "remoteUser": "root"
 }
--- a/.dockerignore
+++ b/.dockerignore
@ -0,0 +1,18 @@
 **/server/utils/agents/aibitat/example/**
 **/server/storage/documents/**
 **/server/storage/vector-cache/**
 **/server/storage/*.db
 **/server/storage/lancedb
 **/collector/hotdir/**
 **/collector/outputs/**
 **/node_modules/
 **/dist/
 **/v-env/
 **/__pycache__/
 **/.env
 **/.env.*
 **/bundleinspector.html
 **/tmp/**
 **/.log
 !docker/.env.example
 !frontend/.env.production
--- a/.editorconfig
+++ b/.editorconfig
@ -0,0 +1,17 @@
 # EditorConfig is awesome: https://EditorConfig.org
 # top-most EditorConfig file
 root = true
 [*]
 # Non-configurable Prettier behaviors
 charset = utf-8
 insert_final_newline = true
 trim_trailing_whitespace = true
 # Configurable Prettier behaviors
 # (change these if your Prettier config differs)
 end_of_line = lf
 indent_style = space
 indent_size = 2
 max_line_length = 80
--- a/.gitattributes
+++ b/.gitattributes
@ -0,0 +1 @@
 * text=auto eol=lf
--- a/.github/FUNDING.yml
+++ b/.github/FUNDING.yml
@ -0,0 +1 @@
 github: Mintplex-Labs
--- a/.github/ISSUE_TEMPLATE/01_bug.yml
+++ b/.github/ISSUE_TEMPLATE/01_bug.yml
@ -0,0 +1,42 @@
 name: 🐛 Bug Report
 description: File a bug report for AnythingLLM
 title: "[BUG]: "
 labels: [possible bug]
 body:
  - type: markdown
    attributes:
      value: | 
        Use this template to file a bug report for AnythingLLM. Please be as descriptive as possible to allow everyone to replicate and solve your issue.
  - type: dropdown
    id: runtime
    attributes:
      label: How are you running AnythingLLM?
      description: AnythingLLM can be run in many environments, pick the one that best represents where you encounter the bug.
      options:
        - Docker (local)
        - Docker (remote machine)
        - Local development
        - AnythingLLM desktop app
        - All versions
        - Not listed
      default: 0
    validations:
      required: true
  - type: textarea
    id: what-happened
    attributes:
      label: What happened?
      description: Also tell us, what did you expect to happen?
    validations:
      required: true
  - type: textarea
    id: reproduction
    attributes:
      label: Are there known steps to reproduce?
      description: |
        Let us know how to reproduce the bug and we may be able to fix it more
        quickly. This is not required, but it is helpful.
    validations:
      required: false
--- a/.github/ISSUE_TEMPLATE/02_feature.yml
+++ b/.github/ISSUE_TEMPLATE/02_feature.yml
@ -0,0 +1,19 @@
 name: ✨ New Feature suggestion
 description: Suggest a new feature for AnythingLLM!
 title: "[FEAT]: "
 labels: [enhancement, feature request]
 body:
  - type: markdown
    attributes:
      value: |
        Share a new idea for a feature or improvement. Be sure to search existing
        issues first to avoid duplicates.
  - type: textarea
    id: description
    attributes:
      label: What would you like to see?
      description: |
        Describe the feature and why it would be useful to your use-case as well as others.
    validations:
      required: true
--- a/.github/ISSUE_TEMPLATE/03_documentation.yml
+++ b/.github/ISSUE_TEMPLATE/03_documentation.yml
@ -0,0 +1,13 @@
 name: 📚 Documentation improvement
 title: "[DOCS]: "
 description: Report an issue or problem with the documentation.
 labels: [documentation]
 body:
  - type: textarea
    id: description
    attributes:
      label: Description
      description: Describe the issue with the documentation that is giving you trouble or causing confusion.
    validations:
      required: true
--- a/.github/ISSUE_TEMPLATE/config.yml
+++ b/.github/ISSUE_TEMPLATE/config.yml
@ -0,0 +1,5 @@
 blank_issues_enabled: true
 contact_links:
  - name: 🧑‍🤝‍🧑 Community Discord
    url: https://discord.gg/6UyHPeGZAC
    about: Interact with the Mintplex Labs community here by asking for help, discussing and more!
--- a/.github/workflows/build-and-push-image-semver.yaml
+++ b/.github/workflows/build-and-push-image-semver.yaml
@ -0,0 +1,115 @@
 name: Publish AnythingLLM Docker image on Release (amd64 & arm64)
 concurrency:
  group: build-${{ github.ref }}
  cancel-in-progress: true
 on:
  release:
    types: [published]
 jobs:
  push_multi_platform_to_registries:
    name: Push Docker multi-platform image to multiple registries
    runs-on: ubuntu-latest
    permissions:
      packages: write
      contents: read
    steps:
      - name: Check out the repo
        uses: actions/checkout@v4
      - name: Check if DockerHub build needed
        shell: bash
        run: |
          # Check if the secret for USERNAME is set (don't even check for the password)
          if [[ -z "${{ secrets.DOCKER_USERNAME }}" ]]; then
            echo "DockerHub build not needed"
            echo "enabled=false" >> $GITHUB_OUTPUT
          else
            echo "DockerHub build needed"
            echo "enabled=true" >> $GITHUB_OUTPUT
          fi
        id: dockerhub
      - name: Set up QEMU
        uses: docker/setup-qemu-action@v3
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
      - name: Log in to Docker Hub
        uses: docker/login-action@f4ef78c080cd8ba55a85445d5b36e214a81df20a
        # Only login to the Docker Hub if the repo is mintplex/anythingllm, to allow for forks to build on GHCR
        if: steps.dockerhub.outputs.enabled == 'true' 
        with:
          username: ${{ secrets.DOCKER_USERNAME }}
          password: ${{ secrets.DOCKER_PASSWORD }}
      - name: Log in to the Container registry
        uses: docker/login-action@65b78e6e13532edd9afa3aa52ac7964289d1a9c1
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - name: Extract metadata (tags, labels) for Docker
        id: meta
        uses: docker/metadata-action@9ec57ed1fcdbf14dcef7dfbe97b2010124a938b7
        with:
          images: |
            ${{ steps.dockerhub.outputs.enabled == 'true' && 'mintplexlabs/anythingllm' || '' }}
            ghcr.io/${{ github.repository }}
          tags: |
            type=semver,pattern={{version}}
            type=semver,pattern={{major}}.{{minor}}
      - name: Build and push multi-platform Docker image
        uses: docker/build-push-action@v6
        with:
          context: .
          file: ./docker/Dockerfile
          push: true
          sbom: true
          provenance: mode=max
          platforms: linux/amd64,linux/arm64
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max
      # For Docker scout there are some intermediary reported CVEs which exists outside
      # of execution content or are unreachable by an attacker but exist in image.
      # We create VEX files for these so they don't show in scout summary. 
      - name: Collect known and verified CVE exceptions
        id: cve-list
        run: |
          # Collect CVEs from filenames in vex folder
          CVE_NAMES=""
          for file in ./docker/vex/*.vex.json; do
            [ -e "$file" ] || continue
            filename=$(basename "$file")
            stripped_filename=${filename%.vex.json}
            CVE_NAMES+=" $stripped_filename"
          done
          echo "CVE_EXCEPTIONS=$CVE_NAMES" >> $GITHUB_OUTPUT
        shell: bash
      # About VEX attestations https://docs.docker.com/scout/explore/exceptions/
      # Justifications https://github.com/openvex/spec/blob/main/OPENVEX-SPEC.md#status-justifications
      - name: Add VEX attestations
        env:
          CVE_EXCEPTIONS: ${{ steps.cve-list.outputs.CVE_EXCEPTIONS }}
        run: |
          echo $CVE_EXCEPTIONS
          curl -sSfL https://raw.githubusercontent.com/docker/scout-cli/main/install.sh | sh -s --
          for cve in $CVE_EXCEPTIONS; do
            for tag in "${{ join(fromJSON(steps.meta.outputs.json).tags, ' ') }}"; do
              echo "Attaching VEX exception $cve to $tag"
              docker scout attestation add \
              --file "./docker/vex/$cve.vex.json" \
              --predicate-type https://openvex.dev/ns/v0.2.0 \
              $tag
            done
          done
        shell: bash
--- a/.github/workflows/build-and-push-image.yaml
+++ b/.github/workflows/build-and-push-image.yaml
@ -0,0 +1,134 @@
 # This GitHub action is for publishing of the primary image for AnythingLLM
 # It will publish a linux/amd64 and linux/arm64 image at the same time
 # This file should ONLY BY USED FOR `master` BRANCH. 
 # TODO: GitHub now has an ubuntu-24.04-arm64 runner, but we still need
 # to use QEMU to build the arm64 image because Chromium is not available for Linux arm64
 # so builds will still fail, or fail much more often. Its inconsistent and frustrating.
 name: Publish AnythingLLM Primary Docker image (amd64/arm64)
 concurrency:
  group: build-${{ github.ref }}
  cancel-in-progress: true
 on:
  push:
    branches: ['master'] # master branch only. Do not modify.
    paths-ignore:
      - '**.md'
      - 'cloud-deployments/**/*'
      - 'images/**/*'
      - '.vscode/**/*'
      - '**/.env.example'
      - '.github/ISSUE_TEMPLATE/**/*'
      - '.devcontainer/**/*'
      - 'embed/**/*' # Embed is submodule
      - 'browser-extension/**/*' # Chrome extension is submodule
      - 'server/utils/agents/aibitat/example/**/*' # Do not push new image for local dev testing of new aibitat images.
 jobs:
  push_multi_platform_to_registries:
    name: Push Docker multi-platform image to multiple registries
    runs-on: ubuntu-latest
    permissions:
      packages: write
      contents: read
    steps:
      - name: Check out the repo
        uses: actions/checkout@v4
      - name: Check if DockerHub build needed
        shell: bash
        run: |
          # Check if the secret for USERNAME is set (don't even check for the password)
          if [[ -z "${{ secrets.DOCKER_USERNAME }}" ]]; then
            echo "DockerHub build not needed"
            echo "enabled=false" >> $GITHUB_OUTPUT
          else
            echo "DockerHub build needed"
            echo "enabled=true" >> $GITHUB_OUTPUT
          fi
        id: dockerhub
      - name: Set up QEMU
        uses: docker/setup-qemu-action@v3
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
      - name: Log in to Docker Hub
        uses: docker/login-action@f4ef78c080cd8ba55a85445d5b36e214a81df20a
        # Only login to the Docker Hub if the repo is mintplex/anythingllm, to allow for forks to build on GHCR
        if: steps.dockerhub.outputs.enabled == 'true' 
        with:
          username: ${{ secrets.DOCKER_USERNAME }}
          password: ${{ secrets.DOCKER_PASSWORD }}
      - name: Log in to the Container registry
        uses: docker/login-action@65b78e6e13532edd9afa3aa52ac7964289d1a9c1
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - name: Extract metadata (tags, labels) for Docker
        id: meta
        uses: docker/metadata-action@9ec57ed1fcdbf14dcef7dfbe97b2010124a938b7
        with:
          images: |
            ${{ steps.dockerhub.outputs.enabled == 'true' && 'mintplexlabs/anythingllm' || '' }}
            ghcr.io/${{ github.repository }}
          tags: |
            type=raw,value=latest,enable={{is_default_branch}}
            type=ref,event=branch
            type=ref,event=tag
            type=ref,event=pr
      - name: Build and push multi-platform Docker image
        uses: docker/build-push-action@v6
        with:
          context: .
          file: ./docker/Dockerfile
          push: true
          sbom: true
          provenance: mode=max
          platforms: linux/amd64,linux/arm64
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max
      # For Docker scout there are some intermediary reported CVEs which exists outside
      # of execution content or are unreachable by an attacker but exist in image.
      # We create VEX files for these so they don't show in scout summary. 
      - name: Collect known and verified CVE exceptions
        id: cve-list
        run: |
          # Collect CVEs from filenames in vex folder
          CVE_NAMES=""
          for file in ./docker/vex/*.vex.json; do
            [ -e "$file" ] || continue
            filename=$(basename "$file")
            stripped_filename=${filename%.vex.json}
            CVE_NAMES+=" $stripped_filename"
          done
          echo "CVE_EXCEPTIONS=$CVE_NAMES" >> $GITHUB_OUTPUT
        shell: bash
      # About VEX attestations https://docs.docker.com/scout/explore/exceptions/
      # Justifications https://github.com/openvex/spec/blob/main/OPENVEX-SPEC.md#status-justifications
      - name: Add VEX attestations
        env:
          CVE_EXCEPTIONS: ${{ steps.cve-list.outputs.CVE_EXCEPTIONS }}
        run: |
          echo $CVE_EXCEPTIONS
          curl -sSfL https://raw.githubusercontent.com/docker/scout-cli/main/install.sh | sh -s --
          for cve in $CVE_EXCEPTIONS; do
            for tag in "${{ join(fromJSON(steps.meta.outputs.json).tags, ' ') }}"; do
              echo "Attaching VEX exception $cve to $tag"
              docker scout attestation add \
              --file "./docker/vex/$cve.vex.json" \
              --predicate-type https://openvex.dev/ns/v0.2.0 \
              $tag
            done
          done
        shell: bash
--- a/.github/workflows/check-translations.yaml
+++ b/.github/workflows/check-translations.yaml
@ -0,0 +1,37 @@
 # This GitHub action is for validation of all languages which translations are offered for
 # in the locales folder in `frontend/src`. All languages are compared to the EN translation
 # schema since that is the fallback language setting. This workflow will run on all PRs that
 # modify any files in the translation directory
 name: Verify translations files
 concurrency:
  group: build-${{ github.ref }}
  cancel-in-progress: true
 on:
  pull_request:
    types: [opened, synchronize, reopened]
    paths:
      - "frontend/src/locales/**.js"
 jobs:
  run-script:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
        uses: actions/checkout@v2
      - name: Set up Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '18'
      - name: Run verifyTranslations.mjs script
        run: |
          cd frontend/src/locales
          node verifyTranslations.mjs
      - name: Fail job on error
        if: failure()
        run: exit 1
--- a/.github/workflows/dev-build.yaml
+++ b/.github/workflows/dev-build.yaml
@ -0,0 +1,114 @@
 name: AnythingLLM Development Docker image (amd64)
 concurrency:
  group: build-${{ github.ref }}
  cancel-in-progress: true
 on:
  push:
    branches: ['sharp-pdf-image-converter'] # put your current branch to create a build. Core team only.
    paths-ignore:
      - '**.md'
      - 'cloud-deployments/*'
      - 'images/**/*'
      - '.vscode/**/*'
      - '**/.env.example'
      - '.github/ISSUE_TEMPLATE/**/*'
      - 'embed/**/*' # Embed should be published to frontend (yarn build:publish) if any changes are introduced
      - 'server/utils/agents/aibitat/example/**/*' # Do not push new image for local dev testing of new aibitat images.
 jobs:
  push_multi_platform_to_registries:
    name: Push Docker multi-platform image to multiple registries
    runs-on: ubuntu-latest
    permissions:
      packages: write
      contents: read
    steps:
      - name: Check out the repo
        uses: actions/checkout@v4
      - name: Check if DockerHub build needed
        shell: bash
        run: |
          # Check if the secret for USERNAME is set (don't even check for the password)
          if [[ -z "${{ secrets.DOCKER_USERNAME }}" ]]; then
            echo "DockerHub build not needed"
            echo "enabled=false" >> $GITHUB_OUTPUT
          else
            echo "DockerHub build needed"
            echo "enabled=true" >> $GITHUB_OUTPUT
          fi
        id: dockerhub
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
      - name: Log in to Docker Hub
        uses: docker/login-action@f4ef78c080cd8ba55a85445d5b36e214a81df20a
        # Only login to the Docker Hub if the repo is mintplex/anythingllm, to allow for forks to build on GHCR
        if: steps.dockerhub.outputs.enabled == 'true' 
        with:
          username: ${{ secrets.DOCKER_USERNAME }}
          password: ${{ secrets.DOCKER_PASSWORD }}
      - name: Extract metadata (tags, labels) for Docker
        id: meta
        uses: docker/metadata-action@9ec57ed1fcdbf14dcef7dfbe97b2010124a938b7
        with:
          images: |
            ${{ steps.dockerhub.outputs.enabled == 'true' && 'mintplexlabs/anythingllm' || '' }}
          tags: |
            type=raw,value=dev
      - name: Build and push multi-platform Docker image
        uses: docker/build-push-action@v6
        with:
          context: .
          file: ./docker/Dockerfile
          push: true
          sbom: true
          provenance: mode=max
          platforms: linux/amd64
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max
      # For Docker scout there are some intermediary reported CVEs which exists outside
      # of execution content or are unreachable by an attacker but exist in image.
      # We create VEX files for these so they don't show in scout summary. 
      - name: Collect known and verified CVE exceptions
        id: cve-list
        run: |
          # Collect CVEs from filenames in vex folder
          CVE_NAMES=""
          for file in ./docker/vex/*.vex.json; do
            [ -e "$file" ] || continue
            filename=$(basename "$file")
            stripped_filename=${filename%.vex.json}
            CVE_NAMES+=" $stripped_filename"
          done
          echo "CVE_EXCEPTIONS=$CVE_NAMES" >> $GITHUB_OUTPUT
        shell: bash
      # About VEX attestations https://docs.docker.com/scout/explore/exceptions/
      # Justifications https://github.com/openvex/spec/blob/main/OPENVEX-SPEC.md#status-justifications
      # Fixed to use v1.15.1 of scout-cli as v1.16.0 install script is broken
      # https://github.com/docker/scout-cli
      - name: Add VEX attestations
        env:
          CVE_EXCEPTIONS: ${{ steps.cve-list.outputs.CVE_EXCEPTIONS }}
        run: |
          echo $CVE_EXCEPTIONS
          curl -sSfL https://raw.githubusercontent.com/docker/scout-cli/main/install.sh | sh -s --
          for cve in $CVE_EXCEPTIONS; do
            for tag in "${{ join(fromJSON(steps.meta.outputs.json).tags, ' ') }}"; do
              echo "Attaching VEX exception $cve to $tag"
              docker scout attestation add \
              --file "./docker/vex/$cve.vex.json" \
              --predicate-type https://openvex.dev/ns/v0.2.0 \
              $tag
            done
          done
        shell: bash
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,11 @@
 v-env
 .env
 !.env.example
 node_modules
 __pycache__
 v-env
 .DS_Store
 aws_cf_deploy_anything_llm.json
 yarn.lock
 *.bak
--- a/.gitmodules
+++ b/.gitmodules
@ -0,0 +1,7 @@
 [submodule "browser-extension"]
 	path = browser-extension
 	url = git@github.com:Mintplex-Labs/anythingllm-extension.git
 [submodule "embed"]
 	path = embed
 	url = git@github.com:Mintplex-Labs/anythingllm-embed.git
 	branch = main
--- a/.hadolint.yaml
+++ b/.hadolint.yaml
@ -0,0 +1,8 @@
 failure-threshold: warning
 ignored:
  - DL3008
  - DL3013
 format: tty
 trustedRegistries:
  - docker.io
  - gcr.io
--- a/.idea/.gitignore
+++ b/.idea/.gitignore
@ -0,0 +1,8 @@
 # Default ignored files
 /shelf/
 /workspace.xml
 # Editor-based HTTP Client requests
 /httpRequests/
 # Datasource local storage ignored files
 /dataSources/
 /dataSources.local.xml
--- a/.idea/anything-llm-master.iml
+++ b/.idea/anything-llm-master.iml
@ -0,0 +1,9 @@
 <?xml version="1.0" encoding="UTF-8"?>
 <module type="JAVA_MODULE" version="4">
  <component name="NewModuleRootManager" inherit-compiler-output="true">
    <exclude-output />
    <content url="file://$MODULE_DIR$" />
    <orderEntry type="inheritedJdk" />
    <orderEntry type="sourceFolder" forTests="false" />
  </component>
 </module>
--- a/.idea/codeStyles/Project.xml
+++ b/.idea/codeStyles/Project.xml
@ -0,0 +1,58 @@
 <component name="ProjectCodeStyleConfiguration">
  <code_scheme name="Project" version="173">
    <option name="LINE_SEPARATOR" value="&#10;" />
    <HTMLCodeStyleSettings>
      <option name="HTML_SPACE_INSIDE_EMPTY_TAG" value="true" />
    </HTMLCodeStyleSettings>
    <JSCodeStyleSettings version="0">
      <option name="FORCE_SEMICOLON_STYLE" value="true" />
      <option name="SPACE_BEFORE_FUNCTION_LEFT_PARENTH" value="false" />
      <option name="FORCE_QUOTE_STYlE" value="true" />
      <option name="ENFORCE_TRAILING_COMMA" value="WhenMultiline" />
      <option name="SPACES_WITHIN_OBJECT_LITERAL_BRACES" value="true" />
      <option name="SPACES_WITHIN_IMPORTS" value="true" />
    </JSCodeStyleSettings>
    <TypeScriptCodeStyleSettings version="0">
      <option name="FORCE_SEMICOLON_STYLE" value="true" />
      <option name="SPACE_BEFORE_FUNCTION_LEFT_PARENTH" value="false" />
      <option name="FORCE_QUOTE_STYlE" value="true" />
      <option name="ENFORCE_TRAILING_COMMA" value="WhenMultiline" />
      <option name="SPACES_WITHIN_OBJECT_LITERAL_BRACES" value="true" />
      <option name="SPACES_WITHIN_IMPORTS" value="true" />
    </TypeScriptCodeStyleSettings>
    <VueCodeStyleSettings>
      <option name="INTERPOLATION_NEW_LINE_AFTER_START_DELIMITER" value="false" />
      <option name="INTERPOLATION_NEW_LINE_BEFORE_END_DELIMITER" value="false" />
    </VueCodeStyleSettings>
    <codeStyleSettings language="HTML">
      <option name="SOFT_MARGINS" value="80" />
      <indentOptions>
        <option name="INDENT_SIZE" value="2" />
        <option name="CONTINUATION_INDENT_SIZE" value="2" />
        <option name="TAB_SIZE" value="2" />
      </indentOptions>
    </codeStyleSettings>
    <codeStyleSettings language="JavaScript">
      <option name="SOFT_MARGINS" value="80" />
      <indentOptions>
        <option name="INDENT_SIZE" value="2" />
        <option name="CONTINUATION_INDENT_SIZE" value="2" />
        <option name="TAB_SIZE" value="2" />
      </indentOptions>
    </codeStyleSettings>
    <codeStyleSettings language="TypeScript">
      <option name="SOFT_MARGINS" value="80" />
      <indentOptions>
        <option name="INDENT_SIZE" value="2" />
        <option name="CONTINUATION_INDENT_SIZE" value="2" />
        <option name="TAB_SIZE" value="2" />
      </indentOptions>
    </codeStyleSettings>
    <codeStyleSettings language="Vue">
      <option name="SOFT_MARGINS" value="80" />
      <indentOptions>
        <option name="CONTINUATION_INDENT_SIZE" value="2" />
      </indentOptions>
    </codeStyleSettings>
  </code_scheme>
 </component>
--- a/.idea/codeStyles/codeStyleConfig.xml
+++ b/.idea/codeStyles/codeStyleConfig.xml
@ -0,0 +1,5 @@
 <component name="ProjectCodeStyleConfiguration">
  <state>
    <option name="USE_PER_PROJECT_SETTINGS" value="true" />
  </state>
 </component>
--- a/.idea/inspectionProfiles/Project_Default.xml
+++ b/.idea/inspectionProfiles/Project_Default.xml
@ -0,0 +1,6 @@
 <component name="InspectionProjectProfileManager">
  <profile version="1.0">
    <option name="myName" value="Project Default" />
    <inspection_tool class="Eslint" enabled="true" level="WARNING" enabled_by_default="true" />
  </profile>
 </component>
--- a/.idea/misc.xml
+++ b/.idea/misc.xml
@ -0,0 +1,6 @@
 <?xml version="1.0" encoding="UTF-8"?>
 <project version="4">
  <component name="ProjectRootManager" version="2" languageLevel="JDK_21" default="true" project-jdk-name="corretto-21" project-jdk-type="JavaSDK">
    <output url="file://$PROJECT_DIR$/out" />
  </component>
 </project>
--- a/.idea/modules.xml
+++ b/.idea/modules.xml
@ -0,0 +1,8 @@
 <?xml version="1.0" encoding="UTF-8"?>
 <project version="4">
  <component name="ProjectModuleManager">
    <modules>
      <module fileurl="file://$PROJECT_DIR$/.idea/anything-llm-master.iml" filepath="$PROJECT_DIR$/.idea/anything-llm-master.iml" />
    </modules>
  </component>
 </project>
--- a/.nvmrc
+++ b/.nvmrc
@ -0,0 +1 @@
 v18.18.0
--- a/.prettierignore
+++ b/.prettierignore
@ -0,0 +1,16 @@
 # defaults
 **/.git
 **/.svn
 **/.hg
 **/node_modules
 #frontend
 frontend/bundleinspector.html
 **/dist
 #server
 server/swagger/openapi.json
 #embed
 **/static/**
 embed/src/utils/chat/hljs.js
--- a/.prettierrc
+++ b/.prettierrc
@ -0,0 +1,38 @@
 {
  "tabWidth": 2,
  "useTabs": false,
  "endOfLine": "lf",
  "semi": true,
  "singleQuote": false,
  "printWidth": 80,
  "trailingComma": "es5",
  "bracketSpacing": true,
  "bracketSameLine": false,
  "overrides": [
    {
      "files": ["*.js", "*.mjs", "*.jsx"],
      "options": {
        "parser": "flow",
        "arrowParens": "always"
      }
    },
    {
      "files": ["*.config.js"],
      "options": {
        "semi": false,
        "parser": "flow",
        "trailingComma": "none"
      }
    },
    {
      "files": "*.html",
      "options": {
        "bracketSameLine": true
      }
    },
    {
      "files": ".prettierrc",
      "options": { "parser": "json" }
    }
  ]
 }
--- a/.vscode/launch.json
+++ b/.vscode/launch.json
@ -0,0 +1,74 @@
 {
  // Use IntelliSense to learn about possible attributes.
  // Hover to view descriptions of existing attributes.
  // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
  "version": "0.2.0",
  "configurations": [
    {
      "name": "Collector debug",
      "request": "launch",
      "cwd": "${workspaceFolder}/collector",
      "env": {
        "NODE_ENV": "development"
      },
      "runtimeArgs": [
        "index.js"
      ],
      // not using yarn/nodemon because it doesn't work with breakpoints
      // "runtimeExecutable": "yarn",
      "skipFiles": [
        "<node_internals>/**"
      ],
      "type": "node"
    },
    {
      "name": "Server debug",
      "request": "launch",
      "cwd": "${workspaceFolder}/server",
      "env": {
        "NODE_ENV": "development"
      },
      "runtimeArgs": [
        "index.js"
      ],
      // not using yarn/nodemon because it doesn't work with breakpoints
      // "runtimeExecutable": "yarn",
      "skipFiles": [
        "<node_internals>/**"
      ],
      "type": "node"
    },
    {
      "name": "Frontend debug",
      "request": "launch",
      "cwd": "${workspaceFolder}/frontend",
      "env": {
        "NODE_ENV": "development",
      },
      "runtimeExecutable": "${workspaceFolder}/frontend/node_modules/.bin/vite",
      "runtimeArgs": [
        "--debug",
        "--host=0.0.0.0"
      ],
      // "runtimeExecutable": "yarn",
      "skipFiles": [
        "<node_internals>/**"
      ],
      "type": "node"
    },
    {
      "name": "Launch Edge",
      "request": "launch",
      "type": "msedge",
      "url": "http://localhost:3000",
      "webRoot": "${workspaceFolder}"
    },
    {
      "type": "chrome",
      "request": "launch",
      "name": "Launch Chrome against localhost",
      "url": "http://localhost:3000",
      "webRoot": "${workspaceFolder}"
    }
  ]
 }
--- a/.vscode/settings.json
+++ b/.vscode/settings.json
@ -0,0 +1,62 @@
 {
  "cSpell.words": [
    "adoc",
    "aibitat",
    "AIbitat",
    "allm",
    "anythingllm",
    "Apipie",
    "Astra",
    "Chartable",
    "cleancss",
    "comkey",
    "cooldown",
    "cooldowns",
    "datafile",
    "Deduplicator",
    "Dockerized",
    "docpath",
    "elevenlabs",
    "Embeddable",
    "epub",
    "fireworksai",
    "GROQ",
    "hljs",
    "huggingface",
    "inferencing",
    "koboldcpp",
    "Langchain",
    "lmstudio",
    "localai",
    "mbox",
    "Milvus",
    "Mintplex",
    "mixtral",
    "moderations",
    "novita",
    "numpages",
    "Ollama",
    "Oobabooga",
    "openai",
    "opendocument",
    "openrouter",
    "pagerender",
    "Qdrant",
    "royalblue",
    "SearchApi",
    "searxng",
    "Serper",
    "Serply",
    "streamable",
    "textgenwebui",
    "togetherai",
    "Unembed",
    "uuidv",
    "vectordbs",
    "Weaviate",
    "XAILLM",
    "Zilliz"
  ],
  "eslint.experimental.useFlatConfig": true,
  "docker.languageserver.formatter.ignoreMultilineInstructions": true
 }
--- a/.vscode/tasks.json
+++ b/.vscode/tasks.json
@ -0,0 +1,94 @@
 {
  // See https://go.microsoft.com/fwlink/?LinkId=733558
  // for the documentation about the tasks.json format
  "version": "2.0.0",
  "tasks": [
    {
      "type": "shell",
      "options": {
        "cwd": "${workspaceFolder}/collector",
        "statusbar": {
          "color": "#ffea00",
          "detail": "Runs the collector",
          "label": "Collector: $(play) run",
          "running": {
            "color": "#ffea00",
            "label": "Collector: $(gear~spin) running"
          }
        }
      },
      "command": "cd ${workspaceFolder}/collector/ && yarn dev",
      "runOptions": {
        "instanceLimit": 1,
        "reevaluateOnRerun": true
      },
      "presentation": {
        "echo": true,
        "reveal": "always",
        "focus": false,
        "panel": "shared",
        "showReuseMessage": true,
        "clear": false
      },
      "label": "Collector: run"
    },
    {
      "type": "shell",
      "options": {
        "cwd": "${workspaceFolder}/server",
        "statusbar": {
          "color": "#ffea00",
          "detail": "Runs the server",
          "label": "Server: $(play) run",
          "running": {
            "color": "#ffea00",
            "label": "Server: $(gear~spin) running"
          }
        }
      },
      "command": "if [ \"${CODESPACES}\" = \"true\" ]; then while ! gh codespace ports -c $CODESPACE_NAME | grep 3001; do sleep 1; done; gh codespace ports visibility 3001:public -c $CODESPACE_NAME; fi & cd ${workspaceFolder}/server/ && yarn dev",
      "runOptions": {
        "instanceLimit": 1,
        "reevaluateOnRerun": true
      },
      "presentation": {
        "echo": true,
        "reveal": "always",
        "focus": false,
        "panel": "shared",
        "showReuseMessage": true,
        "clear": false
      },
      "label": "Server: run"
    },
    {
      "type": "shell",
      "options": {
        "cwd": "${workspaceFolder}/frontend",
        "statusbar": {
          "color": "#ffea00",
          "detail": "Runs the frontend",
          "label": "Frontend: $(play) run",
          "running": {
            "color": "#ffea00",
            "label": "Frontend: $(gear~spin) running"
          }
        }
      },
      "command": "cd ${workspaceFolder}/frontend/ && yarn dev",
      "runOptions": {
        "instanceLimit": 1,
        "reevaluateOnRerun": true
      },
      "presentation": {
        "echo": true,
        "reveal": "always",
        "focus": false,
        "panel": "shared",
        "showReuseMessage": true,
        "clear": false
      },
      "label": "Frontend: run"
    }
  ]
 }
--- a/BARE_METAL.md
+++ b/BARE_METAL.md
@ -0,0 +1,115 @@
 # Run AnythingLLM in production without Docker
 > [!WARNING]
 > This method of deployment is **not supported** by the core-team and is to be used as a reference for your deployment.
 > You are fully responsible for securing your deployment and data in this mode.
 > **Any issues** experienced from bare-metal or non-containerized deployments will be **not** answered or supported.
 Here you can find the scripts and known working process to run AnythingLLM outside of a Docker container.
 ### Minimum Requirements
 > [!TIP]
 > You should aim for at least 2GB of RAM. Disk storage is proportional to however much data
 > you will be storing (documents, vectors, models, etc). Minimum 10GB recommended.
 - NodeJS v18
 - Yarn
 ## Getting started
 1. Clone the repo into your server as the user who the application will run as.
 `git clone git@github.com:Mintplex-Labs/anything-llm.git`
 2. `cd anything-llm` and run `yarn setup`. This will install all dependencies to run in production as well as debug the application.
 3. `cp server/.env.example server/.env` to create the basic ENV file for where instance settings will be read from on service start.
 4. Ensure that the `server/.env` file has _at least_ these keys to start. These values will persist and this file will be automatically written and managed after your first successful boot.
 ```
 STORAGE_DIR="/your/absolute/path/to/server/storage"
 ```
 5. Edit the `frontend/.env` file for the `VITE_BASE_API` to now be set to `/api`. This is documented in the .env for which one you should use.
 ```
 # VITE_API_BASE='http://localhost:3001/api' # Use this URL when developing locally
 # VITE_API_BASE="https://$CODESPACE_NAME-3001.$GITHUB_CODESPACES_PORT_FORWARDING_DOMAIN/api" # for GitHub Codespaces
 VITE_API_BASE='/api' # Use this URL deploying on non-localhost address OR in docker.
 ```
 ## To start the application
 AnythingLLM is comprised of three main sections. The `frontend`, `server`, and `collector`. When running in production you will be running `server` and `collector` on two different processes, with a build step for compilation of the frontend.
 1. Build the frontend application.
 `cd frontend && yarn build` - this will produce a `frontend/dist` folder that will be used later.
 2. Copy `frontend/dist` to `server/public` - `cp -R frontend/dist server/public`.
 This should create a folder in `server` named `public` which contains a top level `index.html` file and various other files/folders.
 3. Migrate and prepare your database file.
 ```
 cd server && npx prisma generate --schema=./prisma/schema.prisma
 cd server && npx prisma migrate deploy --schema=./prisma/schema.prisma
 ```
 4. Boot the server in production
 `cd server && NODE_ENV=production node index.js &`
 5. Boot the collection in another process
 `cd collector && NODE_ENV=production node index.js &`
 AnythingLLM should now be running on `http://localhost:3001`!
 ## Updating AnythingLLM
 To update AnythingLLM with future updates you can `git pull origin master` to pull in the latest code and then repeat steps 2 - 5 to deploy with all changes fully.
 _note_ You should ensure that each folder runs `yarn` again to ensure packages are up to date in case any dependencies were added, changed, or removed.
 _note_ You should `pkill node` before running an update so that you are not running multiple AnythingLLM processes on the same instance as this can cause conflicts.
 ### Example update script
 ```shell
 #!/bin/bash
 cd $HOME/anything-llm &&\
 git checkout . &&\
 git pull origin master &&\
 echo "HEAD pulled to commit $(git log -1 --pretty=format:"%h" | tail -n 1)"
 echo "Freezing current ENVs"
 curl -I "http://localhost:3001/api/env-dump" | head -n 1|cut -d$' ' -f2
 echo "Rebuilding Frontend"
 cd $HOME/anything-llm/frontend && yarn && yarn build && cd $HOME/anything-llm
 echo "Copying to Sever Public"
 rm -rf server/public
 cp -r frontend/dist server/public
 echo "Killing node processes"
 pkill node
 echo "Installing collector dependencies"
 cd $HOME/anything-llm/collector && yarn
 echo "Installing server dependencies & running migrations"
 cd $HOME/anything-llm/server && yarn
 cd $HOME/anything-llm/server && npx prisma migrate deploy --schema=./prisma/schema.prisma
 cd $HOME/anything-llm/server && npx prisma generate
 echo "Booting up services."
 truncate -s 0 /logs/server.log # Or any other log file location.
 truncate -s 0 /logs/collector.log
 cd $HOME/anything-llm/server
 (NODE_ENV=production node index.js) &> /logs/server.log &
 cd $HOME/anything-llm/collector
 (NODE_ENV=production node index.js) &> /logs/collector.log &
 ```
--- a/21
+++ b/21
@ -0,0 +1,21 @@
 The MIT License
 Copyright (c) Mintplex Labs Inc.
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
 in the Software without restriction, including without limitation the rights
 to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 copies of the Software, and to permit persons to whom the Software is
 furnished to do so, subject to the following conditions:
 The above copyright notice and this permission notice shall be included in
 all copies or substantial portions of the Software.
 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 THE SOFTWARE.
--- a/README.md
+++ b/README.md
@ -0,0 +1,268 @@
 <a name="readme-top"></a>
 <p align="center">
  <a href="https://anythingllm.com"><img src="https://github.com/Mintplex-Labs/anything-llm/blob/master/images/wordmark.png?raw=true" alt="AnythingLLM logo"></a>
 </p>
 <div align='center'>
 <a href="https://trendshift.io/repositories/2415" target="_blank"><img src="https://trendshift.io/api/badge/repositories/2415" alt="Mintplex-Labs%2Fanything-llm | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
 </div>
 <p align="center">
    <b>AnythingLLM:</b> The all-in-one AI app you were looking for.<br />
    Chat with your docs, use AI Agents, hyper-configurable, multi-user, & no frustrating set up required.
 </p>
 <p align="center">
  <a href="https://discord.gg/6UyHPeGZAC" target="_blank">
      <img src="https://img.shields.io/badge/chat-mintplex_labs-blue.svg?style=flat&logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAACAAAAAgCAMAAABEpIrGAAAAIGNIUk0AAHomAACAhAAA+gAAAIDoAAB1MAAA6mAAADqYAAAXcJy6UTwAAAH1UExURQAAAP////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////r6+ubn5+7u7/3+/v39/enq6urq6/v7+97f39rb26eoqT1BQ0pOT4+Rkuzs7cnKykZKS0NHSHl8fdzd3ejo6UxPUUBDRdzc3RwgIh8jJSAkJm5xcvHx8aanqB4iJFBTVezt7V5hYlJVVuLj43p9fiImKCMnKZKUlaaoqSElJ21wcfT09O3u7uvr6zE0Nr6/wCUpK5qcnf7+/nh7fEdKTHx+f0tPUOTl5aipqiouMGtubz5CRDQ4OsTGxufn515hY7a3uH1/gXBydIOFhlVYWvX29qaoqCQoKs7Pz/Pz87/AwUtOUNfY2dHR0mhrbOvr7E5RUy8zNXR2d/f39+Xl5UZJSx0hIzQ3Odra2/z8/GlsbaGjpERHSezs7L/BwScrLTQ4Odna2zM3Obm7u3x/gKSmp9jZ2T1AQu/v71pdXkVISr2+vygsLiInKTg7PaOlpisvMcXGxzk8PldaXPLy8u7u7rm6u7S1tsDBwvj4+MPExbe4ueXm5s/Q0Kyf7ewAAAAodFJOUwAABClsrNjx/QM2l9/7lhmI6jTB/kA1GgKJN+nea6vy/MLZQYeVKK3rVA5tAAAAAWJLR0QB/wIt3gAAAAd0SU1FB+cKBAAmMZBHjXIAAAISSURBVDjLY2CAAkYmZhZWNnYODnY2VhZmJkYGVMDIycXNw6sBBbw8fFycyEoYGfkFBDVQgKAAPyMjQl5IWEQDDYgIC8FUMDKKsmlgAWyiEBWMjGJY5YEqxMAqGMWFNXAAYXGgAkYJSQ2cQFKCkYFRShq3AmkpRgYJbghbU0tbB0Tr6ukbgGhDI10gySfBwCwDUWBsYmpmDqQtLK2sbTQ0bO3sHYA8GWYGWWj4WTs6Obu4ami4OTm7exhqeHp5+4DCVJZBDmqdr7ufn3+ArkZgkJ+fU3CIRmgYWFiOARYGvo5OQUHhEUAFTkF+kVHRsLBgkIeyYmLjwoOc4hMSk5JTnINS06DC8gwcEEZ6RqZGlpOfc3ZObl5+gZ+TR2ERWFyBQQFMF5eklmqUpQb5+ReU61ZUOvkFVVXXQBSAraitq29o1GiKcfLzc29u0mjxBzq0tQ0kww5xZHtHUGeXhkZhdxBYgZ4d0LI6c4gjwd7siQQraOp1AivQ6CuAKZCDBBRQQQNQgUb/BGf3cqCCiZOcnCe3QQIKHNRTpk6bDgpZjRkzg3pBQTBrdtCcuZCgluAD0vPmL1gIdvSixUuWgqNs2YJ+DUhkEYxuggkGmOQUcckrioPTJCOXEnZ5JS5YslbGnuyVERlDDFvGEUPOWvwqaH6RVkHKeuDMK6SKnHlVhTgx8jeTmqy6Eij7K6nLqiGyPwChsa1MUrnq1wAAACV0RVh0ZGF0ZTpjcmVhdGUAMjAyMy0xMC0wNFQwMDozODo0OSswMDowMB9V0a8AAAAldEVYdGRhdGU6bW9kaWZ5ADIwMjMtMTAtMDRUMDA6Mzg6NDkrMDA6MDBuCGkTAAAAKHRFWHRkYXRlOnRpbWVzdGFtcAAyMDIzLTEwLTA0VDAwOjM4OjQ5KzAwOjAwOR1IzAAAAABJRU5ErkJggg==" alt="Discord">
  </a> |
  <a href="https://github.com/Mintplex-Labs/anything-llm/blob/master/LICENSE" target="_blank">
      <img src="https://img.shields.io/static/v1?label=license&message=MIT&color=white" alt="License">
  </a> |
  <a href="https://docs.anythingllm.com" target="_blank">
    Docs
  </a> |
   <a href="https://my.mintplexlabs.com/aio-checkout?product=anythingllm" target="_blank">
    Hosted Instance
  </a>
 </p>
 <p align="center">
  <b>English</b> · <a href='./locales/README.zh-CN.md'>简体中文</a> · <a href='./locales/README.ja-JP.md'>日本語</a>
 </p>
 <p align="center">
 👉 AnythingLLM for desktop (Mac, Windows, & Linux)! <a href="https://anythingllm.com/download" target="_blank"> Download Now</a>
 </p>
 A full-stack application that enables you to turn any document, resource, or piece of content into context that any LLM can use as references during chatting. This application allows you to pick and choose which LLM or Vector Database you want to use as well as supporting multi-user management and permissions.
 ![Chatting](https://github.com/Mintplex-Labs/anything-llm/assets/16845892/cfc5f47c-bd91-4067-986c-f3f49621a859)
 <details>
 <summary><kbd>Watch the demo!</kbd></summary>
 [![Watch the video](/images/youtube.png)](https://youtu.be/f95rGD9trL0)
 </details>
 ### Product Overview
 AnythingLLM is a full-stack application where you can use commercial off-the-shelf LLMs or popular open source LLMs and vectorDB solutions to build a private ChatGPT with no compromises that you can run locally as well as host remotely and be able to chat intelligently with any documents you provide it.
 AnythingLLM divides your documents into objects called `workspaces`. A Workspace functions a lot like a thread, but with the addition of containerization of your documents. Workspaces can share documents, but they do not talk to each other so you can keep your context for each workspace clean.
 ## Cool features of AnythingLLM
 - 🆕 [**Custom AI Agents**](https://docs.anythingllm.com/agent/custom/introduction)
 - 🆕 [**No-code AI Agent builder**](https://docs.anythingllm.com/agent-flows/overview)
 - 🖼️ **Multi-modal support (both closed and open-source LLMs!)**
 - 👤 Multi-user instance support and permissioning _Docker version only_
 - 🦾 Agents inside your workspace (browse the web, etc)
 - 💬 [Custom Embeddable Chat widget for your website](https://github.com/Mintplex-Labs/anythingllm-embed/blob/main/README.md) _Docker version only_
 - 📖 Multiple document type support (PDF, TXT, DOCX, etc)
 - Simple chat UI with Drag-n-Drop funcitonality and clear citations.
 - 100% Cloud deployment ready.
 - Works with all popular [closed and open-source LLM providers](#supported-llms-embedder-models-speech-models-and-vector-databases).
 - Built-in cost & time-saving measures for managing very large documents compared to any other chat UI.
 - Full Developer API for custom integrations!
 - Much more...install and find out!
 ### Supported LLMs, Embedder Models, Speech models, and Vector Databases
 **Large Language Models (LLMs):**
 - [Any open-source llama.cpp compatible model](/server/storage/models/README.md#text-generation-llm-selection)
 - [OpenAI](https://openai.com)
 - [OpenAI (Generic)](https://openai.com)
 - [Azure OpenAI](https://azure.microsoft.com/en-us/products/ai-services/openai-service)
 - [AWS Bedrock](https://aws.amazon.com/bedrock/)
 - [Anthropic](https://www.anthropic.com/)
 - [NVIDIA NIM (chat models)](https://build.nvidia.com/explore/discover)
 - [Google Gemini Pro](https://ai.google.dev/)
 - [Hugging Face (chat models)](https://huggingface.co/)
 - [Ollama (chat models)](https://ollama.ai/)
 - [LM Studio (all models)](https://lmstudio.ai)
 - [LocalAi (all models)](https://localai.io/)
 - [Together AI (chat models)](https://www.together.ai/)
 - [Fireworks AI  (chat models)](https://fireworks.ai/)
 - [Perplexity (chat models)](https://www.perplexity.ai/)
 - [OpenRouter (chat models)](https://openrouter.ai/)
 - [DeepSeek (chat models)](https://deepseek.com/)
 - [Mistral](https://mistral.ai/)
 - [Groq](https://groq.com/)
 - [Cohere](https://cohere.com/)
 - [KoboldCPP](https://github.com/LostRuins/koboldcpp)
 - [LiteLLM](https://github.com/BerriAI/litellm)
 - [Text Generation Web UI](https://github.com/oobabooga/text-generation-webui)
 - [Apipie](https://apipie.ai/)
 - [xAI](https://x.ai/)
 - [Novita AI (chat models)](https://novita.ai/model-api/product/llm-api?utm_source=github_anything-llm&utm_medium=github_readme&utm_campaign=link)
 **Embedder models:**
 - [AnythingLLM Native Embedder](/server/storage/models/README.md) (default)
 - [OpenAI](https://openai.com)
 - [Azure OpenAI](https://azure.microsoft.com/en-us/products/ai-services/openai-service)
 - [LocalAi (all)](https://localai.io/)
 - [Ollama (all)](https://ollama.ai/)
 - [LM Studio (all)](https://lmstudio.ai)
 - [Cohere](https://cohere.com/)
 **Audio Transcription models:**
 - [AnythingLLM Built-in](https://github.com/Mintplex-Labs/anything-llm/tree/master/server/storage/models#audiovideo-transcription) (default)
 - [OpenAI](https://openai.com/)
 **TTS (text-to-speech) support:**
 - Native Browser Built-in (default)
 - [PiperTTSLocal - runs in browser](https://github.com/rhasspy/piper)
 - [OpenAI TTS](https://platform.openai.com/docs/guides/text-to-speech/voice-options)
 - [ElevenLabs](https://elevenlabs.io/)
 - Any OpenAI Compatible TTS service.
 **STT (speech-to-text) support:**
 - Native Browser Built-in (default)
 **Vector Databases:**
 - [LanceDB](https://github.com/lancedb/lancedb) (default)
 - [Astra DB](https://www.datastax.com/products/datastax-astra)
 - [Pinecone](https://pinecone.io)
 - [Chroma](https://trychroma.com)
 - [Weaviate](https://weaviate.io)
 - [Qdrant](https://qdrant.tech)
 - [Milvus](https://milvus.io)
 - [Zilliz](https://zilliz.com)
 ### Technical Overview
 This monorepo consists of three main sections:
 - `frontend`: A viteJS + React frontend that you can run to easily create and manage all your content the LLM can use.
 - `server`: A NodeJS express server to handle all the interactions and do all the vectorDB management and LLM interactions.
 - `collector`: NodeJS express server that process and parses documents from the UI.
 - `docker`: Docker instructions and build process + information for building from source.
 - `embed`: Submodule for generation & creation of the [web embed widget](https://github.com/Mintplex-Labs/anythingllm-embed).
 - `browser-extension`: Submodule for the [chrome browser extension](https://github.com/Mintplex-Labs/anythingllm-extension).
 ## 🛳 Self Hosting
 Mintplex Labs & the community maintain a number of deployment methods, scripts, and templates that you can use to run AnythingLLM locally. Refer to the table below to read how to deploy on your preferred environment or to automatically deploy.
 | Docker | AWS | GCP | Digital Ocean | Render.com |
 |----------------------------------------|----|-----|---------------|------------|
 | [![Deploy on Docker][docker-btn]][docker-deploy] | [![Deploy on AWS][aws-btn]][aws-deploy] | [![Deploy on GCP][gcp-btn]][gcp-deploy] | [![Deploy on DigitalOcean][do-btn]][do-deploy] | [![Deploy on Render.com][render-btn]][render-deploy] |
 | Railway  |  RepoCloud | Elestio |
 | --- | --- | --- |
 | [![Deploy on Railway][railway-btn]][railway-deploy] | [![Deploy on RepoCloud][repocloud-btn]][repocloud-deploy] | [![Deploy on Elestio][elestio-btn]][elestio-deploy] |
 [or set up a production AnythingLLM instance without Docker →](./BARE_METAL.md)
 ## How to setup for development
 - `yarn setup` To fill in the required `.env` files you'll need in each of the application sections (from root of repo).
  - Go fill those out before proceeding. Ensure `server/.env.development` is filled or else things won't work right.
 - `yarn dev:server` To boot the server locally (from root of repo).
 - `yarn dev:frontend` To boot the frontend locally (from root of repo).
 - `yarn dev:collector` To then run the document collector (from root of repo).
 [Learn about documents](./server/storage/documents/DOCUMENTS.md)
 [Learn about vector caching](./server/storage/vector-cache/VECTOR_CACHE.md)
 ## External Apps & Integrations
 _These are apps that are not maintained by Mintplex Labs, but are compatible with AnythingLLM. A listing here is not an endorsement._
 - [Midori AI Subsystem Manager](https://io.midori-ai.xyz/subsystem/anythingllm/) - A streamlined and efficient way to deploy AI systems using Docker container technology.
 - [Coolify](https://coolify.io/docs/services/anythingllm/) - Deploy AnythingLLM with a single click.
 - [GPTLocalhost for Microsoft Word](https://gptlocalhost.com/demo/) - A local Word Add-in for you to use AnythingLLM in Microsoft Word.
 ## Telemetry & Privacy
 AnythingLLM by Mintplex Labs Inc contains a telemetry feature that collects anonymous usage information.
 <details>
 <summary><kbd>More about Telemetry & Privacy for AnythingLLM</kbd></summary>
 ### Why?
 We use this information to help us understand how AnythingLLM is used, to help us prioritize work on new features and bug fixes, and to help us improve AnythingLLM's performance and stability.
 ### Opting out
 Set `DISABLE_TELEMETRY` in your server or docker .env settings to "true" to opt out of telemetry. You can also do this in-app by going to the sidebar > `Privacy` and disabling telemetry.
 ### What do you explicitly track?
 We will only track usage details that help us make product and roadmap decisions, specifically:
 - Type of your installation (Docker or Desktop)
 - When a document is added or removed. No information _about_ the document. Just that the event occurred. This gives us an idea of use.
 - Type of vector database in use. Let's us know which vector database provider is the most used to prioritize changes when updates arrive for that provider.
 - Type of LLM in use. Let's us know the most popular choice and prioritize changes when updates arrive for that provider.
 - Chat is sent. This is the most regular "event" and gives us an idea of the daily-activity of this project across all installations. Again, only the event is sent - we have no information on the nature or content of the chat itself.
 You can verify these claims by finding all locations `Telemetry.sendTelemetry` is called. Additionally these events are written to the output log so you can also see the specific data which was sent - if enabled. No IP or other identifying information is collected. The Telemetry provider is [PostHog](https://posthog.com/) - an open-source telemetry collection service.
 [View all telemetry events in source code](https://github.com/search?q=repo%3AMintplex-Labs%2Fanything-llm%20.sendTelemetry\(&type=code)
 </details>
 ## 👋 Contributing
 - create issue
 - create PR with branch name format of `<issue number>-<short name>`
 - LGTM from core-team
 ## 🌟 Contributors
 [![anythingllm contributors](https://contrib.rocks/image?repo=mintplex-labs/anything-llm)](https://github.com/mintplex-labs/anything-llm/graphs/contributors)
 [![Star History Chart](https://api.star-history.com/svg?repos=mintplex-labs/anything-llm&type=Timeline)](https://star-history.com/#mintplex-labs/anything-llm&Date)
 ## 🔗 More Products
 - **[VectorAdmin][vector-admin]:** An all-in-one GUI & tool-suite for managing vector databases.
 - **[OpenAI Assistant Swarm][assistant-swarm]:** Turn your entire library of OpenAI assistants into one single army commanded from a single agent.
 <div align="right">
 [![][back-to-top]](#readme-top)
 </div>
 ---
 Copyright © 2025 [Mintplex Labs][profile-link]. <br />
 This project is [MIT](./LICENSE) licensed.
 <!-- LINK GROUP -->
 [back-to-top]: https://img.shields.io/badge/-BACK_TO_TOP-222628?style=flat-square
 [profile-link]: https://github.com/mintplex-labs
 [vector-admin]: https://github.com/mintplex-labs/vector-admin
 [assistant-swarm]: https://github.com/Mintplex-Labs/openai-assistant-swarm
 [docker-btn]: ./images/deployBtns/docker.png
 [docker-deploy]: ./docker/HOW_TO_USE_DOCKER.md
 [aws-btn]: ./images/deployBtns/aws.png
 [aws-deploy]: ./cloud-deployments/aws/cloudformation/DEPLOY.md
 [gcp-btn]: https://deploy.cloud.run/button.svg
 [gcp-deploy]: ./cloud-deployments/gcp/deployment/DEPLOY.md
 [do-btn]: https://www.deploytodo.com/do-btn-blue.svg
 [do-deploy]: ./cloud-deployments/digitalocean/terraform/DEPLOY.md
 [render-btn]: https://render.com/images/deploy-to-render-button.svg
 [render-deploy]: https://render.com/deploy?repo=https://github.com/Mintplex-Labs/anything-llm&branch=render
 [render-btn]: https://render.com/images/deploy-to-render-button.svg
 [render-deploy]: https://render.com/deploy?repo=https://github.com/Mintplex-Labs/anything-llm&branch=render
 [railway-btn]: https://railway.app/button.svg
 [railway-deploy]: https://railway.app/template/HNSCS1?referralCode=WFgJkn
 [repocloud-btn]: https://d16t0pc4846x52.cloudfront.net/deploylobe.svg
 [repocloud-deploy]: https://repocloud.io/details/?app_id=276
 [elestio-btn]: https://elest.io/images/logos/deploy-to-elestio-btn.png
 [elestio-deploy]: https://elest.io/open-source/anythingllm
--- a/SECURITY.md
+++ b/SECURITY.md
@ -0,0 +1,15 @@
 # Security Policy
 ## Supported Versions
 Use this section to tell people about which versions of your project are
 currently being supported with security updates.
 | Version | Supported          |
 | ------- | ------------------ |
 | 0.1.x   | :white_check_mark: |
 ## Reporting a Vulnerability
 If a security concern is found that you would like to disclose you can create a PR for it or if you would like to clear this issue before posting you can email [Core Mintplex Labs Team](mailto:team@mintplexlabs.com).
--- a/cloud-deployments/aws/cloudformation/DEPLOY.md
+++ b/cloud-deployments/aws/cloudformation/DEPLOY.md
@ -0,0 +1,49 @@
 # How to deploy a private AnythingLLM instance on AWS
 With an AWS account you can easily deploy a private AnythingLLM instance on AWS. This will create a url that you can access from any browser over HTTP (HTTPS not supported). This single instance will run on your own keys and they will not be exposed - however if you want your instance to be protected it is highly recommend that you set a password one setup is complete.
 **Quick Launch (EASY)**
 1. Log in to your AWS account
 2. Open [CloudFormation](https://us-west-1.console.aws.amazon.com/cloudformation/home)
 3. Ensure you are deploying in a geographic zone that is nearest to your physical location to reduce latency.
 4. Click `Create Stack`
 ![Create Stack](../../../images/screenshots/create_stack.png)
 5. Use the file `cloudformation_create_anythingllm.json` as your JSON template.
 ![Upload Stack](../../../images/screenshots/upload.png)
 6. Click Deploy.  
 7. Wait for stack events to finish and be marked as `Completed`
 8. View `Outputs` tab.
 ![Stack Output](../../../images/screenshots/cf_outputs.png)
 9. Wait for all resources to be built. Now wait until instance is available on `[InstanceIP]:3001`.
 This process may take up to 10 minutes. See **Note** below on how to visualize this process.
 The output of this cloudformation stack will be:
 - 1 EC2 Instance
 - 1 Security Group with 0.0.0.0/0 access on port 3001
 - 1 EC2 Instance Volume `gb2` of 10Gib minimum - customizable pre-deploy.
 **Requirements**
 - An AWS account with billing information.
 ## Please read this notice before submitting issues about your deployment
 **Note:** 
 Your instance will not be available instantly. Depending on the instance size you launched with it can take 5-10 minutes to fully boot up.
 If you want to check the instance's progress, navigate to [your deployed EC2 instances](https://us-west-1.console.aws.amazon.com/ec2/home) and connect to your instance via SSH in browser.
 Once connected run `sudo tail -f /var/log/cloud-init-output.log` and wait for the file to conclude deployment of the docker image.
 You should see an output like this
 ```
 [+] Running 2/2
 ⠿ Network docker_anything-llm  Created 
 ⠿ Container anything-llm       Started  
 ```
 Additionally, your use of this deployment process means you are responsible for any costs of these AWS resources fully.
--- a/cloud-deployments/aws/cloudformation/aws_https_instructions.md
+++ b/cloud-deployments/aws/cloudformation/aws_https_instructions.md
@ -0,0 +1,118 @@
 # How to Configure HTTPS for Anything LLM AWS private deployment
 Instructions for manual https configuration after generating and running the aws cloudformation template (aws_build_from_source_no_credentials.json). Tested on following browsers: Firefox version 119, Chrome version 118, Edge 118.
 **Requirements**
 - Successful deployment of Amazon Linux 2023 EC2 instance with Docker container running Anything LLM
 - Admin priv to configure Elastic IP for EC2 instance via AWS Management Console UI
 - Admin priv to configure DNS services (i.e. AWS Route 53) via AWS Management Console UI
 - Admin priv to configure EC2 Security Group rules via AWS Management Console UI
 ## Step 1: Allocate and assign Elastic IP Address to your deployed EC2 instance
 1. Follow AWS instructions on allocating EIP here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/elastic-ip-addresses-eip.html#using-instance-addressing-eips-allocating
 2. Follow AWS instructions on assigning EIP to EC2 instance here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/elastic-ip-addresses-eip.html#using-instance-addressing-eips-associating  
 ## Step 2: Configure DNS A record to resolve to the previously assigned EC2 instance via EIP 
 These instructions assume that you already have a top-level domain configured and are using a subdomain 
 to access AnythingLLM.
 1. Follow AWS instructions on routing traffic to EC2 instance here: https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/routing-to-ec2-instance.html 
 ## Step 3: Install and enable nginx
 These instructions are for CLI configuration and assume you are logged in to EC2 instance as the ec2-user.
 1. $sudo yum install nginx -y
 2. $sudo systemctl enable nginx && sudo systemctl start nginx
 ## Step 4: Install certbot
 These instructions are for CLI configuration and assume you are logged in to EC2 instance as the ec2-user.
 1. $sudo yum install -y augeas-libs
 2. $sudo python3 -m venv /opt/certbot/
 3. $sudo /opt/certbot/bin/pip install --upgrade pip
 4. $sudo /opt/certbot/bin/pip install certbot certbot-nginx
 5. $sudo ln -s /opt/certbot/bin/certbot /usr/bin/certbot
 ## Step 5: Configure temporary Inbound Traffic Rule for Security Group to certbot DNS verification
 1. Follow AWS instructions on creating inbound rule (http port 80 0.0.0.0/0) for EC2 security group here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/working-with-security-groups.html#adding-security-group-rule
 ## Step 6: Comment out default http NGINX proxy configuration
 These instructions are for CLI configuration and assume you are logged in to EC2 instance as the ec2-user.
 1. $sudo vi /etc/nginx/nginx.conf
 2. In the nginx.conf file, comment out the default server block configuration for http/port 80. It should look something like the following:
 ```
 #    server {
 #        listen       80;
 #        listen       [::]:80;
 #        server_name  _;
 #        root         /usr/share/nginx/html;
 #
 #        # Load configuration files for the default server block.
 #        include /etc/nginx/default.d/*.conf;
 #
 #        error_page 404 /404.html;
 #        location = /404.html {
 #        }
 #
 #        error_page 500 502 503 504 /50x.html;
 #        location = /50x.html {
 #        }
 #    }
 ```
 3. Enter ':wq' to save the changes to the nginx default config
 ## Step 7: Create simple http proxy configuration for AnythingLLM 
 These instructions are for CLI configuration and assume you are logged in to EC2 instance as the ec2-user.
 1. $sudo vi /etc/nginx/conf.d/anything.conf
 2. Add the following configuration ensuring that you add your FQDN:.
 ```
 server {
   # Enable websocket connections for agent protocol.
   location ~* ^/api/agent-invocation/(.*) {
      proxy_pass http://0.0.0.0:3001;
      proxy_http_version 1.1;
      proxy_set_header Upgrade $http_upgrade;
      proxy_set_header Connection "Upgrade";
   }
   listen 80;
   server_name [insert FQDN here];
   location / {
      # Prevent timeouts on long-running requests.
      proxy_connect_timeout       605;
      proxy_send_timeout          605;
      proxy_read_timeout          605;
      send_timeout                605;
      keepalive_timeout           605;
      # Enable readable HTTP Streaming for LLM streamed responses
      proxy_buffering off; 
      proxy_cache off;
      # Proxy your locally running service
      proxy_pass  http://0.0.0.0:3001;
    }
 }
 ```
 3. Enter ':wq' to save the changes to the anything config file
 ## Step 8: Test nginx http proxy config and restart nginx service
 These instructions are for CLI configuration and assume you are logged in to EC2 instance as the ec2-user.
 1. $sudo nginx -t
 2. $sudo systemctl restart nginx
 3. Navigate to http://FQDN in a browser and you should be proxied to the AnythingLLM web UI.
 ## Step 9: Generate/install cert
 These instructions are for CLI configuration and assume you are logged in to EC2 instance as the ec2-user.
 1. $sudo certbot --nginx -d [Insert FQDN here] 
    Example command: $sudo certbot --nginx -d anythingllm.exampleorganization.org
    This command will generate the appropriate certificate files, write the files to /etc/letsencrypt/live/yourFQDN, and make updates to the nginx
    configuration file for anythingllm located at /etc/nginx/conf.d/anything.llm
 3. Enter the email address you would like to use for updates.
 4. Accept the terms of service.
 5. Accept or decline to receive communication from LetsEncrypt.
 ## Step 10: Test Cert installation
 1. $sudo cat /etc/nginx/conf.d/anything.conf
 Your should see a completely updated configuration that includes https/443 and a redirect configuration for http/80. 
 2. Navigate to https://FQDN in a browser and you should be proxied to the AnythingLLM web UI.
 ## Step 11: (Optional) Remove temporary Inbound Traffic Rule for Security Group to certbot DNS verification
 1. Follow AWS instructions on deleting inbound rule (http port 80 0.0.0.0/0) for EC2 security group here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/working-with-security-groups.html#deleting-security-group-rule
--- a/cloud-deployments/aws/cloudformation/cloudformation_create_anythingllm.json
+++ b/cloud-deployments/aws/cloudformation/cloudformation_create_anythingllm.json
@ -0,0 +1,234 @@
 {
  "AWSTemplateFormatVersion": "2010-09-09",
  "Description": "Create a stack that runs AnythingLLM on a single instance",
  "Parameters": {
    "InstanceType": {
      "Description": "EC2 instance type",
      "Type": "String",
      "Default": "t3.small"
    },
    "InstanceVolume": {
      "Description": "Storage size of disk on Instance in GB",
      "Type": "Number",
      "Default": 10,
      "MinValue": 4
    }
  },
  "Resources": {
    "AnythingLLMInstance": {
      "Type": "AWS::EC2::Instance",
      "Properties": {
        "ImageId": {
          "Fn::FindInMap": [
            "Region2AMI",
            {
              "Ref": "AWS::Region"
            },
            "AMI"
          ]
        },
        "InstanceType": {
          "Ref": "InstanceType"
        },
        "SecurityGroupIds": [
          {
            "Ref": "AnythingLLMInstanceSecurityGroup"
          }
        ],
        "BlockDeviceMappings": [
          {
            "DeviceName": {
              "Fn::FindInMap": [
                "Region2AMI",
                {
                  "Ref": "AWS::Region"
                },
                "RootDeviceName"
              ]
            },
            "Ebs": {
              "VolumeSize": {
                "Ref": "InstanceVolume"
              }
            }
          }
        ],
        "UserData": {
          "Fn::Base64": {
            "Fn::Join": [
              "",
              [
                "Content-Type: multipart/mixed; boundary=\"//\"\n",
                "MIME-Version: 1.0\n",
                "\n",
                "--//\n",
                "Content-Type: text/cloud-config; charset=\"us-ascii\"\n",
                "MIME-Version: 1.0\n",
                "Content-Transfer-Encoding: 7bit\n",
                "Content-Disposition: attachment; filename=\"cloud-config.txt\"\n",
                "\n",
                "\n",
                "#cloud-config\n",
                "cloud_final_modules:\n",
                "- [scripts-user, once-per-instance]\n",
                "\n",
                "\n",
                "--//\n",
                "Content-Type: text/x-shellscript; charset=\"us-ascii\"\n",
                "MIME-Version: 1.0\n",
                "Content-Transfer-Encoding: 7bit\n",
                "Content-Disposition: attachment; filename=\"userdata.txt\"\n",
                "\n",
                "\n",
                "#!/bin/bash\n",
                "# check output of userdata script with sudo tail -f /var/log/cloud-init-output.log\n",
                "sudo yum install docker iptables -y\n",
                "sudo iptables -A OUTPUT -m owner ! --uid-owner root -d 169.254.169.254 -j DROP\n",
                "sudo systemctl enable docker\n",
                "sudo systemctl start docker\n",
                "mkdir -p /home/ec2-user/anythingllm\n",
                "touch /home/ec2-user/anythingllm/.env\n",
                "sudo chown ec2-user:ec2-user -R /home/ec2-user/anythingllm\n",
                "docker pull mintplexlabs/anythingllm\n",
                "docker run -d -p 3001:3001 --cap-add SYS_ADMIN -v /home/ec2-user/anythingllm:/app/server/storage -v /home/ec2-user/anythingllm/.env:/app/server/.env -e STORAGE_DIR=\"/app/server/storage\" mintplexlabs/anythingllm\n",
                "echo \"Container ID: $(sudo docker ps --latest --quiet)\"\n",
                "export ONLINE=$(curl -Is http://localhost:3001/api/ping | head -n 1|cut -d$' ' -f2)\n",
                "echo \"Health check: $ONLINE\"\n",
                "echo \"Setup complete! AnythingLLM instance is now online!\"\n",
                "\n",
                "--//--\n"
              ]
            ]
          }
        }
      }
    },
    "AnythingLLMInstanceSecurityGroup": {
      "Type": "AWS::EC2::SecurityGroup",
      "Properties": {
        "GroupDescription": "AnythingLLM Instance Security Group",
        "SecurityGroupIngress": [
          {
            "IpProtocol": "tcp",
            "FromPort": "22",
            "ToPort": "22",
            "CidrIp": "0.0.0.0/0"
          },
          {
            "IpProtocol": "tcp",
            "FromPort": "3001",
            "ToPort": "3001",
            "CidrIp": "0.0.0.0/0"
          },
          {
            "IpProtocol": "tcp",
            "FromPort": "3001",
            "ToPort": "3001",
            "CidrIpv6": "::/0"
          }
        ]
      }
    }
  },
  "Outputs": {
    "ServerIp": {
      "Description": "IP address of the AnythingLLM instance",
      "Value": {
        "Fn::GetAtt": [
          "AnythingLLMInstance",
          "PublicIp"
        ]
      }
    },
    "ServerURL": {
      "Description": "URL of the AnythingLLM server",
      "Value": {
        "Fn::Join": [
          "",
          [
            "http://",
            {
              "Fn::GetAtt": [
                "AnythingLLMInstance",
                "PublicIp"
              ]
            },
            ":3001"
          ]
        ]
      }
    }
  },
  "Mappings": {
    "Region2AMI": {
      "ap-south-1": {
        "AMI": "ami-0e6329e222e662a52",
        "RootDeviceName": "/dev/xvda"
      },
      "eu-north-1": {
        "AMI": "ami-08c308b1bb265e927",
        "RootDeviceName": "/dev/xvda"
      },
      "eu-west-3": {
        "AMI": "ami-069d1ea6bc64443f0",
        "RootDeviceName": "/dev/xvda"
      },
      "eu-west-2": {
        "AMI": "ami-06a566ca43e14780d",
        "RootDeviceName": "/dev/xvda"
      },
      "eu-west-1": {
        "AMI": "ami-0a8dc52684ee2fee2",
        "RootDeviceName": "/dev/xvda"
      },
      "ap-northeast-3": {
        "AMI": "ami-0c8a89b455fae8513",
        "RootDeviceName": "/dev/xvda"
      },
      "ap-northeast-2": {
        "AMI": "ami-0ff56409a6e8ea2a0",
        "RootDeviceName": "/dev/xvda"
      },
      "ap-northeast-1": {
        "AMI": "ami-0ab0bbbd329f565e6",
        "RootDeviceName": "/dev/xvda"
      },
      "ca-central-1": {
        "AMI": "ami-033c256a10931f206",
        "RootDeviceName": "/dev/xvda"
      },
      "sa-east-1": {
        "AMI": "ami-0dabf4dab6b183eef",
        "RootDeviceName": "/dev/xvda"
      },
      "ap-southeast-1": {
        "AMI": "ami-0dc5785603ad4ff54",
        "RootDeviceName": "/dev/xvda"
      },
      "ap-southeast-2": {
        "AMI": "ami-0c5d61202c3b9c33e",
        "RootDeviceName": "/dev/xvda"
      },
      "eu-central-1": {
        "AMI": "ami-004359656ecac6a95",
        "RootDeviceName": "/dev/xvda"
      },
      "us-east-1": {
        "AMI": "ami-0cff7528ff583bf9a",
        "RootDeviceName": "/dev/xvda"
      },
      "us-east-2": {
        "AMI": "ami-02238ac43d6385ab3",
        "RootDeviceName": "/dev/xvda"
      },
      "us-west-1": {
        "AMI": "ami-01163e76c844a2129",
        "RootDeviceName": "/dev/xvda"
      },
      "us-west-2": {
        "AMI": "ami-0ceecbb0f30a902a6",
        "RootDeviceName": "/dev/xvda"
      }
    }
  }
 }
--- a/cloud-deployments/digitalocean/terraform/DEPLOY.md
+++ b/cloud-deployments/digitalocean/terraform/DEPLOY.md
@ -0,0 +1,44 @@
 # How to deploy a private AnythingLLM instance on DigitalOcean using Terraform
 With a DigitalOcean account, you can easily deploy a private AnythingLLM instance using Terraform. This will create a URL that you can access from any browser over HTTP (HTTPS not supported). This single instance will run on your own keys, and they will not be exposed. However, if you want your instance to be protected, it is highly recommended that you set a password one setup is complete.
 The output of this Terraform configuration will be:
 - 1 DigitalOcean Droplet
 - An IP address to access your application
 **Requirements**
 - An DigitalOcean  account with billing information
 - Terraform installed on your local machine
  - Follow the instructions in the [official Terraform documentation](https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli) for your operating system.
 ## How to deploy on DigitalOcean
 Open your terminal and navigate to the `docker` folder
 1. Create a `.env` file by cloning the `.env.example`. 
 2. Navigate to `digitalocean/terraform` folder.
 3. Replace the token value in the provider "digitalocean" block in main.tf with your DigitalOcean API token.
 4. Run the following commands to initialize Terraform, review the infrastructure changes, and apply them:
    ```
    terraform init  
    terraform plan  
    terraform apply  
    ```
 Confirm the changes by typing yes when prompted.
 5. Once the deployment is complete, Terraform will output the public IP address of your droplet. You can access your application using this IP address.
 ## How to deploy on DigitalOcean
 To delete the resources created by Terraform, run the following command in the terminal:
 `
 terraform destroy  
 `
 ## Please read this notice before submitting issues about your deployment
 **Note:** 
 Your instance will not be available instantly. Depending on the instance size you launched with it can take anywhere from 5-10 minutes to fully boot up.
 If you want to check the instances progress, navigate to [your deployed instances](https://cloud.digitalocean.com/droplets) and connect to your instance via SSH in browser.
 Once connected run `sudo tail -f /var/log/cloud-init-output.log` and wait for the file to conclude deployment of the docker image.
 Additionally, your use of this deployment process means you are responsible for any costs of these Digital Ocean resources fully.
--- a/cloud-deployments/digitalocean/terraform/main.tf
+++ b/cloud-deployments/digitalocean/terraform/main.tf
@ -0,0 +1,52 @@
 terraform {
  required_version = ">= 1.0.0"
  required_providers {
    digitalocean = {
      source  = "digitalocean/digitalocean"
      version = "~> 2.0"
    }
  }
 }
 provider "digitalocean" {  
  # Add your DigitalOcean API token here  
  token = "DigitalOcean API token"  
 }  
 resource "digitalocean_droplet" "anything_llm_instance" {  
  image  = "ubuntu-24-04-x64"  
  name   = "anything-llm-instance"  
  region = "nyc3"  
  size   = "s-2vcpu-2gb"  
  user_data = templatefile("user_data.tp1", {   
    env_content = local.formatted_env_content 
  })
 }  
 locals {  
  env_content = file("../../../docker/.env")  
  formatted_env_content = join("\n", [  
    for line in split("\n", local.env_content) :  
    line  
    if !(  
      (  
        substr(line, 0, 1) == "#"  
      ) ||  
      (  
        substr(line, 0, 3) == "UID"  
      ) ||  
      (  
        substr(line, 0, 3) == "GID"  
      ) ||  
      (  
        substr(line, 0, 11) == "CLOUD_BUILD"  
      ) ||  
      (  
        line == ""  
      )  
    )  
  ])  
 }
--- a/cloud-deployments/digitalocean/terraform/outputs.tf
+++ b/cloud-deployments/digitalocean/terraform/outputs.tf
@ -0,0 +1,4 @@
 output "ip_address" {
  value = digitalocean_droplet.anything_llm_instance.ipv4_address
  description = "The public IP address of your droplet application."
 }
--- a/cloud-deployments/digitalocean/terraform/user_data.tp1
+++ b/cloud-deployments/digitalocean/terraform/user_data.tp1
@ -0,0 +1,22 @@
 #!/bin/bash  
 # check output of userdata script with sudo tail -f /var/log/cloud-init-output.log 
 sudo apt-get update  
 sudo apt-get install -y docker.io  
 sudo usermod -a -G docker ubuntu
 sudo systemctl enable docker  
 sudo systemctl start docker  
 mkdir -p /home/anythingllm
 cat <<EOF >/home/anythingllm/.env
 ${env_content}
 EOF
 sudo docker pull mintplexlabs/anythingllm
 sudo docker run -d -p 3001:3001 --cap-add SYS_ADMIN -v /home/anythingllm:/app/server/storage -v /home/anythingllm/.env:/app/server/.env -e STORAGE_DIR="/app/server/storage" mintplexlabs/anythingllm
 echo "Container ID: $(sudo docker ps --latest --quiet)"  
 export ONLINE=$(curl -Is http://localhost:3001/api/ping | head -n 1|cut -d$' ' -f2)  
 echo "Health check: $ONLINE"  
 echo "Setup complete! AnythingLLM instance is now online!"  
--- a/cloud-deployments/gcp/deployment/DEPLOY.md
+++ b/cloud-deployments/gcp/deployment/DEPLOY.md
@ -0,0 +1,54 @@
 # How to deploy a private AnythingLLM instance on GCP
 With a GCP account you can easily deploy a private AnythingLLM instance on GCP. This will create a url that you can access from any browser over HTTP (HTTPS not supported). This single instance will run on your own keys and they will not be exposed - however if you want your instance to be protected it is highly recommend that you set a password one setup is complete.
 The output of this cloudformation stack will be:
 - 1 GCP VM
 - 1 Security Group with 0.0.0.0/0 access on Ports 22 & 3001
 - 1 GCP VM Volume `gb2` of 10Gib minimum
 **Requirements**
 - An GCP account with billing information.
 ## How to deploy on GCP
 Open your terminal
 1. Log in to your GCP account using the following command:
    ```
    gcloud auth login 
    ```
 2. After successful login, Run the following command to create a deployment using the Deployment Manager CLI:
  ```
  gcloud deployment-manager deployments create anything-llm-deployment --config gcp/deployment/gcp_deploy_anything_llm.yaml
  ```
 Once you execute these steps, the CLI will initiate the deployment process on GCP based on your configuration file. You can monitor the deployment status and view the outputs using the Google Cloud Console or the Deployment Manager CLI commands.
 ```
 gcloud compute instances get-serial-port-output anything-llm-instance 
 ```
 ssh into the instance
 ```
 gcloud compute ssh anything-llm-instance 
 ```
 Delete the deployment
 ```
 gcloud deployment-manager deployments delete anything-llm-deployment 
 ```
 ## Please read this notice before submitting issues about your deployment
 **Note:** 
 Your instance will not be available instantly. Depending on the instance size you launched with it can take anywhere from 5-10 minutes to fully boot up.
 If you want to check the instances progress, navigate to [your deployed instances](https://console.cloud.google.com/compute/instances) and connect to your instance via SSH in browser.
 Once connected run `sudo tail -f /var/log/cloud-init-output.log` and wait for the file to conclude deployment of the docker image.
 Additionally, your use of this deployment process means you are responsible for any costs of these GCP resources fully.
--- a/cloud-deployments/gcp/deployment/gcp_deploy_anything_llm.yaml
+++ b/cloud-deployments/gcp/deployment/gcp_deploy_anything_llm.yaml
@ -0,0 +1,45 @@
 resources:  
  - name: anything-llm-instance  
    type: compute.v1.instance  
    properties:  
      zone: us-central1-a  
      machineType: zones/us-central1-a/machineTypes/n1-standard-1  
      disks:  
        - deviceName: boot  
          type: PERSISTENT  
          boot: true  
          autoDelete: true  
          initializeParams:  
            sourceImage: projects/ubuntu-os-cloud/global/images/family/ubuntu-2004-lts  
            diskSizeGb: 10  
      networkInterfaces:  
        - network: global/networks/default  
          accessConfigs:  
            - name: External NAT  
              type: ONE_TO_ONE_NAT  
      metadata:  
        items:  
          - key: startup-script  
            value: |  
              #!/bin/bash  
              # check output of userdata script with sudo tail -f /var/log/cloud-init-output.log  
              sudo apt-get update  
              sudo apt-get install -y docker.io  
              sudo usermod -a -G docker ubuntu
              sudo systemctl enable docker  
              sudo systemctl start docker  
              mkdir -p /home/anythingllm
              touch /home/anythingllm/.env
              sudo chown -R ubuntu:ubuntu /home/anythingllm
              sudo docker pull mintplexlabs/anythingllm
              sudo docker run -d -p 3001:3001 --cap-add SYS_ADMIN -v /home/anythingllm:/app/server/storage -v /home/anythingllm/.env:/app/server/.env -e STORAGE_DIR="/app/server/storage" mintplexlabs/anythingllm
              echo "Container ID: $(sudo docker ps --latest --quiet)"  
              export ONLINE=$(curl -Is http://localhost:3001/api/ping | head -n 1|cut -d$' ' -f2)  
              echo "Health check: $ONLINE"  
              echo "Setup complete! AnythingLLM instance is now online!"  
--- a/cloud-deployments/huggingface-spaces/Dockerfile
+++ b/cloud-deployments/huggingface-spaces/Dockerfile
@ -0,0 +1,31 @@
 # With this dockerfile in a Huggingface space you will get an entire AnythingLLM instance running
 # in your space with all features you would normally get from the docker based version of AnythingLLM.
 #
 # How to use
 # - Login to https://huggingface.co/spaces
 # - Click on "Create new Space"
 # - Name the space and select "Docker" as the SDK w/ a blank template
 # - The default 2vCPU/16GB machine is OK. The more the merrier.
 # - Decide if you want your AnythingLLM Space public or private.
 #   **You might want to stay private until you at least set a password or enable multi-user mode**
 # - Click "Create Space"
 # - Click on "Settings" on top of page (https://huggingface.co/spaces/<username>/<space-name>/settings)
 # - Scroll to "Persistent Storage" and select the lowest tier of now - you can upgrade if you run out.
 # - Confirm and continue storage upgrade
 # - Go to "Files" Tab (https://huggingface.co/spaces/<username>/<space-name>/tree/main)
 # - Click "Add Files"
 # - Upload this file or create a file named `Dockerfile` and copy-paste this content into it. "Commit to main" and save.
 # - Your container will build and boot. You now have AnythingLLM on HuggingFace. Your data is stored in the persistent storage attached.
 # Have Fun 🤗 
 # Have issues? Check the logs on HuggingFace for clues.
 FROM mintplexlabs/anythingllm:render
 USER root
 RUN mkdir -p /data/storage
 RUN ln -s /data/storage /storage
 USER anythingllm
 ENV STORAGE_DIR="/data/storage"
 ENV SERVER_PORT=7860
 ENTRYPOINT ["/bin/bash", "/usr/local/bin/render-entrypoint.sh"]
--- a/cloud-deployments/k8/manifest.yaml
+++ b/cloud-deployments/k8/manifest.yaml
@ -0,0 +1,214 @@
 ---
 apiVersion: v1                                                                                                                                           
 kind: PersistentVolume                                                                                                                                   
 metadata:                                                                                                                                                
  name: anything-llm-volume                                                                                                                              
  annotations:                                                                                                                                           
    pv.beta.kubernetes.io/uid: "1000"                                                                                                                    
    pv.beta.kubernetes.io/gid: "1000"                                                                                                                    
 spec:                                                                                                                                                    
  storageClassName: gp2                                                                                                                                  
  capacity:                                                                                                                                              
    storage: 5Gi                                                                                                                                        
  accessModes:                                                                                                                                           
    - ReadWriteOnce                                                                                                                                      
  awsElasticBlockStore:    
    # This is the volume UUID from AWS EC2 EBS Volumes list.                                                                                                                              
    volumeID: "{{ anythingllm_awsElasticBlockStore_volumeID }}"                                                                                                                           
    fsType: ext4
  nodeAffinity:                                                                                                                                          
    required:                                                                                                                                            
      nodeSelectorTerms:                                                                                                                                 
      - matchExpressions:                                                                                                                                
        - key: topology.kubernetes.io/zone                                                                                                               
          operator: In                                                                                                                                   
          values:                                                                                                                                        
          - us-east-1c  
 ---
 apiVersion: v1
 kind: PersistentVolumeClaim
 metadata:
  name: anything-llm-volume-claim
  namespace: "{{ namespace }}"
 spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
 ---
 apiVersion: apps/v1
 kind: Deployment
 metadata:
  name: anything-llm
  namespace: "{{ namespace }}"
  labels:
    anything-llm: "true"
 spec:
  selector:
    matchLabels:
      k8s-app: anything-llm
  replicas: 1
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 0%
      maxUnavailable: 100%
  template:
    metadata:
      labels:
        anything-llm: "true"
        k8s-app: anything-llm
        app.kubernetes.io/name: anything-llm
        app.kubernetes.io/part-of: anything-llm
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/path: /metrics
        prometheus.io/port: "9090"
    spec:
      serviceAccountName: "default"
      terminationGracePeriodSeconds: 10
      securityContext:                                                                                                                                                              
        fsGroup: 1000
        runAsNonRoot: true                                                                                                                                                          
        runAsGroup: 1000
        runAsUser: 1000
      affinity:                                                                                                                                                                                                                                                                          
        nodeAffinity:                                                                                                                                                                                                                                                                    
          requiredDuringSchedulingIgnoredDuringExecution:                                                                                                                                                                                                                                
            nodeSelectorTerms:                                                                                                                                                                                                                                                           
            - matchExpressions:                                                                                                                                                                                                                                                          
              - key: topology.kubernetes.io/zone                                                                                                                                                                                                                                         
                operator: In                                                                                                                                                                                                                                                             
                values:                                                                                                                                                                                                                                                                  
                - us-east-1c  
      containers:
      - name: anything-llm
        resources:
          limits:
            memory: "1Gi"
            cpu: "500m"
          requests:
            memory: "512Mi"
            cpu: "250m"
        imagePullPolicy: IfNotPresent
        image: "mintplexlabs/anythingllm:render"
        securityContext:                     
          allowPrivilegeEscalation: true                                                                                                                                                                                                                                                 
          capabilities:                                                                                                                                                                                                                                                                  
            add:                                                                                                                                                                                                                                                                         
              - SYS_ADMIN                                                                                                                                                                                                                                                                
          runAsNonRoot: true                                                                                                                                                                                                                                                             
          runAsGroup: 1000                                                                                                                                                                                                                                                               
          runAsUser: 1000                                                                                                                                       
        command: 
          # Specify a command to override the Dockerfile's ENTRYPOINT.
          - /bin/bash
          - -c
          - |
            set -x -e
            sleep 3
            echo "AWS_REGION: $AWS_REGION"
            echo "SERVER_PORT: $SERVER_PORT"
            echo "NODE_ENV: $NODE_ENV"
            echo "STORAGE_DIR: $STORAGE_DIR"
            {
              cd /app/server/ &&
                npx prisma generate --schema=./prisma/schema.prisma &&
                npx prisma migrate deploy --schema=./prisma/schema.prisma &&
                node /app/server/index.js
              echo "Server process exited with status $?"
            } &
            { 
              node /app/collector/index.js
              echo "Collector process exited with status $?"
            } &
            wait -n
            exit $?
        readinessProbe:
          httpGet:
            path: /v1/api/health
            port: 8888
          initialDelaySeconds: 15
          periodSeconds: 5
          successThreshold: 2
        livenessProbe:
          httpGet:
            path: /v1/api/health
            port: 8888
          initialDelaySeconds: 15
          periodSeconds: 5
          failureThreshold: 3
        env:
          - name: AWS_REGION
            value: "{{ aws_region }}"
          - name: AWS_ACCESS_KEY_ID
            value: "{{ aws_access_id }}"
          - name: AWS_SECRET_ACCESS_KEY
            value: "{{ aws_access_secret }}"
          - name: SERVER_PORT
            value: "3001"
          - name: JWT_SECRET
            value: "my-random-string-for-seeding" # Please generate random string at least 12 chars long.
          - name: STORAGE_DIR
            value: "/storage"
          - name: NODE_ENV
            value: "production"
          - name: UID
            value: "1000"
          - name: GID
            value: "1000"
        volumeMounts: 
          - name: anything-llm-server-storage-volume-mount
            mountPath: /storage                                                                                                                                                  
      volumes:
        - name: anything-llm-server-storage-volume-mount
          persistentVolumeClaim:
            claimName: anything-llm-volume-claim
 ---
 # This serves the UI and the backend.
 apiVersion: networking.k8s.io/v1
 kind: Ingress
 metadata:
  name: anything-llm-ingress
  namespace: "{{ namespace }}"
  annotations:
    external-dns.alpha.kubernetes.io/hostname: "{{ namespace }}-chat.{{ base_domain }}"
    kubernetes.io/ingress.class: "internal-ingress"
    nginx.ingress.kubernetes.io/rewrite-target: /
    ingress.kubernetes.io/ssl-redirect: "false"
 spec:
  rules:
  - host: "{{ namespace }}-chat.{{ base_domain }}"
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: anything-llm-svc
            port: 
              number: 3001
  tls: # < placing a host in the TLS config will indicate a cert should be created
    - hosts:
        - "{{ namespace }}-chat.{{ base_domain }}"
      secretName: letsencrypt-prod
 ---
 apiVersion: v1
 kind: Service
 metadata:
  labels:
    kubernetes.io/name: anything-llm
  name: anything-llm-svc
  namespace: "{{ namespace }}"
 spec:
  ports:
  # "port" is external port, and "targetPort" is internal.
  - port: 3301
    targetPort: 3001
    name: traffic
  - port: 9090
    targetPort: 9090
    name: metrics
  selector:
    k8s-app: anything-llm
--- a/collector/.env.example
+++ b/collector/.env.example
@ -0,0 +1 @@
 # Placeholder .env file for collector runtime
--- a/collector/.gitignore
+++ b/collector/.gitignore
@ -0,0 +1,6 @@
 hotdir/*
 !hotdir/__HOTDIR__.md
 yarn-error.log
 !yarn.lock
 outputs
 scripts
--- a/collector/.nvmrc
+++ b/collector/.nvmrc
@ -0,0 +1 @@
 v18.13.0
--- a/collector/extensions/index.js
+++ b/collector/extensions/index.js
@ -0,0 +1,159 @@
 const { setDataSigner } = require("../middleware/setDataSigner");
 const { verifyPayloadIntegrity } = require("../middleware/verifyIntegrity");
 const { resolveRepoLoader, resolveRepoLoaderFunction } = require("../utils/extensions/RepoLoader");
 const { reqBody } = require("../utils/http");
 const { validURL } = require("../utils/url");
 const RESYNC_METHODS = require("./resync");
 function extensions(app) {
  if (!app) return;
  app.post(
    "/ext/resync-source-document",
    [verifyPayloadIntegrity, setDataSigner],
    async function (request, response) {
      try {
        const { type, options } = reqBody(request);
        if (!RESYNC_METHODS.hasOwnProperty(type)) throw new Error(`Type "${type}" is not a valid type to sync.`);
        return await RESYNC_METHODS[type](options, response);
      } catch (e) {
        console.error(e);
        response.status(200).json({
          success: false,
          content: null,
          reason: e.message || "A processing error occurred.",
        });
      }
      return;
    }
  )
  app.post(
    "/ext/:repo_platform-repo",
    [verifyPayloadIntegrity, setDataSigner],
    async function (request, response) {
      try {
        const loadRepo = resolveRepoLoaderFunction(request.params.repo_platform);
        const { success, reason, data } = await loadRepo(
          reqBody(request),
          response,
        );
        response.status(200).json({
          success,
          reason,
          data,
        });
      } catch (e) {
        console.error(e);
        response.status(200).json({
          success: false,
          reason: e.message || "A processing error occurred.",
          data: {},
        });
      }
      return;
    }
  );
  // gets all branches for a specific repo
  app.post(
    "/ext/:repo_platform-repo/branches",
    [verifyPayloadIntegrity],
    async function (request, response) {
      try {
        const RepoLoader = resolveRepoLoader(request.params.repo_platform);
        const allBranches = await new RepoLoader(
          reqBody(request)
        ).getRepoBranches();
        response.status(200).json({
          success: true,
          reason: null,
          data: {
            branches: allBranches,
          },
        });
      } catch (e) {
        console.error(e);
        response.status(400).json({
          success: false,
          reason: e.message,
          data: {
            branches: [],
          },
        });
      }
      return;
    }
  );
  app.post(
    "/ext/youtube-transcript",
    [verifyPayloadIntegrity],
    async function (request, response) {
      try {
        const { loadYouTubeTranscript } = require("../utils/extensions/YoutubeTranscript");
        const { success, reason, data } = await loadYouTubeTranscript(
          reqBody(request)
        );
        response.status(200).json({ success, reason, data });
      } catch (e) {
        console.error(e);
        response.status(400).json({
          success: false,
          reason: e.message,
          data: {
            title: null,
            author: null,
          },
        });
      }
      return;
    }
  );
  app.post(
    "/ext/website-depth",
    [verifyPayloadIntegrity],
    async function (request, response) {
      try {
        const websiteDepth = require("../utils/extensions/WebsiteDepth");
        const { url, depth = 1, maxLinks = 20 } = reqBody(request);
        if (!validURL(url)) throw new Error("Not a valid URL.");
        const scrapedData = await websiteDepth(url, depth, maxLinks);
        response.status(200).json({ success: true, data: scrapedData });
      } catch (e) {
        console.error(e);
        response.status(400).json({ success: false, reason: e.message });
      }
      return;
    }
  );
  app.post(
    "/ext/confluence",
    [verifyPayloadIntegrity, setDataSigner],
    async function (request, response) {
      try {
        const { loadConfluence } = require("../utils/extensions/Confluence");
        const { success, reason, data } = await loadConfluence(
          reqBody(request),
          response
        );
        response.status(200).json({ success, reason, data });
      } catch (e) {
        console.error(e);
        response.status(400).json({
          success: false,
          reason: e.message,
          data: {
            title: null,
            author: null,
          },
        });
      }
      return;
    }
  );
 }
 module.exports = extensions;
--- a/collector/extensions/resync/index.js
+++ b/collector/extensions/resync/index.js
@ -0,0 +1,114 @@
 const { getLinkText } = require("../../processLink");
 /**
 * Fetches the content of a raw link. Returns the content as a text string of the link in question.
 * @param {object} data - metadata from document (eg: link) 
 * @param {import("../../middleware/setDataSigner").ResponseWithSigner} response
 */
 async function resyncLink({ link }, response) {
  if (!link) throw new Error('Invalid link provided');
  try {
    const { success, content = null } = await getLinkText(link);
    if (!success) throw new Error(`Failed to sync link content. ${reason}`);
    response.status(200).json({ success, content });
  } catch (e) {
    console.error(e);
    response.status(200).json({
      success: false,
      content: null,
    });
  }
 }
 /**
 * Fetches the content of a YouTube link. Returns the content as a text string of the video in question.
 * We offer this as there may be some videos where a transcription could be manually edited after initial scraping
 * but in general - transcriptions often never change.
 * @param {object} data - metadata from document (eg: link) 
 * @param {import("../../middleware/setDataSigner").ResponseWithSigner} response
 */
 async function resyncYouTube({ link }, response) {
  if (!link) throw new Error('Invalid link provided');
  try {
    const { fetchVideoTranscriptContent } = require("../../utils/extensions/YoutubeTranscript");
    const { success, reason, content } = await fetchVideoTranscriptContent({ url: link });
    if (!success) throw new Error(`Failed to sync YouTube video transcript. ${reason}`);
    response.status(200).json({ success, content });
  } catch (e) {
    console.error(e);
    response.status(200).json({
      success: false,
      content: null,
    });
  }
 }
 /**
 * Fetches the content of a specific confluence page via its chunkSource. 
 * Returns the content as a text string of the page in question and only that page.
 * @param {object} data - metadata from document (eg: chunkSource) 
 * @param {import("../../middleware/setDataSigner").ResponseWithSigner} response
 */
 async function resyncConfluence({ chunkSource }, response) {
  if (!chunkSource) throw new Error('Invalid source property provided');
  try {
    // Confluence data is `payload` encrypted. So we need to expand its
    // encrypted payload back into query params so we can reFetch the page with same access token/params.
    const source = response.locals.encryptionWorker.expandPayload(chunkSource);
    const { fetchConfluencePage } = require("../../utils/extensions/Confluence");
    const { success, reason, content } = await fetchConfluencePage({
      pageUrl: `https:${source.pathname}`, // need to add back the real protocol
      baseUrl: source.searchParams.get('baseUrl'),
      spaceKey: source.searchParams.get('spaceKey'),
      accessToken: source.searchParams.get('token'),
      username: source.searchParams.get('username'),
    });
    if (!success) throw new Error(`Failed to sync Confluence page content. ${reason}`);
    response.status(200).json({ success, content });
  } catch (e) {
    console.error(e);
    response.status(200).json({
      success: false,
      content: null,
    });
  }
 }
 /**
 * Fetches the content of a specific confluence page via its chunkSource. 
 * Returns the content as a text string of the page in question and only that page.
 * @param {object} data - metadata from document (eg: chunkSource) 
 * @param {import("../../middleware/setDataSigner").ResponseWithSigner} response
 */
 async function resyncGithub({ chunkSource }, response) {
  if (!chunkSource) throw new Error('Invalid source property provided');
  try {
    // Github file data is `payload` encrypted (might contain PAT). So we need to expand its
    // encrypted payload back into query params so we can reFetch the page with same access token/params.
    const source = response.locals.encryptionWorker.expandPayload(chunkSource);
    const { fetchGithubFile } = require("../../utils/extensions/RepoLoader/GithubRepo");
    const { success, reason, content } = await fetchGithubFile({
      repoUrl: `https:${source.pathname}`, // need to add back the real protocol
      branch: source.searchParams.get('branch'),
      accessToken: source.searchParams.get('pat'),
      sourceFilePath: source.searchParams.get('path'),
    });
    if (!success) throw new Error(`Failed to sync GitHub file content. ${reason}`);
    response.status(200).json({ success, content });
  } catch (e) {
    console.error(e);
    response.status(200).json({
      success: false,
      content: null,
    });
  }
 }
 module.exports = {
  link: resyncLink,
  youtube: resyncYouTube,
  confluence: resyncConfluence,
  github: resyncGithub,
 }
--- a/collector/hotdir/HOTDIR.md
+++ b/collector/hotdir/HOTDIR.md
@ -0,0 +1,3 @@
 ### What is the "Hot directory"
 This is a pre-set file location that documents will be written to when uploaded by AnythingLLM. There is really no need to touch it.
--- a/collector/index.js
+++ b/collector/index.js
@ -0,0 +1,151 @@
 process.env.NODE_ENV === "development"
  ? require("dotenv").config({ path: `.env.${process.env.NODE_ENV}` })
  : require("dotenv").config();
 require("./utils/logger")();
 const express = require("express");
 const bodyParser = require("body-parser");
 const cors = require("cors");
 const path = require("path");
 const { ACCEPTED_MIMES } = require("./utils/constants");
 const { reqBody } = require("./utils/http");
 const { processSingleFile } = require("./processSingleFile");
 const { processLink, getLinkText } = require("./processLink");
 const { wipeCollectorStorage } = require("./utils/files");
 const extensions = require("./extensions");
 const { processRawText } = require("./processRawText");
 const { verifyPayloadIntegrity } = require("./middleware/verifyIntegrity");
 const app = express();
 const FILE_LIMIT = "3GB";
 app.use(cors({ origin: true }));
 app.use(
  bodyParser.text({ limit: FILE_LIMIT }),
  bodyParser.json({ limit: FILE_LIMIT }),
  bodyParser.urlencoded({
    limit: FILE_LIMIT,
    extended: true,
  })
 );
 app.post(
  "/process",
  [verifyPayloadIntegrity],
  async function (request, response) {
    const { filename, options = {} } = reqBody(request);
    try {
      const targetFilename = path
        .normalize(filename)
        .replace(/^(\.\.(\/|\\|$))+/, "");
      const {
        success,
        reason,
        documents = [],
      } = await processSingleFile(targetFilename, options);
      response
        .status(200)
        .json({ filename: targetFilename, success, reason, documents });
    } catch (e) {
      console.error(e);
      response.status(200).json({
        filename: filename,
        success: false,
        reason: "A processing error occurred.",
        documents: [],
      });
    }
    return;
  }
 );
 app.post(
  "/process-link",
  [verifyPayloadIntegrity],
  async function (request, response) {
    const { link } = reqBody(request);
    try {
      const { success, reason, documents = [] } = await processLink(link);
      response.status(200).json({ url: link, success, reason, documents });
    } catch (e) {
      console.error(e);
      response.status(200).json({
        url: link,
        success: false,
        reason: "A processing error occurred.",
        documents: [],
      });
    }
    return;
  }
 );
 app.post(
  "/util/get-link",
  [verifyPayloadIntegrity],
  async function (request, response) {
    const { link, captureAs = "text" } = reqBody(request);
    try {
      const { success, content = null } = await getLinkText(link, captureAs);
      response.status(200).json({ url: link, success, content });
    } catch (e) {
      console.error(e);
      response.status(200).json({
        url: link,
        success: false,
        content: null,
      });
    }
    return;
  }
 );
 app.post(
  "/process-raw-text",
  [verifyPayloadIntegrity],
  async function (request, response) {
    const { textContent, metadata } = reqBody(request);
    try {
      const {
        success,
        reason,
        documents = [],
      } = await processRawText(textContent, metadata);
      response
        .status(200)
        .json({ filename: metadata.title, success, reason, documents });
    } catch (e) {
      console.error(e);
      response.status(200).json({
        filename: metadata?.title || "Unknown-doc.txt",
        success: false,
        reason: "A processing error occurred.",
        documents: [],
      });
    }
    return;
  }
 );
 extensions(app);
 app.get("/accepts", function (_, response) {
  response.status(200).json(ACCEPTED_MIMES);
 });
 app.all("*", function (_, response) {
  response.sendStatus(200);
 });
 app
  .listen(8888, async () => {
    await wipeCollectorStorage();
    console.log(`Document processor app listening on port 8888`);
  })
  .on("error", function (_) {
    process.once("SIGUSR2", function () {
      process.kill(process.pid, "SIGUSR2");
    });
    process.on("SIGINT", function () {
      process.kill(process.pid, "SIGINT");
    });
  });
--- a/collector/middleware/setDataSigner.js
+++ b/collector/middleware/setDataSigner.js
@ -0,0 +1,41 @@
 const { EncryptionWorker } = require("../utils/EncryptionWorker");
 const { CommunicationKey } = require("../utils/comKey");
 /** 
 * Express Response Object interface with defined encryptionWorker attached to locals property.
 * @typedef {import("express").Response & import("express").Response['locals'] & {encryptionWorker: EncryptionWorker} } ResponseWithSigner
 */
 // You can use this middleware to assign the EncryptionWorker to the response locals
 // property so that if can be used to encrypt/decrypt arbitrary data via response object.
 // eg: Encrypting API keys in chunk sources.
 // The way this functions is that the rolling RSA Communication Key is used server-side to private-key encrypt the raw
 // key of the persistent EncryptionManager credentials. Since EncryptionManager credentials do _not_ roll, we should not send them
 // even between server<>collector in plaintext because if the user configured the server/collector to be public they could technically
 // be exposing the key in transit via the X-Payload-Signer header. Even if this risk is minimal we should not do this.
 // This middleware uses the CommunicationKey public key to first decrypt the base64 representation of the EncryptionManager credentials
 // and then loads that in to the EncryptionWorker as a buffer so we can use the same credentials across the system. Should we ever break the
 // collector out into its own service this would still work without SSL/TLS.
 /**
 * 
 * @param {import("express").Request} request 
 * @param {import("express").Response} response 
 * @param {import("express").NextFunction} next 
 */
 function setDataSigner(request, response, next) {
  const comKey = new CommunicationKey();
  const encryptedPayloadSigner = request.header("X-Payload-Signer");
  if (!encryptedPayloadSigner) console.log('Failed to find signed-payload to set encryption worker! Encryption calls will fail.');
  const decryptedPayloadSignerKey = comKey.decrypt(encryptedPayloadSigner);
  const encryptionWorker = new EncryptionWorker(decryptedPayloadSignerKey);
  response.locals.encryptionWorker = encryptionWorker;
  next();
 }
 module.exports = {
  setDataSigner
 }
--- a/collector/middleware/verifyIntegrity.js
+++ b/collector/middleware/verifyIntegrity.js
@ -0,0 +1,21 @@
 const { CommunicationKey } = require("../utils/comKey");
 function verifyPayloadIntegrity(request, response, next) {
  const comKey = new CommunicationKey();
  if (process.env.NODE_ENV === "development") {
    comKey.log('verifyPayloadIntegrity is skipped in development.')
    next();
    return;
  }
  const signature = request.header("X-Integrity");
  if (!signature) return response.status(400).json({ msg: 'Failed integrity signature check.' })
  const validSignedPayload = comKey.verify(signature, request.body);
  if (!validSignedPayload) return response.status(400).json({ msg: 'Failed integrity signature check.' })
  next();
 }
 module.exports = {
  verifyPayloadIntegrity
 }
--- a/collector/nodemon.json
+++ b/collector/nodemon.json
@ -0,0 +1,3 @@
 {
  "events": {}
 }
--- a/collector/package.json
+++ b/collector/package.json
@ -0,0 +1,54 @@
 {
  "name": "anything-llm-document-collector",
  "version": "0.2.0",
  "description": "Document collector server endpoints",
  "main": "index.js",
  "author": "Timothy Carambat (Mintplex Labs)",
  "license": "MIT",
  "private": false,
  "engines": {
    "node": ">=18.12.1"
  },
  "scripts": {
    "dev": "cross-env NODE_ENV=development nodemon --ignore hotdir --ignore storage --trace-warnings index.js",
    "start": "cross-env NODE_ENV=production node index.js",
    "lint": "yarn prettier --ignore-path ../.prettierignore --write ./processSingleFile ./processLink ./utils index.js"
  },
  "dependencies": {
    "@langchain/community": "^0.2.23",
    "@xenova/transformers": "^2.11.0",
    "bcrypt": "^5.1.0",
    "body-parser": "^1.20.2",
    "cors": "^2.8.5",
    "dotenv": "^16.0.3",
    "epub2": "^3.0.2",
    "express": "^4.18.2",
    "fluent-ffmpeg": "^2.1.2",
    "html-to-text": "^9.0.5",
    "ignore": "^5.3.0",
    "js-tiktoken": "^1.0.8",
    "langchain": "0.1.36",
    "mammoth": "^1.6.0",
    "mbox-parser": "^1.0.1",
    "mime": "^3.0.0",
    "moment": "^2.29.4",
    "node-html-parser": "^6.1.13",
    "node-xlsx": "^0.24.0",
    "officeparser": "^4.0.5",
    "openai": "4.38.5",
    "pdf-parse": "^1.1.1",
    "puppeteer": "~21.5.2",
    "sharp": "^0.33.5",
    "slugify": "^1.6.6",
    "tesseract.js": "^6.0.0",
    "url-pattern": "^1.0.3",
    "uuid": "^9.0.0",
    "wavefile": "^11.0.0",
    "winston": "^3.13.0",
    "youtubei.js": "^9.1.0"
  },
  "devDependencies": {
    "nodemon": "^2.0.22",
    "prettier": "^2.4.1"
  }
 }
--- a/collector/processLink/convert/generic.js
+++ b/collector/processLink/convert/generic.js
@ -0,0 +1,127 @@
 const { v4 } = require("uuid");
 const {
  PuppeteerWebBaseLoader,
 } = require("langchain/document_loaders/web/puppeteer");
 const { writeToServerDocuments } = require("../../utils/files");
 const { tokenizeString } = require("../../utils/tokenizer");
 const { default: slugify } = require("slugify");
 /**
 * Scrape a generic URL and return the content in the specified format
 * @param {string} link - The URL to scrape
 * @param {('html' | 'text')} captureAs - The format to capture the page content as
 * @param {boolean} processAsDocument - Whether to process the content as a document or return the content directly
 * @returns {Promise<Object>} - The content of the page
 */
 async function scrapeGenericUrl(
  link,
  captureAs = "text",
  processAsDocument = true
 ) {
  console.log(`-- Working URL ${link} => (${captureAs}) --`);
  const content = await getPageContent(link, captureAs);
  if (!content.length) {
    console.error(`Resulting URL content was empty at ${link}.`);
    return {
      success: false,
      reason: `No URL content found at ${link}.`,
      documents: [],
    };
  }
  if (!processAsDocument) {
    return {
      success: true,
      content,
    };
  }
  const url = new URL(link);
  const decodedPathname = decodeURIComponent(url.pathname);
  const filename = `${url.hostname}${decodedPathname.replace(/\//g, "_")}`;
  const data = {
    id: v4(),
    url: "file://" + slugify(filename) + ".html",
    title: slugify(filename) + ".html",
    docAuthor: "no author found",
    description: "No description found.",
    docSource: "URL link uploaded by the user.",
    chunkSource: `link://${link}`,
    published: new Date().toLocaleString(),
    wordCount: content.split(" ").length,
    pageContent: content,
    token_count_estimate: tokenizeString(content),
  };
  const document = writeToServerDocuments(
    data,
    `url-${slugify(filename)}-${data.id}`
  );
  console.log(`[SUCCESS]: URL ${link} converted & ready for embedding.\n`);
  return { success: true, reason: null, documents: [document] };
 }
 /**
 * Get the content of a page
 * @param {string} link - The URL to get the content of
 * @param {('html' | 'text')} captureAs - The format to capture the page content as
 * @returns {Promise<string>} - The content of the page
 */
 async function getPageContent(link, captureAs = "text") {
  try {
    let pageContents = [];
    const loader = new PuppeteerWebBaseLoader(link, {
      launchOptions: {
        headless: "new",
        ignoreHTTPSErrors: true,
      },
      gotoOptions: {
        waitUntil: "networkidle2",
      },
      async evaluate(page, browser) {
        const result = await page.evaluate((captureAs) => {
          if (captureAs === "text") return document.body.innerText;
          if (captureAs === "html") return document.documentElement.innerHTML;
          return document.body.innerText;
        }, captureAs);
        await browser.close();
        return result;
      },
    });
    const docs = await loader.load();
    for (const doc of docs) {
      pageContents.push(doc.pageContent);
    }
    return pageContents.join(" ");
  } catch (error) {
    console.error(
      "getPageContent failed to be fetched by puppeteer - falling back to fetch!",
      error
    );
  }
  try {
    const pageText = await fetch(link, {
      method: "GET",
      headers: {
        "Content-Type": "text/plain",
        "User-Agent":
          "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36,gzip(gfe)",
      },
    }).then((res) => res.text());
    return pageText;
  } catch (error) {
    console.error("getPageContent failed to be fetched by any method.", error);
  }
  return null;
 }
 module.exports = {
  scrapeGenericUrl,
 };
--- a/collector/processLink/index.js
+++ b/collector/processLink/index.js
@ -0,0 +1,23 @@
 const { validURL } = require("../utils/url");
 const { scrapeGenericUrl } = require("./convert/generic");
 async function processLink(link) {
  if (!validURL(link)) return { success: false, reason: "Not a valid URL." };
  return await scrapeGenericUrl(link);
 }
 /**
 * Get the text content of a link
 * @param {string} link - The link to get the text content of
 * @param {('html' | 'text' | 'json')} captureAs - The format to capture the page content as
 * @returns {Promise<{success: boolean, content: string}>} - Response from collector
 */
 async function getLinkText(link, captureAs = "text") {
  if (!validURL(link)) return { success: false, reason: "Not a valid URL." };
  return await scrapeGenericUrl(link, captureAs, false);
 }
 module.exports = {
  processLink,
  getLinkText,
 };
--- a/collector/processRawText/index.js
+++ b/collector/processRawText/index.js
@ -0,0 +1,69 @@
 const { v4 } = require("uuid");
 const { writeToServerDocuments } = require("../utils/files");
 const { tokenizeString } = require("../utils/tokenizer");
 const { default: slugify } = require("slugify");
 // Will remove the last .extension from the input 
 // and stringify the input + move to lowercase.
 function stripAndSlug(input) {
  if (!input.includes('.')) return slugify(input, { lower: true });
  return slugify(input.split('.').slice(0, -1).join('-'), { lower: true })
 }
 const METADATA_KEYS = {
  possible: {
    url: ({ url, title }) => {
      let validUrl;
      try {
        const u = new URL(url);
        validUrl = ["https:", "http:"].includes(u.protocol);
      } catch { }
      if (validUrl) return `web://${url.toLowerCase()}.website`;
      return `file://${stripAndSlug(title)}.txt`;
    },
    title: ({ title }) => `${stripAndSlug(title)}.txt`,
    docAuthor: ({ docAuthor }) => { return typeof docAuthor === 'string' ? docAuthor : 'no author specified' },
    description: ({ description }) => { return typeof description === 'string' ? description : 'no description found' },
    docSource: ({ docSource }) => { return typeof docSource === 'string' ? docSource : 'no source set' },
    chunkSource: ({ chunkSource, title }) => { return typeof chunkSource === 'string' ? chunkSource : `${stripAndSlug(title)}.txt` },
    published: ({ published }) => {
      if (isNaN(Number(published))) return new Date().toLocaleString();
      return new Date(Number(published)).toLocaleString()
    },
  }
 }
 async function processRawText(textContent, metadata) {
  console.log(`-- Working Raw Text doc ${metadata.title} --`);
  if (!textContent || textContent.length === 0) {
    return {
      success: false,
      reason: "textContent was empty - nothing to process.",
      documents: [],
    };
  }
  const data = {
    id: v4(),
    url: METADATA_KEYS.possible.url(metadata),
    title: METADATA_KEYS.possible.title(metadata),
    docAuthor: METADATA_KEYS.possible.docAuthor(metadata),
    description: METADATA_KEYS.possible.description(metadata),
    docSource: METADATA_KEYS.possible.docSource(metadata),
    chunkSource: METADATA_KEYS.possible.chunkSource(metadata),
    published: METADATA_KEYS.possible.published(metadata),
    wordCount: textContent.split(" ").length,
    pageContent: textContent,
    token_count_estimate: tokenizeString(textContent),
  };
  const document = writeToServerDocuments(
    data,
    `raw-${stripAndSlug(metadata.title)}-${data.id}`
  );
  console.log(`[SUCCESS]: Raw text and metadata saved & ready for embedding.\n`);
  return { success: true, reason: null, documents: [document] };
 }
 module.exports = { processRawText }
--- a/collector/processSingleFile/convert/asAudio.js
+++ b/collector/processSingleFile/convert/asAudio.js
@ -0,0 +1,73 @@
 const { v4 } = require("uuid");
 const {
  createdDate,
  trashFile,
  writeToServerDocuments,
 } = require("../../utils/files");
 const { tokenizeString } = require("../../utils/tokenizer");
 const { default: slugify } = require("slugify");
 const { LocalWhisper } = require("../../utils/WhisperProviders/localWhisper");
 const { OpenAiWhisper } = require("../../utils/WhisperProviders/OpenAiWhisper");
 const WHISPER_PROVIDERS = {
  openai: OpenAiWhisper,
  local: LocalWhisper,
 };
 async function asAudio({ fullFilePath = "", filename = "", options = {} }) {
  const WhisperProvider = WHISPER_PROVIDERS.hasOwnProperty(
    options?.whisperProvider
  )
    ? WHISPER_PROVIDERS[options?.whisperProvider]
    : WHISPER_PROVIDERS.local;
  console.log(`-- Working ${filename} --`);
  const whisper = new WhisperProvider({ options });
  const { content, error } = await whisper.processFile(fullFilePath, filename);
  if (!!error) {
    console.error(`Error encountered for parsing of ${filename}.`);
    trashFile(fullFilePath);
    return {
      success: false,
      reason: error,
      documents: [],
    };
  }
  if (!content?.length) {
    console.error(`Resulting text content was empty for ${filename}.`);
    trashFile(fullFilePath);
    return {
      success: false,
      reason: `No text content found in ${filename}.`,
      documents: [],
    };
  }
  const data = {
    id: v4(),
    url: "file://" + fullFilePath,
    title: filename,
    docAuthor: "no author found",
    description: "No description found.",
    docSource: "pdf file uploaded by the user.",
    chunkSource: "",
    published: createdDate(fullFilePath),
    wordCount: content.split(" ").length,
    pageContent: content,
    token_count_estimate: tokenizeString(content),
  };
  const document = writeToServerDocuments(
    data,
    `${slugify(filename)}-${data.id}`
  );
  trashFile(fullFilePath);
  console.log(
    `[SUCCESS]: ${filename} transcribed, converted & ready for embedding.\n`
  );
  return { success: true, reason: null, documents: [document] };
 }
 module.exports = asAudio;
--- a/collector/processSingleFile/convert/asDocx.js
+++ b/collector/processSingleFile/convert/asDocx.js
@ -0,0 +1,57 @@
 const { v4 } = require("uuid");
 const { DocxLoader } = require("langchain/document_loaders/fs/docx");
 const {
  createdDate,
  trashFile,
  writeToServerDocuments,
 } = require("../../utils/files");
 const { tokenizeString } = require("../../utils/tokenizer");
 const { default: slugify } = require("slugify");
 async function asDocX({ fullFilePath = "", filename = "" }) {
  const loader = new DocxLoader(fullFilePath);
  console.log(`-- Working ${filename} --`);
  let pageContent = [];
  const docs = await loader.load();
  for (const doc of docs) {
    console.log(`-- Parsing content from docx page --`);
    if (!doc.pageContent.length) continue;
    pageContent.push(doc.pageContent);
  }
  if (!pageContent.length) {
    console.error(`Resulting text content was empty for ${filename}.`);
    trashFile(fullFilePath);
    return {
      success: false,
      reason: `No text content found in ${filename}.`,
      documents: [],
    };
  }
  const content = pageContent.join("");
  const data = {
    id: v4(),
    url: "file://" + fullFilePath,
    title: filename,
    docAuthor: "no author found",
    description: "No description found.",
    docSource: "pdf file uploaded by the user.",
    chunkSource: "",
    published: createdDate(fullFilePath),
    wordCount: content.split(" ").length,
    pageContent: content,
    token_count_estimate: tokenizeString(content),
  };
  const document = writeToServerDocuments(
    data,
    `${slugify(filename)}-${data.id}`
  );
  trashFile(fullFilePath);
  console.log(`[SUCCESS]: ${filename} converted & ready for embedding.\n`);
  return { success: true, reason: null, documents: [document] };
 }
 module.exports = asDocX;
--- a/collector/processSingleFile/convert/asEPub.js
+++ b/collector/processSingleFile/convert/asEPub.js
@ -0,0 +1,55 @@
 const { v4 } = require("uuid");
 const { EPubLoader } = require("langchain/document_loaders/fs/epub");
 const { tokenizeString } = require("../../utils/tokenizer");
 const {
  createdDate,
  trashFile,
  writeToServerDocuments,
 } = require("../../utils/files");
 const { default: slugify } = require("slugify");
 async function asEPub({ fullFilePath = "", filename = "" }) {
  let content = "";
  try {
    const loader = new EPubLoader(fullFilePath, { splitChapters: false });
    const docs = await loader.load();
    docs.forEach((doc) => (content += doc.pageContent));
  } catch (err) {
    console.error("Could not read epub file!", err);
  }
  if (!content?.length) {
    console.error(`Resulting text content was empty for ${filename}.`);
    trashFile(fullFilePath);
    return {
      success: false,
      reason: `No text content found in ${filename}.`,
      documents: [],
    };
  }
  console.log(`-- Working ${filename} --`);
  const data = {
    id: v4(),
    url: "file://" + fullFilePath,
    title: filename,
    docAuthor: "Unknown", // TODO: Find a better author
    description: "Unknown", // TODO: Find a better description
    docSource: "a epub file uploaded by the user.",
    chunkSource: "",
    published: createdDate(fullFilePath),
    wordCount: content.split(" ").length,
    pageContent: content,
    token_count_estimate: tokenizeString(content),
  };
  const document = writeToServerDocuments(
    data,
    `${slugify(filename)}-${data.id}`
  );
  trashFile(fullFilePath);
  console.log(`[SUCCESS]: ${filename} converted & ready for embedding.\n`);
  return { success: true, reason: null, documents: [document] };
 }
 module.exports = asEPub;
--- a/collector/processSingleFile/convert/asImage.js
+++ b/collector/processSingleFile/convert/asImage.js
@ -0,0 +1,48 @@
 const { v4 } = require("uuid");
 const { tokenizeString } = require("../../utils/tokenizer");
 const {
  createdDate,
  trashFile,
  writeToServerDocuments,
 } = require("../../utils/files");
 const OCRLoader = require("../../utils/OCRLoader");
 const { default: slugify } = require("slugify");
 async function asImage({ fullFilePath = "", filename = "" }) {
  let content = await new OCRLoader().ocrImage(fullFilePath);
  if (!content?.length) {
    console.error(`Resulting text content was empty for ${filename}.`);
    trashFile(fullFilePath);
    return {
      success: false,
      reason: `No text content found in ${filename}.`,
      documents: [],
    };
  }
  console.log(`-- Working ${filename} --`);
  const data = {
    id: v4(),
    url: "file://" + fullFilePath,
    title: filename,
    docAuthor: "Unknown", // TODO: Find a better author
    description: "Unknown", // TODO: Find a better description
    docSource: "a text file uploaded by the user.",
    chunkSource: "",
    published: createdDate(fullFilePath),
    wordCount: content.split(" ").length,
    pageContent: content,
    token_count_estimate: tokenizeString(content),
  };
  const document = writeToServerDocuments(
    data,
    `${slugify(filename)}-${data.id}`
  );
  trashFile(fullFilePath);
  console.log(`[SUCCESS]: ${filename} converted & ready for embedding.\n`);
  return { success: true, reason: null, documents: [document] };
 }
 module.exports = asImage;
--- a/collector/processSingleFile/convert/asMbox.js
+++ b/collector/processSingleFile/convert/asMbox.js
@ -0,0 +1,74 @@
 const { v4 } = require("uuid");
 const fs = require("fs");
 const { mboxParser } = require("mbox-parser");
 const {
  createdDate,
  trashFile,
  writeToServerDocuments,
 } = require("../../utils/files");
 const { tokenizeString } = require("../../utils/tokenizer");
 const { default: slugify } = require("slugify");
 async function asMbox({ fullFilePath = "", filename = "" }) {
  console.log(`-- Working ${filename} --`);
  const mails = await mboxParser(fs.createReadStream(fullFilePath))
    .then((mails) => mails)
    .catch((error) => {
      console.log(`Could not parse mail items`, error);
      return [];
    });
  if (!mails.length) {
    console.error(`Resulting mail items was empty for ${filename}.`);
    trashFile(fullFilePath);
    return {
      success: false,
      reason: `No mail items found in ${filename}.`,
      documents: [],
    };
  }
  let item = 1;
  const documents = [];
  for (const mail of mails) {
    if (!mail.hasOwnProperty("text")) continue;
    const content = mail.text;
    if (!content) continue;
    console.log(
      `-- Working on message "${mail.subject || "Unknown subject"}" --`
    );
    const data = {
      id: v4(),
      url: "file://" + fullFilePath,
      title: mail?.subject
        ? slugify(mail?.subject?.replace(".", "")) + ".mbox"
        : `msg_${item}-${filename}`,
      docAuthor: mail?.from?.text,
      description: "No description found.",
      docSource: "Mbox message file uploaded by the user.",
      chunkSource: "",
      published: createdDate(fullFilePath),
      wordCount: content.split(" ").length,
      pageContent: content,
      token_count_estimate: tokenizeString(content),
    };
    item++;
    const document = writeToServerDocuments(
      data,
      `${slugify(filename)}-${data.id}-msg-${item}`
    );
    documents.push(document);
  }
  trashFile(fullFilePath);
  console.log(
    `[SUCCESS]: ${filename} messages converted & ready for embedding.\n`
  );
  return { success: true, reason: null, documents };
 }
 module.exports = asMbox;
--- a/collector/processSingleFile/convert/asOfficeMime.js
+++ b/collector/processSingleFile/convert/asOfficeMime.js
@ -0,0 +1,53 @@
 const { v4 } = require("uuid");
 const officeParser = require("officeparser");
 const {
  createdDate,
  trashFile,
  writeToServerDocuments,
 } = require("../../utils/files");
 const { tokenizeString } = require("../../utils/tokenizer");
 const { default: slugify } = require("slugify");
 async function asOfficeMime({ fullFilePath = "", filename = "" }) {
  console.log(`-- Working ${filename} --`);
  let content = "";
  try {
    content = await officeParser.parseOfficeAsync(fullFilePath);
  } catch (error) {
    console.error(`Could not parse office or office-like file`, error);
  }
  if (!content.length) {
    console.error(`Resulting text content was empty for ${filename}.`);
    trashFile(fullFilePath);
    return {
      success: false,
      reason: `No text content found in ${filename}.`,
      documents: [],
    };
  }
  const data = {
    id: v4(),
    url: "file://" + fullFilePath,
    title: filename,
    docAuthor: "no author found",
    description: "No description found.",
    docSource: "Office file uploaded by the user.",
    chunkSource: "",
    published: createdDate(fullFilePath),
    wordCount: content.split(" ").length,
    pageContent: content,
    token_count_estimate: tokenizeString(content),
  };
  const document = writeToServerDocuments(
    data,
    `${slugify(filename)}-${data.id}`
  );
  trashFile(fullFilePath);
  console.log(`[SUCCESS]: ${filename} converted & ready for embedding.\n`);
  return { success: true, reason: null, documents: [document] };
 }
 module.exports = asOfficeMime;
--- a/collector/processSingleFile/convert/asPDF/PDFLoader/index.js
+++ b/collector/processSingleFile/convert/asPDF/PDFLoader/index.js
@ -0,0 +1,97 @@
 const fs = require("fs").promises;
 class PDFLoader {
  constructor(filePath, { splitPages = true } = {}) {
    this.filePath = filePath;
    this.splitPages = splitPages;
  }
  async load() {
    const buffer = await fs.readFile(this.filePath);
    const { getDocument, version } = await this.getPdfJS();
    const pdf = await getDocument({
      data: new Uint8Array(buffer),
      useWorkerFetch: false,
      isEvalSupported: false,
      useSystemFonts: true,
    }).promise;
    const meta = await pdf.getMetadata().catch(() => null);
    const documents = [];
    for (let i = 1; i <= pdf.numPages; i += 1) {
      const page = await pdf.getPage(i);
      const content = await page.getTextContent();
      if (content.items.length === 0) {
        continue;
      }
      let lastY;
      const textItems = [];
      for (const item of content.items) {
        if ("str" in item) {
          if (lastY === item.transform[5] || !lastY) {
            textItems.push(item.str);
          } else {
            textItems.push(`\n${item.str}`);
          }
          lastY = item.transform[5];
        }
      }
      const text = textItems.join("");
      documents.push({
        pageContent: text.trim(),
        metadata: {
          source: this.filePath,
          pdf: {
            version,
            info: meta?.info,
            metadata: meta?.metadata,
            totalPages: pdf.numPages,
          },
          loc: { pageNumber: i },
        },
      });
    }
    if (this.splitPages) {
      return documents;
    }
    if (documents.length === 0) {
      return [];
    }
    return [
      {
        pageContent: documents.map((doc) => doc.pageContent).join("\n\n"),
        metadata: {
          source: this.filePath,
          pdf: {
            version,
            info: meta?.info,
            metadata: meta?.metadata,
            totalPages: pdf.numPages,
          },
        },
      },
    ];
  }
  async getPdfJS() {
    try {
      const pdfjs = await import("pdf-parse/lib/pdf.js/v1.10.100/build/pdf.js");
      return { getDocument: pdfjs.getDocument, version: pdfjs.version };
    } catch (e) {
      console.error(e);
      throw new Error(
        "Failed to load pdf-parse. Please install it with eg. `npm install pdf-parse`."
      );
    }
  }
 }
 module.exports = PDFLoader;
--- a/collector/processSingleFile/convert/asPDF/index.js
+++ b/collector/processSingleFile/convert/asPDF/index.js
@ -0,0 +1,72 @@
 const { v4 } = require("uuid");
 const {
  createdDate,
  trashFile,
  writeToServerDocuments,
 } = require("../../../utils/files");
 const { tokenizeString } = require("../../../utils/tokenizer");
 const { default: slugify } = require("slugify");
 const PDFLoader = require("./PDFLoader");
 const OCRLoader = require("../../../utils/OCRLoader");
 async function asPdf({ fullFilePath = "", filename = "" }) {
  const pdfLoader = new PDFLoader(fullFilePath, {
    splitPages: true,
  });
  console.log(`-- Working ${filename} --`);
  const pageContent = [];
  let docs = await pdfLoader.load();
  if (docs.length === 0) {
    console.log(
      `[asPDF] No text content found for ${filename}. Will attempt OCR parse.`
    );
    docs = await new OCRLoader().ocrPDF(fullFilePath);
  }
  for (const doc of docs) {
    console.log(
      `-- Parsing content from pg ${
        doc.metadata?.loc?.pageNumber || "unknown"
      } --`
    );
    if (!doc.pageContent || !doc.pageContent.length) continue;
    pageContent.push(doc.pageContent);
  }
  if (!pageContent.length) {
    console.error(`[asPDF] Resulting text content was empty for ${filename}.`);
    trashFile(fullFilePath);
    return {
      success: false,
      reason: `No text content found in ${filename}.`,
      documents: [],
    };
  }
  const content = pageContent.join("");
  const data = {
    id: v4(),
    url: "file://" + fullFilePath,
    title: filename,
    docAuthor: docs[0]?.metadata?.pdf?.info?.Creator || "no author found",
    description: docs[0]?.metadata?.pdf?.info?.Title || "No description found.",
    docSource: "pdf file uploaded by the user.",
    chunkSource: "",
    published: createdDate(fullFilePath),
    wordCount: content.split(" ").length,
    pageContent: content,
    token_count_estimate: tokenizeString(content),
  };
  const document = writeToServerDocuments(
    data,
    `${slugify(filename)}-${data.id}`
  );
  trashFile(fullFilePath);
  console.log(`[SUCCESS]: ${filename} converted & ready for embedding.\n`);
  return { success: true, reason: null, documents: [document] };
 }
 module.exports = asPdf;
--- a/collector/processSingleFile/convert/asTxt.js
+++ b/collector/processSingleFile/convert/asTxt.js
@ -0,0 +1,53 @@
 const { v4 } = require("uuid");
 const fs = require("fs");
 const { tokenizeString } = require("../../utils/tokenizer");
 const {
  createdDate,
  trashFile,
  writeToServerDocuments,
 } = require("../../utils/files");
 const { default: slugify } = require("slugify");
 async function asTxt({ fullFilePath = "", filename = "" }) {
  let content = "";
  try {
    content = fs.readFileSync(fullFilePath, "utf8");
  } catch (err) {
    console.error("Could not read file!", err);
  }
  if (!content?.length) {
    console.error(`Resulting text content was empty for ${filename}.`);
    trashFile(fullFilePath);
    return {
      success: false,
      reason: `No text content found in ${filename}.`,
      documents: [],
    };
  }
  console.log(`-- Working ${filename} --`);
  const data = {
    id: v4(),
    url: "file://" + fullFilePath,
    title: filename,
    docAuthor: "Unknown", // TODO: Find a better author
    description: "Unknown", // TODO: Find a better description
    docSource: "a text file uploaded by the user.",
    chunkSource: "",
    published: createdDate(fullFilePath),
    wordCount: content.split(" ").length,
    pageContent: content,
    token_count_estimate: tokenizeString(content),
  };
  const document = writeToServerDocuments(
    data,
    `${slugify(filename)}-${data.id}`
  );
  trashFile(fullFilePath);
  console.log(`[SUCCESS]: ${filename} converted & ready for embedding.\n`);
  return { success: true, reason: null, documents: [document] };
 }
 module.exports = asTxt;
--- a/collector/processSingleFile/convert/asXlsx.js
+++ b/collector/processSingleFile/convert/asXlsx.js
@ -0,0 +1,113 @@
 const { v4 } = require("uuid");
 const xlsx = require("node-xlsx").default;
 const path = require("path");
 const fs = require("fs");
 const {
  createdDate,
  trashFile,
  writeToServerDocuments,
 } = require("../../utils/files");
 const { tokenizeString } = require("../../utils/tokenizer");
 const { default: slugify } = require("slugify");
 function convertToCSV(data) {
  return data
    .map((row) =>
      row
        .map((cell) => {
          if (cell === null || cell === undefined) return "";
          if (typeof cell === "string" && cell.includes(","))
            return `"${cell}"`;
          return cell;
        })
        .join(",")
    )
    .join("\n");
 }
 async function asXlsx({ fullFilePath = "", filename = "" }) {
  const documents = [];
  const folderName = slugify(`${path.basename(filename)}-${v4().slice(0, 4)}`, {
    lower: true,
    trim: true,
  });
  const outFolderPath =
    process.env.NODE_ENV === "development"
      ? path.resolve(
          __dirname,
          `../../../server/storage/documents/${folderName}`
        )
      : path.resolve(process.env.STORAGE_DIR, `documents/${folderName}`);
  try {
    const workSheetsFromFile = xlsx.parse(fullFilePath);
    if (!fs.existsSync(outFolderPath))
      fs.mkdirSync(outFolderPath, { recursive: true });
    for (const sheet of workSheetsFromFile) {
      try {
        const { name, data } = sheet;
        const content = convertToCSV(data);
        if (!content?.length) {
          console.warn(`Sheet "${name}" is empty. Skipping.`);
          continue;
        }
        console.log(`-- Processing sheet: ${name} --`);
        const sheetData = {
          id: v4(),
          url: `file://${path.join(outFolderPath, `${slugify(name)}.csv`)}`,
          title: `${filename} - Sheet:${name}`,
          docAuthor: "Unknown",
          description: `Spreadsheet data from sheet: ${name}`,
          docSource: "an xlsx file uploaded by the user.",
          chunkSource: "",
          published: createdDate(fullFilePath),
          wordCount: content.split(/\s+/).length,
          pageContent: content,
          token_count_estimate: tokenizeString(content),
        };
        const document = writeToServerDocuments(
          sheetData,
          `sheet-${slugify(name)}`,
          outFolderPath
        );
        documents.push(document);
        console.log(
          `[SUCCESS]: Sheet "${name}" converted & ready for embedding.`
        );
      } catch (err) {
        console.error(`Error processing sheet "${name}":`, err);
        continue;
      }
    }
  } catch (err) {
    console.error("Could not process xlsx file!", err);
    return {
      success: false,
      reason: `Error processing ${filename}: ${err.message}`,
      documents: [],
    };
  } finally {
    trashFile(fullFilePath);
  }
  if (documents.length === 0) {
    console.error(`No valid sheets found in ${filename}.`);
    return {
      success: false,
      reason: `No valid sheets found in ${filename}.`,
      documents: [],
    };
  }
  console.log(
    `[SUCCESS]: ${filename} fully processed. Created ${documents.length} document(s).\n`
  );
  return { success: true, reason: null, documents };
 }
 module.exports = asXlsx;
--- a/collector/processSingleFile/index.js
+++ b/collector/processSingleFile/index.js
@ -0,0 +1,78 @@
 const path = require("path");
 const fs = require("fs");
 const {
  WATCH_DIRECTORY,
  SUPPORTED_FILETYPE_CONVERTERS,
 } = require("../utils/constants");
 const {
  trashFile,
  isTextType,
  normalizePath,
  isWithin,
 } = require("../utils/files");
 const RESERVED_FILES = ["__HOTDIR__.md"];
 async function processSingleFile(targetFilename, options = {}) {
  const fullFilePath = path.resolve(
    WATCH_DIRECTORY,
    normalizePath(targetFilename)
  );
  if (!isWithin(path.resolve(WATCH_DIRECTORY), fullFilePath))
    return {
      success: false,
      reason: "Filename is a not a valid path to process.",
      documents: [],
    };
  if (RESERVED_FILES.includes(targetFilename))
    return {
      success: false,
      reason: "Filename is a reserved filename and cannot be processed.",
      documents: [],
    };
  if (!fs.existsSync(fullFilePath))
    return {
      success: false,
      reason: "File does not exist in upload directory.",
      documents: [],
    };
  const fileExtension = path.extname(fullFilePath).toLowerCase();
  if (fullFilePath.includes(".") && !fileExtension) {
    return {
      success: false,
      reason: `No file extension found. This file cannot be processed.`,
      documents: [],
    };
  }
  let processFileAs = fileExtension;
  if (!SUPPORTED_FILETYPE_CONVERTERS.hasOwnProperty(fileExtension)) {
    if (isTextType(fullFilePath)) {
      console.log(
        `\x1b[33m[Collector]\x1b[0m The provided filetype of ${fileExtension} does not have a preset and will be processed as .txt.`
      );
      processFileAs = ".txt";
    } else {
      trashFile(fullFilePath);
      return {
        success: false,
        reason: `File extension ${fileExtension} not supported for parsing and cannot be assumed as text file type.`,
        documents: [],
      };
    }
  }
  const FileTypeProcessor = require(SUPPORTED_FILETYPE_CONVERTERS[
    processFileAs
  ]);
  return await FileTypeProcessor({
    fullFilePath,
    filename: targetFilename,
    options,
  });
 }
 module.exports = {
  processSingleFile,
 };
--- a/collector/storage/.gitignore
+++ b/collector/storage/.gitignore
@ -0,0 +1,2 @@
 tmp/*
 !tmp/.placeholder
--- a/collector/storage/tmp/.placeholder
+++ b/collector/storage/tmp/.placeholder
--- a/collector/utils/EncryptionWorker/index.js
+++ b/collector/utils/EncryptionWorker/index.js
@ -0,0 +1,77 @@
 const crypto = require("crypto");
 // Differs from EncryptionManager in that is does not set or define the keys that will be used
 // to encrypt or read data and it must be told the key (as base64 string) explicitly that will be used and is provided to
 // the class on creation. This key should be the same `key` that is used by the EncryptionManager class.
 class EncryptionWorker {
  constructor(presetKeyBase64 = "") {
    this.key = Buffer.from(presetKeyBase64, "base64");
    this.algorithm = "aes-256-cbc";
    this.separator = ":";
  }
  log(text, ...args) {
    console.log(`\x1b[36m[EncryptionManager]\x1b[0m ${text}`, ...args);
  }
  /**
   * Give a chunk source, parse its payload query param and expand that object back into the URL
   * as additional query params
   * @param {string} chunkSource
   * @returns {URL} Javascript URL object with query params decrypted from payload query param.
   */
  expandPayload(chunkSource = "") {
    try {
      const url = new URL(chunkSource);
      if (!url.searchParams.has("payload")) return url;
      const decryptedPayload = this.decrypt(url.searchParams.get("payload"));
      const encodedParams = JSON.parse(decryptedPayload);
      url.searchParams.delete("payload"); // remove payload prop
      // Add all query params needed to replay as query params
      Object.entries(encodedParams).forEach(([key, value]) =>
        url.searchParams.append(key, value)
      );
      return url;
    } catch (e) {
      console.error(e);
    }
    return new URL(chunkSource);
  }
  encrypt(plainTextString = null) {
    try {
      if (!plainTextString)
        throw new Error("Empty string is not valid for this method.");
      const iv = crypto.randomBytes(16);
      const cipher = crypto.createCipheriv(this.algorithm, this.key, iv);
      const encrypted = cipher.update(plainTextString, "utf8", "hex");
      return [
        encrypted + cipher.final("hex"),
        Buffer.from(iv).toString("hex"),
      ].join(this.separator);
    } catch (e) {
      this.log(e);
      return null;
    }
  }
  decrypt(encryptedString) {
    try {
      const [encrypted, iv] = encryptedString.split(this.separator);
      if (!iv) throw new Error("IV not found");
      const decipher = crypto.createDecipheriv(
        this.algorithm,
        this.key,
        Buffer.from(iv, "hex")
      );
      return decipher.update(encrypted, "hex", "utf8") + decipher.final("utf8");
    } catch (e) {
      this.log(e);
      return null;
    }
  }
 }
 module.exports = { EncryptionWorker };
--- a/collector/utils/OCRLoader/index.js
+++ b/collector/utils/OCRLoader/index.js
@ -0,0 +1,307 @@
 const fs = require("fs");
 const os = require("os");
 const path = require("path");
 class OCRLoader {
  constructor() {
    this.cacheDir = path.resolve(
      process.env.STORAGE_DIR
        ? path.resolve(process.env.STORAGE_DIR, `models`, `tesseract`)
        : path.resolve(__dirname, `../../../server/storage/models/tesseract`)
    );
  }
  log(text, ...args) {
    console.log(`\x1b[36m[OCRLoader]\x1b[0m ${text}`, ...args);
  }
  /**
   * Loads a PDF file and returns an array of documents.
   * This function is reserved to parsing for SCANNED documents - digital documents are not supported in this function
   * @returns {Promise<{pageContent: string, metadata: object}[]>} An array of documents with page content and metadata.
   */
  async ocrPDF(
    filePath,
    { maxExecutionTime = 300_000, batchSize = 10, maxWorkers = null } = {}
  ) {
    if (
      !filePath ||
      !fs.existsSync(filePath) ||
      !fs.statSync(filePath).isFile()
    ) {
      this.log(`File ${filePath} does not exist. Skipping OCR.`);
      return [];
    }
    const documentTitle = path.basename(filePath);
    this.log(`Starting OCR of ${documentTitle}`);
    const pdfjs = await import("pdf-parse/lib/pdf.js/v2.0.550/build/pdf.js");
    let buffer = fs.readFileSync(filePath);
    const pdfDocument = await pdfjs.getDocument({ data: buffer });
    const documents = [];
    const meta = await pdfDocument.getMetadata().catch(() => null);
    const metadata = {
      source: filePath,
      pdf: {
        version: "v2.0.550",
        info: meta?.info,
        metadata: meta?.metadata,
        totalPages: pdfDocument.numPages,
      },
    };
    const pdfSharp = new PDFSharp({
      validOps: [
        pdfjs.OPS.paintJpegXObject,
        pdfjs.OPS.paintImageXObject,
        pdfjs.OPS.paintInlineImageXObject,
      ],
    });
    await pdfSharp.init();
    const { createWorker, OEM } = require("tesseract.js");
    const BATCH_SIZE = batchSize;
    const MAX_EXECUTION_TIME = maxExecutionTime;
    const NUM_WORKERS = maxWorkers ?? Math.min(os.cpus().length, 4);
    const totalPages = pdfDocument.numPages;
    const workerPool = await Promise.all(
      Array(NUM_WORKERS)
        .fill(0)
        .map(() =>
          createWorker("eng", OEM.LSTM_ONLY, {
            cachePath: this.cacheDir,
          })
        )
    );
    const startTime = Date.now();
    try {
      this.log("Bootstrapping OCR completed successfully!", {
        MAX_EXECUTION_TIME_MS: MAX_EXECUTION_TIME,
        BATCH_SIZE,
        MAX_CONCURRENT_WORKERS: NUM_WORKERS,
        TOTAL_PAGES: totalPages,
      });
      const timeoutPromise = new Promise((_, reject) => {
        setTimeout(() => {
          reject(
            new Error(
              `OCR job took too long to complete (${
                MAX_EXECUTION_TIME / 1000
              } seconds)`
            )
          );
        }, MAX_EXECUTION_TIME);
      });
      const processPages = async () => {
        for (
          let startPage = 1;
          startPage <= totalPages;
          startPage += BATCH_SIZE
        ) {
          const endPage = Math.min(startPage + BATCH_SIZE - 1, totalPages);
          const pageNumbers = Array.from(
            { length: endPage - startPage + 1 },
            (_, i) => startPage + i
          );
          this.log(`Working on pages ${startPage} - ${endPage}`);
          const pageQueue = [...pageNumbers];
          const results = [];
          const workerPromises = workerPool.map(async (worker, workerIndex) => {
            while (pageQueue.length > 0) {
              const pageNum = pageQueue.shift();
              this.log(
                `\x1b[34m[Worker ${
                  workerIndex + 1
                }]\x1b[0m assigned pg${pageNum}`
              );
              const page = await pdfDocument.getPage(pageNum);
              const imageBuffer = await pdfSharp.pageToBuffer({ page });
              if (!imageBuffer) continue;
              const { data } = await worker.recognize(imageBuffer, {}, "text");
              this.log(
                `✅ \x1b[34m[Worker ${
                  workerIndex + 1
                }]\x1b[0m completed pg${pageNum}`
              );
              results.push({
                pageContent: data.text,
                metadata: {
                  ...metadata,
                  loc: { pageNumber: pageNum },
                },
              });
            }
          });
          await Promise.all(workerPromises);
          documents.push(
            ...results.sort(
              (a, b) => a.metadata.loc.pageNumber - b.metadata.loc.pageNumber
            )
          );
        }
        return documents;
      };
      await Promise.race([timeoutPromise, processPages()]);
    } catch (e) {
      this.log(`Error: ${e.message}`, e.stack);
    } finally {
      global.Image = undefined;
      await Promise.all(workerPool.map((worker) => worker.terminate()));
    }
    this.log(`Completed OCR of ${documentTitle}!`, {
      documentsParsed: documents.length,
      totalPages: totalPages,
      executionTime: `${((Date.now() - startTime) / 1000).toFixed(2)}s`,
    });
    return documents;
  }
  /**
   * Loads an image file and returns the OCRed text.
   * @param {string} filePath - The path to the image file.
   * @param {Object} options - The options for the OCR.
   * @param {number} options.maxExecutionTime - The maximum execution time of the OCR in milliseconds.
   * @returns {Promise<string>} The OCRed text.
   */
  async ocrImage(filePath, { maxExecutionTime = 300_000 } = {}) {
    let content = "";
    let worker = null;
    if (
      !filePath ||
      !fs.existsSync(filePath) ||
      !fs.statSync(filePath).isFile()
    ) {
      this.log(`File ${filePath} does not exist. Skipping OCR.`);
      return null;
    }
    const documentTitle = path.basename(filePath);
    try {
      this.log(`Starting OCR of ${documentTitle}`);
      const startTime = Date.now();
      const { createWorker, OEM } = require("tesseract.js");
      worker = await createWorker("eng", OEM.LSTM_ONLY, {
        cachePath: this.cacheDir,
      });
      // Race the timeout with the OCR
      const timeoutPromise = new Promise((_, reject) => {
        setTimeout(() => {
          reject(
            new Error(
              `OCR job took too long to complete (${
                maxExecutionTime / 1000
              } seconds)`
            )
          );
        }, maxExecutionTime);
      });
      const processImage = async () => {
        const { data } = await worker.recognize(filePath, {}, "text");
        content = data.text;
      };
      await Promise.race([timeoutPromise, processImage()]);
      this.log(`Completed OCR of ${documentTitle}!`, {
        executionTime: `${((Date.now() - startTime) / 1000).toFixed(2)}s`,
      });
      return content;
    } catch (e) {
      this.log(`Error: ${e.message}`);
      return null;
    } finally {
      if (!worker) return;
      await worker.terminate();
    }
  }
 }
 /**
 * Converts a PDF page to a buffer using Sharp.
 * @param {Object} options - The options for the Sharp PDF page object.
 * @param {Object} options.page - The PDFJS page proxy object.
 * @returns {Promise<Buffer>} The buffer of the page.
 */
 class PDFSharp {
  constructor({ validOps = [] } = {}) {
    this.sharp = null;
    this.validOps = validOps;
  }
  log(text, ...args) {
    console.log(`\x1b[36m[PDFSharp]\x1b[0m ${text}`, ...args);
  }
  async init() {
    this.sharp = (await import("sharp")).default;
  }
  /**
   * Converts a PDF page to a buffer.
   * @param {Object} options - The options for the Sharp PDF page object.
   * @param {Object} options.page - The PDFJS page proxy object.
   * @returns {Promise<Buffer>} The buffer of the page.
   */
  async pageToBuffer({ page }) {
    if (!this.sharp) await this.init();
    try {
      this.log(`Converting page ${page.pageNumber} to image...`);
      const ops = await page.getOperatorList();
      const pageImages = ops.fnArray.length;
      for (let i = 0; i < pageImages; i++) {
        try {
          if (!this.validOps.includes(ops.fnArray[i])) continue;
          const name = ops.argsArray[i][0];
          const img = await page.objs.get(name);
          const { width, height } = img;
          const size = img.data.length;
          const channels = size / width / height;
          const targetDPI = 70;
          const targetWidth = Math.floor(width * (targetDPI / 72));
          const targetHeight = Math.floor(height * (targetDPI / 72));
          const image = this.sharp(img.data, {
            raw: { width, height, channels },
            density: targetDPI,
          })
            .resize({
              width: targetWidth,
              height: targetHeight,
              fit: "fill",
            })
            .withMetadata({
              density: targetDPI,
              resolution: targetDPI,
            })
            .png();
          // For debugging purposes
          // await image.toFile(path.resolve(__dirname, `../../storage/`, `pg${page.pageNumber}.png`));
          return await image.toBuffer();
        } catch (error) {
          this.log(`Iteration error: ${error.message}`, error.stack);
          continue;
        }
      }
      this.log(`No valid images found on page ${page.pageNumber}`);
      return null;
    } catch (error) {
      this.log(`Error: ${error.message}`, error.stack);
      return null;
    }
  }
 }
 module.exports = OCRLoader;
--- a/collector/utils/WhisperProviders/OpenAiWhisper.js
+++ b/collector/utils/WhisperProviders/OpenAiWhisper.js
@ -0,0 +1,49 @@
 const fs = require("fs");
 class OpenAiWhisper {
  constructor({ options }) {
    const { OpenAI: OpenAIApi } = require("openai");
    if (!options.openAiKey) throw new Error("No OpenAI API key was set.");
    this.openai = new OpenAIApi({
      apiKey: options.openAiKey,
    });
    this.model = "whisper-1";
    this.temperature = 0;
    this.#log("Initialized.");
  }
  #log(text, ...args) {
    console.log(`\x1b[32m[OpenAiWhisper]\x1b[0m ${text}`, ...args);
  }
  async processFile(fullFilePath) {
    return await this.openai.audio.transcriptions
      .create({
        file: fs.createReadStream(fullFilePath),
        model: this.model,
        temperature: this.temperature,
      })
      .then((response) => {
        if (!response) {
          return {
            content: "",
            error: "No content was able to be transcribed.",
          };
        }
        return { content: response.text, error: null };
      })
      .catch((error) => {
        this.#log(
          `Could not get any response from openai whisper`,
          error.message
        );
        return { content: "", error: error.message };
      });
  }
 }
 module.exports = {
  OpenAiWhisper,
 };
--- a/collector/utils/WhisperProviders/localWhisper.js
+++ b/collector/utils/WhisperProviders/localWhisper.js
@ -0,0 +1,219 @@
 const fs = require("fs");
 const path = require("path");
 const { v4 } = require("uuid");
 const defaultWhisper = "Xenova/whisper-small"; // Model Card: https://huggingface.co/Xenova/whisper-small
 const fileSize = {
  "Xenova/whisper-small": "250mb",
  "Xenova/whisper-large": "1.56GB",
 };
 class LocalWhisper {
  constructor({ options }) {
    this.model = options?.WhisperModelPref ?? defaultWhisper;
    this.fileSize = fileSize[this.model];
    this.cacheDir = path.resolve(
      process.env.STORAGE_DIR
        ? path.resolve(process.env.STORAGE_DIR, `models`)
        : path.resolve(__dirname, `../../../server/storage/models`)
    );
    this.modelPath = path.resolve(this.cacheDir, ...this.model.split("/"));
    // Make directory when it does not exist in existing installations
    if (!fs.existsSync(this.cacheDir))
      fs.mkdirSync(this.cacheDir, { recursive: true });
    this.#log("Initialized.");
  }
  #log(text, ...args) {
    console.log(`\x1b[32m[LocalWhisper]\x1b[0m ${text}`, ...args);
  }
  #validateAudioFile(wavFile) {
    const sampleRate = wavFile.fmt.sampleRate;
    const duration = wavFile.data.samples / sampleRate;
    // Most speech recognition systems expect minimum 8kHz
    // But we'll set it lower to be safe
    if (sampleRate < 4000) {
      // 4kHz minimum
      throw new Error(
        "Audio file sample rate is too low for accurate transcription. Minimum required is 4kHz."
      );
    }
    // Typical audio file duration limits
    const MAX_DURATION_SECONDS = 4 * 60 * 60; // 4 hours
    if (duration > MAX_DURATION_SECONDS) {
      throw new Error("Audio file duration exceeds maximum limit of 4 hours.");
    }
    // Check final sample count after upsampling to prevent memory issues
    const targetSampleRate = 16000;
    const upsampledSamples = duration * targetSampleRate;
    const MAX_SAMPLES = 230_400_000; // ~4 hours at 16kHz
    if (upsampledSamples > MAX_SAMPLES) {
      throw new Error("Audio file exceeds maximum allowed length.");
    }
    return true;
  }
  async #convertToWavAudioData(sourcePath) {
    try {
      let buffer;
      const wavefile = require("wavefile");
      const ffmpeg = require("fluent-ffmpeg");
      const outFolder = path.resolve(__dirname, `../../storage/tmp`);
      if (!fs.existsSync(outFolder))
        fs.mkdirSync(outFolder, { recursive: true });
      const fileExtension = path.extname(sourcePath).toLowerCase();
      if (fileExtension !== ".wav") {
        this.#log(
          `File conversion required! ${fileExtension} file detected - converting to .wav`
        );
        const outputFile = path.resolve(outFolder, `${v4()}.wav`);
        const convert = new Promise((resolve) => {
          ffmpeg(sourcePath)
            .toFormat("wav")
            .on("error", (error) => {
              this.#log(`Conversion Error! ${error.message}`);
              resolve(false);
            })
            .on("progress", (progress) =>
              this.#log(
                `Conversion Processing! ${progress.targetSize}KB converted`
              )
            )
            .on("end", () => {
              this.#log(`Conversion Complete! File converted to .wav!`);
              resolve(true);
            })
            .save(outputFile);
        });
        const success = await convert;
        if (!success)
          throw new Error(
            "[Conversion Failed]: Could not convert file to .wav format!"
          );
        const chunks = [];
        const stream = fs.createReadStream(outputFile);
        for await (let chunk of stream) chunks.push(chunk);
        buffer = Buffer.concat(chunks);
        fs.rmSync(outputFile);
      } else {
        const chunks = [];
        const stream = fs.createReadStream(sourcePath);
        for await (let chunk of stream) chunks.push(chunk);
        buffer = Buffer.concat(chunks);
      }
      const wavFile = new wavefile.WaveFile(buffer);
      try {
        this.#validateAudioFile(wavFile);
      } catch (error) {
        this.#log(`Audio validation failed: ${error.message}`);
        throw new Error(`Invalid audio file: ${error.message}`);
      }
      wavFile.toBitDepth("32f");
      wavFile.toSampleRate(16000);
      let audioData = wavFile.getSamples();
      if (Array.isArray(audioData)) {
        if (audioData.length > 1) {
          const SCALING_FACTOR = Math.sqrt(2);
          // Merge channels into first channel to save memory
          for (let i = 0; i < audioData[0].length; ++i) {
            audioData[0][i] =
              (SCALING_FACTOR * (audioData[0][i] + audioData[1][i])) / 2;
          }
        }
        audioData = audioData[0];
      }
      return audioData;
    } catch (error) {
      console.error(`convertToWavAudioData`, error);
      return null;
    }
  }
  async client() {
    if (!fs.existsSync(this.modelPath)) {
      this.#log(
        `The native whisper model has never been run and will be downloaded right now. Subsequent runs will be faster. (~${this.fileSize})`
      );
    }
    try {
      // Convert ESM to CommonJS via import so we can load this library.
      const pipeline = (...args) =>
        import("@xenova/transformers").then(({ pipeline }) =>
          pipeline(...args)
        );
      return await pipeline("automatic-speech-recognition", this.model, {
        cache_dir: this.cacheDir,
        ...(!fs.existsSync(this.modelPath)
          ? {
              // Show download progress if we need to download any files
              progress_callback: (data) => {
                if (!data.hasOwnProperty("progress")) return;
                console.log(
                  `\x1b[34m[Embedding - Downloading Model Files]\x1b[0m ${
                    data.file
                  } ${~~data?.progress}%`
                );
              },
            }
          : {}),
      });
    } catch (error) {
      this.#log("Failed to load the native whisper model:", error);
      throw error;
    }
  }
  async processFile(fullFilePath, filename) {
    try {
      const transcriberPromise = new Promise((resolve) =>
        this.client().then((client) => resolve(client))
      );
      const audioDataPromise = new Promise((resolve) =>
        this.#convertToWavAudioData(fullFilePath).then((audioData) =>
          resolve(audioData)
        )
      );
      const [audioData, transcriber] = await Promise.all([
        audioDataPromise,
        transcriberPromise,
      ]);
      if (!audioData) {
        this.#log(`Failed to parse content from ${filename}.`);
        return {
          content: null,
          error: `Failed to parse content from ${filename}.`,
        };
      }
      this.#log(`Transcribing audio data to text...`);
      const { text } = await transcriber(audioData, {
        chunk_length_s: 30,
        stride_length_s: 5,
      });
      return { content: text, error: null };
    } catch (error) {
      return { content: null, error: error.message };
    }
  }
 }
 module.exports = {
  LocalWhisper,
 };
--- a/collector/utils/comKey/index.js
+++ b/collector/utils/comKey/index.js
@ -0,0 +1,54 @@
 const crypto = require("crypto");
 const fs = require("fs");
 const path = require("path");
 const keyPath =
  process.env.NODE_ENV === "development"
    ? path.resolve(__dirname, `../../../server/storage/comkey`)
    : path.resolve(
        process.env.STORAGE_DIR ??
          path.resolve(__dirname, `../../../server/storage`),
        `comkey`
      );
 class CommunicationKey {
  #pubKeyName = "ipc-pub.pem";
  #storageLoc = keyPath;
  constructor() {}
  log(text, ...args) {
    console.log(`\x1b[36m[CommunicationKeyVerify]\x1b[0m ${text}`, ...args);
  }
  #readPublicKey() {
    return fs.readFileSync(path.resolve(this.#storageLoc, this.#pubKeyName));
  }
  // Given a signed payload from private key from /app/server/ this signature should
  // decode to match the textData provided. This class does verification only in collector.
  // Note: The textData is typically the JSON stringified body sent to the document processor API.
  verify(signature = "", textData = "") {
    try {
      let data = textData;
      if (typeof textData !== "string") data = JSON.stringify(data);
      return crypto.verify(
        "RSA-SHA256",
        Buffer.from(data),
        this.#readPublicKey(),
        Buffer.from(signature, "hex")
      );
    } catch {}
    return false;
  }
  // Use the rolling public-key to decrypt arbitrary data that was encrypted via the private key on the server side CommunicationKey class
  // that we know was done with the same key-pair and the given input is in base64 format already.
  // Returns plaintext string of the data that was encrypted.
  decrypt(base64String = "") {
    return crypto
      .publicDecrypt(this.#readPublicKey(), Buffer.from(base64String, "base64"))
      .toString();
  }
 }
 module.exports = { CommunicationKey };
--- a/collector/utils/constants.js
+++ b/collector/utils/constants.js
@ -0,0 +1,71 @@
 const WATCH_DIRECTORY = require("path").resolve(__dirname, "../hotdir");
 const ACCEPTED_MIMES = {
  "text/plain": [".txt", ".md", ".org", ".adoc", ".rst"],
  "text/html": [".html"],
  "application/vnd.openxmlformats-officedocument.wordprocessingml.document": [
    ".docx",
  ],
  "application/vnd.openxmlformats-officedocument.presentationml.presentation": [
    ".pptx",
  ],
  "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet": [
    ".xlsx",
  ],
  "application/vnd.oasis.opendocument.text": [".odt"],
  "application/vnd.oasis.opendocument.presentation": [".odp"],
  "application/pdf": [".pdf"],
  "application/mbox": [".mbox"],
  "audio/wav": [".wav"],
  "audio/mpeg": [".mp3"],
  "video/mp4": [".mp4"],
  "video/mpeg": [".mpeg"],
  "application/epub+zip": [".epub"],
  "image/png": [".png"],
  "image/jpeg": [".jpg"],
  "image/jpg": [".jpg"],
 };
 const SUPPORTED_FILETYPE_CONVERTERS = {
  ".txt": "./convert/asTxt.js",
  ".md": "./convert/asTxt.js",
  ".org": "./convert/asTxt.js",
  ".adoc": "./convert/asTxt.js",
  ".rst": "./convert/asTxt.js",
  ".html": "./convert/asTxt.js",
  ".pdf": "./convert/asPDF/index.js",
  ".docx": "./convert/asDocx.js",
  ".pptx": "./convert/asOfficeMime.js",
  ".odt": "./convert/asOfficeMime.js",
  ".odp": "./convert/asOfficeMime.js",
  ".xlsx": "./convert/asXlsx.js",
  ".mbox": "./convert/asMbox.js",
  ".epub": "./convert/asEPub.js",
  ".mp3": "./convert/asAudio.js",
  ".wav": "./convert/asAudio.js",
  ".mp4": "./convert/asAudio.js",
  ".mpeg": "./convert/asAudio.js",
  ".png": "./convert/asImage.js",
  ".jpg": "./convert/asImage.js",
  ".jpeg": "./convert/asImage.js",
 };
 module.exports = {
  SUPPORTED_FILETYPE_CONVERTERS,
  WATCH_DIRECTORY,
  ACCEPTED_MIMES,
 };
--- a/collector/utils/extensions/Confluence/ConfluenceLoader/index.js
+++ b/collector/utils/extensions/Confluence/ConfluenceLoader/index.js
@ -0,0 +1,141 @@
 /*
 * This is a custom implementation of the Confluence langchain loader. There was an issue where
 * code blocks were not being extracted. This is a temporary fix until this issue is resolved.*/
 const { htmlToText } = require("html-to-text");
 class ConfluencePagesLoader {
  constructor({
    baseUrl,
    spaceKey,
    username,
    accessToken,
    limit = 25,
    expand = "body.storage,version",
    personalAccessToken,
    cloud = true,
  }) {
    this.baseUrl = baseUrl;
    this.spaceKey = spaceKey;
    this.username = username;
    this.accessToken = accessToken;
    this.limit = limit;
    this.expand = expand;
    this.personalAccessToken = personalAccessToken;
    this.cloud = cloud;
  }
  get authorizationHeader() {
    if (this.personalAccessToken) {
      return `Bearer ${this.personalAccessToken}`;
    } else if (this.username && this.accessToken) {
      const authToken = Buffer.from(
        `${this.username}:${this.accessToken}`
      ).toString("base64");
      return `Basic ${authToken}`;
    }
    return undefined;
  }
  async load(options) {
    try {
      const pages = await this.fetchAllPagesInSpace(
        options?.start,
        options?.limit
      );
      return pages.map((page) => this.createDocumentFromPage(page));
    } catch (error) {
      console.error("Error:", error);
      return [];
    }
  }
  async fetchConfluenceData(url) {
    try {
      const initialHeaders = {
        "Content-Type": "application/json",
        Accept: "application/json",
      };
      const authHeader = this.authorizationHeader;
      if (authHeader) {
        initialHeaders.Authorization = authHeader;
      }
      const response = await fetch(url, {
        headers: initialHeaders,
      });
      if (!response.ok) {
        throw new Error(
          `Failed to fetch ${url} from Confluence: ${response.status}`
        );
      }
      return await response.json();
    } catch (error) {
      throw new Error(`Failed to fetch ${url} from Confluence: ${error}`);
    }
  }
  // https://developer.atlassian.com/cloud/confluence/rest/v2/intro/#auth
  async fetchAllPagesInSpace(start = 0, limit = this.limit) {
    const url = `${this.baseUrl}${
      this.cloud ? "/wiki" : ""
    }/rest/api/content?spaceKey=${
      this.spaceKey
    }&limit=${limit}&start=${start}&expand=${this.expand}`;
    const data = await this.fetchConfluenceData(url);
    if (data.size === 0) {
      return [];
    }
    const nextPageStart = start + data.size;
    const nextPageResults = await this.fetchAllPagesInSpace(
      nextPageStart,
      limit
    );
    return data.results.concat(nextPageResults);
  }
  createDocumentFromPage(page) {
    // Function to extract code blocks
    const extractCodeBlocks = (content) => {
      const codeBlockRegex =
        /<ac:structured-macro ac:name="code"[^>]*>[\s\S]*?<ac:plain-text-body><!\[CDATA\[([\s\S]*?)\]\]><\/ac:plain-text-body>[\s\S]*?<\/ac:structured-macro>/g;
      const languageRegex =
        /<ac:parameter ac:name="language">(.*?)<\/ac:parameter>/;
      return content.replace(codeBlockRegex, (match) => {
        const language = match.match(languageRegex)?.[1] || "";
        const code =
          match.match(
            /<ac:plain-text-body><!\[CDATA\[([\s\S]*?)\]\]><\/ac:plain-text-body>/
          )?.[1] || "";
        return `\n\`\`\`${language}\n${code.trim()}\n\`\`\`\n`;
      });
    };
    const contentWithCodeBlocks = extractCodeBlocks(page.body.storage.value);
    const plainTextContent = htmlToText(contentWithCodeBlocks, {
      wordwrap: false,
      preserveNewlines: true,
    });
    const textWithPreservedStructure = plainTextContent.replace(
      /\n{3,}/g,
      "\n\n"
    );
    const pageUrl = `${this.baseUrl}/spaces/${this.spaceKey}/pages/${page.id}`;
    return {
      pageContent: textWithPreservedStructure,
      metadata: {
        id: page.id,
        status: page.status,
        title: page.title,
        type: page.type,
        url: pageUrl,
        version: page.version?.number,
        updated_by: page.version?.by?.displayName,
        updated_at: page.version?.when,
      },
    };
  }
 }
 module.exports = { ConfluencePagesLoader };
--- a/collector/utils/extensions/Confluence/index.js
+++ b/collector/utils/extensions/Confluence/index.js
@ -0,0 +1,257 @@
 const fs = require("fs");
 const path = require("path");
 const { default: slugify } = require("slugify");
 const { v4 } = require("uuid");
 const { writeToServerDocuments, sanitizeFileName } = require("../../files");
 const { tokenizeString } = require("../../tokenizer");
 const { ConfluencePagesLoader } = require("./ConfluenceLoader");
 /**
 * Load Confluence documents from a spaceID and Confluence credentials
 * @param {object} args - forwarded request body params
 * @param {import("../../../middleware/setDataSigner").ResponseWithSigner} response - Express response object with encryptionWorker
 * @returns
 */
 async function loadConfluence(
  {
    baseUrl = null,
    spaceKey = null,
    username = null,
    accessToken = null,
    cloud = true,
    personalAccessToken = null,
  },
  response
 ) {
  if (!personalAccessToken && (!username || !accessToken)) {
    return {
      success: false,
      reason:
        "You need either a personal access token (PAT), or a username and access token to use the Confluence connector.",
    };
  }
  if (!baseUrl || !validBaseUrl(baseUrl)) {
    return {
      success: false,
      reason: "Provided base URL is not a valid URL.",
    };
  }
  if (!spaceKey) {
    return {
      success: false,
      reason: "You need to provide a Confluence space key.",
    };
  }
  const { origin, hostname } = new URL(baseUrl);
  console.log(`-- Working Confluence ${origin} --`);
  const loader = new ConfluencePagesLoader({
    baseUrl: origin, // Use the origin to avoid issues with subdomains, ports, protocols, etc.
    spaceKey,
    username,
    accessToken,
    cloud,
    personalAccessToken,
  });
  const { docs, error } = await loader
    .load()
    .then((docs) => {
      return { docs, error: null };
    })
    .catch((e) => {
      return {
        docs: [],
        error: e.message?.split("Error:")?.[1] || e.message,
      };
    });
  if (!docs.length || !!error) {
    return {
      success: false,
      reason: error ?? "No pages found for that Confluence space.",
    };
  }
  const outFolder = slugify(
    `confluence-${hostname}-${v4().slice(0, 4)}`
  ).toLowerCase();
  const outFolderPath =
    process.env.NODE_ENV === "development"
      ? path.resolve(
          __dirname,
          `../../../../server/storage/documents/${outFolder}`
        )
      : path.resolve(process.env.STORAGE_DIR, `documents/${outFolder}`);
  if (!fs.existsSync(outFolderPath))
    fs.mkdirSync(outFolderPath, { recursive: true });
  docs.forEach((doc) => {
    if (!doc.pageContent) return;
    const data = {
      id: v4(),
      url: doc.metadata.url + ".page",
      title: doc.metadata.title || doc.metadata.source,
      docAuthor: origin,
      description: doc.metadata.title,
      docSource: `${origin} Confluence`,
      chunkSource: generateChunkSource(
        { doc, baseUrl: origin, spaceKey, accessToken, username, cloud },
        response.locals.encryptionWorker
      ),
      published: new Date().toLocaleString(),
      wordCount: doc.pageContent.split(" ").length,
      pageContent: doc.pageContent,
      token_count_estimate: tokenizeString(doc.pageContent),
    };
    console.log(
      `[Confluence Loader]: Saving ${doc.metadata.title} to ${outFolder}`
    );
    const fileName = sanitizeFileName(
      `${slugify(doc.metadata.title)}-${data.id}`
    );
    writeToServerDocuments(data, fileName, outFolderPath);
  });
  return {
    success: true,
    reason: null,
    data: {
      spaceKey,
      destination: outFolder,
    },
  };
 }
 /**
 * Gets the page content from a specific Confluence page, not all pages in a workspace.
 * @returns
 */
 async function fetchConfluencePage({
  pageUrl,
  baseUrl,
  spaceKey,
  username,
  accessToken,
  cloud = true,
 }) {
  if (!pageUrl || !baseUrl || !spaceKey || !username || !accessToken) {
    return {
      success: false,
      content: null,
      reason:
        "You need either a username and access token, or a personal access token (PAT), to use the Confluence connector.",
    };
  }
  if (!validBaseUrl(baseUrl)) {
    return {
      success: false,
      content: null,
      reason: "Provided base URL is not a valid URL.",
    };
  }
  if (!spaceKey) {
    return {
      success: false,
      content: null,
      reason: "You need to provide a Confluence space key.",
    };
  }
  console.log(`-- Working Confluence Page ${pageUrl} --`);
  const loader = new ConfluencePagesLoader({
    baseUrl, // Should be the origin of the baseUrl
    spaceKey,
    username,
    accessToken,
    cloud,
  });
  const { docs, error } = await loader
    .load()
    .then((docs) => {
      return { docs, error: null };
    })
    .catch((e) => {
      return {
        docs: [],
        error: e.message?.split("Error:")?.[1] || e.message,
      };
    });
  if (!docs.length || !!error) {
    return {
      success: false,
      reason: error ?? "No pages found for that Confluence space.",
      content: null,
    };
  }
  const targetDocument = docs.find(
    (doc) => doc.pageContent && doc.metadata.url === pageUrl
  );
  if (!targetDocument) {
    return {
      success: false,
      reason: "Target page could not be found in Confluence space.",
      content: null,
    };
  }
  return {
    success: true,
    reason: null,
    content: targetDocument.pageContent,
  };
 }
 /**
 * Validates if the provided baseUrl is a valid URL at all.
 * @param {string} baseUrl
 * @returns {boolean}
 */
 function validBaseUrl(baseUrl) {
  try {
    new URL(baseUrl);
    return true;
  } catch (e) {
    return false;
  }
 }
 /**
 * Generate the full chunkSource for a specific Confluence page so that we can resync it later.
 * This data is encrypted into a single `payload` query param so we can replay credentials later
 * since this was encrypted with the systems persistent password and salt.
 * @param {object} chunkSourceInformation
 * @param {import("../../EncryptionWorker").EncryptionWorker} encryptionWorker
 * @returns {string}
 */
 function generateChunkSource(
  { doc, baseUrl, spaceKey, accessToken, username, cloud },
  encryptionWorker
 ) {
  const payload = {
    baseUrl,
    spaceKey,
    token: accessToken,
    username,
    cloud,
  };
  return `confluence://${doc.metadata.url}?payload=${encryptionWorker.encrypt(
    JSON.stringify(payload)
  )}`;
 }
 module.exports = {
  loadConfluence,
  fetchConfluencePage,
 };
--- a/collector/utils/extensions/RepoLoader/GithubRepo/RepoLoader/index.js
+++ b/collector/utils/extensions/RepoLoader/GithubRepo/RepoLoader/index.js
@ -0,0 +1,235 @@
 /**
 * @typedef {Object} RepoLoaderArgs
 * @property {string} repo - The GitHub repository URL.
 * @property {string} [branch] - The branch to load from (optional).
 * @property {string} [accessToken] - GitHub access token for authentication (optional).
 * @property {string[]} [ignorePaths] - Array of paths to ignore when loading (optional).
 */
 /**
 * @class
 * @classdesc Loads and manages GitHub repository content.
 */
 class GitHubRepoLoader {
  /**
   * Creates an instance of RepoLoader.
   * @param {RepoLoaderArgs} [args] - The configuration options.
   * @returns {GitHubRepoLoader}
   */
  constructor(args = {}) {
    this.ready = false;
    this.repo = args?.repo;
    this.branch = args?.branch;
    this.accessToken = args?.accessToken || null;
    this.ignorePaths = args?.ignorePaths || [];
    this.author = null;
    this.project = null;
    this.branches = [];
  }
  #validGithubUrl() {
    try {
      const url = new URL(this.repo);
      // Not a github url at all.
      if (url.hostname !== "github.com") {
        console.log(
          `[GitHub Loader]: Invalid GitHub URL provided! Hostname must be 'github.com'. Got ${url.hostname}`
        );
        return false;
      }
      // Assume the url is in the format of github.com/{author}/{project}
      // Remove the first slash from the pathname so we can split it properly.
      const [author, project, ..._rest] = url.pathname.slice(1).split("/");
      if (!author || !project) {
        console.log(
          `[GitHub Loader]: Invalid GitHub URL provided! URL must be in the format of 'github.com/{author}/{project}'. Got ${url.pathname}`
        );
        return false;
      }
      this.author = author;
      this.project = project;
      return true;
    } catch (e) {
      console.log(
        `[GitHub Loader]: Invalid GitHub URL provided! Error: ${e.message}`
      );
      return false;
    }
  }
  // Ensure the branch provided actually exists
  // and if it does not or has not been set auto-assign to primary branch.
  async #validBranch() {
    await this.getRepoBranches();
    if (!!this.branch && this.branches.includes(this.branch)) return;
    console.log(
      "[GitHub Loader]: Branch not set! Auto-assigning to a default branch."
    );
    this.branch = this.branches.includes("main") ? "main" : "master";
    console.log(`[GitHub Loader]: Branch auto-assigned to ${this.branch}.`);
    return;
  }
  async #validateAccessToken() {
    if (!this.accessToken) return;
    const valid = await fetch("https://api.github.com/octocat", {
      method: "GET",
      headers: {
        Authorization: `Bearer ${this.accessToken}`,
        "X-GitHub-Api-Version": "2022-11-28",
      },
    })
      .then((res) => {
        if (!res.ok) throw new Error(res.statusText);
        return res.ok;
      })
      .catch((e) => {
        console.error(
          "Invalid GitHub Access Token provided! Access token will not be used",
          e.message
        );
        return false;
      });
    if (!valid) this.accessToken = null;
    return;
  }
  /**
   * Initializes the RepoLoader instance.
   * @returns {Promise<RepoLoader>} The initialized RepoLoader instance.
   */
  async init() {
    if (!this.#validGithubUrl()) return;
    await this.#validBranch();
    await this.#validateAccessToken();
    this.ready = true;
    return this;
  }
  /**
   * Recursively loads the repository content.
   * @returns {Promise<Array<Object>>} An array of loaded documents.
   * @throws {Error} If the RepoLoader is not in a ready state.
   */
  async recursiveLoader() {
    if (!this.ready) throw new Error("[GitHub Loader]: not in ready state!");
    const {
      GithubRepoLoader: LCGithubLoader,
    } = require("@langchain/community/document_loaders/web/github");
    if (this.accessToken)
      console.log(
        `[GitHub Loader]: Access token set! Recursive loading enabled!`
      );
    const loader = new LCGithubLoader(this.repo, {
      branch: this.branch,
      recursive: !!this.accessToken, // Recursive will hit rate limits.
      maxConcurrency: 5,
      unknown: "warn",
      accessToken: this.accessToken,
      ignorePaths: this.ignorePaths,
      verbose: true,
    });
    const docs = await loader.load();
    return docs;
  }
  // Sort branches to always show either main or master at the top of the result.
  #branchPrefSort(branches = []) {
    const preferredSort = ["main", "master"];
    return branches.reduce((acc, branch) => {
      if (preferredSort.includes(branch)) return [branch, ...acc];
      return [...acc, branch];
    }, []);
  }
  /**
   * Retrieves all branches for the repository.
   * @returns {Promise<string[]>} An array of branch names.
   */
  async getRepoBranches() {
    if (!this.#validGithubUrl() || !this.author || !this.project) return [];
    await this.#validateAccessToken(); // Ensure API access token is valid for pre-flight
    let page = 0;
    let polling = true;
    const branches = [];
    while (polling) {
      console.log(`Fetching page ${page} of branches for ${this.project}`);
      await fetch(
        `https://api.github.com/repos/${this.author}/${this.project}/branches?per_page=100&page=${page}`,
        {
          method: "GET",
          headers: {
            ...(this.accessToken
              ? { Authorization: `Bearer ${this.accessToken}` }
              : {}),
            "X-GitHub-Api-Version": "2022-11-28",
          },
        }
      )
        .then((res) => {
          if (res.ok) return res.json();
          throw new Error(`Invalid request to Github API: ${res.statusText}`);
        })
        .then((branchObjects) => {
          polling = branchObjects.length > 0;
          branches.push(branchObjects.map((branch) => branch.name));
          page++;
        })
        .catch((err) => {
          polling = false;
          console.log(`RepoLoader.branches`, err);
        });
    }
    this.branches = [...new Set(branches.flat())];
    return this.#branchPrefSort(this.branches);
  }
  /**
   * Fetches the content of a single file from the repository.
   * @param {string} sourceFilePath - The path to the file in the repository.
   * @returns {Promise<string|null>} The content of the file, or null if fetching fails.
   */
  async fetchSingleFile(sourceFilePath) {
    try {
      return fetch(
        `https://api.github.com/repos/${this.author}/${this.project}/contents/${sourceFilePath}?ref=${this.branch}`,
        {
          method: "GET",
          headers: {
            Accept: "application/vnd.github+json",
            "X-GitHub-Api-Version": "2022-11-28",
            ...(!!this.accessToken
              ? { Authorization: `Bearer ${this.accessToken}` }
              : {}),
          },
        }
      )
        .then((res) => {
          if (res.ok) return res.json();
          throw new Error(`Failed to fetch from Github API: ${res.statusText}`);
        })
        .then((json) => {
          if (json.hasOwnProperty("status") || !json.hasOwnProperty("content"))
            throw new Error(json?.message || "missing content");
          return atob(json.content);
        });
    } catch (e) {
      console.error(`RepoLoader.fetchSingleFile`, e);
      return null;
    }
  }
 }
 module.exports = GitHubRepoLoader;
--- a/collector/utils/extensions/RepoLoader/GithubRepo/index.js
+++ b/collector/utils/extensions/RepoLoader/GithubRepo/index.js
@ -0,0 +1,159 @@
 const RepoLoader = require("./RepoLoader");
 const fs = require("fs");
 const path = require("path");
 const { default: slugify } = require("slugify");
 const { v4 } = require("uuid");
 const { writeToServerDocuments } = require("../../../files");
 const { tokenizeString } = require("../../../tokenizer");
 /**
 * Load in a GitHub Repo recursively or just the top level if no PAT is provided
 * @param {object} args - forwarded request body params
 * @param {import("../../../middleware/setDataSigner").ResponseWithSigner} response - Express response object with encryptionWorker
 * @returns
 */
 async function loadGithubRepo(args, response) {
  const repo = new RepoLoader(args);
  await repo.init();
  if (!repo.ready)
    return {
      success: false,
      reason: "Could not prepare GitHub repo for loading! Check URL",
    };
  console.log(
    `-- Working GitHub ${repo.author}/${repo.project}:${repo.branch} --`
  );
  const docs = await repo.recursiveLoader();
  if (!docs.length) {
    return {
      success: false,
      reason: "No files were found for those settings.",
    };
  }
  console.log(`[GitHub Loader]: Found ${docs.length} source files. Saving...`);
  const outFolder = slugify(
    `${repo.author}-${repo.project}-${repo.branch}-${v4().slice(0, 4)}`
  ).toLowerCase();
  const outFolderPath =
    process.env.NODE_ENV === "development"
      ? path.resolve(
          __dirname,
          `../../../../../server/storage/documents/${outFolder}`
        )
      : path.resolve(process.env.STORAGE_DIR, `documents/${outFolder}`);
  if (!fs.existsSync(outFolderPath))
    fs.mkdirSync(outFolderPath, { recursive: true });
  for (const doc of docs) {
    if (!doc.pageContent) continue;
    const data = {
      id: v4(),
      url: "github://" + doc.metadata.source,
      title: doc.metadata.source,
      docAuthor: repo.author,
      description: "No description found.",
      docSource: doc.metadata.source,
      chunkSource: generateChunkSource(
        repo,
        doc,
        response.locals.encryptionWorker
      ),
      published: new Date().toLocaleString(),
      wordCount: doc.pageContent.split(" ").length,
      pageContent: doc.pageContent,
      token_count_estimate: tokenizeString(doc.pageContent),
    };
    console.log(
      `[GitHub Loader]: Saving ${doc.metadata.source} to ${outFolder}`
    );
    writeToServerDocuments(
      data,
      `${slugify(doc.metadata.source)}-${data.id}`,
      outFolderPath
    );
  }
  return {
    success: true,
    reason: null,
    data: {
      author: repo.author,
      repo: repo.project,
      branch: repo.branch,
      files: docs.length,
      destination: outFolder,
    },
  };
 }
 /**
 * Gets the page content from a specific source file in a give GitHub Repo, not all items in a repo.
 * @returns
 */
 async function fetchGithubFile({
  repoUrl,
  branch,
  accessToken = null,
  sourceFilePath,
 }) {
  const repo = new RepoLoader({
    repo: repoUrl,
    branch,
    accessToken,
  });
  await repo.init();
  if (!repo.ready)
    return {
      success: false,
      content: null,
      reason: "Could not prepare GitHub repo for loading! Check URL or PAT.",
    };
  console.log(
    `-- Working GitHub ${repo.author}/${repo.project}:${repo.branch} file:${sourceFilePath} --`
  );
  const fileContent = await repo.fetchSingleFile(sourceFilePath);
  if (!fileContent) {
    return {
      success: false,
      reason: "Target file returned a null content response.",
      content: null,
    };
  }
  return {
    success: true,
    reason: null,
    content: fileContent,
  };
 }
 /**
 * Generate the full chunkSource for a specific file so that we can resync it later.
 * This data is encrypted into a single `payload` query param so we can replay credentials later
 * since this was encrypted with the systems persistent password and salt.
 * @param {RepoLoader} repo
 * @param {import("@langchain/core/documents").Document} doc
 * @param {import("../../EncryptionWorker").EncryptionWorker} encryptionWorker
 * @returns {string}
 */
 function generateChunkSource(repo, doc, encryptionWorker) {
  const payload = {
    owner: repo.author,
    project: repo.project,
    branch: repo.branch,
    path: doc.metadata.source,
    pat: !!repo.accessToken ? repo.accessToken : null,
  };
  return `github://${repo.repo}?payload=${encryptionWorker.encrypt(
    JSON.stringify(payload)
  )}`;
 }
 module.exports = { loadGithubRepo, fetchGithubFile };
--- a/collector/utils/extensions/RepoLoader/GitlabRepo/RepoLoader/index.js
+++ b/collector/utils/extensions/RepoLoader/GitlabRepo/RepoLoader/index.js
@ -0,0 +1,376 @@
 const ignore = require("ignore");
 /**
 * @typedef {Object} RepoLoaderArgs
 * @property {string} repo - The GitLab repository URL.
 * @property {string} [branch] - The branch to load from (optional).
 * @property {string} [accessToken] - GitLab access token for authentication (optional).
 * @property {string[]} [ignorePaths] - Array of paths to ignore when loading (optional).
 * @property {boolean} [fetchIssues] - Should issues be fetched (optional).
 */
 /**
 * @typedef {Object} FileTreeObject
 * @property {string} id - The file object ID.
 * @property {string} name - name of file.
 * @property {('blob'|'tree')} type - type of file object.
 * @property {string} path - path + name of file.
 * @property {string} mode - Linux permission code.
 */
 /**
 * @class
 * @classdesc Loads and manages GitLab repository content.
 */
 class GitLabRepoLoader {
  /**
   * Creates an instance of RepoLoader.
   * @param {RepoLoaderArgs} [args] - The configuration options.
   * @returns {GitLabRepoLoader}
   */
  constructor(args = {}) {
    this.ready = false;
    this.repo = args?.repo;
    this.branch = args?.branch;
    this.accessToken = args?.accessToken || null;
    this.ignorePaths = args?.ignorePaths || [];
    this.ignoreFilter = ignore().add(this.ignorePaths);
    this.withIssues = args?.fetchIssues || false;
    this.projectId = null;
    this.apiBase = "https://gitlab.com";
    this.author = null;
    this.project = null;
    this.branches = [];
  }
  #validGitlabUrl() {
    const UrlPattern = require("url-pattern");
    const validPatterns = [
      new UrlPattern("https\\://gitlab.com/(:author*)/(:project(*))", {
        segmentValueCharset: "a-zA-Z0-9-._~%+",
      }),
      // This should even match the regular hosted URL, but we may want to know
      // if this was a hosted GitLab (above) or a self-hosted (below) instance
      // since the API interface could be different.
      new UrlPattern(
        "(:protocol(http|https))\\://(:hostname*)/(:author*)/(:project(*))",
        {
          segmentValueCharset: "a-zA-Z0-9-._~%+",
        }
      ),
    ];
    let match = null;
    for (const pattern of validPatterns) {
      if (match !== null) continue;
      match = pattern.match(this.repo);
    }
    if (!match) return false;
    const { author, project } = match;
    this.projectId = encodeURIComponent(`${author}/${project}`);
    this.apiBase = new URL(this.repo).origin;
    this.author = author;
    this.project = project;
    return true;
  }
  async #validBranch() {
    await this.getRepoBranches();
    if (!!this.branch && this.branches.includes(this.branch)) return;
    console.log(
      "[Gitlab Loader]: Branch not set! Auto-assigning to a default branch."
    );
    this.branch = this.branches.includes("main") ? "main" : "master";
    console.log(`[Gitlab Loader]: Branch auto-assigned to ${this.branch}.`);
    return;
  }
  async #validateAccessToken() {
    if (!this.accessToken) return;
    try {
      await fetch(`${this.apiBase}/api/v4/user`, {
        method: "GET",
        headers: this.accessToken ? { "PRIVATE-TOKEN": this.accessToken } : {},
      }).then((res) => res.ok);
    } catch (e) {
      console.error(
        "Invalid Gitlab Access Token provided! Access token will not be used",
        e.message
      );
      this.accessToken = null;
    }
  }
  /**
   * Initializes the RepoLoader instance.
   * @returns {Promise<RepoLoader>} The initialized RepoLoader instance.
   */
  async init() {
    if (!this.#validGitlabUrl()) return;
    await this.#validBranch();
    await this.#validateAccessToken();
    this.ready = true;
    return this;
  }
  /**
   * Recursively loads the repository content.
   * @returns {Promise<Array<Object>>} An array of loaded documents.
   * @throws {Error} If the RepoLoader is not in a ready state.
   */
  async recursiveLoader() {
    if (!this.ready) throw new Error("[Gitlab Loader]: not in ready state!");
    if (this.accessToken)
      console.log(
        `[Gitlab Loader]: Access token set! Recursive loading enabled for ${this.repo}!`
      );
    const docs = [];
    console.log(`[Gitlab Loader]: Fetching files.`);
    const files = await this.fetchFilesRecursive();
    console.log(`[Gitlab Loader]: Fetched ${files.length} files.`);
    for (const file of files) {
      if (this.ignoreFilter.ignores(file.path)) continue;
      docs.push({
        pageContent: file.content,
        metadata: {
          source: file.path,
          url: `${this.repo}/-/blob/${this.branch}/${file.path}`,
        },
      });
    }
    if (this.withIssues) {
      console.log(`[Gitlab Loader]: Fetching issues.`);
      const issues = await this.fetchIssues();
      console.log(
        `[Gitlab Loader]: Fetched ${issues.length} issues with discussions.`
      );
      docs.push(
        ...issues.map((issue) => ({
          issue,
          metadata: {
            source: `issue-${this.repo}-${issue.iid}`,
            url: issue.web_url,
          },
        }))
      );
    }
    return docs;
  }
  #branchPrefSort(branches = []) {
    const preferredSort = ["main", "master"];
    return branches.reduce((acc, branch) => {
      if (preferredSort.includes(branch)) return [branch, ...acc];
      return [...acc, branch];
    }, []);
  }
  /**
   * Retrieves all branches for the repository.
   * @returns {Promise<string[]>} An array of branch names.
   */
  async getRepoBranches() {
    if (!this.#validGitlabUrl() || !this.projectId) return [];
    await this.#validateAccessToken();
    this.branches = [];
    const branchesRequestData = {
      endpoint: `/api/v4/projects/${this.projectId}/repository/branches`,
    };
    let branchesPage = [];
    while ((branchesPage = await this.fetchNextPage(branchesRequestData))) {
      this.branches.push(...branchesPage.map((branch) => branch.name));
    }
    return this.#branchPrefSort(this.branches);
  }
  /**
   * Returns list of all file objects from tree API for GitLab
   * @returns {Promise<FileTreeObject[]>}
   */
  async fetchFilesRecursive() {
    const files = [];
    const filesRequestData = {
      endpoint: `/api/v4/projects/${this.projectId}/repository/tree`,
      queryParams: {
        ref: this.branch,
        recursive: true,
      },
    };
    let filesPage = null;
    let pagePromises = [];
    while ((filesPage = await this.fetchNextPage(filesRequestData))) {
      // Fetch all the files that are not ignored in parallel.
      pagePromises = filesPage
        .filter((file) => {
          if (file.type !== "blob") return false;
          return !this.ignoreFilter.ignores(file.path);
        })
        .map(async (file) => {
          const content = await this.fetchSingleFileContents(file.path);
          if (!content) return null;
          return {
            path: file.path,
            content,
          };
        });
      const pageFiles = await Promise.all(pagePromises);
      files.push(...pageFiles.filter((item) => item !== null));
      console.log(`Fetched ${files.length} files.`);
    }
    console.log(`Total files fetched: ${files.length}`);
    return files;
  }
  /**
   * Fetches all issues from the repository.
   * @returns {Promise<Issue[]>} An array of issue objects.
   */
  async fetchIssues() {
    const issues = [];
    const issuesRequestData = {
      endpoint: `/api/v4/projects/${this.projectId}/issues`,
    };
    let issuesPage = null;
    let pagePromises = [];
    while ((issuesPage = await this.fetchNextPage(issuesRequestData))) {
      // Fetch all the issues in parallel.
      pagePromises = issuesPage.map(async (issue) => {
        const discussionsRequestData = {
          endpoint: `/api/v4/projects/${this.projectId}/issues/${issue.iid}/discussions`,
        };
        let discussionPage = null;
        const discussions = [];
        while (
          (discussionPage = await this.fetchNextPage(discussionsRequestData))
        ) {
          discussions.push(
            ...discussionPage.map(({ notes }) =>
              notes.map(
                ({ body, author, created_at }) =>
                  `${author.username} at ${created_at}:
 ${body}`
              )
            )
          );
        }
        const result = {
          ...issue,
          discussions,
        };
        return result;
      });
      const pageIssues = await Promise.all(pagePromises);
      issues.push(...pageIssues);
      console.log(`Fetched ${issues.length} issues.`);
    }
    console.log(`Total issues fetched: ${issues.length}`);
    return issues;
  }
  /**
   * Fetches the content of a single file from the repository.
   * @param {string} sourceFilePath - The path to the file in the repository.
   * @returns {Promise<string|null>} The content of the file, or null if fetching fails.
   */
  async fetchSingleFileContents(sourceFilePath) {
    try {
      const data = await fetch(
        `${this.apiBase}/api/v4/projects/${
          this.projectId
        }/repository/files/${encodeURIComponent(sourceFilePath)}/raw?ref=${
          this.branch
        }`,
        {
          method: "GET",
          headers: this.accessToken
            ? { "PRIVATE-TOKEN": this.accessToken }
            : {},
        }
      ).then((res) => {
        if (res.ok) return res.text();
        throw new Error(`Failed to fetch single file ${sourceFilePath}`);
      });
      return data;
    } catch (e) {
      console.error(`RepoLoader.fetchSingleFileContents`, e);
      return null;
    }
  }
  /**
   * Fetches the next page of data from the API.
   * @param {Object} requestData - The request data.
   * @returns {Promise<Array<Object>|null>} The next page of data, or null if no more pages.
   */
  async fetchNextPage(requestData) {
    try {
      if (requestData.page === -1) return null;
      if (!requestData.page) requestData.page = 1;
      const { endpoint, perPage = 100, queryParams = {} } = requestData;
      const params = new URLSearchParams({
        ...queryParams,
        per_page: perPage,
        page: requestData.page,
      });
      const url = `${this.apiBase}${endpoint}?${params.toString()}`;
      const response = await fetch(url, {
        method: "GET",
        headers: this.accessToken ? { "PRIVATE-TOKEN": this.accessToken } : {},
      });
      // Rate limits get hit very often if no PAT is provided
      if (response.status === 401) {
        console.warn(`Rate limit hit for ${endpoint}. Skipping.`);
        return null;
      }
      const totalPages = Number(response.headers.get("x-total-pages"));
      const data = await response.json();
      if (!Array.isArray(data)) {
        console.warn(`Unexpected response format for ${endpoint}:`, data);
        return [];
      }
      console.log(
        `Gitlab RepoLoader: fetched ${endpoint} page ${requestData.page}/${totalPages} with ${data.length} records.`
      );
      if (totalPages === requestData.page) {
        requestData.page = -1;
      } else {
        requestData.page = Number(response.headers.get("x-next-page"));
      }
      return data;
    } catch (e) {
      console.error(`RepoLoader.fetchNextPage`, e);
      return null;
    }
  }
 }
 module.exports = GitLabRepoLoader;
--- a/collector/utils/extensions/RepoLoader/GitlabRepo/index.js
+++ b/collector/utils/extensions/RepoLoader/GitlabRepo/index.js
@ -0,0 +1,252 @@
 const RepoLoader = require("./RepoLoader");
 const fs = require("fs");
 const path = require("path");
 const { default: slugify } = require("slugify");
 const { v4 } = require("uuid");
 const { writeToServerDocuments } = require("../../../files");
 const { tokenizeString } = require("../../../tokenizer");
 /**
 * Load in a Gitlab Repo recursively or just the top level if no PAT is provided
 * @param {object} args - forwarded request body params
 * @param {import("../../../middleware/setDataSigner").ResponseWithSigner} response - Express response object with encryptionWorker
 * @returns
 */
 async function loadGitlabRepo(args, response) {
  const repo = new RepoLoader(args);
  await repo.init();
  if (!repo.ready)
    return {
      success: false,
      reason: "Could not prepare Gitlab repo for loading! Check URL",
    };
  console.log(
    `-- Working GitLab ${repo.author}/${repo.project}:${repo.branch} --`
  );
  const docs = await repo.recursiveLoader();
  if (!docs.length) {
    return {
      success: false,
      reason: "No files were found for those settings.",
    };
  }
  console.log(`[GitLab Loader]: Found ${docs.length} source files. Saving...`);
  const outFolder = slugify(
    `${repo.author}-${repo.project}-${repo.branch}-${v4().slice(0, 4)}`
  ).toLowerCase();
  const outFolderPath =
    process.env.NODE_ENV === "development"
      ? path.resolve(
          __dirname,
          `../../../../../server/storage/documents/${outFolder}`
        )
      : path.resolve(process.env.STORAGE_DIR, `documents/${outFolder}`);
  if (!fs.existsSync(outFolderPath))
    fs.mkdirSync(outFolderPath, { recursive: true });
  for (const doc of docs) {
    if (!doc.metadata || (!doc.pageContent && !doc.issue)) continue;
    let pageContent = null;
    const data = {
      id: v4(),
      url: "gitlab://" + doc.metadata.source,
      docSource: doc.metadata.source,
      chunkSource: generateChunkSource(
        repo,
        doc,
        response.locals.encryptionWorker
      ),
      published: new Date().toLocaleString(),
    };
    if (doc.pageContent) {
      pageContent = doc.pageContent;
      data.title = doc.metadata.source;
      data.docAuthor = repo.author;
      data.description = "No description found.";
    } else if (doc.issue) {
      pageContent = issueToMarkdown(doc.issue);
      data.title = `Issue ${doc.issue.iid}: ${doc.issue.title}`;
      data.docAuthor = doc.issue.author.username;
      data.description = doc.issue.description;
    } else {
      continue;
    }
    data.wordCount = pageContent.split(" ").length;
    data.token_count_estimate = tokenizeString(pageContent);
    data.pageContent = pageContent;
    console.log(
      `[GitLab Loader]: Saving ${doc.metadata.source} to ${outFolder}`
    );
    writeToServerDocuments(
      data,
      `${slugify(doc.metadata.source)}-${data.id}`,
      outFolderPath
    );
  }
  return {
    success: true,
    reason: null,
    data: {
      author: repo.author,
      repo: repo.project,
      projectId: repo.projectId,
      branch: repo.branch,
      files: docs.length,
      destination: outFolder,
    },
  };
 }
 async function fetchGitlabFile({
  repoUrl,
  branch,
  accessToken = null,
  sourceFilePath,
 }) {
  const repo = new RepoLoader({
    repo: repoUrl,
    branch,
    accessToken,
  });
  await repo.init();
  if (!repo.ready)
    return {
      success: false,
      content: null,
      reason: "Could not prepare GitLab repo for loading! Check URL or PAT.",
    };
  console.log(
    `-- Working GitLab ${repo.author}/${repo.project}:${repo.branch} file:${sourceFilePath} --`
  );
  const fileContent = await repo.fetchSingleFile(sourceFilePath);
  if (!fileContent) {
    return {
      success: false,
      reason: "Target file returned a null content response.",
      content: null,
    };
  }
  return {
    success: true,
    reason: null,
    content: fileContent,
  };
 }
 function generateChunkSource(repo, doc, encryptionWorker) {
  const payload = {
    projectId: decodeURIComponent(repo.projectId),
    branch: repo.branch,
    path: doc.metadata.source,
    pat: !!repo.accessToken ? repo.accessToken : null,
  };
  return `gitlab://${repo.repo}?payload=${encryptionWorker.encrypt(
    JSON.stringify(payload)
  )}`;
 }
 function issueToMarkdown(issue) {
  const metadata = {};
  const userFields = ["author", "assignees", "closed_by"];
  const userToUsername = ({ username }) => username;
  for (const userField of userFields) {
    if (issue[userField]) {
      if (Array.isArray(issue[userField])) {
        metadata[userField] = issue[userField].map(userToUsername);
      } else {
        metadata[userField] = userToUsername(issue[userField]);
      }
    }
  }
  const singleValueFields = [
    "web_url",
    "state",
    "created_at",
    "updated_at",
    "closed_at",
    "due_date",
    "type",
    "merge_request_count",
    "upvotes",
    "downvotes",
    "labels",
    "has_tasks",
    "task_status",
    "confidential",
    "severity",
  ];
  for (const singleValueField of singleValueFields) {
    metadata[singleValueField] = issue[singleValueField];
  }
  if (issue.milestone) {
    metadata.milestone = `${issue.milestone.title} (${issue.milestone.id})`;
  }
  if (issue.time_stats) {
    const timeFields = ["time_estimate", "total_time_spent"];
    for (const timeField of timeFields) {
      const fieldName = `human_${timeField}`;
      if (issue?.time_stats[fieldName]) {
        metadata[timeField] = issue.time_stats[fieldName];
      }
    }
  }
  const metadataString = Object.entries(metadata)
    .map(([name, value]) => {
      if (!value || value?.length < 1) {
        return null;
      }
      let result = `- ${name.replace("_", " ")}:`;
      if (!Array.isArray(value)) {
        result += ` ${value}`;
      } else {
        result += "\n" + value.map((s) => `  - ${s}`).join("\n");
      }
      return result;
    })
    .filter((item) => item != null)
    .join("\n");
  let markdown = `# ${issue.title} (${issue.iid})
 ${issue.description}
 ## Metadata
 ${metadataString}`;
  if (issue.discussions.length > 0) {
    markdown += `
 ## Activity
 ${issue.discussions.join("\n\n")}
 `;
  }
  return markdown;
 }
 module.exports = { loadGitlabRepo, fetchGitlabFile };
--- a/collector/utils/extensions/RepoLoader/index.js
+++ b/collector/utils/extensions/RepoLoader/index.js
@ -0,0 +1,41 @@
 /**
 * Dynamically load the correct repository loader from a specific platform
 * by default will return GitHub.
 * @param {('github'|'gitlab')} platform
 * @returns {import("./GithubRepo/RepoLoader")|import("./GitlabRepo/RepoLoader")} the repo loader class for provider
 */
 function resolveRepoLoader(platform = "github") {
  switch (platform) {
    case "github":
      console.log(`Loading GitHub RepoLoader...`);
      return require("./GithubRepo/RepoLoader");
    case "gitlab":
      console.log(`Loading GitLab RepoLoader...`);
      return require("./GitlabRepo/RepoLoader");
    default:
      console.log(`Loading GitHub RepoLoader...`);
      return require("./GithubRepo/RepoLoader");
  }
 }
 /**
 * Dynamically load the correct repository loader function from a specific platform
 * by default will return Github.
 * @param {('github'|'gitlab')} platform
 * @returns {import("./GithubRepo")['fetchGithubFile'] | import("./GitlabRepo")['fetchGitlabFile']} the repo loader class for provider
 */
 function resolveRepoLoaderFunction(platform = "github") {
  switch (platform) {
    case "github":
      console.log(`Loading GitHub loader function...`);
      return require("./GithubRepo").loadGithubRepo;
    case "gitlab":
      console.log(`Loading GitLab loader function...`);
      return require("./GitlabRepo").loadGitlabRepo;
    default:
      console.log(`Loading GitHub loader function...`);
      return require("./GithubRepo").loadGithubRepo;
  }
 }
 module.exports = { resolveRepoLoader, resolveRepoLoaderFunction };
--- a/collector/utils/extensions/WebsiteDepth/index.js
+++ b/collector/utils/extensions/WebsiteDepth/index.js
@ -0,0 +1,166 @@
 const { v4 } = require("uuid");
 const {
  PuppeteerWebBaseLoader,
 } = require("langchain/document_loaders/web/puppeteer");
 const { default: slugify } = require("slugify");
 const { parse } = require("node-html-parser");
 const { writeToServerDocuments } = require("../../files");
 const { tokenizeString } = require("../../tokenizer");
 const path = require("path");
 const fs = require("fs");
 async function discoverLinks(startUrl, maxDepth = 1, maxLinks = 20) {
  const baseUrl = new URL(startUrl);
  const discoveredLinks = new Set([startUrl]);
  let queue = [[startUrl, 0]]; // [url, currentDepth]
  const scrapedUrls = new Set();
  for (let currentDepth = 0; currentDepth < maxDepth; currentDepth++) {
    const levelSize = queue.length;
    const nextQueue = [];
    for (let i = 0; i < levelSize && discoveredLinks.size < maxLinks; i++) {
      const [currentUrl, urlDepth] = queue[i];
      if (!scrapedUrls.has(currentUrl)) {
        scrapedUrls.add(currentUrl);
        const newLinks = await getPageLinks(currentUrl, baseUrl);
        for (const link of newLinks) {
          if (!discoveredLinks.has(link) && discoveredLinks.size < maxLinks) {
            discoveredLinks.add(link);
            if (urlDepth + 1 < maxDepth) {
              nextQueue.push([link, urlDepth + 1]);
            }
          }
        }
      }
    }
    queue = nextQueue;
    if (queue.length === 0 || discoveredLinks.size >= maxLinks) break;
  }
  return Array.from(discoveredLinks);
 }
 async function getPageLinks(url, baseUrl) {
  try {
    const loader = new PuppeteerWebBaseLoader(url, {
      launchOptions: { headless: "new" },
      gotoOptions: { waitUntil: "networkidle2" },
    });
    const docs = await loader.load();
    const html = docs[0].pageContent;
    const links = extractLinks(html, baseUrl);
    return links;
  } catch (error) {
    console.error(`Failed to get page links from ${url}.`, error);
    return [];
  }
 }
 function extractLinks(html, baseUrl) {
  const root = parse(html);
  const links = root.querySelectorAll("a");
  const extractedLinks = new Set();
  for (const link of links) {
    const href = link.getAttribute("href");
    if (href) {
      const absoluteUrl = new URL(href, baseUrl.href).href;
      if (
        absoluteUrl.startsWith(
          baseUrl.origin + baseUrl.pathname.split("/").slice(0, -1).join("/")
        )
      ) {
        extractedLinks.add(absoluteUrl);
      }
    }
  }
  return Array.from(extractedLinks);
 }
 async function bulkScrapePages(links, outFolderPath) {
  const scrapedData = [];
  for (let i = 0; i < links.length; i++) {
    const link = links[i];
    console.log(`Scraping ${i + 1}/${links.length}: ${link}`);
    try {
      const loader = new PuppeteerWebBaseLoader(link, {
        launchOptions: { headless: "new" },
        gotoOptions: { waitUntil: "networkidle2" },
        async evaluate(page, browser) {
          const result = await page.evaluate(() => document.body.innerText);
          await browser.close();
          return result;
        },
      });
      const docs = await loader.load();
      const content = docs[0].pageContent;
      if (!content.length) {
        console.warn(`Empty content for ${link}. Skipping.`);
        continue;
      }
      const url = new URL(link);
      const decodedPathname = decodeURIComponent(url.pathname);
      const filename = `${url.hostname}${decodedPathname.replace(/\//g, "_")}`;
      const data = {
        id: v4(),
        url: "file://" + slugify(filename) + ".html",
        title: slugify(filename) + ".html",
        docAuthor: "no author found",
        description: "No description found.",
        docSource: "URL link uploaded by the user.",
        chunkSource: `link://${link}`,
        published: new Date().toLocaleString(),
        wordCount: content.split(" ").length,
        pageContent: content,
        token_count_estimate: tokenizeString(content),
      };
      writeToServerDocuments(data, data.title, outFolderPath);
      scrapedData.push(data);
      console.log(`Successfully scraped ${link}.`);
    } catch (error) {
      console.error(`Failed to scrape ${link}.`, error);
    }
  }
  return scrapedData;
 }
 async function websiteScraper(startUrl, depth = 1, maxLinks = 20) {
  const websiteName = new URL(startUrl).hostname;
  const outFolder = slugify(
    `${slugify(websiteName)}-${v4().slice(0, 4)}`
  ).toLowerCase();
  const outFolderPath =
    process.env.NODE_ENV === "development"
      ? path.resolve(
          __dirname,
          `../../../../server/storage/documents/${outFolder}`
        )
      : path.resolve(process.env.STORAGE_DIR, `documents/${outFolder}`);
  console.log("Discovering links...");
  const linksToScrape = await discoverLinks(startUrl, depth, maxLinks);
  console.log(`Found ${linksToScrape.length} links to scrape.`);
  if (!fs.existsSync(outFolderPath))
    fs.mkdirSync(outFolderPath, { recursive: true });
  console.log("Starting bulk scraping...");
  const scrapedData = await bulkScrapePages(linksToScrape, outFolderPath);
  console.log(`Scraped ${scrapedData.length} pages.`);
  return scrapedData;
 }
 module.exports = websiteScraper;
--- a/collector/utils/extensions/YoutubeTranscript/YoutubeLoader/index.js
+++ b/collector/utils/extensions/YoutubeTranscript/YoutubeLoader/index.js
@ -0,0 +1,90 @@
 /*
 * This is just a custom implementation of the Langchain JS YouTubeLoader class
 * as the dependency for YoutubeTranscript is quite fickle and its a rat race to keep it up
 * and instead of waiting for patches we can just bring this simple script in-house and at least
 * be able to patch it since its so flaky. When we have more connectors we can kill this because
 * it will be a pain to maintain over time.
 */
 class YoutubeLoader {
  #videoId;
  #language;
  #addVideoInfo;
  constructor({ videoId = null, language = null, addVideoInfo = false } = {}) {
    if (!videoId) throw new Error("Invalid video id!");
    this.#videoId = videoId;
    this.#language = language;
    this.#addVideoInfo = addVideoInfo;
  }
  /**
   * Extracts the videoId from a YouTube video URL.
   * @param url The URL of the YouTube video.
   * @returns The videoId of the YouTube video.
   */
  static getVideoID(url) {
    const match = url.match(
      /.*(?:youtu.be\/|v\/|u\/\w\/|embed\/|watch\?v=)([^#&?]*).*/
    );
    if (match !== null && match[1].length === 11) {
      return match[1];
    } else {
      throw new Error("Failed to get youtube video id from the url");
    }
  }
  /**
   * Creates a new instance of the YoutubeLoader class from a YouTube video
   * URL.
   * @param url The URL of the YouTube video.
   * @param config Optional configuration options for the YoutubeLoader instance, excluding the videoId.
   * @returns A new instance of the YoutubeLoader class.
   */
  static createFromUrl(url, config = {}) {
    const videoId = YoutubeLoader.getVideoID(url);
    return new YoutubeLoader({ ...config, videoId });
  }
  /**
   * Loads the transcript and video metadata from the specified YouTube
   * video. It uses the youtube-transcript library to fetch the transcript
   * and the youtubei.js library to fetch the video metadata.
   * @returns Langchain like doc that is 1 element with PageContent and
   */
  async load() {
    let transcript;
    const metadata = {
      source: this.#videoId,
    };
    try {
      const { YoutubeTranscript } = require("./youtube-transcript");
      transcript = await YoutubeTranscript.fetchTranscript(this.#videoId, {
        lang: this.#language,
      });
      if (!transcript) {
        throw new Error("Transcription not found");
      }
      if (this.#addVideoInfo) {
        const { Innertube } = require("youtubei.js");
        const youtube = await Innertube.create();
        const info = (await youtube.getBasicInfo(this.#videoId)).basic_info;
        metadata.description = info.short_description;
        metadata.title = info.title;
        metadata.view_count = info.view_count;
        metadata.author = info.author;
      }
    } catch (e) {
      throw new Error(
        `Failed to get YouTube video transcription: ${e?.message}`
      );
    }
    return [
      {
        pageContent: transcript,
        metadata,
      },
    ];
  }
 }
 module.exports.YoutubeLoader = YoutubeLoader;
--- a/collector/utils/extensions/YoutubeTranscript/YoutubeLoader/youtube-transcript.js
+++ b/collector/utils/extensions/YoutubeTranscript/YoutubeLoader/youtube-transcript.js
@ -0,0 +1,117 @@
 const { parse } = require("node-html-parser");
 const RE_YOUTUBE =
  /(?:youtube\.com\/(?:[^\/]+\/.+\/|(?:v|e(?:mbed)?)\/|.*[?&]v=)|youtu\.be\/)([^"&?\/\s]{11})/i;
 const USER_AGENT =
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36,gzip(gfe)";
 class YoutubeTranscriptError extends Error {
  constructor(message) {
    super(`[YoutubeTranscript] ${message}`);
  }
 }
 /**
 * Class to retrieve transcript if exist
 */
 class YoutubeTranscript {
  /**
   * Fetch transcript from YTB Video
   * @param videoId Video url or video identifier
   * @param config Object with lang param (eg: en, es, hk, uk) format.
   * Will just the grab first caption if it can find one, so no special lang caption support.
   */
  static async fetchTranscript(videoId, config = {}) {
    const identifier = this.retrieveVideoId(videoId);
    const lang = config?.lang ?? "en";
    try {
      const transcriptUrl = await fetch(
        `https://www.youtube.com/watch?v=${identifier}`,
        {
          headers: {
            "User-Agent": USER_AGENT,
          },
        }
      )
        .then((res) => res.text())
        .then((html) => parse(html))
        .then((html) => this.#parseTranscriptEndpoint(html, lang));
      if (!transcriptUrl)
        throw new Error("Failed to locate a transcript for this video!");
      // Result is hopefully some XML.
      const transcriptXML = await fetch(transcriptUrl)
        .then((res) => res.text())
        .then((xml) => parse(xml));
      let transcript = "";
      const chunks = transcriptXML.getElementsByTagName("text");
      for (const chunk of chunks) {
        // Add space after each text chunk
        transcript += chunk.textContent + " ";
      }
      // Trim extra whitespace
      return transcript.trim().replace(/\s+/g, " ");
    } catch (e) {
      throw new YoutubeTranscriptError(e);
    }
  }
  static #parseTranscriptEndpoint(document, langCode = null) {
    try {
      // Get all script tags on document page
      const scripts = document.getElementsByTagName("script");
      // find the player data script.
      const playerScript = scripts.find((script) =>
        script.textContent.includes("var ytInitialPlayerResponse = {")
      );
      const dataString =
        playerScript.textContent
          ?.split("var ytInitialPlayerResponse = ")?.[1] //get the start of the object {....
          ?.split("};")?.[0] + // chunk off any code after object closure.
        "}"; // add back that curly brace we just cut.
      const data = JSON.parse(dataString.trim()); // Attempt a JSON parse
      const availableCaptions =
        data?.captions?.playerCaptionsTracklistRenderer?.captionTracks || [];
      // If languageCode was specified then search for it's code, otherwise get the first.
      let captionTrack = availableCaptions?.[0];
      if (langCode)
        captionTrack =
          availableCaptions.find((track) =>
            track.languageCode.includes(langCode)
          ) ?? availableCaptions?.[0];
      return captionTrack?.baseUrl;
    } catch (e) {
      console.error(`YoutubeTranscript.#parseTranscriptEndpoint ${e.message}`);
      return null;
    }
  }
  /**
   * Retrieve video id from url or string
   * @param videoId video url or video id
   */
  static retrieveVideoId(videoId) {
    if (videoId.length === 11) {
      return videoId;
    }
    const matchId = videoId.match(RE_YOUTUBE);
    if (matchId && matchId.length) {
      return matchId[1];
    }
    throw new YoutubeTranscriptError(
      "Impossible to retrieve Youtube video ID."
    );
  }
 }
 module.exports = {
  YoutubeTranscript,
  YoutubeTranscriptError,
 };
--- a/collector/utils/extensions/YoutubeTranscript/index.js
+++ b/collector/utils/extensions/YoutubeTranscript/index.js
@ -0,0 +1,142 @@
 const fs = require("fs");
 const path = require("path");
 const { default: slugify } = require("slugify");
 const { v4 } = require("uuid");
 const { writeToServerDocuments } = require("../../files");
 const { tokenizeString } = require("../../tokenizer");
 const { YoutubeLoader } = require("./YoutubeLoader");
 function validYoutubeVideoUrl(link) {
  const UrlPattern = require("url-pattern");
  const opts = new URL(link);
  const url = `${opts.protocol}//${opts.host}${opts.pathname}${
    opts.searchParams.has("v") ? `?v=${opts.searchParams.get("v")}` : ""
  }`;
  const shortPatternMatch = new UrlPattern(
    "https\\://(www.)youtu.be/(:videoId)"
  ).match(url);
  const fullPatternMatch = new UrlPattern(
    "https\\://(www.)youtube.com/watch?v=(:videoId)"
  ).match(url);
  const videoId =
    shortPatternMatch?.videoId || fullPatternMatch?.videoId || null;
  if (!!videoId) return true;
  return false;
 }
 async function fetchVideoTranscriptContent({ url }) {
  if (!validYoutubeVideoUrl(url)) {
    return {
      success: false,
      reason: "Invalid URL. Should be youtu.be or youtube.com/watch.",
      content: null,
      metadata: {},
    };
  }
  console.log(`-- Working YouTube ${url} --`);
  const loader = YoutubeLoader.createFromUrl(url, { addVideoInfo: true });
  const { docs, error } = await loader
    .load()
    .then((docs) => {
      return { docs, error: null };
    })
    .catch((e) => {
      return {
        docs: [],
        error: e.message?.split("Error:")?.[1] || e.message,
      };
    });
  if (!docs.length || !!error) {
    return {
      success: false,
      reason: error ?? "No transcript found for that YouTube video.",
      content: null,
      metadata: {},
    };
  }
  const metadata = docs[0].metadata;
  const content = docs[0].pageContent;
  if (!content.length) {
    return {
      success: false,
      reason: "No transcript could be parsed for that YouTube video.",
      content: null,
      metadata: {},
    };
  }
  return {
    success: true,
    reason: null,
    content,
    metadata,
  };
 }
 async function loadYouTubeTranscript({ url }) {
  const transcriptResults = await fetchVideoTranscriptContent({ url });
  if (!transcriptResults.success) {
    return {
      success: false,
      reason:
        transcriptResults.reason ||
        "An unknown error occurred during transcription retrieval",
    };
  }
  const { content, metadata } = transcriptResults;
  const outFolder = slugify(
    `${metadata.author} YouTube transcripts`
  ).toLowerCase();
  const outFolderPath =
    process.env.NODE_ENV === "development"
      ? path.resolve(
          __dirname,
          `../../../../server/storage/documents/${outFolder}`
        )
      : path.resolve(process.env.STORAGE_DIR, `documents/${outFolder}`);
  if (!fs.existsSync(outFolderPath))
    fs.mkdirSync(outFolderPath, { recursive: true });
  const data = {
    id: v4(),
    url: url + ".youtube",
    title: metadata.title || url,
    docAuthor: metadata.author,
    description: metadata.description,
    docSource: url,
    chunkSource: `youtube://${url}`,
    published: new Date().toLocaleString(),
    wordCount: content.split(" ").length,
    pageContent: content,
    token_count_estimate: tokenizeString(content),
  };
  console.log(`[YouTube Loader]: Saving ${metadata.title} to ${outFolder}`);
  writeToServerDocuments(
    data,
    `${slugify(metadata.title)}-${data.id}`,
    outFolderPath
  );
  return {
    success: true,
    reason: "test",
    data: {
      title: metadata.title,
      author: metadata.author,
      destination: outFolder,
    },
  };
 }
 module.exports = {
  loadYouTubeTranscript,
  fetchVideoTranscriptContent,
 };
--- a/collector/utils/files/index.js
+++ b/collector/utils/files/index.js
@ -0,0 +1,192 @@
 const fs = require("fs");
 const path = require("path");
 const { MimeDetector } = require("./mime");
 /**
 * Checks if a file is text by checking the mime type and then falling back to buffer inspection.
 * This way we can capture all the cases where the mime type is not known but still parseable as text
 * without having to constantly add new mime type overrides.
 * @param {string} filepath - The path to the file.
 * @returns {boolean} - Returns true if the file is text, false otherwise.
 */
 function isTextType(filepath) {
  if (!fs.existsSync(filepath)) return false;
  const result = isKnownTextMime(filepath);
  if (result.valid) return true; // Known text type - return true.
  if (result.reason !== "generic") return false; // If any other reason than generic - return false.
  return parseableAsText(filepath); // Fallback to parsing as text via buffer inspection.
 }
 /**
 * Checks if a file is known to be text by checking the mime type.
 * @param {string} filepath - The path to the file.
 * @returns {boolean} - Returns true if the file is known to be text, false otherwise.
 */
 function isKnownTextMime(filepath) {
  try {
    const mimeLib = new MimeDetector();
    const mime = mimeLib.getType(filepath);
    if (mimeLib.badMimes.includes(mime))
      return { valid: false, reason: "bad_mime" };
    const type = mime.split("/")[0];
    if (mimeLib.nonTextTypes.includes(type))
      return { valid: false, reason: "non_text_mime" };
    return { valid: true, reason: "valid_mime" };
  } catch (e) {
    return { valid: false, reason: "generic" };
  }
 }
 /**
 * Checks if a file is parseable as text by forcing it to be read as text in utf8 encoding.
 * If the file looks too much like a binary file, it will return false.
 * @param {string} filepath - The path to the file.
 * @returns {boolean} - Returns true if the file is parseable as text, false otherwise.
 */
 function parseableAsText(filepath) {
  try {
    const fd = fs.openSync(filepath, "r");
    const buffer = Buffer.alloc(1024); // Read first 1KB of the file synchronously
    const bytesRead = fs.readSync(fd, buffer, 0, 1024, 0);
    fs.closeSync(fd);
    const content = buffer.subarray(0, bytesRead).toString("utf8");
    const nullCount = (content.match(/\0/g) || []).length;
    const controlCount = (content.match(/[\x00-\x08\x0B\x0C\x0E-\x1F]/g) || [])
      .length;
    const threshold = bytesRead * 0.1;
    return nullCount + controlCount < threshold;
  } catch {
    return false;
  }
 }
 function trashFile(filepath) {
  if (!fs.existsSync(filepath)) return;
  try {
    const isDir = fs.lstatSync(filepath).isDirectory();
    if (isDir) return;
  } catch {
    return;
  }
  fs.rmSync(filepath);
  return;
 }
 function createdDate(filepath) {
  try {
    const { birthtimeMs, birthtime } = fs.statSync(filepath);
    if (birthtimeMs === 0) throw new Error("Invalid stat for file!");
    return birthtime.toLocaleString();
  } catch {
    return "unknown";
  }
 }
 function writeToServerDocuments(
  data = {},
  filename,
  destinationOverride = null
 ) {
  const destination = destinationOverride
    ? path.resolve(destinationOverride)
    : path.resolve(
        __dirname,
        "../../../server/storage/documents/custom-documents"
      );
  if (!fs.existsSync(destination))
    fs.mkdirSync(destination, { recursive: true });
  const destinationFilePath = path.resolve(destination, filename) + ".json";
  fs.writeFileSync(destinationFilePath, JSON.stringify(data, null, 4), {
    encoding: "utf-8",
  });
  return {
    ...data,
    // relative location string that can be passed into the /update-embeddings api
    // that will work since we know the location exists and since we only allow
    // 1-level deep folders this will always work. This still works for integrations like GitHub and YouTube.
    location: destinationFilePath.split("/").slice(-2).join("/"),
  };
 }
 // When required we can wipe the entire collector hotdir and tmp storage in case
 // there were some large file failures that we unable to be removed a reboot will
 // force remove them.
 async function wipeCollectorStorage() {
  const cleanHotDir = new Promise((resolve) => {
    const directory = path.resolve(__dirname, "../../hotdir");
    fs.readdir(directory, (err, files) => {
      if (err) resolve();
      for (const file of files) {
        if (file === "__HOTDIR__.md") continue;
        try {
          fs.rmSync(path.join(directory, file));
        } catch {}
      }
      resolve();
    });
  });
  const cleanTmpDir = new Promise((resolve) => {
    const directory = path.resolve(__dirname, "../../storage/tmp");
    fs.readdir(directory, (err, files) => {
      if (err) resolve();
      for (const file of files) {
        if (file === ".placeholder") continue;
        try {
          fs.rmSync(path.join(directory, file));
        } catch {}
      }
      resolve();
    });
  });
  await Promise.all([cleanHotDir, cleanTmpDir]);
  console.log(`Collector hot directory and tmp storage wiped!`);
  return;
 }
 /**
 * Checks if a given path is within another path.
 * @param {string} outer - The outer path (should be resolved).
 * @param {string} inner - The inner path (should be resolved).
 * @returns {boolean} - Returns true if the inner path is within the outer path, false otherwise.
 */
 function isWithin(outer, inner) {
  if (outer === inner) return false;
  const rel = path.relative(outer, inner);
  return !rel.startsWith("../") && rel !== "..";
 }
 function normalizePath(filepath = "") {
  const result = path
    .normalize(filepath.trim())
    .replace(/^(\.\.(\/|\\|$))+/, "")
    .trim();
  if (["..", ".", "/"].includes(result)) throw new Error("Invalid path.");
  return result;
 }
 function sanitizeFileName(fileName) {
  if (!fileName) return fileName;
  return fileName.replace(/[<>:"\/\\|?*]/g, "");
 }
 module.exports = {
  trashFile,
  isTextType,
  createdDate,
  writeToServerDocuments,
  wipeCollectorStorage,
  normalizePath,
  isWithin,
  sanitizeFileName,
 };
--- a/collector/utils/files/mime.js
+++ b/collector/utils/files/mime.js
@ -0,0 +1,64 @@
 const MimeLib = require("mime");
 class MimeDetector {
  nonTextTypes = ["multipart", "model", "audio", "video", "font"];
  badMimes = [
    "application/octet-stream",
    "application/zip",
    "application/pkcs8",
    "application/vnd.microsoft.portable-executable",
    "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", // XLSX are binaries and need to be handled explicitly.
    "application/x-msdownload",
  ];
  constructor() {
    this.lib = MimeLib;
    this.setOverrides();
  }
  setOverrides() {
    // the .ts extension maps to video/mp2t because of https://en.wikipedia.org/wiki/MPEG_transport_stream
    // which has had this extension far before TS was invented. So need to force re-map this MIME map.
    this.lib.define(
      {
        "text/plain": [
          "ts",
          "tsx",
          "py",
          "opts",
          "lock",
          "jsonl",
          "qml",
          "sh",
          "c",
          "cs",
          "h",
          "js",
          "lua",
          "pas",
          "r",
          "go",
          "ino",
          "hpp",
          "linq",
          "cs",
        ],
      },
      true
    );
  }
  /**
   * Returns the MIME type of the file. If the file has no extension found, it will be processed as a text file.
   * @param {string} filepath
   * @returns {string}
   */
  getType(filepath) {
    const parsedMime = this.lib.getType(filepath);
    if (!!parsedMime) return parsedMime;
    return null;
  }
 }
 module.exports = {
  MimeDetector,
 };
--- a/collector/utils/http/index.js
+++ b/collector/utils/http/index.js
@ -0,0 +1,18 @@
 process.env.NODE_ENV === "development"
  ? require("dotenv").config({ path: `.env.${process.env.NODE_ENV}` })
  : require("dotenv").config();
 function reqBody(request) {
  return typeof request.body === "string"
    ? JSON.parse(request.body)
    : request.body;
 }
 function queryParams(request) {
  return request.query;
 }
 module.exports = {
  reqBody,
  queryParams,
 };
--- a/collector/utils/logger/index.js
+++ b/collector/utils/logger/index.js
@ -0,0 +1,68 @@
 const winston = require("winston");
 class Logger {
  logger = console;
  static _instance;
  constructor() {
    if (Logger._instance) return Logger._instance;
    this.logger =
      process.env.NODE_ENV === "production" ? this.getWinstonLogger() : console;
    Logger._instance = this;
  }
  getWinstonLogger() {
    const logger = winston.createLogger({
      level: "info",
      defaultMeta: { service: "collector" },
      transports: [
        new winston.transports.Console({
          format: winston.format.combine(
            winston.format.colorize(),
            winston.format.printf(
              ({ level, message, service, origin = "" }) => {
                return `\x1b[36m[${service}]\x1b[0m${
                  origin ? `\x1b[33m[${origin}]\x1b[0m` : ""
                } ${level}: ${message}`;
              }
            )
          ),
        }),
      ],
    });
    function formatArgs(args) {
      return args
        .map((arg) => {
          if (arg instanceof Error) {
            return arg.stack; // If argument is an Error object, return its stack trace
          } else if (typeof arg === "object") {
            return JSON.stringify(arg); // Convert objects to JSON string
          } else {
            return arg; // Otherwise, return as-is
          }
        })
        .join(" ");
    }
    console.log = function (...args) {
      logger.info(formatArgs(args));
    };
    console.error = function (...args) {
      logger.error(formatArgs(args));
    };
    console.info = function (...args) {
      logger.warn(formatArgs(args));
    };
    return logger;
  }
 }
 /**
 * Sets and overrides Console methods for logging when called.
 * This is a singleton method and will not create multiple loggers.
 * @returns {winston.Logger | console} - instantiated logger interface.
 */
 function setLogger() {
  return new Logger().logger;
 }
 module.exports = setLogger;
--- a/collector/utils/tokenizer/index.js
+++ b/collector/utils/tokenizer/index.js
@ -0,0 +1,66 @@
 const { getEncoding } = require("js-tiktoken");
 class TikTokenTokenizer {
  static MAX_KB_ESTIMATE = 10;
  static DIVISOR = 8;
  constructor() {
    if (TikTokenTokenizer.instance) {
      this.log(
        "Singleton instance already exists. Returning existing instance."
      );
      return TikTokenTokenizer.instance;
    }
    this.encoder = getEncoding("cl100k_base");
    TikTokenTokenizer.instance = this;
    this.log("Initialized new TikTokenTokenizer instance.");
  }
  log(text, ...args) {
    console.log(`\x1b[35m[TikTokenTokenizer]\x1b[0m ${text}`, ...args);
  }
  /**
   * Check if the input is too long to encode
   * this is more of a rough estimate and a sanity check to prevent
   * CPU issues from encoding too large of strings
   * Assumes 1 character = 2 bytes in JS
   * @param {string} input
   * @returns {boolean}
   */
  #isTooLong(input) {
    const bytesEstimate = input.length * 2;
    const kbEstimate = Math.floor(bytesEstimate / 1024);
    return kbEstimate >= TikTokenTokenizer.MAX_KB_ESTIMATE;
  }
  /**
   * Encode a string into tokens for rough token count estimation.
   * @param {string} input
   * @returns {number}
   */
  tokenizeString(input = "") {
    try {
      if (this.#isTooLong(input)) {
        this.log("Input will take too long to encode - estimating");
        return Math.ceil(input.length / TikTokenTokenizer.DIVISOR);
      }
      return this.encoder.encode(input).length;
    } catch (e) {
      this.log("Could not tokenize string! Estimating...", e.message, e.stack);
      return Math.ceil(input?.length / TikTokenTokenizer.DIVISOR) || 0;
    }
  }
 }
 const tokenizer = new TikTokenTokenizer();
 module.exports = {
  /**
   * Encode a string into tokens for rough token count estimation.
   * @param {string} input
   * @returns {number}
   */
  tokenizeString: (input) => tokenizer.tokenizeString(input),
 };
--- a/collector/utils/url/index.js
+++ b/collector/utils/url/index.js
@ -0,0 +1,55 @@
 /**  ATTN: SECURITY RESEARCHERS
 * To Security researchers about to submit an SSRF report CVE - please don't.
 * We are aware that the code below is does not defend against any of the thousands of ways
 * you can map a hostname to another IP via tunneling, hosts editing, etc. The code below does not have intention of blocking this
 * and is simply to prevent the user from accidentally putting in non-valid websites, which is all this protects
 * since _all urls must be submitted by the user anyway_ and cannot be done with authentication and manager or admin roles.
 * If an attacker has those roles then the system is already vulnerable and this is not a primary concern.
 *
 * We have gotten this report may times, marked them as duplicate or information and continue to get them. We communicate
 * already that deployment (and security) of an instance is on the deployer and system admin deploying it. This would include
 * isolation, firewalls, and the general security of the instance.
 */
 const VALID_PROTOCOLS = ["https:", "http:"];
 const INVALID_OCTETS = [192, 172, 10, 127];
 /**
 * If an ip address is passed in the user is attempting to collector some internal service running on internal/private IP.
 * This is not a security feature and simply just prevents the user from accidentally entering invalid IP addresses.
 * @param {URL} param0
 * @param {URL['hostname']} param0.hostname
 * @returns {boolean}
 */
 function isInvalidIp({ hostname }) {
  const IPRegex = new RegExp(
    /^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$/gi
  );
  // Not an IP address at all - passthrough
  if (!IPRegex.test(hostname)) return false;
  const [octetOne, ..._rest] = hostname.split(".");
  // If fails to validate to number - abort and return as invalid.
  if (isNaN(Number(octetOne))) return true;
  // Allow localhost loopback and 0.0.0.0 for scraping convenience
  // for locally hosted services or websites
  if (["127.0.0.1", "0.0.0.0"].includes(hostname)) return false;
  return INVALID_OCTETS.includes(Number(octetOne));
 }
 function validURL(url) {
  try {
    const destination = new URL(url);
    if (!VALID_PROTOCOLS.includes(destination.protocol)) return false;
    if (isInvalidIp(destination)) return false;
    return true;
  } catch {}
  return false;
 }
 module.exports = {
  validURL,
 };
--- a/collector/yarn.lock
+++ b/collector/yarn.lock
--- a/docker/.env.example
+++ b/docker/.env.example
@ -0,0 +1,318 @@
 SERVER_PORT=3001
 STORAGE_DIR="/app/server/storage"
 UID='1000'
 GID='1000'
 # SIG_KEY='passphrase' # Please generate random string at least 32 chars long.
 # SIG_SALT='salt' # Please generate random string at least 32 chars long.
 # JWT_SECRET="my-random-string-for-seeding" # Only needed if AUTH_TOKEN is set. Please generate random string at least 12 chars long.
 ###########################################
 ######## LLM API SElECTION ################
 ###########################################
 # LLM_PROVIDER='openai'
 # OPEN_AI_KEY=
 # OPEN_MODEL_PREF='gpt-4o'
 # LLM_PROVIDER='gemini'
 # GEMINI_API_KEY=
 # GEMINI_LLM_MODEL_PREF='gemini-pro'
 # LLM_PROVIDER='azure'
 # AZURE_OPENAI_ENDPOINT=
 # AZURE_OPENAI_KEY=
 # OPEN_MODEL_PREF='my-gpt35-deployment' # This is the "deployment" on Azure you want to use. Not the base model.
 # EMBEDDING_MODEL_PREF='embedder-model' # This is the "deployment" on Azure you want to use for embeddings. Not the base model. Valid base model is text-embedding-ada-002
 # LLM_PROVIDER='anthropic'
 # ANTHROPIC_API_KEY=sk-ant-xxxx
 # ANTHROPIC_MODEL_PREF='claude-2'
 # LLM_PROVIDER='lmstudio'
 # LMSTUDIO_BASE_PATH='http://your-server:1234/v1'
 # LMSTUDIO_MODEL_PREF='Loaded from Chat UI' # this is a bug in LMStudio 0.2.17
 # LMSTUDIO_MODEL_TOKEN_LIMIT=4096
 # LLM_PROVIDER='localai'
 # LOCAL_AI_BASE_PATH='http://host.docker.internal:8080/v1'
 # LOCAL_AI_MODEL_PREF='luna-ai-llama2'
 # LOCAL_AI_MODEL_TOKEN_LIMIT=4096
 # LOCAL_AI_API_KEY="sk-123abc"
 # LLM_PROVIDER='ollama'
 # OLLAMA_BASE_PATH='http://host.docker.internal:11434'
 # OLLAMA_MODEL_PREF='llama2'
 # OLLAMA_MODEL_TOKEN_LIMIT=4096
 # LLM_PROVIDER='togetherai'
 # TOGETHER_AI_API_KEY='my-together-ai-key'
 # TOGETHER_AI_MODEL_PREF='mistralai/Mixtral-8x7B-Instruct-v0.1'
 # LLM_PROVIDER='mistral'
 # MISTRAL_API_KEY='example-mistral-ai-api-key'
 # MISTRAL_MODEL_PREF='mistral-tiny'
 # LLM_PROVIDER='perplexity'
 # PERPLEXITY_API_KEY='my-perplexity-key'
 # PERPLEXITY_MODEL_PREF='codellama-34b-instruct'
 # LLM_PROVIDER='openrouter'
 # OPENROUTER_API_KEY='my-openrouter-key'
 # OPENROUTER_MODEL_PREF='openrouter/auto'
 # LLM_PROVIDER='huggingface'
 # HUGGING_FACE_LLM_ENDPOINT=https://uuid-here.us-east-1.aws.endpoints.huggingface.cloud
 # HUGGING_FACE_LLM_API_KEY=hf_xxxxxx
 # HUGGING_FACE_LLM_TOKEN_LIMIT=8000
 # LLM_PROVIDER='groq'
 # GROQ_API_KEY=gsk_abcxyz
 # GROQ_MODEL_PREF=llama3-8b-8192
 # LLM_PROVIDER='koboldcpp'
 # KOBOLD_CPP_BASE_PATH='http://127.0.0.1:5000/v1'
 # KOBOLD_CPP_MODEL_PREF='koboldcpp/codellama-7b-instruct.Q4_K_S'
 # KOBOLD_CPP_MODEL_TOKEN_LIMIT=4096
 # LLM_PROVIDER='textgenwebui'
 # TEXT_GEN_WEB_UI_BASE_PATH='http://127.0.0.1:5000/v1'
 # TEXT_GEN_WEB_UI_TOKEN_LIMIT=4096
 # TEXT_GEN_WEB_UI_API_KEY='sk-123abc'
 # LLM_PROVIDER='generic-openai'
 # GENERIC_OPEN_AI_BASE_PATH='http://proxy.url.openai.com/v1'
 # GENERIC_OPEN_AI_MODEL_PREF='gpt-3.5-turbo'
 # GENERIC_OPEN_AI_MODEL_TOKEN_LIMIT=4096
 # GENERIC_OPEN_AI_API_KEY=sk-123abc
 # LLM_PROVIDER='litellm'
 # LITE_LLM_MODEL_PREF='gpt-3.5-turbo'
 # LITE_LLM_MODEL_TOKEN_LIMIT=4096
 # LITE_LLM_BASE_PATH='http://127.0.0.1:4000'
 # LITE_LLM_API_KEY='sk-123abc'
 # LLM_PROVIDER='novita'
 # NOVITA_LLM_API_KEY='your-novita-api-key-here' check on https://novita.ai/settings/key-management
 # NOVITA_LLM_MODEL_PREF='deepseek/deepseek-r1'
 # LLM_PROVIDER='cohere'
 # COHERE_API_KEY=
 # COHERE_MODEL_PREF='command-r'
 # LLM_PROVIDER='bedrock'
 # AWS_BEDROCK_LLM_ACCESS_KEY_ID=
 # AWS_BEDROCK_LLM_ACCESS_KEY=
 # AWS_BEDROCK_LLM_REGION=us-west-2
 # AWS_BEDROCK_LLM_MODEL_PREFERENCE=meta.llama3-1-8b-instruct-v1:0
 # AWS_BEDROCK_LLM_MODEL_TOKEN_LIMIT=8191
 # LLM_PROVIDER='fireworksai'
 # FIREWORKS_AI_LLM_API_KEY='my-fireworks-ai-key'
 # FIREWORKS_AI_LLM_MODEL_PREF='accounts/fireworks/models/llama-v3p1-8b-instruct'
 # LLM_PROVIDER='apipie'
 # APIPIE_LLM_API_KEY='sk-123abc'
 # APIPIE_LLM_MODEL_PREF='openrouter/llama-3.1-8b-instruct'
 # LLM_PROVIDER='xai'
 # XAI_LLM_API_KEY='xai-your-api-key-here'
 # XAI_LLM_MODEL_PREF='grok-beta'
 # LLM_PROVIDER='nvidia-nim'
 # NVIDIA_NIM_LLM_BASE_PATH='http://127.0.0.1:8000'
 # NVIDIA_NIM_LLM_MODEL_PREF='meta/llama-3.2-3b-instruct'
 # LLM_PROVIDER='deepseek'
 # DEEPSEEK_API_KEY='your-deepseek-api-key-here'
 # DEEPSEEK_MODEL_PREF='deepseek-chat'
 ###########################################
 ######## Embedding API SElECTION ##########
 ###########################################
 # Only used if you are using an LLM that does not natively support embedding (openai or Azure)
 # EMBEDDING_ENGINE='openai'
 # OPEN_AI_KEY=sk-xxxx
 # EMBEDDING_MODEL_PREF='text-embedding-ada-002'
 # EMBEDDING_ENGINE='azure'
 # AZURE_OPENAI_ENDPOINT=
 # AZURE_OPENAI_KEY=
 # EMBEDDING_MODEL_PREF='my-embedder-model' # This is the "deployment" on Azure you want to use for embeddings. Not the base model. Valid base model is text-embedding-ada-002
 # EMBEDDING_ENGINE='localai'
 # EMBEDDING_BASE_PATH='http://localhost:8080/v1'
 # EMBEDDING_MODEL_PREF='text-embedding-ada-002'
 # EMBEDDING_MODEL_MAX_CHUNK_LENGTH=1000 # The max chunk size in chars a string to embed can be
 # EMBEDDING_ENGINE='ollama'
 # EMBEDDING_BASE_PATH='http://host.docker.internal:11434'
 # EMBEDDING_MODEL_PREF='nomic-embed-text:latest'
 # EMBEDDING_MODEL_MAX_CHUNK_LENGTH=8192
 # EMBEDDING_ENGINE='lmstudio'
 # EMBEDDING_BASE_PATH='https://host.docker.internal:1234/v1'
 # EMBEDDING_MODEL_PREF='nomic-ai/nomic-embed-text-v1.5-GGUF/nomic-embed-text-v1.5.Q4_0.gguf'
 # EMBEDDING_MODEL_MAX_CHUNK_LENGTH=8192
 # EMBEDDING_ENGINE='cohere'
 # COHERE_API_KEY=
 # EMBEDDING_MODEL_PREF='embed-english-v3.0'
 # EMBEDDING_ENGINE='voyageai'
 # VOYAGEAI_API_KEY=
 # EMBEDDING_MODEL_PREF='voyage-large-2-instruct'
 # EMBEDDING_ENGINE='litellm'
 # EMBEDDING_MODEL_PREF='text-embedding-ada-002'
 # EMBEDDING_MODEL_MAX_CHUNK_LENGTH=8192
 # LITE_LLM_BASE_PATH='http://127.0.0.1:4000'
 # LITE_LLM_API_KEY='sk-123abc'
 # EMBEDDING_ENGINE='generic-openai'
 # EMBEDDING_MODEL_PREF='text-embedding-ada-002'
 # EMBEDDING_MODEL_MAX_CHUNK_LENGTH=8192
 # EMBEDDING_BASE_PATH='http://127.0.0.1:4000'
 # GENERIC_OPEN_AI_EMBEDDING_API_KEY='sk-123abc'
 # GENERIC_OPEN_AI_EMBEDDING_MAX_CONCURRENT_CHUNKS=500
 # EMBEDDING_ENGINE='gemini'
 # GEMINI_EMBEDDING_API_KEY=
 # EMBEDDING_MODEL_PREF='text-embedding-004'
 ###########################################
 ######## Vector Database Selection ########
 ###########################################
 # Enable all below if you are using vector database: Chroma.
 # VECTOR_DB="chroma"
 # CHROMA_ENDPOINT='http://host.docker.internal:8000'
 # CHROMA_API_HEADER="X-Api-Key"
 # CHROMA_API_KEY="sk-123abc"
 # Enable all below if you are using vector database: Pinecone.
 # VECTOR_DB="pinecone"
 # PINECONE_API_KEY=
 # PINECONE_INDEX=
 # Enable all below if you are using vector database: LanceDB.
 # VECTOR_DB="lancedb"
 # Enable all below if you are using vector database: Weaviate.
 # VECTOR_DB="weaviate"
 # WEAVIATE_ENDPOINT="http://localhost:8080"
 # WEAVIATE_API_KEY=
 # Enable all below if you are using vector database: Qdrant.
 # VECTOR_DB="qdrant"
 # QDRANT_ENDPOINT="http://localhost:6333"
 # QDRANT_API_KEY=
 # Enable all below if you are using vector database: Milvus.
 # VECTOR_DB="milvus"
 # MILVUS_ADDRESS="http://localhost:19530"
 # MILVUS_USERNAME=
 # MILVUS_PASSWORD=
 # Enable all below if you are using vector database: Zilliz Cloud.
 # VECTOR_DB="zilliz"
 # ZILLIZ_ENDPOINT="https://sample.api.gcp-us-west1.zillizcloud.com"
 # ZILLIZ_API_TOKEN=api-token-here
 # Enable all below if you are using vector database: Astra DB.
 # VECTOR_DB="astra"
 # ASTRA_DB_APPLICATION_TOKEN=
 # ASTRA_DB_ENDPOINT=
 ###########################################
 ######## Audio Model Selection ############
 ###########################################
 # (default) use built-in whisper-small model.
 # WHISPER_PROVIDER="local"
 # use openai hosted whisper model.
 # WHISPER_PROVIDER="openai"
 # OPEN_AI_KEY=sk-xxxxxxxx
 ###########################################
 ######## TTS/STT Model Selection ##########
 ###########################################
 # TTS_PROVIDER="native"
 # TTS_PROVIDER="openai"
 # TTS_OPEN_AI_KEY=sk-example
 # TTS_OPEN_AI_VOICE_MODEL=nova
 # TTS_PROVIDER="generic-openai"
 # TTS_OPEN_AI_COMPATIBLE_KEY=sk-example
 # TTS_OPEN_AI_COMPATIBLE_VOICE_MODEL=nova
 # TTS_OPEN_AI_COMPATIBLE_ENDPOINT="https://api.openai.com/v1"
 # TTS_PROVIDER="elevenlabs"
 # TTS_ELEVEN_LABS_KEY=
 # TTS_ELEVEN_LABS_VOICE_MODEL=21m00Tcm4TlvDq8ikWAM # Rachel
 # CLOUD DEPLOYMENT VARIRABLES ONLY
 # AUTH_TOKEN="hunter2" # This is the password to your application if remote hosting.
 # DISABLE_TELEMETRY="false"
 ###########################################
 ######## PASSWORD COMPLEXITY ##############
 ###########################################
 # Enforce a password schema for your organization users.
 # Documentation on how to use https://github.com/kamronbatman/joi-password-complexity
 # Default is only 8 char minimum
 # PASSWORDMINCHAR=8
 # PASSWORDMAXCHAR=250
 # PASSWORDLOWERCASE=1
 # PASSWORDUPPERCASE=1
 # PASSWORDNUMERIC=1
 # PASSWORDSYMBOL=1
 # PASSWORDREQUIREMENTS=4
 ###########################################
 ######## ENABLE HTTPS SERVER ##############
 ###########################################
 # By enabling this and providing the path/filename for the key and cert,
 # the server will use HTTPS instead of HTTP.
 #ENABLE_HTTPS="true"
 #HTTPS_CERT_PATH="sslcert/cert.pem"
 #HTTPS_KEY_PATH="sslcert/key.pem"
 ###########################################
 ######## AGENT SERVICE KEYS ###############
 ###########################################
 #------ SEARCH ENGINES -------
 #=============================
 #------ Google Search -------- https://programmablesearchengine.google.com/controlpanel/create
 # AGENT_GSE_KEY=
 # AGENT_GSE_CTX=
 #------ SearchApi.io ----------- https://www.searchapi.io/
 # AGENT_SEARCHAPI_API_KEY=
 # AGENT_SEARCHAPI_ENGINE=google
 #------ Serper.dev ----------- https://serper.dev/
 # AGENT_SERPER_DEV_KEY=
 #------ Bing Search ----------- https://portal.azure.com/
 # AGENT_BING_SEARCH_API_KEY=
 #------ Serply.io ----------- https://serply.io/
 # AGENT_SERPLY_API_KEY=
 #------ SearXNG ----------- https://github.com/searxng/searxng
 # AGENT_SEARXNG_API_URL=
 #------ Tavily ----------- https://www.tavily.com/
 # AGENT_TAVILY_API_KEY=
 ###########################################
 ######## Other Configurations ############
 ###########################################
 # Disable viewing chat history from the UI and frontend APIs.
 # See https://docs.anythingllm.com/configuration#disable-view-chat-history for more information.
 # DISABLE_VIEW_CHAT_HISTORY=1
 # Enable simple SSO passthrough to pre-authenticate users from a third party service.
 # See https://docs.anythingllm.com/configuration#simple-sso-passthrough for more information.
 # SIMPLE_SSO_ENABLED=1
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@ -0,0 +1,173 @@
 # Setup base image
 FROM ubuntu:jammy-20240627.1 AS base
 # Build arguments
 ARG ARG_UID=1000
 ARG ARG_GID=1000
 FROM base AS build-arm64
 RUN echo "Preparing build of AnythingLLM image for arm64 architecture"
 SHELL ["/bin/bash", "-o", "pipefail", "-c"]
 # Install system dependencies
 # hadolint ignore=DL3008,DL3013
 RUN DEBIAN_FRONTEND=noninteractive apt-get update && \
    DEBIAN_FRONTEND=noninteractive apt-get install -yq --no-install-recommends \
        unzip curl gnupg libgfortran5 libgbm1 tzdata netcat \
        libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 \
        libgcc1 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libx11-6 libx11-xcb1 libxcb1 \
        libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 \
        libxss1 libxtst6 ca-certificates fonts-liberation libappindicator1 libnss3 lsb-release \
        xdg-utils git build-essential ffmpeg && \
    mkdir -p /etc/apt/keyrings && \
    curl -fsSL https://deb.nodesource.com/gpgkey/nodesource-repo.gpg.key | gpg --dearmor -o /etc/apt/keyrings/nodesource.gpg && \
    echo "deb [signed-by=/etc/apt/keyrings/nodesource.gpg] https://deb.nodesource.com/node_18.x nodistro main" | tee /etc/apt/sources.list.d/nodesource.list && \
    apt-get update && \
    apt-get install -yq --no-install-recommends nodejs && \
    curl -LO https://github.com/yarnpkg/yarn/releases/download/v1.22.19/yarn_1.22.19_all.deb \
        && dpkg -i yarn_1.22.19_all.deb \
        && rm yarn_1.22.19_all.deb && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*
 # Create a group and user with specific UID and GID
 RUN groupadd -g "$ARG_GID" anythingllm && \
    useradd -l -u "$ARG_UID" -m -d /app -s /bin/bash -g anythingllm anythingllm && \
    mkdir -p /app/frontend/ /app/server/ /app/collector/ && chown -R anythingllm:anythingllm /app
 # Copy docker helper scripts
 COPY ./docker/docker-entrypoint.sh /usr/local/bin/
 COPY ./docker/docker-healthcheck.sh /usr/local/bin/
 COPY --chown=anythingllm:anythingllm ./docker/.env.example /app/server/.env
 # Ensure the scripts are executable
 RUN chmod +x /usr/local/bin/docker-entrypoint.sh && \
    chmod +x /usr/local/bin/docker-healthcheck.sh
 USER anythingllm
 WORKDIR /app
 # Puppeteer does not ship with an ARM86 compatible build for Chromium
 # so web-scraping would be broken in arm docker containers unless we patch it
 # by manually installing a compatible chromedriver.
 RUN echo "Need to patch Puppeteer x Chromium support for ARM86 - installing dep!" && \
    curl https://playwright.azureedge.net/builds/chromium/1088/chromium-linux-arm64.zip -o chrome-linux.zip && \
    unzip chrome-linux.zip && \
    rm -rf chrome-linux.zip
 ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true
 ENV CHROME_PATH=/app/chrome-linux/chrome
 ENV PUPPETEER_EXECUTABLE_PATH=/app/chrome-linux/chrome
 RUN echo "Done running arm64 specific installation steps"
 #############################################
 # amd64-specific stage
 FROM base AS build-amd64
 RUN echo "Preparing build of AnythingLLM image for non-ARM architecture"
 SHELL ["/bin/bash", "-o", "pipefail", "-c"]
 # Install system dependencies
 # hadolint ignore=DL3008,DL3013
 RUN DEBIAN_FRONTEND=noninteractive apt-get update && \
    DEBIAN_FRONTEND=noninteractive apt-get install -yq --no-install-recommends \
        curl gnupg libgfortran5 libgbm1 tzdata netcat \
        libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 \
        libgcc1 libglib2.0-0 libgtk-3-0 libnspr4 libpango-1.0-0 libx11-6 libx11-xcb1 libxcb1 \
        libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 \
        libxss1 libxtst6 ca-certificates fonts-liberation libappindicator1 libnss3 lsb-release \
        xdg-utils git build-essential ffmpeg && \
    mkdir -p /etc/apt/keyrings && \
    curl -fsSL https://deb.nodesource.com/gpgkey/nodesource-repo.gpg.key | gpg --dearmor -o /etc/apt/keyrings/nodesource.gpg && \
    echo "deb [signed-by=/etc/apt/keyrings/nodesource.gpg] https://deb.nodesource.com/node_18.x nodistro main" | tee /etc/apt/sources.list.d/nodesource.list && \
    apt-get update && \
    apt-get install -yq --no-install-recommends nodejs && \
    curl -LO https://github.com/yarnpkg/yarn/releases/download/v1.22.19/yarn_1.22.19_all.deb \
        && dpkg -i yarn_1.22.19_all.deb \
        && rm yarn_1.22.19_all.deb && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*
 # Create a group and user with specific UID and GID
 RUN groupadd -g "$ARG_GID" anythingllm && \
    useradd -l -u "$ARG_UID" -m -d /app -s /bin/bash -g anythingllm anythingllm && \
    mkdir -p /app/frontend/ /app/server/ /app/collector/ && chown -R anythingllm:anythingllm /app
 # Copy docker helper scripts
 COPY ./docker/docker-entrypoint.sh /usr/local/bin/
 COPY ./docker/docker-healthcheck.sh /usr/local/bin/
 COPY --chown=anythingllm:anythingllm ./docker/.env.example /app/server/.env
 # Ensure the scripts are executable
 RUN chmod +x /usr/local/bin/docker-entrypoint.sh && \
    chmod +x /usr/local/bin/docker-healthcheck.sh
 #############################################
 # COMMON BUILD FLOW FOR ALL ARCHS
 #############################################
 # hadolint ignore=DL3006
 FROM build-${TARGETARCH} AS build
 RUN echo "Running common build flow of AnythingLLM image for all architectures"
 USER anythingllm
 WORKDIR /app
 # Install & Build frontend layer
 FROM build AS frontend-build
 COPY --chown=anythingllm:anythingllm ./frontend /app/frontend/
 WORKDIR /app/frontend
 RUN yarn install --network-timeout 100000 && yarn cache clean
 RUN yarn build && \
    cp -r dist /tmp/frontend-build && \
    rm -rf * && \
    cp -r /tmp/frontend-build dist && \
    rm -rf /tmp/frontend-build
 WORKDIR /app
 # Install server layer
 # Also pull and build collector deps (chromium issues prevent bad bindings)
 FROM build AS backend-build
 COPY ./server /app/server/
 WORKDIR /app/server
 RUN yarn install --production --network-timeout 100000 && yarn cache clean
 WORKDIR /app
 # Install collector dependencies
 COPY ./collector/ ./collector/
 WORKDIR /app/collector
 ENV PUPPETEER_DOWNLOAD_BASE_URL=https://storage.googleapis.com/chrome-for-testing-public
 RUN yarn install --production --network-timeout 100000 && yarn cache clean
 WORKDIR /app
 USER anythingllm
 # Since we are building from backend-build we just need to move built frontend into server/public
 FROM backend-build AS production-build
 WORKDIR /app
 COPY --chown=anythingllm:anythingllm --from=frontend-build /app/frontend/dist /app/server/public
 USER root
 RUN chown -R anythingllm:anythingllm /app/server && \
    chown -R anythingllm:anythingllm /app/collector
 USER anythingllm
 # No longer needed? (deprecated)
 # WORKDIR /app/server
 # RUN npx prisma generate --schema=./prisma/schema.prisma && \
 #     npx prisma migrate deploy --schema=./prisma/schema.prisma
 # WORKDIR /app
 # Setup the environment
 ENV NODE_ENV=production
 ENV ANYTHING_LLM_RUNTIME=docker
 # Setup the healthcheck
 HEALTHCHECK --interval=1m --timeout=10s --start-period=1m \
  CMD /bin/bash /usr/local/bin/docker-healthcheck.sh || exit 1
 # Run the server
 # CMD ["sh", "-c", "tail -f /dev/null"] # For development: keep container open
 ENTRYPOINT ["/bin/bash", "/usr/local/bin/docker-entrypoint.sh"]
--- a/docker/HOW_TO_USE_DOCKER.md
+++ b/docker/HOW_TO_USE_DOCKER.md
@ -0,0 +1,209 @@
 # How to use Dockerized Anything LLM
 Use the Dockerized version of AnythingLLM for a much faster and complete startup of AnythingLLM.
 ### Minimum Requirements
 > [!TIP]
 > Running AnythingLLM on AWS/GCP/Azure?
 > You should aim for at least 2GB of RAM. Disk storage is proportional to however much data
 > you will be storing (documents, vectors, models, etc). Minimum 10GB recommended.
 - `docker` installed on your machine
 - `yarn` and `node` on your machine
 - access to an LLM running locally or remotely
 \*AnythingLLM by default uses a built-in vector database powered by [LanceDB](https://github.com/lancedb/lancedb)
 \*AnythingLLM by default embeds text on instance privately [Learn More](../server/storage/models/README.md)
 ## Recommend way to run dockerized AnythingLLM!
 > [!IMPORTANT]
 > If you are running another service on localhost like Chroma, LocalAi, or LMStudio
 > you will need to use http://host.docker.internal:xxxx to access the service from within
 > the docker container using AnythingLLM as `localhost:xxxx` will not resolve for the host system.
 >
 > **Requires** Docker v18.03+ on Win/Mac and 20.10+ on Linux/Ubuntu for host.docker.internal to resolve!
 >
 > _Linux_: add `--add-host=host.docker.internal:host-gateway` to docker run command for this to resolve.
 >
 > eg: Chroma host URL running on localhost:8000 on host machine needs to be http://host.docker.internal:8000
 > when used in AnythingLLM.
 > [!TIP]
 > It is best to mount the containers storage volume to a folder on your host machine
 > so that you can pull in future updates without deleting your existing data!
 Pull in the latest image from docker. Supports both `amd64` and `arm64` CPU architectures.
 ```shell
 docker pull mintplexlabs/anythingllm
 ```
 <table>
 <tr>
 <th colspan="2">Mount the storage locally and run AnythingLLM in Docker</th>
 </tr>
 <tr>
 <td>
  Linux/MacOs
 </td>
 <td>
 ```shell
 export STORAGE_LOCATION=$HOME/anythingllm && \
 mkdir -p $STORAGE_LOCATION && \
 touch "$STORAGE_LOCATION/.env" && \
 docker run -d -p 3001:3001 \
 --cap-add SYS_ADMIN \
 -v ${STORAGE_LOCATION}:/app/server/storage \
 -v ${STORAGE_LOCATION}/.env:/app/server/.env \
 -e STORAGE_DIR="/app/server/storage" \
 mintplexlabs/anythingllm
 ```
 </td>
 </tr>
 <tr>
 <td>
  Windows
 </td>
 <td>
 ```powershell
 # Run this in powershell terminal
 $env:STORAGE_LOCATION="$HOME\Documents\anythingllm"; `
 If(!(Test-Path $env:STORAGE_LOCATION)) {New-Item $env:STORAGE_LOCATION -ItemType Directory}; `
 If(!(Test-Path "$env:STORAGE_LOCATION\.env")) {New-Item "$env:STORAGE_LOCATION\.env" -ItemType File}; `
 docker run -d -p 3001:3001 `
 --cap-add SYS_ADMIN `
 -v "$env:STORAGE_LOCATION`:/app/server/storage" `
 -v "$env:STORAGE_LOCATION\.env:/app/server/.env" `
 -e STORAGE_DIR="/app/server/storage" `
 mintplexlabs/anythingllm;
 ```
 </td>
 </tr>
 <tr>
 <td> Docker Compose</td>
 <td>
 ```yaml
 version: '3.8'
 services:
  anythingllm:
    image: mintplexlabs/anythingllm
    container_name: anythingllm
    ports:
    - "3001:3001"
    cap_add:
      - SYS_ADMIN
    environment:
    # Adjust for your environment
      - STORAGE_DIR=/app/server/storage
      - JWT_SECRET="make this a large list of random numbers and letters 20+"
      - LLM_PROVIDER=ollama
      - OLLAMA_BASE_PATH=http://127.0.0.1:11434
      - OLLAMA_MODEL_PREF=llama2
      - OLLAMA_MODEL_TOKEN_LIMIT=4096
      - EMBEDDING_ENGINE=ollama
      - EMBEDDING_BASE_PATH=http://127.0.0.1:11434
      - EMBEDDING_MODEL_PREF=nomic-embed-text:latest
      - EMBEDDING_MODEL_MAX_CHUNK_LENGTH=8192
      - VECTOR_DB=lancedb
      - WHISPER_PROVIDER=local
      - TTS_PROVIDER=native
      - PASSWORDMINCHAR=8
      # Add any other keys here for services or settings
      # you can find in the docker/.env.example file
    volumes:
      - anythingllm_storage:/app/server/storage
    restart: always
 volumes:
  anythingllm_storage:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /path/on/local/disk
 ```
  </td>
 </tr>
 </table>
 Go to `http://localhost:3001` and you are now using AnythingLLM! All your data and progress will persist between
 container rebuilds or pulls from Docker Hub.
 ## How to use the user interface
 - To access the full application, visit `http://localhost:3001` in your browser.
 ## About UID and GID in the ENV
 - The UID and GID are set to 1000 by default. This is the default user in the Docker container and on most host operating systems. If there is a mismatch between your host user UID and GID and what is set in the `.env` file, you may experience permission issues.
 ## Build locally from source _not recommended for casual use_
 - `git clone` this repo and `cd anything-llm` to get to the root directory.
 - `touch server/storage/anythingllm.db` to create empty SQLite DB file.
 - `cd docker/`
 - `cp .env.example .env` **you must do this before building**
 - `docker-compose up -d --build` to build the image - this will take a few moments.
 Your docker host will show the image as online once the build process is completed. This will build the app to `http://localhost:3001`.
 ## Integrations and one-click setups
 The integrations below are templates or tooling built by the community to make running the docker experience of AnythingLLM easier.
 ### Use the Midori AI Subsystem to Manage AnythingLLM
 Follow the setup found on [Midori AI Subsystem Site](https://io.midori-ai.xyz/subsystem/manager/) for your host OS
 After setting that up install the AnythingLLM docker backend to the Midori AI Subsystem.
 Once that is done, you are all set!
 ## Common questions and fixes
 ### Cannot connect to service running on localhost!
 If you are in docker and cannot connect to a service running on your host machine running on a local interface or loopback:
 - `localhost`
 - `127.0.0.1`
 - `0.0.0.0`
 > [!IMPORTANT]
 > On linux `http://host.docker.internal:xxxx` does not work.
 > Use `http://172.17.0.1:xxxx` instead to emulate this functionality.
 Then in docker you need to replace that localhost part with `host.docker.internal`. For example, if running Ollama on the host machine, bound to http://127.0.0.1:11434 you should put `http://host.docker.internal:11434` into the connection URL in AnythingLLM.
 ### API is not working, cannot login, LLM is "offline"?
 You are likely running the docker container on a remote machine like EC2 or some other instance where the reachable URL
 is not `http://localhost:3001` and instead is something like `http://193.xx.xx.xx:3001` - in this case all you need to do is add the following to your `frontend/.env.production` before running `docker-compose up -d --build`
 ```
 # frontend/.env.production
 GENERATE_SOURCEMAP=false
 VITE_API_BASE="http://<YOUR_REACHABLE_IP_ADDRESS>:3001/api"
 ```
 For example, if the docker instance is available on `192.186.1.222` your `VITE_API_BASE` would look like `VITE_API_BASE="http://192.186.1.222:3001/api"` in `frontend/.env.production`.
 ### Having issues with Ollama?
 If you are getting errors like `llama:streaming - could not stream chat. Error: connect ECONNREFUSED 172.17.0.1:11434` then visit the README below.
 [Fix common issues with Ollama](../server/utils/AiProviders/ollama/README.md)
 ### Still not working?
 [Ask for help on Discord](https://discord.gg/6UyHPeGZAC)
--- a/docker/docker-compose.yml
+++ b/docker/docker-compose.yml
@ -0,0 +1,31 @@
 name: anythingllm
 networks:
  anything-llm:
    driver: bridge
 services:
  anything-llm:
    container_name: anythingllm
    build:
      context: ../.
      dockerfile: ./docker/Dockerfile
      args:
        ARG_UID: ${UID:-1000}
        ARG_GID: ${GID:-1000}
    cap_add:
      - SYS_ADMIN
    volumes:
      - "./.env:/app/server/.env"
      - "../server/storage:/app/server/storage"
      - "../collector/hotdir/:/app/collector/hotdir"
      - "../collector/outputs/:/app/collector/outputs"
    user: "${UID:-1000}:${GID:-1000}"
    ports:
      - "3001:3001"
    env_file:
      - .env
    networks:
      - anything-llm
    extra_hosts:
      - "host.docker.internal:host-gateway"