Skip to Content
Data Ingestion API

Data Ingestion API

The Data Ingestion API enables bulk import of content into your knowledge bases. Use this API to programmatically ingest large volumes of data from external systems, migrate content, or integrate with content management platforms.

Overview

The data-ingestion API provides asynchronous job-based processing for importing content at scale. Key features include:

  • Bulk data import - Efficiently ingest hundreds or thousands of records in a single request
  • Asynchronous processing - Jobs are processed in the background with progress tracking
  • Multiple content formats - JSON payloads or multipart file uploads
  • Flexible adapters - Generic JSON adapter or specialized formats
  • Automatic deduplication - Content-hash based duplicate detection
  • Progress tracking - Real-time monitoring of job processing
  • Workspace isolation - All jobs scoped to your workspace (and/or knowledge base)

Authentication

All data-ingestion endpoints require bearer token authentication with specific scopes.

For complete authentication details including token refresh and security best practices, see the Authentication Guide.

Quick Start

  1. Create an access token client in your workspace with the required scopes
  2. Obtain an access token using the client credentials flow
  3. Include the token in your requests using the Authorization header

Required Scopes

Create an access token client with these scopes:

  • data-ingestion:write - Required for creating and cancelling jobs
  • data-ingestion:read - Required for listing and checking job status

Or use * (full access) for all operations.

Authentication Header

Include your access token in all requests:

{
  "Authorization": "Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..."
}

Quick Start

1. Submit an Ingestion Job

Submit a bulk ingestion job with the generic JSON adapter:

// Submit a bulk data ingestion job with JSON payload
const response = await fetch("https://nexus-api.uat.knowbl.com/api/v2/data-ingestion/jobs", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    Authorization: "Bearer YOUR_ACCESS_TOKEN",
  },
  body: JSON.stringify({
    adapterType: "generic-json",
    knowledgeBaseId: "550e8400-e29b-41d4-a716-446655440000",
    data: {
      records: [
        {
          id: "doc-123",
          title: "Product Documentation",
          content: "This is the main content of the document...",
          contentType: "markdown",
          metadata: {
            author: "John Doe",
            category: "technical",
            version: "1.0",
          },
          url: "https://example.com/docs/product",
          tags: ["documentation", "product", "technical"],
        },
        {
          id: "doc-456",
          title: "API Reference",
          content: "Complete API reference documentation...",
          contentType: "markdown",
          metadata: {
            author: "Jane Smith",
            category: "api",
          },
          tags: ["api", "reference"],
        },
      ],
    },
  }),
});

const job = await response.json();
console.log(job);
// {
//   jobId: "7c9e8e8a-5b5a-4f5d-8b5e-5f5d5b5a5b5a",
//   status: "pending"
// }

2. Check Job Status

Monitor the job’s progress:

// Check the status of an ingestion job
const jobId = "7c9e8e8a-5b5a-4f5d-8b5e-5f5d5b5a5b5a";

const response = await fetch(
  `https://nexus-api.uat.knowbl.com/api/v2/data-ingestion/jobs/${jobId}?includePayload=false`,
  {
    method: "GET",
    headers: {
      Authorization: "Bearer YOUR_ACCESS_TOKEN",
    },
  },
);

const job = await response.json();
console.log(job);
// {
//   id: "7c9e8e8a-5b5a-4f5d-8b5e-5f5d5b5a5b5a",
//   status: "completed",
//   data: {
//     progress: {
//       processed: 50,
//       total: 50,
//       successful: 45,
//       failed: 2,
//       skipped: 3
//     },
//     // ... more details
//   }
// }

3. Wait for Completion

Poll the job status until processing completes:

// Poll job status until completion
async function waitForJobCompletion(jobId, accessToken) {
  const maxAttempts = 60; // 5 minutes with 5-second intervals
  const pollInterval = 5000; // 5 seconds

  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    const response = await fetch(
      `https://nexus-api.uat.knowbl.com/api/v2/data-ingestion/jobs/${jobId}`,
      {
        headers: { Authorization: `Bearer ${accessToken}` },
      },
    );

    const job = await response.json();

    if (job.status === "completed") {
      console.log("Job completed successfully!");
      console.log(`Created: ${job.result.entriesCreated} entries`);
      console.log(`Skipped: ${job.result.entriesSkipped} entries`);
      console.log(`Generated: ${job.result.embeddingsGenerated} embeddings`);
      return job;
    }

    if (job.status === "failed") {
      console.error("Job failed:", job.result.error);
      throw new Error(job.result.error);
    }

    // Still processing
    console.log(
      `Progress: ${job.data.progress.processed}/${job.data.progress.total}`,
    );
    await new Promise((resolve) => setTimeout(resolve, pollInterval));
  }

  throw new Error("Job polling timeout");
}

// Usage
const jobId = "7c9e8e8a-5b5a-4f5d-8b5e-5f5d5b5a5b5a";
const result = await waitForJobCompletion(jobId, "YOUR_ACCESS_TOKEN");

Endpoints

POST /v2/data-ingestion/jobs

Submit a new bulk data ingestion job.

URL: POST /api/v2/data-ingestion/jobs

Authentication: Bearer token with data-ingestion:write scope

Content-Type: application/json or multipart/form-data

Request Format (JSON)

Required Fields:

adapterType (string)

The adapter to use for processing your data. Use the /adapters endpoint to discover available adapters.

{
  "adapterType": "generic-json"
}

data (object)

Adapter-specific payload containing your records. Format depends on the adapter type.

For the generic-json adapter:

{
  "data": {
    "records": [
      {
        "id": "doc-123",
        "title": "Example Document",
        "content": "Content here...",
        "contentType": "markdown",
        "metadata": {
          "author": "John Doe"
        },
        "tags": [
          "example"
        ]
      }
    ]
  }
}

knowledgeBaseId (string, UUID)

UUID of the knowledge base to associate ingested entries with. All successfully created entries will be linked to this knowledge base.

{
  "knowledgeBaseId": "550e8400-e29b-41d4-a716-446655440000"
}

Request Format (Multipart)

For large payloads exceeding 30MB, use multipart/form-data:

// Submit a large payload using multipart/form-data
const formData = new FormData();
formData.append("adapterType", "generic-json");
formData.append("knowledgeBaseId", "550e8400-e29b-41d4-a716-446655440000");

// Create a JSON file blob for large payloads
const payload = {
  records: [
    // ... many records
  ],
};
const jsonBlob = new Blob([JSON.stringify(payload)], {
  type: "application/json",
});
formData.append("data", jsonBlob, "data.json");

const response = await fetch("https://nexus-api.uat.knowbl.com/api/v2/data-ingestion/jobs", {
  method: "POST",
  headers: {
    Authorization: "Bearer YOUR_ACCESS_TOKEN",
    // Content-Type is set automatically by FormData
  },
  body: formData,
});

const job = await response.json();
console.log(job.jobId);

Form Fields:

  • adapterType (text field, required)
  • knowledgeBaseId (text field, required, UUID)
  • data (file or text field, required) - Must be JSON file or JSON string

Response (202 Accepted)

{
  "jobId": "7c9e8e8a-5b5a-4f5d-8b5e-5f5d5b5a5b5a",
  "status": "pending"
}

The job is created immediately and processed asynchronously in the background.

GET /v2/data-ingestion/jobs

List ingestion jobs with optional filtering.

URL: GET /api/v2/data-ingestion/jobs

Authentication: Bearer token with data-ingestion:read scope

Query Parameters

status (optional): Filter by job status

  • Values: pending, processing, completed, failed

limit (optional): Results per page

  • Range: 1-100
  • Default: 20

offset (optional): Records to skip

  • Minimum: 0
  • Default: 0

includePayload (optional): Include raw input data

  • Default: false
  • Set to true to include the original submitted payload

startDate (optional): Filter jobs created after date (ISO 8601)

endDate (optional): Filter jobs created before date (ISO 8601)

Example

// List ingestion jobs with filtering and pagination
const params = new URLSearchParams({
  status: "completed",
  limit: "20",
  offset: "0",
  startDate: "2025-01-01",
  endDate: "2025-12-31",
});

const response = await fetch(`https://nexus-api.uat.knowbl.com/api/v2/data-ingestion/jobs?${params}`, {
  method: "GET",
  headers: {
    Authorization: "Bearer YOUR_ACCESS_TOKEN",
  },
});

const result = await response.json();
console.log(`Total jobs: ${result.total}`);
console.log(`Retrieved: ${result.jobs.length} jobs`);

result.jobs.forEach((job) => {
  console.log(
    `${job.id}: ${job.status} (${job.data.progress.processed}/${job.data.progress.total})`,
  );
});

Response (200 OK)

Note: The structure of ingestion results payloads is subject to change as we refine job reporting and progress tracking.

{
  "total": 150,
  "jobs": [
    {
      "id": "7c9e8e8a-5b5a-4f5d-8b5e-5f5d5b5a5b5a",
      "status": "completed",
      "data": {
        "adapterId": "generic-json",
        "adapterType": "generic-json",
        "recordCount": 50,
        "progress": {
          "processed": 50,
          "total": 50,
          "successful": 45,
          "failed": 2,
          "skipped": 3
        },
        "results": [
          {
            "sourceId": "doc-123",
            "status": "success",
            "entryId": "e8f5a3b1-2c4d-5e6f-7a8b-9c0d1e2f3a4b",
            "embeddingCount": 12
          }
        ]
      },
      "result": {
        "sourceId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
        "entriesCreated": 45,
        "entriesSkipped": 3,
        "entriesDeleted": 0,
        "embeddingsGenerated": 540
      },
      "started_at": "2025-10-30T18:05:00.000Z",
      "completed_at": "2025-10-30T18:08:30.000Z"
    }
  ]
}

GET /v2/data-ingestion/jobs/:jobId

Get detailed status of a specific job.

URL: GET /api/v2/data-ingestion/jobs/:jobId

Authentication: Bearer token with data-ingestion:read scope

Query Parameters

includePayload (optional): Include raw input data

  • Default: false

Response (200 OK)

Returns the same structure as a single job from the list endpoint.

DELETE /v2/data-ingestion/jobs/:jobId

Cancel a pending or processing job.

URL: DELETE /api/v2/data-ingestion/jobs/:jobId

Authentication: Bearer token with data-ingestion:write scope

Example

// Cancel a pending or processing job
const jobId = "7c9e8e8a-5b5a-4f5d-8b5e-5f5d5b5a5b5a";

const response = await fetch(`https://nexus-api.uat.knowbl.com/api/v2/data-ingestion/jobs/${jobId}`, {
  method: "DELETE",
  headers: {
    Authorization: "Bearer YOUR_ACCESS_TOKEN",
  },
});

if (response.status === 204) {
  console.log("Job cancelled successfully");
} else {
  const error = await response.json();
  console.error("Failed to cancel job:", error.message);
}

Response (204 No Content)

Empty response body on successful cancellation.

Note: Only jobs with status pending or processing can be cancelled. Completed or failed jobs cannot be cancelled.

GET /v2/data-ingestion/adapters

List available data ingestion adapters.

URL: GET /api/v2/data-ingestion/adapters

Authentication: Bearer token with data-ingestion:read scope

Example

// Get list of available data ingestion adapters
const response = await fetch("https://nexus-api.uat.knowbl.com/api/v2/data-ingestion/adapters", {
  method: "GET",
  headers: {
    Authorization: "Bearer YOUR_ACCESS_TOKEN",
  },
});

const adapters = await response.json();
console.log(adapters);
// [
//   { name: "generic-json", format: "json", version: "1.0.0" },
//   { name: "content-hub", format: "content-hub-json", version: "1.0.0" }
// ]

Response (200 OK)

[
  {
    "name": "generic-json",
    "format": "json",
    "version": "1.0.0"
  },
  {
    "name": "content-hub",
    "format": "content-hub-json",
    "version": "1.0.0"
  }
]

Adapters

Note: Ingestion adapters and their payload formats are subject to change as we iterate on integration efforts with various content sources.

Adapters transform your data into a standardized format for ingestion into knowledge bases.

Generic JSON Adapter

Adapter Type: generic-json

Use Case: Simple bulk imports where data is already in a flat, standardized JSON structure.

Input Format:

{
  "records": [
    {
      "id": "string",
      "title": "string",
      "content": "string",
      "contentType": "markdown",
      "metadata": {},
      "deleted": false,
      "url": "string",
      "tags": [
        "string"
      ]
    }
  ]
}

Field Details:

  • id (required): Unique identifier for the record
  • title (optional): Document title (auto-generated if omitted)
  • content (required): Main text content
  • contentType (optional): Content classification (text, markdown, code, or qa_pair)
  • metadata (optional): Flexible key-value pairs for custom data
  • deleted (optional): Set to true to delete existing entry
  • url (optional): External reference URL
  • tags (optional): Array of tags for categorization

Content Hub Adapter

Adapter Type: content-hub

Use Case: Importing content with HTML-encoded content and inline attributes from content management systems.

Input Format:

[
  [
    {
      "content": {
        "documentIdentifier": "Page://content/acme/resource-center/article-123",
        "parseImage": false,
        "deleted": false,
        "inlineContent": {
          "textContent": {
            "data": "&lt;h1&gt;Title&lt;/h1&gt;&lt;p&gt;Content...&lt;/p&gt;"
          },
          "type": "TEXT"
        }
      },
      "metadata": {
        "inlineAttributes": [
          {
            "key": "Title",
            "value": {
              "stringValue": "Article Title",
              "type": "STRING"
            }
          },
          {
            "key": "keywords",
            "value": {
              "stringValue": "business,enterprise",
              "type": "STRING"
            }
          }
        ]
      }
    }
  ],
  {
    "timeTaken": "250ms"
  }
]

Processing:

  • HTML entities are decoded
  • HTML converted to Markdown
  • Inline attributes converted to metadata
  • Title extracted from Title attribute or first heading
  • Tags extracted from keywords and custom entity fields
  • Content type detected based on taxonomy

Job Processing

Lifecycle

Jobs follow this lifecycle:

  1. Pending - Job created, waiting to start
  2. Processing - Job is actively processing records
  3. Completed - All records processed successfully (or with tracked failures)
  4. Failed - Job failed due to unrecoverable error

Background Processing

When you submit a job, it is processed asynchronously:

  1. Job Creation - Job is created and queued for processing
  2. Processing - Each record is validated, deduplicated, and transformed
  3. Completion - Job status is updated with results summary

You can poll the job status endpoint to track progress and retrieve results when complete.

Automatic Features

Deduplication: Content is automatically checked against existing entries to prevent duplicates.

Embedding Generation: Content is automatically chunked and embedded into vectors for semantic search, enabling AI-powered retrieval.

Error Handling

400 Bad Request

Request validation failed. Common causes:

  • Invalid JSON format - Malformed request body
  • Missing required fields - adapterType or data not provided
  • Unknown adapter - Adapter type not recognized
  • Payload validation failed - Data doesn’t match adapter’s schema
  • Invalid file type - Multipart data field is not JSON

Example:

{
  "statusCode": 400,
  "error": "Bad Request",
  "message": "Unknown adapter: invalid-adapter",
  "timestamp": "2025-10-30T18:10:00.000Z",
  "path": "/v2/data-ingestion/jobs"
}

401 Unauthorized

Authentication failed. Common causes:

  • No access token provided - Missing Authorization header
  • Invalid token format - Malformed bearer token
  • Expired access token - Token has passed expiration time
  • Invalid token - Token signature verification failed

Example:

{
  "statusCode": 401,
  "error": "Unauthorized",
  "message": "Access token is required",
  "timestamp": "2025-10-30T18:10:00.000Z",
  "path": "/v2/data-ingestion/jobs"
}

403 Forbidden

Insufficient permissions. Common causes:

  • Missing scope - Access token lacks data-ingestion:write or data-ingestion:read scope
  • Workspace access denied - Resource belongs to different workspace

Example:

{
  "statusCode": 403,
  "error": "Forbidden",
  "message": "Insufficient scope: data-ingestion:write required",
  "timestamp": "2025-10-30T18:10:00.000Z",
  "path": "/v2/data-ingestion/jobs"
}

404 Not Found

Resource not found. Common causes:

  • Job not found - Job ID doesn’t exist or belongs to different workspace
  • Knowledge base not found - Specified knowledgeBaseId doesn’t exist

Example:

{
  "statusCode": 404,
  "error": "Not Found",
  "message": "Job 7c9e8e8a-5b5a-4f5d-8b5e-5f5d5b5a5b5a not found",
  "timestamp": "2025-10-30T18:10:00.000Z",
  "path": "/v2/data-ingestion/jobs/7c9e8e8a-5b5a-4f5d-8b5e-5f5d5b5a5b5a"
}

409 Conflict

Request conflicts with current state. Common causes:

  • Job already terminal - Attempting to cancel a completed or failed job

Example:

{
  "statusCode": 409,
  "error": "Conflict",
  "message": "Job 7c9e8e8a-5b5a-4f5d-8b5e-5f5d5b5a5b5a is already completed",
  "timestamp": "2025-10-30T18:10:00.000Z",
  "path": "/v2/data-ingestion/jobs/7c9e8e8a-5b5a-4f5d-8b5e-5f5d5b5a5b5a"
}

500 Internal Server Error

Unexpected server error. This indicates a problem with the platform.

Example:

{
  "statusCode": 500,
  "error": "Internal Server Error",
  "message": "An unexpected error occurred",
  "timestamp": "2025-10-30T18:10:00.000Z",
  "path": "/v2/data-ingestion/jobs"
}

Error Handling Best Practices

Implement robust error handling in your integration:

// Robust error handling for data ingestion
async function submitIngestionJob(payload, accessToken) {
  try {
    const response = await fetch("https://nexus-api.uat.knowbl.com/api/v2/data-ingestion/jobs", {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        Authorization: `Bearer ${accessToken}`,
      },
      body: JSON.stringify(payload),
    });

    if (!response.ok) {
      const error = await response.json();

      switch (response.status) {
        case 400:
          console.error("Validation error:", error.message);
          if (error.errors) {
            error.errors.forEach((err) => {
              console.error(`- ${err.path}: ${err.message}`);
            });
          }
          break;

        case 401:
          console.error("Authentication failed:", error.message);
          console.error(
            "Check that your access token is valid and not expired",
          );
          break;

        case 403:
          console.error("Permission denied:", error.message);
          console.error(
            "Ensure your access token has data-ingestion:write scope",
          );
          break;

        case 404:
          console.error("Knowledge base not found:", error.message);
          break;

        default:
          console.error("Unexpected error:", error.message);
      }

      throw new Error(error.message);
    }

    return await response.json();
  } catch (error) {
    if (error.name === "TypeError") {
      console.error("Network error - check API URL and connectivity");
    }
    throw error;
  }
}

// Usage
try {
  const job = await submitIngestionJob(
    {
      adapterType: "generic-json",
      data: {
        records: [
          /* ... */
        ],
      },
    },
    "YOUR_ACCESS_TOKEN",
  );
  console.log("Job submitted:", job.jobId);
} catch (error) {
  console.error("Failed to submit job:", error.message);
}

Best Practices

Job Monitoring

Poll with exponential backoff:

Start with short intervals and increase progressively to reduce API calls:

let interval = 1000; // Start with 1 second
const maxInterval = 30000; // Max 30 seconds

while (job.status === "processing") {
  await sleep(interval);
  job = await checkJobStatus(jobId);
  interval = Math.min(interval * 1.5, maxInterval);
}

Store job IDs immediately:

Save the returned jobId to your database immediately after job creation to enable recovery from crashes:

const { jobId } = await submitJob(payload);
await db.jobs.create({ id: jobId, status: "pending" });

File Size Limits

Default limit: 30MB for multipart uploads

Recommendations:

  • For payloads under 1MB: Use JSON format
  • For payloads 1MB-30MB: Use multipart format
  • For payloads over 30MB: Split into multiple jobs

Large Datasets

Batch your records:

Submit jobs in batches of 100-500 records for optimal processing:

const batchSize = 250;
for (let i = 0; i < allRecords.length; i += batchSize) {
  const batch = allRecords.slice(i, i + batchSize);
  await submitJob({ adapterType: "generic-json", data: { records: batch } });
}

Rate limiting:

Respect platform rate limits by implementing delays between job submissions:

await submitJob(payload);
await sleep(500); // 500ms delay between submissions

cURL Examples

Complete Command-Line Reference

# Submit a job
curl -X POST "https://nexus-api.uat.knowbl.com/api/v2/data-ingestion/jobs" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_ACCESS_TOKEN" \
  -d '{
    "adapterType": "generic-json",
    "knowledgeBaseId": "550e8400-e29b-41d4-a716-446655440000",
    "data": {
      "records": [
        {
          "id": "doc-123",
          "title": "Example Document",
          "content": "Document content here..."
        }
      ]
    }
  }'

# Check job status
curl -X GET "https://nexus-api.uat.knowbl.com/api/v2/data-ingestion/jobs/JOB_ID" \
  -H "Authorization: Bearer YOUR_ACCESS_TOKEN"

# List jobs
curl -X GET "https://nexus-api.uat.knowbl.com/api/v2/data-ingestion/jobs?status=completed&limit=10" \
  -H "Authorization: Bearer YOUR_ACCESS_TOKEN"

# Cancel a job
curl -X DELETE "https://nexus-api.uat.knowbl.com/api/v2/data-ingestion/jobs/JOB_ID" \
  -H "Authorization: Bearer YOUR_ACCESS_TOKEN"

Next Steps