.NET developers need to efficiently process, chunk, and retrieve information from diverse document formats while preserving semantic meaning and structural context. The Microsoft.Extensions.DataIngestion libraries provide a unified approach for representing document ingestion components.
The Microsoft.Extensions.DataIngestion.Abstractions package provides the core exchange types, including IngestionDocument, IngestionChunker<T>, IngestionChunkProcessor<T>, and IngestionChunkWriter<T>. Any .NET library that provides document processing capabilities can implement these abstractions to enable seamless integration with consuming code.
The Microsoft.Extensions.DataIngestion package has an implicit dependency on the Microsoft.Extensions.DataIngestion.Abstractions package. This package enables you to easily integrate components such as enrichment processors, vector storage writers, and telemetry into your applications using familiar dependency injection and pipeline patterns. For example, it provides the SentimentEnricher, KeywordEnricher, and SummaryEnricher processors that can be chained together in ingestion pipelines.
Libraries that provide implementations of the abstractions typically reference only Microsoft.Extensions.DataIngestion.Abstractions.
To also have access to higher-level utilities for working with document ingestion components, reference the Microsoft.Extensions.DataIngestion package instead (which itself references Microsoft.Extensions.DataIngestion.Abstractions). Most consuming applications and services should reference the Microsoft.Extensions.DataIngestion package along with one or more libraries that provide concrete implementations of the abstractions, such as Microsoft.Extensions.DataIngestion.MarkItDown or Microsoft.Extensions.DataIngestion.Markdig.
From the command-line:
dotnet add package Microsoft.Extensions.DataIngestion --prereleaseOr directly in the C# project file:
<ItemGroup>
<PackageReference Include="Microsoft.Extensions.DataIngestion" Version="[CURRENTVERSION]" />
</ItemGroup>We welcome feedback and contributions in our GitHub repo.