A glossary for data integration
Now that we've got a basic understanding of what an integration is, it's important to establish some of the foundational concepts we need to press forward. This doesn't mean you can't use OpenFn if you don't know what any of these words mean prior to reading our documentation, but it does mean that some of the most important tasks along the OpenFn journey will assume at least a basic understanding of each of these terms. In some cases, we also link to further reading if you want a better understanding of some part of your data integration picture.
Note: This glossary is meant to be OpenFn-agnostic. The rest of the docs help you to get a picture of the parts of OpenFn, what we call them, and why, but this glossary is really meant as a prerequisite to all those other things to aid users with no experience in this area.
API is short for "application programming interface," and it's the part of some software application that has chosen to make itself visible (interface) to users outside the application itself. And it's doing that in a programmatic way, in a way that allows developers of other applications or data systems to use it the same way each time.
There's no hard and fast rule about how an API gets developed, but over time, standards have emerged to make it more straightforward for a new user to interact with Platform X's API, by trying to ensure most applications use one of a few different formats. That's what an API protocol is. A few of the big names here are REST, SOAP, JSON, and GraphQL. Rather than reinvent the wheel, here's a good primer on how protocols differ, their data formats, and why that all matters.
Any organized collection of data can probably be safely called a database. If it's got a structure with which to reference all the stuff it's storing, and the "stuff" is data, then it's a database.
A data source is an application, database, or table that provides data to some other platform. Nothing is always a data source. For example, Google Sheets can be a data source, but it can also pull from data sources (individual CSV uploads or manual user data entry). We just call it a source when it's doing the job of sourcing data to some other place. Data sources are the starting point, temporally, for any integration.
Sometimes folks get confused about the distinction between a database, a data source, an application, and a data system. A data system is a more complex collection of these other things, usually one that allows a user to more easily interact with all of the data they should have access to. The data system often serves as an entry point to the myriad databases, applications, tables, etc. that a user would otherwise have to go 12 different places to find.
In this day and age, security is everything. Encryption is the process of taking something that is readable to anyone and making it only readable to people we want to read it. OpenFn ensures your data is encrypted every step of the way while it's in our platform. For more on different kinds of encryption, you can look here.
A file system is to files what a data system is to data. It structures your files in a way that makes it easy for you to retrieve them in a standardized way (think of your home file system with its file paths on your home computer). File systems can exist in other contexts too, and sometimes you need to access them to retrieve a file (a Word doc, CSV, plain text file, etc. might all be relevant depending on your use case). The only real difference between file systems and data systems or databases is the kind of information stored, data vs. files.
ETL stands for extract, transform, and load. These are often thought of as the three constituent parts of a data integration. First, we extract (push of pull data from a data source). Then, we transform (make any changes to the data to make it acceptable to the destination system or application). Then, we load (send it to the destination).
An integration platform (e.g., OpenFn) is an application (or set of applications) that help organizations set up, run, and maintain/manage the integrations between all of their various systems.
You may also see the acronym "iPaaS". This stands for integration platform as a service and is a type of "software as a service" (or "SaaS"). SaaS is a software purchasing model in which software is paid for only as it is used (often month-to-month), rather than purchased up front or given away for free.
This is data that tells us about our data. In a table, for example, that's the name of the columns, the number of rows, etc. Metadata is often brought up in conversations about privacy—e.g., regulators might want to ensure that only metadata is moved from Ministry A to Ministry B, as opposed to personally identifiable information (PII) about individuals themselves.
Push, pull, and streamingPushing is when a triggering action in the data source causes it to send data to the destination. Pulling is the opposite, where the destination system requests the data from the source based on some triggering action, rather than waiting for the source to send it on its own. Streaming is a bit different, and it's when a data source is essentially constantly sending data to a destination system.
A webhook (also called a web callback or HTTP push API — thanks SendGrid!) is a feature of an application that allows pushing. It's often configured to notify some external URL when an event occurs. A system administrator might create a "webhook" which notifies an integration platform whenever some event occurs so that the iPaaS can start executing some complex workflow.
Structured and unstructured data
Structured data is data that has metadata. Unstructured data has very little metadata (though probably still has things like time of creation, update, etc.). Without metadata about the format of the data, unstructured data is more difficult to interact with programmatically. We need different sorts of rules when doing ETL on unstructured data to do it well, whereas structured data is an easier starting point because we know what to expect from a column with a name, data type, field size, and so on.
Refers to a destination system making a change in a data source. When my destination application receives information from a data source and wants to do something back to the source in response, that's writeback.