When a business owner approaches me to set up a new ETL pipeline for their emerging dataset, I always ask one question: “What’s the unique identifier of this dataset?” This question gets paraphrased according to the business owner’s knowledge and understanding on data and database, and often times I help them identify a candidate (either a single-column primary key, or a multi-column composite key).
What I don’t do is I never take no for an answer.
The statement that “This dataset has no unique identifier” invites trouble1 — ETL pipelines exist to bring order into the chaotic data from various sources. If they are adamant about the data being without a unique identifier, follow up with a few more questions:
Chances are you will find that they have absolutely no clue themselves. Kindly ask them to come back later.