Exams are particularly difficult for someone with a developer background. Even though I’m more focused on PaaS and SaaS components these days (like most people), the exams mostly cover IaaS components such as VNETs and VMs, including the disaster recovery scenarios with Azure Site Recovery. They also include a lot of Hybrid-Cloud scenarios, again with Azure Site Recovery. Long story short, most questions come from Azure Site Recovery (sigh). I understand the point; as an architect, you need to be confident around network and security components, but I believe it covers a tad too much of the exams.
Another area that exam heavily tests you is Azure Active Directory and its synchronisation scenarios. Especially with Azure AD Connect and how to use it with Hybrid-Cloud requirements, such as multi-national corporations with multiple regions as well as multiple on-premise sites. Again, as a developer, we mostly focus on how to use AD rather than design it, so it requires a bit of first-hand knowledge there.
However, it is manageable and achievable, and I’m happy to prove to myself that I’m capable of covering those scenarios, which mostly was in my blind spots. I hope to encounter many more different use cases!
Azure Data Factory gives many out-of-the-box activities, but one thing it doesn’t have is to run custom code easily. The emphasis here is on easily because it only supports that through Azure Batch, which is a pain to manage, let alone make it work. The other option is to use Azure Functions, but Microsoft says on MSDN documentation that we have only 230 seconds to finish what we’re doing, otherwise ADF will timeout and break the pipeline.
So, how can we run a long (or longer than 230 seconds) job on ADF, say extracting tar file contents (their example with normal Azure Functions is here)? The solution is to make it a background work with Durable Functions.
Durable Functions are running on the same platform as normal Azure Functions, but they run asynchronously. When you trigger a Durable Function, it creates a background process and the gives you a few URLs that you can interact with that process; including one to query its status. When you call it, it returns the status of the background job as well as the data if it’s finished.
The idea is to trigger a Durable Function through an HTTP endpoint, wait for it to finish and then get the result. Here’s how it looks like:
And here’s how it looks inside the Until activity:
Here’s the process:
First, we trigger our Durable Function through an HTTP trigger using Azure Function activity.
Then with the Until activity, we check status of that function.
The Wait activity waits around 30 seconds (or different, up to you) to let function to be executed.
The Web activity makes a request to the statusQueryUrl that Azure Function activity returns, by calling @activity('StartUntar').output.statusQueryGetUri
Until activity checks the result of CheckStatus Web activity with expression @not(or(equals(activity('CheckStatus').output.runtimeStatus, 'Pending'), equals(activity('CheckStatus').output.runtimeStatus, 'Running')))
It repeats until the function is finished or failed, or until it times out (set on the Timeout property)
In this scenario, the Durable Function untars the file and keeps the contents in a storage account, which returns those storage URLs in the HTTP response body. The next step may be to copy those files into a Data Lake using a ForEach with Copy, or whatever the process requires.
Although this example is to untar the file, you can apply the same logic whenever you need to run a custom background job on Azure Data Factory; like converting an XML file to JSON, because ADF doesn’t support XML (this will be another blog post).
There are many ways to build an ETL process in on-premise, but there’s no simple way or a more comprehensive product in Azure. One option is to adopt a product built for another purpose and try to make use of it, which quite often fails miserably (like Azure Data Lake Analytics). Another option is to develop something from scratch, which requires world-class level design and development process. Most of these ETL requirements surface within another project, which possibly doesn’t have time or resources to accomplish something like that.
Recently, Microsoft has been investing a lot on both Azure Data Factory and Azure Databricks. Databricks lets you process data with managed Spark clusters through the use of data frames. It’s a fantastic tool, but it requires a good knowledge of distributed computing, parallelization, Python or Scala. Once you grasp the idea, it propels you forward like a photon torpedo. However, it’s a pain to learn.
Azure Data Factory, on the other hand, has many connectivity features but not enough transformation capabilities. If the data is already prepared or requires minimal touch, you can use ADF to transport your data, add conditional flows, call external sources, etc. If you need something more, not helpful enough. At least it wasn’t until Microsoft released its next trick in its sleeves to the public.
Azure Data Flow is an addition to Data Factory, backed by Azure Databricks. It allows you to create data flows to manipulate, join, separate, aggregate your data visually, and then runs it on Azure Databricks to create your result. Everything you do with Azure Data Flow is converted to Spark code and executed on your clusters, which makes them fast, resilient, distributed and structured.
It provides the following activities:
Table of Contents
To achieve an ETL pipeline, you compose a data flow with the activities above. A sample flow can contain a Source (which points to a file in blob storage), a Filter (which filters the source based on a condition), and a Target (which leads to another file in blob storage).
It has many great features, but it also has its limitations. At the moment, you can only specify file-based sources (Blob and Data Lake Store) and Azure SQL Database. Everything else, you can put a copy activity on your Data Factory pipeline before you call Data Flow, which will allow you to query your source, output its results to the blob, then execute the Data Flow on this file. Similarly, that’s also what you need to do if you want to write to other sinks.
Let’s carry on with our experiment.
For our experiment, we’ll use a data file that contains the football match results in England, starting from 1888. The data file is available here and is around 14MB.
We’ll load the data into our storage account and then calculate three different outputs from that:
Total matches won, lost and drawn by each team, for each season and division
Standings for each season and division
Top 10 teams with most championships
The objective of our experiment is to prove that Data Flow and Data Factory;
Can create complex data processing pipelines
Is easy to develop, deploy, orchestrate and use
Is easy to debug and trace issues
Source code for this example is can be downloaded here. It includes the Data Factory files as well as the example csv file, which you need to upload it to a blob storage and configure your linked services.
We start with loading our CSV file into blob storage that we can access in Data Factory. For this, you’ll need to create a Blob Storage and then create a linked service for it:
First Example: Team Analytics
Then, we can continue to create our Datasets. For this experiment, we’ll need 4 datasets: MatchesCsv (our input file), AnalyticsCsv (for team analytics), SeasonStandingsCsv (for each season standings) and MostChampionsCsv (for showing top champions). All four is available with the source code of this article, but all of them are CSV files that have headers enabled. The only difference is the output files doesn’t have an output file name; only folders. This is a requirement from Data Flow (actually, Spark) so that it would create files per partition under this path.
Next thing on our list is to create our Data Flow. Data Flow is built on a few concepts:
A data flow needs to read data from a source and write to a sink. At the moment, it can read files from Blob and Data Lake, as well as the Azure SQL Database. I’ve only experimented with Blob files but will try to do with SQL.
It’s essentially creating a Spark Data Frame and pushes it through your pipeline. You can look them up if you want to understand how it works under the hood.
You can create streams from the same source or any activity. After you add an activity, click the plus sign again and select New Branch. This way, you don’t have to recalculate everything up until that point, and Spark optimizes this pretty well.
Flows cannot work on their own; they need to be put into a pipeline.
Flows are reusable; you can reuse and chain them in many pipelines.
In private preview, you could also define your custom activities by uploading Jar files as external dependencies, but they removed it on the public preview.
Enough with the advertisement, let’s get our hands dirty. We’ll start with our first case: Calculating the team analytics for each season and division. To do this, we’ll create a flow that’ll follow these steps;
The dataset shows match results, so it has a Home and an Away team column. To create team-based statistics, we best build a file structure with a single column named Team.
We do this by separating our source stream into two: Home Scores and Away Scores. It’s a Select activity that formats the Home column as Team for Home Scores, Away for the Away Scores.
Then, we need to add three columns based on the Result column (which has three values: ‘H’ for a Home win, ‘A’ for an Away win, ‘D’ for a Draw): Winner, Loser and Draw. These are boolean fields and checks if the team is a winner, a loser, or it’s a draw. We’ll base our statistics on these. For the Home team, Winner is when Result equals ‘H’ and Loser is when Result equals ‘A’. For Away teams, it’s the other way. We’ll achieve this using Derived Column activity.
Then we use Union to combine these two datasets and form a dataset we can run stats on AllStats. It looks like this:
Now, here’s what we want to create as an output. Just like any good SQL statement, we’ll achieve this by using Group By and then running aggregations on it. We’ll do that using the Aggregate activity, named TeamAnalytics), and configuring it like below:
When executed, this code will run aggregations and create statistics output. Next step is to dump this into the blob storage; but first, we order it using the Sort activity (by Season, desc) and set its partitioning to Key. This will create a separate file per season so that we can have smaller and more organized files. Unfortunately, I have no idea how to give a meaningful name to the file names at this point, but I’m sure there’ll be a way in the future. Now, we can use Sink activity to dump our contents into AnalyticsCsv dataset.
Next: Season Standings
Now that we know how each team performed on each season and division, we can create more sophisticated reports. Next one is to create each team’s standings on their season and division. We’ll do this by again grouping by teams and giving them a ranking by their position in the league, which will be based on their points in descending order. Now, I know that we have a crude point calculation (3 points for winnings and 1 point for draws) and we have neither overall nor individual goal difference in the calculation, but hey, football’s already too much complicated these days.
We branch the TeamAnalytics activity and add a Window to the new stream. The Window looks like a very complicated activity, and it is, but we already know something similar in the SQL world for that: Things like ROW_NUMBER() OVER (Count DESC). The Window does the same thing, better explained by its help graphic:
The main idea is to group and sort rows and create windows for each of them; then calculate aggregations on each window. It works quite similar to its SQL equivalent, and it’s as much powerful. We’ll use it to create groups by Season and Division, then sort by Points in descending order and calculate row number for each row:
Range by current row offset
We’ll then proceed and dump this into another CSV dataset of ours, SeasonStandingsCsv.
Next: Top 10 Most Champions
Our next step is to create a list of top 10 teams with most championships. To do that, we’ll branch SeasonStandings and apply a series of transformation on it.
Filter: To get only the champions from each season and division.
Aggregate: To calculate championship count for each team
Window: To calculate row number by sorting the entire list descending on Count
Range by current row offset
Filter: To get top 10 rows based on No row number column
Optimize: Single partition
You see that it’s unusually long and complicated flow to get top 10 rows. Usually, after the aggregate activity, we should’ve been able to use Filter to get the top 10 rows directly. However, Data Flows Filter activity does not yet support Top N functionality, yet, so we had to improvise. The workaround is to put all the records into the same window (0*Count which will result as 0 in ID column), sort by count in desc, then add row number as a column. The trick here is to put everything in a single window. Otherwise, it would restart the row number for each window. I know, it’s a workaround, but hey, life is a big workaround to death anyway.
After these, we can safely dump our result set into our MostChampionsCsv dataset. Here’s how our calculation branch looks like:
And finally, here’s our grand result with three different sinks and many streams:
I have categorized my comments on Data Flows below but long story short, it looks like a promising product. It handled complex scenarios well, and Microsoft seems to be investing a lot on it. If it happens to be a success, we’ll have an excellent tool to design our ETL processes on Azure, without the need of any third party tools.
I found the Data Flow quite powerful and capable of creating complex data flows. Especially being able to develop team rankings by seasons with the help of Windows Activity shows that most of our data processing use cases can quickly be addressed with Data Flows.
There’s a good collection of activities available for data transformation. It covers most scenarios; if not, you can always enrich it with other Data Factory (not Data Flow) capabilities, like Azure Function calls.
It looks like Microsoft is investing a lot on Data Factory and Data Flows. On data set creation page, it says it’ll support most of the sources of Data Factory in Data Flows as well. This is good, means that we can always query the source directly with Data Flow and won’t be bothered to download the data into a CSV first and then process it.
Considering that Data Flows live inside the Data Factory Pipelines, it benefits from all features it provides. You can trigger a pipeline with an event, which can connect to a source, transform it, and then dump the results somewhere else. It also uses Data Factory’s monitoring and tracing capabilities, which is quite good. It even shows Data Flow runs in a diagram, so you can see how many rows have been processed by each activity.
The Window activity is quite powerful when creating group based calculations on the data (like rowNumber, rank, etc.).
Creating data streams and processing data through that with activities makes your flow easily readable, processable, and diagnosable.
To develop and see your activities’ outputs instantly, there’s a feature called Debug Mode. It attaches your flow to an interactive Databricks cluster and lets you click on activities, change them and see the result immediately without the need of deploying any flow or pipeline and then running it.
Even though it’s still in preview, Data Flows are quite stable. Keep in mind that even if it has a Go-Live license, things can always change until GA.
Also thinking that the flow definition is just a JSON file, it makes it possible to parse it and use it in the data catalog. We can parse these transformation rules to calculate data lineage reports in the future.
No “Top N Rows” support in Filter activity yet. Not essential, but annoying.
In the private preview, we were able to create a Databricks cluster and link it in Data Factory. This allowed us to control and maintain our clusters if needed, as well as to keep them up or down at specific times. Now in public preview, we cannot specify our Databricks. , and it spawns a new cluster every time you start a job. Considering that an average cluster wakes up around in 5 minutes, it increases the job run times significantly, even for small jobs.
By removing the Databricks linked service, they also removed the support for custom code (Extend activity). It wasn’t a necessary capability, but knowing that it was there was a comfort for future complex scenarios. Now that it’s gone, we cannot create custom Spark functions anymore.
Another consideration over auto-managed clusters is the data security part. It’s not clear yet how and where Data Factory manages these clusters. There’s no documentation around it yet, so it’s a future task.
Even though you are developing through the Azure Portal, not from your regular IDE, it’s still an excellent experience. The Data Factory portal is quite reliable and works as powerful as an IDE.
From the portal, you can download the latest version of your pipelines, datasets, and dataflows as JSON and put it into your source control later. This requires the development environment to be shared by the developers, but with proper planning and multiple dev environments, I’m sure that won’t be an issue.
Data Factory also has source control integration directly through its portal. You can hook it up to a Git repo, create branches, load your pipelines and flows, commit your changes, deploy them, create PRs, etc. It makes a good IDE experience on the portal.
You can also download the ARM template and incorporate it with your source control easily. It also allows you to import the ARM template, which can be handy in some cases.
Each pipeline, dataset, and flow has a JSON code which you can see and download. Then, with the Data Factory API, you can deploy them into any Data Factory.
One thing I didn’t like was when you add a source, point it to a dataset and import its schema into your source, it just loses the data types as well as the formats. This is annoying and requires constant manual editing on both places when you change dataset’s schema.
Another negative thing was when you have a timestamp field on your source, and you define a date time format, it always empties the format field and makes you enter it again and again. I mostly encountered this in Debug mode, which you click on the activities repeatedly.
Considering it’s backed by Databricks, it’ll do quite useful when it comes to processing a large amount of data.
Even with a small amount of data, the execution is quite fast. It depends on how many transformations you have, but still good.
I have not recorded any successful execution time below 1 minute 30 seconds. My average with the current experiment run time is around 00:02:30 to 00:03:30. Although this is not a bad time, especially compared to Azure Data Lake Analytics, but still makes you think about near-real-time use cases. Maybe with some more optimization, like properly partitioning the data at the beginning of the flow, can decrease the time significantly, but that’s for another day.
Because they removed Bring-Your-Own-Databricks in the public preview, now it manages the clusters for you. This causes a new cluster to be kicked off every time you run a data flow pipeline, which makes the minimum run time around 6 minutes and 30 seconds.
Even though I knew it would be difficult beforehand, it was still quite a shock. Three hours is quite tiring and trying for a normal test but looking to a screen under strong light and smell of damp (the test centre I used was horrible, had a terrible headache afterwards for hours) makes it like an end-game boss-fight.
Anyway, it is difficult but certainly doable, especially if you have good knowledge around VMs and VNETs. This was the most challenging part for me; as a developer I do have a good grasp of PaaS and SaaS components even VM configurations, but VNETs and network components like Virtual Appliances are not my strongest suit.
I know the NDA doesn’t let me talk about the questions but here’s some vague info on the exam:
I got around 55 questions, most of these questions were around VMs and VNETs. There were even some case study related questions.
There were 2 hands-on labs where you need to achieve 10-12 tasks per lab using Azure Portal or Azure Cloud Shell. Doesn’t really matter which you choose, it’s the result that matters.
The labs were the most time consuming parts of the exam. Three hours was just enough to do it but may take longer actually.
The next exam is AZ-301: Microsoft Azure Architect Design, which actually even has a more broader scope. From the subtle details of Azure AD to building a secure data platform, that’s a lot of ground to cover.
The first step of that journey is Exam AZ-300 at the end of the month. It has five main categories:
Deploy and configure infrastructure
Implement workloads and security
Create and deploy apps
Implement authentication and secure data
Develop for the cloud and for Azure storage
Personally, I have the coverage of more than 80% of these subjects. My main gap lies around VM migrations/site recovery and alerts (and maybe a little bit of Kubernetes, which is relatively new on Azure). I’ll be trying to fill these holes in my knowledge until the day of the test and I’m confident that I will pass with flying colours.
There are some sample tests available online and I took the free one from Whizlabs and failed spectacularly. The test has 15 questions but oddly more than 10 are about VM migrations, which probably doesn’t reflect the reality of the test.
I’ve managed to touch most of the Azure services available for the last few years but sometimes it feels like swimming against a strong current. Maybe it won’t be that difficult to get this certification but it’s certainly going to be so to keep to the same level.
It caught my attention that there’s still no easy way to rotate an Azure Function App’s host key, in case you want to do it regularly and put it into a Key Vault. There are many scripts available online but none of them did the work I needed, so I compiled them into a single and simple one.
We released a new video a few days ago, focusing on how Application Lifecycle from Development to Deployment is affected throughout this transformations. I hope you’ll enjoy it as much as the first video!
One of the new functionalities of Azure Functions is that they can get triggered by an Event. One of the use cases that I encountered at one of my clients was to copy a file from a Blob Storage to Azure Data Lake Store, whenever a new file arrived to the blob storage. A simple way to do this was to create an Azure Event Subscription, which would listen to events of the blob storage, then kick off Azure Function to trigger copy process. Most of the stuff was easy to implement: ARM templates, Azure Function itself, key rotations, configurations, CI/CD pipeline, etc. But one thing was an issue: When deploying an Event Subscription via ARM templates, you need to create the Webhook URL yourself, which included getting the authentication code and putting onto Webhook URL.
If you create the subscription through the Azure Portal, it automatically resolves the auth code; but you cannot do the same thing with ARM templates. Furthermore, that code it creates is not one of the keys available on function app/function itself. Apparently it’s something called “Event Grid Extension System Key”, which can only be obtained through Kudu/Admin API of Function App. We somehow need to obtain this key and pass it as a parameter to ARM template, so we can create the Webhook URL properly and deploy the ARM template through VSTS Release.
In order to get this “system key”, we need to create a Powershell script and do some chained API calls. Here are our steps:
Obtain a Bearer token to call Kudu API to obtain Function App master key. (Explained here.)
Call Kudu API and get the master key. (Explained here and here.)
Call Kudu API and pass the master key to obtain Event Grid Extension System Key (Explained here.)
Let’s get on the road. My starting point was the Powershell script developed here; which was really great. I refactored it and added the System Key codes into it. Furthermore, it’s now ready to use directly on your VSTS pipeline. Here is the full script:
The script will do all the API calls and write the System Key onto a VSTS variable called FunctionAppEventGridSystemKey. Through this variable, we can construct the Webhook URL, which we do in following ARM template for Event Subscription:
If you wish to add it to your deployment pipeline, first you add an Azure Powershell task and execute your Script:
Then, execute your Event Subscription ARM template by passing the System Key as a template parameter:
And, voila! You have (hopefully) successfully deployed your Event Subscription. In this particular example, your Event Subscription will pick up events from your storage account and push the event to your Azure Function.
There are many easy ways to do things with ARM templates, but sadly this wasn’t one of them. I am hoping in the future Microsoft will somehow embed Kudu into Azure RM API, so we can do all of this through Powershell commands.
As we mentioned in the intro of the series, we are looking at the gap between on-premise and cloud systems from different angles. The first article in the series was “Cloud: Why?”, which we talked about why an organisation would decide to go cloud. In this article we will discuss how that decision changes the solution architecture and what are the right approaches to follow.
The first question here is, why would going on cloud change our architecture? Aren’t there any VMs over there? Can’t we just build our existing architecture on the cloud?
Yes, you can. And yes, there are VMs over there. But why would you? Why would you want to build an on-premise look-alike system on the cloud? If that’s what you want, why move at all? Apart from cutting down some hardware maintenance costs, there is absolutely nothing else you’ll be gaining from that.
If you are moving towards the cloud, you need adopt the design patterns and best practices over there. They are called “best practices” for a reason: They can provide light on your treacherous path, reducing the time and effort needed for your transformation.
Let’s take a look at what needs to be changed.
There are some things on cloud that still works the same way. Let’s assume that you have a web application that connects to an API, which talks to a SQL database and returns the records. This still works the same, the components you use just change names:
IIS Application Pool
App Service Plan
Azure SQL Server
Azure SQL Database
In the on-premise world, you host your Web and API in two separate IIS Websites. API then connects to your SQL Database under a SQL Server, which probably is generic and has many databases on it. On the cloud, you deploy your Web and API on App Services, which are equivalent to IIS Websites. You also have an Azure SQL Server that has your Azure SQL Database.
Then what changes in this example? Well, many things but one of them is the authentication. On-premise world (the ones running on Windows platform) has Active Directory Domain to take care of it. You just enable Windows Authentication and bam! No anonymous user can connect to your site. If you want to do authorisation, you can check if the user is in the correct AD group. If you want to connect to your API, you ask for service accounts and then enter those credentials to Application Pools as Application Pool Identities. This way, your application can connect to other apps that has their Windows Authentication enabled.
But when you move to PaaS on Azure, there is no domain available there. You cannot enable Windows Authentication; your website is living on a machine that is not on your domain (and probably running many other apps from other people at the same time).
Things like this, even they’re small (especially when they’re small), can cause dramatic changes on your application architecture.
Use PaaS and SaaS
There are many tools that cloud providers give you as PaaS (Platform as a Service) and SaaS (Software as a Service). These tools allow you to focus on your application development and forget about the maintenance of these services in the background. All you have to know is what are the SLAs and follow up some planned maintenance times. The picture below is #2 in my favourites (#1 is the Liskov Substitution Principle motivational picture with the duck example) and shows clearly the difference between on-premise, IaaS, PaaS and SaaS.
PaaS tools provide you the proper runtime and a deployment strategy, which allows you to develop your applications and directly launch them on the platform. This makes a huge difference on delivery speed. It also changes most of the patterns you know from the on-premise world. Things work different on cloud; authentication and authorisation is different, logging is different, reliable messaging still looks the same but feels different.
These differences will change your application architecture, mostly in a good way. Let’s put changes on a concrete perspective.
Let’s visit our example again. You found out that there is no domain available on PaaS services. After some heavy drinking and whining, you realise that there is something called OAuth and Azure Active Directory implements it. You create an Azure AD, enable OAuth and OpenID Connect on your applications, create service principals and bam! You have a lift off. (We’ll discuss these security stuff in detail in its own article)
Azure Specific Tools
There are many things Azure (and other cloud providers) provide. Trust me, the list is endless. When I visit Azure portal, sometimes I feel like a kid breaking free from his parents and races towards the snacks aisle in a supermarket. For most of your use cases, there is a tool available there.
(I suggest you take a look at Azure Info Hub, it shows almost everything Azure is capable of. It also includes documentation, videos, examples, links, etc.)
You also get lots of new toys to play around (cheeky childish smile on). If you want to do some IoT based project, Azure gives you Azure Stream Analytics and Azure IoT Hub. These tools can process huge amount of incoming data on the fly and produce the value your business needs from those devices.
You have very large files that you batch process through the night? Well, there is Azure Data Lake Analytics, it’s the perfect tool for the job.
Do you have analysts that slice and dice the data already? There’s Azure Databricks to help them do that with the help of Spark, Scala, Python and R.
Do they want to use some machine learning to gain analytics and insights? Azure Machine Learning has an extensive library/market already and Azure Databricks plays well with it.
You want to run some code triggered by a service bus message but don’t want to go full scale with App Services? You have Azure Functions and Logic Apps. They are both part of the Server-less Architecture that Azure provides.
You prefer containers? Azure has Azure Kubernetes Services. It’s managed and quite powerful, you can easily deploy your container and have a beer on the balcony right away.
You need a No-SQL database? There’s CosmosDB. It even has Graph DB support with Gremlin, in case you need it (I do at the moment).
I mostly mentioned Azure here but there are equivalents/alternatives on other cloud providers such as Amazon. They are equally powerful.
Bottom line is, you don’t have to develop something on your own or install the same product and use it. There are many things available out there; you just have to choose them and integrate them with your applications. These will drive changes on your traditional solution architecture but remember: Change is good.
Don’t forget to check out other articles in the series; each of them will take the matter from a different angle. Next one is going to be on Application Development and CI/CD.
On June 27th, Microsoft announced the Public Preview of Azure Data Lake Store Gen2. It’s more powerful and now equipped with many features that Gen1 didn’t have. This is thanks to the full integration with Blob storage now.
Azure Data Lake Storage Gen2 offers a no-compromise data lake. It unifies the core capabilities from the first generation of Azure Data Lake with a Hadoop compatible file system endpoint now directly integrated into Azure Blob Storage. This enhancement combines the scale and cost benefits of object storage with the reliability and performance typically associated only with on-premises file systems. This new file system includes a full hierarchical namespace that makes files and folders first class citizens, translating to faster, more reliable analytic job execution.
Azure Data Lake Storage Gen2 also includes limitless storage ensuring capacity to meet the needs of even the largest, most complex workloads. In addition, Azure Data Lake Storage Gen2 will deliver on native integration with Azure Active Directory and support POSIX compliant ACLs to enable granular permission assignments on files and folders.
As Azure Data Lake Storage Gen2 is fully integrated with Blob storage, customers can access data through the new file system-oriented APIs or the object store APIs from Blob Storage. Customers also have all the benefits of Azure Blob Storage including encryption at rest, object level tiering and lifecycle policies as well as HA/DR capabilities such as ZRS and GRS. All of this will come at a lower cost and lower overall TCO for customers’ analytics projects! Azure Data Lake Storage Gen2 is the most comprehensive data lake available anywhere.
I’m digging up more information on Gen2 and preparing to draft an in-depth article on its pros and cons; but for those who are interested, here you can watch the video on Youtube or use the resource links below: