Earlier today, I used for the first time the Power BI REST API to access some reports in my company's workspace. This API can be used with Windows PowerShell, a tool I use on occasion but rarely for any real scripting. But after spending a couple hours familiarizing myself with both PowerShell and the API, I think I've finally got the hang of it. If you've never used the Power BI REST API previously with PowerShell previously, you first need to download the module onto your system. Enter the following into the PowerShell console: Install the Power BI REST API Module
Next, I will demonstrate how to quickly interact with the API and your published content. If you haven't yet, open up PowerShell ISE, an IDE for PowerShell Scripting. I encourage you to run each line of code independently (not all at once) and to examine the structure and contents of each variable. And of course, don't forget to save your script when finished. Accessing Report Metadata
I'll be posting more tutorials and code snippets on Power BI API implementations in the future. Until then, I would strongly recommend watching Guy in a Cube tutorials on YouTube. They have some great content for all things Power BI, including implementation of the REST API. Thanks for stopping by, and happy coding!
1 Comment
The applications for hosting an R Shiny web application using cloud data are huge, and learning to work with cloud services is critical for data analysts and scientists. Dashboards that rely on cloud-hosted data do not need to be constantly redeployed; as long they point to the right file on S3 (or database on RDS), for instance, your Shiny application will use the most recently available data. This has the potential to reduce overhead and provide real or near-real time access to information for your clients. In this post, I will demonstrate how to: (1) set up the bones of an Amazon Web Services (AWS) S3 Bucket, (2) connect S3 to an R script, and (3) insert this connection into an R Shiny dashboard. Setting up an Amazon S3 BucketCreate an AWS Account. You will be asked for an email, password, and credit card information. As long as you stay within Free Tier restrictions, you will not be charged. Only my use of the AWS Key Management Service (KMS) has incurred charges from the workflow described in this tutorial, but currently I've only been charged $0.12- a reasonable sum to learn a highly sought skill. Create an S3 Bucket. The name for your bucket must be unique- no other AWS user must have created a bucket with the same name. Choose an applicable region. The "cloud" just refers to data centers that Amazon maintains. These data centers are located in various regions. I chose the region closest to me as of the writing of this tutorial. Ensure that you remember the region you picked; transferring data between regions can incur costs, so if you have other AWS services that you anticipate needing in the future, having them located in one region will reduce spending. Block all public access to this account. Amazon recommends this security feature be activated to prevent unrestricted access to the contents of your bucket. You can always change this feature in the future, but I recommend enabling it until you are more familiar with what your cloud needs will be and how Amazon charges for the use of its servers. You can find this feature in the Management Console. Encrypt your data. I can't stress the importance of data security enough. I therefore recommend you enable server-side encryption. This will encrypt all the data at the object level that enters the bucket. You have the option of using either SSE-S3 encryption of SSE-KMS (Key Management Service) encryption. It is up to you, however, to decide what is more suitable for your needs. I personally chose to use SSE-KMS, with symmetric encryption. I then assigned a specific IAM user to have access to the key. (For details on setting up an IAM user, see the next step; this can be done concurrently). When you use SSE-KMS, you create an Access Key ID, Secret Access Key, and user for said key. Ensure you keep these credentials safe- you will need these for APIs and other services (e.g, R Shiny) to access your data. Set-up IAM users, IAM stands for Identity and Access Management. This allows you to enable users to access certain AWS services without needing access to the root account. This improves security and reduces risk of unauthorized access to data and cloud services. Complete the set-up of your bucket and upload files. Feel free to upload some test files. I decided to upload an RDS file for ease of use with R. This RDS file contains a simple Leaflet map I created previously that I wanted to be able to showcase easily on Shinyapps.io. I ensured all my read/write permissions for the file were set to 'private'. Optional. Set up budget alerts. In the management console under billing, you have the option to set up email and SNS alerts if you begin to go over budget. I have mine set to $10/month, and will receive notifications if I approach 80% of that amount. Connect R to your S3 BucketNow comes the easy part! Seriously- connecting R to S3 is a piece of cake, if you are already an R programmer. First, store your S3 credentials in a file named ".Renviron" in your working directory. It is good practice not to keep your credentials hardcoded in your code. The format should be something like this: AWS_ACCESS_KEY_ID = "XXXXXXXXXXX" AWS_SECRET_ACCESS_KEY = "XXXXXXXXX" AWS_DEFAULT_REGION = "XXXXXXX" R will know where to find these credentials. I've also hidden the file names and paths within text files located in my working directory for the same reason. Run my code below (albeit with the correct folder paths). Connect R to AWS S3 Bucket
Implementing this into a Shiny dashboard is relatively straightforward once you've completed the previous task. We simply need to wrap all this into a new script called app.R. Wrap it all into a Shiny Dashboard
This is a pretty bare-bones dashboard- merely a structure for you to build off of. But I hope you've found this tutorial helpful for merging these technologies together. And once you build out the the dashboard, deploying it online is pretty simple using Shinyapps.io. Another option could be to use AWS EC2 to host the application, depending on your needs.
Please let me know how you like this tutorial- I am always receptive to constructive feedback and I would be happy to hear suggestions that may improve my approach. Thanks, and happy coding! Introduction
For a long period of time, the ability for individuals and organizations to analyze geospatial data was limited to those who could afford expensive software (such as TerrSet, ERDAS, ENVI, or ArcGIS). R has emerged as an alternative to these expensive software packages. R is an open source statistical programming language that enables users to create reproducible, easy to understand, and efficient workflows. Its widespread use has inspired many individuals (many, if not most of whom are researchers and data scientists) to create packages that expand the capabilities of the language- including in GIS and raster image analysis. This project highlights some of the abilities of the R programming language to work with geospatial data- all made possible through these packages. Specifically, the focus will be on the data cleaning, interpolation, and advanced analysis of time-series data. Objectives This tutorial, developed collaboratively between myself and Priyanka Verma, seeks to gain insights from particulate matter air pollutant trends in the New York Metropolitan Area. Specifically, we look at particulate matter that is less than 2.5 micrometers (PM2.5), and only data obtained during winter months, due to the higher level of PM2.5 during that season. Study Area Our study area includes part of the New-York Metropolitan Area, including New York City, Long Island, counties in Upstate New York, and large portions of New Jersey. These areas were determined by the New York Core Based Statistical Area (NY CBSA). Data PM2.5 data has been provided by the Environmental Protection Agency (EPA), and can be found through the open data portal: https://www.epa.gov/outdoor-air-quality-data/download-daily-data. The CBSA boundary shapefile can be obtained from the United States Census Bureau. Workflow: The full tutorial has been hosted on GitHub, which I have found to be a much easier way of sharing code and code output. You can find the tutorial at https://freygeospatial.github.io/PM25-TimeSeries-R-Tutorial/, which contains the code and output in an easy-to-follow format. The main repository can be found at https://github.com/verma-priyanka/PM25_NYC, which contains the actual files (including raw data) used in our analysis. Last fall semester, I was able to take an independent study where I expanded my knowledge of SQL beyond its traditional use of querying relational databases for text and numeric values. Instead, I began using it as a standalone GIS, using it to store and query geometries (points, lines, and polygons). One of the ways in which I initially learned to use PostGIS functions (my new RDMS of choice) was to redo some homework assignments from a previous introductory course in GIS- but using SQL functions instead of ArcMap geoprocessing tools. This particular homework assignment sought to illustrate a use case for buffering, dissolving, and intersecting. By transliterating the workflow from ArcMap into SQL code, I was able to create a process that was reproducible, easy to understand, and more efficient to run. Plus, there was no file bloat of intermediary files- a common data management problem when geoprocessing data in ArcGIS. The assignment simulates a problem easy to envision occurring in the natural resources sector: "Forest management staff have asked you to develop information to assist with the application of pesticides... Management intends to apply pesticides to areas within 200 meters of each infested point." The goals of the project include:
We are given 4 different set of geometries, stored as shapefiles:
First, I uploaded each shapefile into its own table in the database with a specified schema. I used ogr2ogr through the command prompt to accomplish this. Let's take a look at our inputs: Then, the first task was to create a buffer around the infestation points, and to dissolve the boundaries. You can do this in a couple ways. The first is the create a nested query, like so: You can think of this method as querying the results of another query. Using nested queries when I first learned database management helped me understand more intuitively how databases functioned. Additionally, it was crucial for helping me to understand the WITH statement (also called a common table expression, or CTE), which is very important for making large queries easier to read and understand. However, by themselves, they can often be messy, lack conciseness, and can become overly verbose. Nested queries are very important to learn, but when you can avoid using one, it is often best to do so. A better method would be the following: Both result in the same output. See below: Let's break down this process. First, we use the ST_Buffer function, on a.geom (with 'a' being the table, infest_points, as it has been given an alias), and give it a buffer of 200 meters. 'Geom' is the column that holds the geometry of infest_points. Remember that we set this name originally in our ogr2ogr command. Using ST_Buffer alone (without the other functions) would produce the following result: As you can see, we still need a dissolve. Which leads us to adding the function, st_union(). The problem with this, though, is that no longer maintain the individuality of each buffered point. Each buffer has now been merged into a single record, which is not what we want. To avoid this, we use ST_Dump(). According to the PostGIS documentation, ST_Dump can be thought of as the opposite of a GROUP_BY statement. Remember to add '.geom' at the end of your select statement, otherwise you will be returned an improperly formatted column formatted geometry column without the desired visuals. Knowing this general methodology, let's skip ahead to my near-final product. You will now find that at this stage in my coding, I resorted back to nested loops! Not the most pleasant sight. It's hard to believe, but this hard-to-read chunk identifies the forest cover within each buffered location, and those sections that are within the park boundary. But how does it work? What are those inner joins doing? Not easy to tell with the way this code is formatted... What I am about to show will start to make the above more readable. I will introduce the WITH statement, which as I mentioned previously, provide an easier way to write large SQL queries. See the documentation for that here. This code chunk below is equivalent to the above, and a bit easier on the eyes: The WITH statement is akin to a nested query, but allows users to separate their code into chunks, piping one query into the next quite easily, even on an ad hoc basis. See how we can easily add an extra line at the end of the above code to return the area of each forest type that will be impacted by the pesticide application: *Note that I added the random() function to the code just as an assurance that no future student will copy my results.
The WITH statement turns a convoluted chunk of code into something easily reproducible, and makes streamlining the GIS workflow a relatively quick process. It is probably one of the most important concepts I have learned this past semester. A future blog post will discuss in more detail how intersections using ST_Intersects and INNER JOIN work in PostGIS. That tutorial will also seek to answer those two remaining questions: Do any streams pass through the application areas? What is the length of each stream segment that passes through an application area? This semester, my peers Aaron Eubank, Bryce Stouffer, Samuel Watson, Wei Hong Loh, and I took on a directed study under our professor Dr. Lyndon Estes on the implementation and use of PostgreSQL/PostGIS. While I had prior experience with SQL Server and some of its spatial functions, I had never used SQL to perform a full range of geoprocessing tasks. I found my workflows to be more reproducible than using ArcGIS Pro/ArcMap and QGIS, and found processing speeds to be faster than R's {sf} package. Our professor noted to us, however, that working on a locally hosted database has become outmoded in the face of cloud computing. He recommended that we learn to set up PostGIS on AWS and learning how to secure it against unauthorized access. I had never used AWS prior to this semester, and didn't know too much about what was involved. As it turns out, use of these cloud services necessitates knowledge of the terminal/command prompt and its associated functions, which I had little prior experience with. By the end of the semester, though, I became much more proficient in its use and how to improve data security. I am even evolving into somewhat of a Linux convert from the experience. But I digress. After uploading data to our new AWS-hosted PostGIS database, we decided to visualize those results in a web map. We found that the use of GeoServer was important for adding an extra layer of security to our data, and improving interoperability as we pushed the data to Leaflet. Given the project's complexity, we found that creating a markdown was helpful for keeping track of each step in our workflow. I recommend that anyone undertaking a similar project for the first time also keeps notes of their progress. Our notes can be found below. And, of course, I would be more than happy to receive questions, comments, or suggestions on our methodology. VM Creation (Elastic Computing 2 - Ubuntu Instance)
3. To install Postgres on the instance for the first time and add users, follow the guide here. 4. To switch to the desired database user, enter the command below: 5. Create the PostGIS spatial extension: VM GUI Installation
Postgres Data Import First Steps (for DB on Windows Local Machine):
First Steps (for DB on AWS EC2 Linux Virutal Machine):
To upload a single file instead of a directory, remove the -r option. Next Steps:
Using psql to import csv files: First, create the table you wish to upload data to: `copy` is a psql command to import csv files: Securing the Virtual Machine and PostgreSQL Database
Accessing the EC2 Postgres Database on PgAdmin
We did the following to grant access:
Data Visualization
Leaflet Visualization
After the Geoserver is in place, we utilized code built upon Leaflets’ library to pull our data from GeoServer and visualize it as a web map. A future posting on GeoServer and Leaflet will be made available, with its use and workflow discussed in further detail. 8/10/2021 UPDATE: The code for our web map can be found here on hackmd.io, along with additional details regarding EC2 security/firewall implementation. What is data fusion, and What is pansharpening? Data fusion, simply put, is a method of combining different modes of data (whose degrees of heterogeneity can vary) into a single output. While with the purpose of data fusion varies from project to project, the unifying goal is to extract information from the fused result that could not be easily obtained from either source alone. Bear in mind the quote, "The whole is greater than the sum of its parts." Pansharpening is one common method of data fusion. Pansharpening is performed to increase the spatial resolution of remotely sensed imagery by fusing an image's panchromatic band with its multispectral bands. Due to the way in which a sensor captures an image, the panchromatic band always has a higher spatial resolution than its multispectral counterparts. With Landsat images, the panchromatic band has a spatial resolution of 15 meters, while the multispectral bands have a spatial resolution of 30 meters. Why use R for Raster Analysis? Why not use Python or some other software? There is nothing wrong with using Python. You can certainly perform the same tasks in that language. However, performing raster analysis and geoprocessing is not a huge leap away from R's natural ability to work with matrices; R's primary function has traditionally been as a statistical, scientific, and data driven programming language. Since raster data is simply matrix data that has been given geographic context, it should be no surprise that R also works well as an alternative to other raster GIS software like TerrSet, ERDAS, or ENVI. The data and the methods: The R scripts below illustrate some simple pansharpening methods in data fusion. The first attempts to follow the Intensity-Hue-Saturation (IHS) method described in the TerrSet/Idrisi help guide. The other is a Brovey transformation method described in Jixian Zhang's 2010 publication, "Multi-Source Remote Sensing Data Fusion: Status and Trends" in the International Journal of Image and Data Fusion. The sample Landsat data I used can be obtained from: https://clarklabs.org/download/ IHS Transformation
Brovey Transformation
Comparing Outputs:
Works Referenced:
Jixian Zhang (2010) Multi-source remote sensing data fusion: status and trends, International Journal of Image and Data Fusion, 1:1, 5-24, DOI: 10.1080/19479830903561035 |
Jordan Frey#GIS and #DataScience professional. Hiking, cycling, skiing, outdoors and fitness, chess, Star Trek. Works with #rstats #python #SQL #JavaScript etc. Archives
August 2021
Categories |