Happy reading!
Blog address: https://aws.amazon.com/blogs/iot/manage-iot-device-state-anywhere/
Source code: https://github.com/aws-samples/manage-IoT-device-using-device-shadow-blog
NanoPi Neo2 with LED hat in my home office, running AWS ECS Anywhere.
All source code can be found at https://github.com/linkcd/step-function-with-ecs-Anywhere-example
As in this demo, the ecs Anywhere is running on a Nanopi, it should be build on the Pi as it is ARM architecture1
2
3
4
5# In nano pi ssh
cd ./container-for-ecs-task
docker build -t linkcd/s3downloader:arm .
docker login
docker push linkcd/s3downloader:arm
Then push to public repository so ECS cluster can download (public docker hub or private ECR)
The s3 upload event is captured by cloudtrail, which triggers and pass the event data to step function.
This PASS step extract the needed info (bucket name and file key). Output is1
2
3
4{
"bucketName": "the_bucket_name_from_event",
"fileKey": "the_file_key_from_event"
}
The CHOICE step check the file key and trigger the ECS task ONLY IF the file key matches “demo*.txt”
This ECS RunTask update the input paramater (adding s3:// prefix to bucket name), then pass the parameters to ecs Anywhere task via environment variables.
Once the ecs Anywhere task is finished, the downloaded file can be found in the ecs Anywhere local file system (in this case, the file is in /data)
In ECS RunTask in Step Functions, override command cannot pass multiple parameters. In our case we would like to use aws cli docker for simple aws cli s3 download. However if we override the command to “s3 cp x y” in ECS RunTask step in State Machine, these 4 parts will NOT be passed as individual 4 parameters but ONE parameter that contains all. AWS cli cannot accept that.
Incorrect value that passed via override command1
2
3"Args": [
"s3 cp x y"
]
Correct call if we directly use aws cli docker from terminal1
2
3
4
5
6"Args": [
"s3",
"cp",
"x",
"y",
]
Therefore we use environment variables to make sure we can pass parameters to ecs container task separately (it means we have to use our own container)
AWS Systems Manager (SSM) is an AWS service that you can use to view and control your infrastructure on AWS. It can securely connect to a managed node. The SSM Agent is installed in EC2 OS. It is pre-installed on many amazon Machine Images (AMIs).
With SSM:
And SSM works regardless if the EC2 instance is in public or private (NAT or Endpoint) subnet.
Requirements for SSM working:
In this case, EC2 instances have no public IP, but they can still talk to internet via NAT.
In this case, the EC2 instance (no public IP) won´t have access internet via NAT but VPC endpoints, some extra works are required
Once the SSM is fully up-and-running, the EC2 instance (either in public/private subnet) will appear in Fleet Manager in SSM web console.
]]>With VSMP,
All can be assembled together easily.
Front view
Zoom in details
The back
Install standard Raspberry OS. I am using the 32bit bulleyes with desktop version, but someone suggested to use lite version for Raspberry Pi Zero. Read more here.
For install pi with headless wifi, read how-to here.1
2touch /Volumes/boot/ssh
touch /Volumes/boot/wpa_supplicant.conf
Content of wpa_supplicant.conf1
2
3
4
5
6
7
8
9country=US
ctrl_interface=DIR=/var/run/wpa_supplicant GROUP=netdev
update_config=1
network={
scan_ssid=1
ssid="your_wifi_ssid"
psk="your_wifi_password"
}
Note: It is a good practice to disable the default user “Pi”, but VSMP installation script from Tom Whitwell is using hard-coded “Pi” home path, so to keep it simple, keep “Pi” user but DEFINITELY update the default password. I also run it in a guest wifi that it has no access to rest of my network devices.
There are many implementations of VSMP:
Note: You can test your e-ink by running omni-epd-test. In my case, I do the following1
omni-epd-test -e waveshare_epd.epd7in5_V2
The omni-epd is a part of the installation.
1 | # assume you have ffmpeg installed on your mac |
Read more examples at here.
1 | # assume you have ffmpeg installed on your mac |
Read more about “-an” parameter at here.
1 | # from movie folder |
By default VSMP is enabled as a service.
Edit the slowmovie.conf file to specify parameters such as video locations and start frame1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16# Edit the config file
vi slowmovie.conf
## Content of slowmovie.conf ##
random-frames = False
delay = 120
increment = 4
contrast = 2.0
epd = waveshare_epd.epd7in5_V2
directory = /home/pi/SlowMovie/Videos
timecode = False
## End of content ##
## Restart the service
sudo systemctl restart slowmovie
sudo systemctl enable slowmovie
If you want to manually run command, REMEMBER to disable the service.
If you run it manually, considering use tmux to ensure the session continues after you log off.
Example:1
2
3
4
5
6
7
8
9
10# Stop services
sudo systemctl stop slowmovie
sudo systemctl disable slowmovie
# enter tmux session
tmux
# Manual run in tmux session window
cd SlowMovie
python3 slowmovie.py -f ./Videos/Kiki.mp4 -d 20 -s 19970 #delay 20 sec, start from 19970 frame
Wondering what to play? Read Content reviews: What makes a good slow movie. I am a big fan of Studio Ghibli so that is my choice.
Also you might want to re-encode the videos as here.
You can use iphone time-lapse to record your VSMP to see how it works. However, iphone time-lapse will sometimes capture some e-paper refresh, when the screen is all white or black. To remove these bad frames from your video, do following (ref #1, #2)1
2
3
4
5
6
7
8
9
10
11
12# extract all frames from iphone time-lapse video
mkdir img
ffmpeg -i time-lapse.MOV -qscale:v 2 -r 30/1 img/img%03d.jpg #iphone time-lapse video is 30 fps, second best output img quality
# remove bad frames
# manual or using ML such as Amazon Lookout For Vision
# regenerate the video from frames
ffmpeg -framerate 30 -pattern_type glob -i 'img/*.jpg' output.mov
# slow it down if needed
ffmpeg -i output.mov -filter:v "setpts=1.3*PTS" output_slow.mov
Example of using Amazon Lookout for Vision for detecting bad frames, but that is another story.
In this demo, we create the AWS Control Tower instance in a brand new AWS account. During this process, control tower creates several services/components, such as AWS Organizations, AWS SSO, default organizations unit (OU) “Security” and 2 AWS accounts “Log Archive” and “Audit”.
In the AWS SSO, some default SSO user groups are created for managing Control Tower:
The default admin user for organization management account is “AWS Control Tower Admin”.
Detailed user info
And it belongs to 2 groups: AWSAccountFactory and AWSControlTowerAdmins
For this demo, we are using a free developer plan of Okta.
Follow the steps in the following document, to use Okta as the idp of AWS SSO.
Note that you need to check steps from both documentation to make sure the integration and user provisioning works.
Steps: How to Configure SAML 2.0 for AWS Single Sign-on
Steps: Configure provisioning for Okta in AWS SSO
After the basic hand-shake between AWS SSO and Okta, the AWS SSO is now using Okta.
In Okta groups UI, you can see identical groups as in AWS SSO are created in Okta. The Everyone is a default Okta user group.
Note: you cannot add/remove users to it, as it says “This group is managed automatically by Okta, so you cannot edit it or modify its membership.”
Lets create some test users:
We also create user groups in Okta
However, they are not appearing in AWS SSO user list. There is still no Okta user nor Okta group.
In order to user the users from Okta, these users need to be assigned to AWS SSO Application in Okta.
Go to Okta -> Application -> AWS SSO, in Assignments tab, you can either assign individual users or user groups. In this screenshot, all users are assigned to AWS SSO via Group (see the Type column).
Soon, you can see these 3 users appear in AWS SSO interface.
The detailed info. Note that it was created and updated by SCIM.
Now you can assign them into AWS account, so the user can login to AWS console via login to Okta.
Now we can grant permission for individual Okta users. But how about Okta group? These new okta groups are not available in AWS SSO yet. And the groups with identical names from AWS SSO is not helping, as we cannot add users into it.
To solve this, we need to push the Okta groups to AWS SSO by setting up the “Push Groups”.
Go Okta > Application > AWS SSO, in tab “Push Groups”, here you can push group by name, or setup roles for batch pushing.
In this demo, we setup a rule named “Pust-AWS-Related-Groups” for pushing any group that starts with “AWS-”
Soon, these groups were pushed to AWS SSO:
Now you can also grant permission to groups, such as every Okta user in AWS-CT-Admin-Okta-Group now have permission as AWS control tower admin.
EoF.
]]>Now we have the data, and how to gain some insights by doing data analytics? I have been using the following products, and would like to share my quick thoughts
Please note that I tested these products back to Feb/March of 2019 and all the feedback were from that time point. I am sure all products were significate upgraded and improved since then, so you might wanna check them again with the lastest features.
Azure Time Series Insight (TSI) is an IoT analytics platform monitor, analyze, and visualize your industrial IoT data at scale. With native integration with Azure IoTHub or EventHub, it is easy to visualize and explore the IoT data such as from our connected car.
You can easily explore data by putting time series data into one screen:
(click to enlarge)
For example, you can identify the relationship between engine RPM and speed, and the increasing temperature of engine coolant.
As TSI is built for handling IoT data, it has built-in functionality for managing metadata/models of IoT data stream. This is a unique feature that only TSI offers, compares to other general-purpose analytics products that I tried.
In another word, in order to use TSI, you will have to setup the following models:
For our case, we can setup the models for representing the1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37Assets - ABC Taxi Company Carpool
├── Car 1: Feng Toyota Auris
│ ├── GPS
│ │ └── FengsDevice_GPS
│ │ ├── GPS Speed
│ │ ├── Altitude
│ │ └── ...
│ └── OBD
│ └── FengsDevice_OBD
│ ├── RPM
│ ├── SPEED
│ ├── MAF
│ ├── ENGINE_LOAD
│ └── ...
│
├── Car 2: Thomas Two Engines Monster Truck
│ ├── GPS
│ │ └── TomsDevice_GPS
│ │ ├── GPS Speed
│ │ ├── Altitude
│ │ └── ...
│ └── OBD
│ ├── TomsDevice_OBD_Engine_1
│ │ ├── RPM
│ │ ├── SPEED
│ │ ├── MAF
│ │ ├── ENGINE_LOAD
│ │ └── ...
│ └── TomsDevice_OBD_Engine_2
│ ├── RPM
│ ├── SPEED
│ ├── MAF
│ ├── ENGINE_LOAD
│ └── ...
│
└── Car 3: ...
└── ...
For our case, these model definitions can be found at here.
Pro:
Con:
It was nice to visualize the time series data in TSI, but I would like to play more with the dataset, such as calculating the fuel consumption vs. speed for example. I would like to use python and jupyter notebook. Therefore I continue the work with Azure Databrick.
By using MFA and speed, it is possible to calculate the fuel consumption, as explained in https://www.windmill.co.uk/fuel.html and https://www.wikihow.com/Convert-MPG-to-Liters-per-100km.1
2
3
4
5
6
7# Adding MPG column
# MPG=Speed(Km/h)*7.718/MAF
dfwithMPG = df.withColumn("MPG",df.series_SPEED_double/df.series_MAF_double*7.718).select("timestamp", "series_SPEED_double", "series_RPM_double", "MPG")
# Then convert from MPG to L/100km, adding Consumption column, using US galoons
l/100km = 282.48/MPG (imperial gallons) or l/100km = 235.21/MPG (US gallons)
dfwithConsumption = dfwithMPG.withColumn("Consumption",235.21/dfwithMPG.MPG).select("timestamp", "series_SPEED_double", "series_RPM_double", "MPG", "Consumption")
(Picture: Oversimplified calculation of eco-driving zone)
If we directly use the TSI parquet files as input for the databricks, we will encounter an error message “Found duplicate column(s) in data schema: “series_speed_double”.
This is because both GPS and OBD modules are reporting speed, but with different case “Speed” and “SPEED”.
TSI is fine with it, as the asset model/metadata helps, but in Databrick there is no data contextualization - all data fields are flattened out, therefore it is quite often encountering this type of issue.
As a workaround, we can set spark.sql.caseSensitive as true1
2sqlContext.sql("set spark.sql.caseSensitive=true")
df = sqlContext.read.parquet(file_path).select("column1", "column2" )
Pro:
Con:
After tried the a-bit-too-simple TSI and a-bit-too-hardcore Databrick, I was looking for a better-balanced product between them. Therefore I started exploring Azure Data Explorer (ADX).
Long story short, I created an ADX cluster and a database for IoT Car data, created 2 tables:1
2
3.create table OBDTable (timestamp: datetime, deviceId: string, speed: real, rpm: real, run_time: real, absolute_load: real, short_fuel_trim_1: real, long_fuel_trim_1: real, timing_advance: real, intake_pressure: real, intake_temp: real, throttle_pos: real, relative_throttle_pos: real, oil_temp: real, maf: real, coolant_temp: real, engine_load: real)
.create table GPSTable (timestamp: datetime, deviceId: string, gps_speed: real, altitude: real, longitude: string, latitude: string)
And create mappings as below:1
2
3
4
5
6
7
8
9
10
11
12
13"Name": GPSMapping,
"Kind": Json,
"Mapping": [{"column":"timestamp","path":"$.timestamp","datatype":"datetime","transform":"None"},{"column":"deviceId","path":"$.deviceId","datatype":"string","transform":"None"},{"column":"gps_speed","path":"$.series[0].gps_speed","datatype":"double","transform":"None"},{"column":"altitude","path":"$.series[0].altitude","datatype":"double","transform":"None"},{"column":"longitude","path":"$.series[0].longitude","datatype":"string","transform":"None"},{"column":"latitude","path":"$.series[0].latitude","datatype":"string","transform":"None"}],
"LastUpdatedOn": 2019-02-27T19:25:47.889932Z,
"Database": iotcardb,
"Table": GPSTable,
"Name": OBDMapping,
"Kind": Json,
"Mapping": [{"column":"timestamp","path":"$.timestamp","datatype":"datetime","transform":"None"},{"column":"deviceId","path":"$.deviceId","datatype":"string","transform":"None"},{"column":"speed","path":"$.series[0].SPEED","datatype":"double","transform":"None"},{"column":"rpm","path":"$.series[0].RPM","datatype":"double","transform":"None"},{"column":"run_time","path":"$.series[0].RUN_TIME","datatype":"double","transform":"None"},{"column":"absolute_load","path":"$.series[0].ABSOLUTE_LOAD","datatype":"double","transform":"None"},{"column":"short_fuel_trim_1","path":"$.series[0].SHORT_FUEL_TRIM_1","datatype":"double","transform":"None"},{"column":"long_fuel_trim_1","path":"$.series[0].LONG_FUEL_TRIM_1","datatype":"double","transform":"None"},{"column":"timing_advance","path":"$.series[0].TIMING_ADVANCE","datatype":"double","transform":"None"},{"column":"intake_pressure","path":"$.series[0].INTAKE_PRESSURE","datatype":"double","transform":"None"},{"column":"intake_temp","path":"$.series[0].INTAKE_TEMP","datatype":"double","transform":"None"},{"column":"throttle_pos","path":"$.series[0].THROTTLE_POS","datatype":"double","transform":"None"},{"column":"relative_throttle_pos","path":"$.series[0].RELATIVE_THROTTLE_POS","datatype":"double","transform":"None"},{"column":"oil_temp","path":"$.series[0].OIL_TEMP","datatype":"double","transform":"None"},{"column":"maf","path":"$.series[0].MAF","datatype":"double","transform":"None"},{"column":"coolant_temp","path":"$.series[0].COOLANT_TEMP","datatype":"double","transform":"None"},{"column":"engine_load","path":"$.series[0].ENGINE_LOAD","datatype":"double","transform":"None"}],
"LastUpdatedOn": 2019-02-27T19:17:24.3220181Z,
"Database": iotcardb,
"Table": OBDTable,
Now we are ready to query by using powerful kusto query language, especially the timeseries related analytics.
Simple data aggregation1
2
3
4
5
6
7//avg gps speed every 20s
let min_t = datetime(2019-03-06 12:30:00); //UTC
let max_t = datetime(2019-03-06 13:00:00); //UTC
GPSTable
| where timestamp between (min_t .. max_t)
| summarize avg(gps_speed) by bin(timestamp, 20s)
| render timechart
Inner join two tables and apply aggregation1
2
3
4
5
6
7
8
9
10
11
12
13//join 2 tables, show obd speed, gps speed and avg maf
let min_t = datetime(2019-03-06 12:30:00); //UTC
let max_t = datetime(2019-03-06 13:00:00); //UTC
GPSTable
| where timestamp between (min_t .. max_t)
| summarize avg(gps_speed) by bin(timestamp, 20s)
| join kind=inner
(OBDTable
| where timestamp between (min_t .. max_t)
| summarize avg(speed), avg(maf) by bin(timestamp, 20s))
on timestamp
| project timestamp, avg_gps_speed, avg_speed, avg_maf
| render timechart
Apply two segments linear regression on engine load, see the document here.
1 | //Applies two segments linear regression on engine_load. |
Pro:
Con:
So far I have tried several products for analytics, but none of them have great built-in visualization features, especially on map support.
PowerBI is a popular tool for data visualization, but it does not support big data analytics. However, by combining PowerBI and ADX, the job is easier.
Instead of doing the visualization in ADX, now we use a query to generate a dataset (two dimensional table)1
2
3
4
5
6
7
8
9
10
11
12
13//for powerbi map, CANNOT have comments!
let min_t = datetime(2019-03-06 12:40:55);
let max_t = datetime(2019-03-06 12:57:20);
GPSTable
| where timestamp between (min_t .. max_t)
| summarize any(longitude), any(latitude) by bin(timestamp, 20s)
| join kind=inner
(OBDTable
| where timestamp between (min_t .. max_t)
| summarize avg(speed), avg(maf), avg(rpm) by bin(timestamp, 20s))
on timestamp
| project timestamp, any_latitude, any_longitude, avg_speed, avg_maf, avg_rpm
Then use “Query to PowerBI” on the dropdown list.
NOTE: When I was testing this, there was an issue that the Kusto query can NOT have inline comments, otherwise these inline comments will be mixed into the generated powerBI query, which ruined the syntax. Keep all the comments out of the kusto query block.
By using the generated PowerBI query from above, I can easily create differnt visualization dashboard in PowerBI. For example the map:
It shows one of the trips on the map, as well as the speed: greener is faster, and reder means slower.
Using PowerBI addon such as Play Axis (Dynamic Slicer), it is easy to replay a trip.
Picture: Play a trip in PowerBI, with map and engine RPM.
It clearly shows where was the traffic jam (drops of speed), and where had a good traffic condition (peak of speed and RPM).
Pro:
Con:
PowerBI is a good visualization tool, but it is not easy to directly create/update kusto query in PowerBI. Most likely you will have to run and test the query in ADX, then export to PowerBI. We hope to overcome this issue with Grafana.
Grafana is an open source tool mainly used for monitoring and data visualization. With the Azure Data Explorer Datasource For Grafana plugin, we can integrate the ADX and Kusto power with fancy and powerful Grafana visualization.
1 | docker run -p 3000:3000 -e "GF_INSTALL_PLUGINS=grafana-azure-data-explorer-datasource" grafana/grafana:latest |
Then follow the plugin documenation to config access.
Now you can directly create Kusto-enabled dashboard, including map.
Pro:
Con:
Now we have tried several products, and my favorite setup is ADX (as backend data storage and query) and Grafana (as front end self-service visualization). I believe it meet the most common needs of ordinary users. But of course other products have different focus areas and can/should be used for different scenarios.
After all, the old saying is always correct: “It depends.”
Thanks for the reading.
(Read Part 1 of this article series)
]]>As there are many possible situations can happen on the edge, such as disconnection from OBD2 connector, or loss of GPS signal (when going through an underground tunnel), the modules are built with the following principles:
In addition, the modules are built into docker containers, together with Azure IoT Edge runtime, which makes it easier to deploy.
All source code can be found at https://github.com/linkcd/IoTCar
This edge device (raspberry pi + OBD connector + GPS dongle) reports the following data per second:
With a USB GPS dongle, it is quite easy to get the location information by using tools such as GPSD.
I immediately met the first challenge: The USB GPS dongle requires a good open sky view to work well. The one that I used does not have an antenna, so I need to put the whole thing (raspberry pi + GPS dongle) outside of the building (or at least outside of the window).
Remind you that it was winter in Norway during that time, and I was not a fan of typing keyboard in the snow with -5 degrees.
Firstly, I have tried do this in my car: parked the car in an outdoor parking slot, put the raspberry pi on the dashboard and remote desktop to it. Well, it worked, GSP signal was strong, but it was quite difficult to type any keys behind the steering wheel :)
But soon I figured out a better solution on my balcony (see below picture), and that worked perfectly (as far as the wifi signal is good and the power bank battery did not die from the low temperature)
Now I can work from a warm cozy place and deal with the GPS data that is collected from the “cold box”.
The GPS receiver reports data as NMEA sentences, and we are combining GGA and RMC.1
2
3
4# GGA
$GPGGA,123519,4807.038,N,01131.000,E,1,08,0.9,545.4,M,46.9,M,,*47
# RMC
$GPRMC,123519,A,4807.038,N,01131.000,E,022.4,084.4,230394,003.1,W*6A
Here we are using a python lib pynmea2 for handling the NMEA sentences, the detailed logic can be found at source code here.
In addition, we need to do some small math for calculating the correct latitude and longitude, otherwise you will find your car was driving in the ocean :)1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18# The latitude is formatted as DDMM.ffff and longitude is DDDMM.ffff where D is the degrees and M is minutes plus the fractional minutes.
# So, 1300.8067,N is 13 degrees 00.8067 minutes North and the longitude of 07733.0003,E is read as 77 degrees 33.0003 minutes East.
# Converting to degrees you would have to do this: 13 + 00.8067/60 for latitude and 77 + 33.0003/60 for the longitude.
# ##NMEA outputs in a human readable DDDMM.mmmm format NOT DECIMAL DEGREES
# 3746.03837
# 37 46.03837
# 37 + (46.03837 / 60)
# result = 37 + 0.7673062
segments = value.split('.')
if len(segments[0]) == 4:
#lanitude
degree = segments[0][:2]
else:
#longtitude
degree = segments[0][:3]
minute = round(Decimal(segments[0][-2:] + "." + segments[1])/60, 6)
Finally, this module reports the following data per second1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26{
"series": [
{
"mag_variation": "",
"geo_sep": "39.1",
"num_sats": 5,
"fixed_time": "20:33:21",
"geo_sep_units": "M",
"horizontal_dil": "2.21",
"longitude_dir": "E",
"mag_var_dir": "",
"gps_speed": 0.242,
"altitude_units": "M",
"true_course": null,
"latitude": "11.111111",
"fixed_full_timestamp": "2019-02-26 20:33:21",
"latitude_dir": "N",
"fixed_date": "2019-02-26",
"gps_quality": 1,
"longitude": "22.222222",
"altitude": 93.4
}
],
"deviceId": "FengsDevice_GPS",
"timestamp": "2019-02-26 20:33:21"
}
Programming/debugging OBD2 can be difficult - after all I do not want to be programming while driving. Instead of hiring a driver and typing the keyboard on the passenger seat, it is better to use an ODB emulator to emulate all telemetries (and error codes) of the car.
Lucky I am not alone who has the same problem during OBD development. There are professional and affordable emulators on Aliexpress and Taobao (BTW The price on Taobao is 1/3 as Aliexpress!). The detailed features can be found at here. My respects to the designers of this emulator - you are life savers!
Now, with the emulator and python obd lib, it is easy to collect the telemetries of the car.
However, the library does not take care of failures and auto-healing, which we need to do it ourselves, otherwise the code just throw exceptions and stop working.
Thanks to the emulator, it is easy to test all corner scenarios, such as disconnect the ODB and reconnect while “the engine” is still running, in a safe environment. That is impossible to test/debug with real car.
The following code snippet ensures the modules works with different scenarios and self-healing:
1 | def getVehicleTelemtries(deviceId): |
Finally, this module reports the following data per second:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23{
"series": [
{
"SPEED": 56,
"RPM": 2830.75,
"RUN_TIME": 639,
"ABSOLUTE_LOAD": 0.0,
"SHORT_FUEL_TRIM_1": -21.09375,
"TIMING_ADVANCE": 0.0,
"INTAKE_PRESSURE": 0,
"LONG_FUEL_TRIM_1": 18.75,
"INTAKE_TEMP": 0,
"THROTTLE_POS": 37.64705882352941,
"OIL_TEMP": 16,
"MAF": 655.35,
"RELATIVE_THROTTLE_POS": 0.0,
"COOLANT_TEMP": 0,
"ENGINE_LOAD": 45.490196078431374
}
],
"timestamp": "2019-02-20 19:27:11.705387",
"deviceId": "FengsDevice_OBD"
}
As Raspberry PI does not have an RTC, the system clock was reset after each power-on. If it has an internet connection, it will fetch the correct date-time from internet, with some delay.
In the current logic, both GPS and OBD modules are using the system clock as the event timestamp. Therefore, if the Pi failed to have internet connection (happened often with mobile hotspot) or sending data before system clock is updated due to delay, the event timestamp will be incorrect.
To overcome this issue, you can install a RTC (Real Time Clock) to the Raspberry Pi, such as this and this.
I end up with a UPS-18650 Raspberry pi UPS Power Expansion Board With RTC. It comes with a power bank AND a built-in RTC. It is design and built by ACE design studio in China, and I am very happy about it. Definitely buy more from them next time.
(Picture: My to-be-tested Raspberry Pi with UPS power expansion board and LoRa/GPS Hat.Hopefully it can use LoRa network connections to replace 4G)
Now we have 2 modules and we have built them into 2 docker images (in variables ${MODULES.OBDModule.arm32v7} and ${MODULES.LocatorModule.arm32v7}). I was hosting them in the dockerhub but it can also be hosted in any private registration.
For now we did not do any computing on the edge but simply forward them to Azure Iot Hub (see here)1
2
3
4
5
6
7
8
9
10
11
12"$edgeHub": {
"properties.desired": {
"schemaVersion": "1.0",
"routes": {
"OBDModuleToIoTHub": "FROM /messages/modules/OBDModule/outputs/* INTO $upstream",
"LocatorModuleToIoTHub": "FROM /messages/modules/LocatorModule/outputs/* INTO $upstream"
},
"storeAndForwardConfiguration": {
"timeToLiveSecs": 7200
}
}
}
More info can be found in the deployment.template.json.
Now we 2 module docker images running on the Raspberry Pi and sending data to Azure IoT Edge runtime. The Raspberry Pi has wifi connection to mobile phone 4G hotspot and forwarding the data to Azure IoT Hub in realtime.
With Azure IoT Hub and Azure Time Series Insights(TSI), we can now visualize the data:
This is a quick example of data analytics for the IoT car. In the second part of the series, I will talk more about the data analytics part (including TSI, DataBrick ++) in the cloud.
Continue reading part 2]]>Of course we understand hamsters are nocturnal animals, which means they are sleeping in day time and become more active at night. But I started wondering how she was doing during the nights, especially how much she ran on the hamster wheel.
Let’s do something about it.
Picture: Qiuqiu with her wheel
There are many possible ways to track the hamster wheel.
Carefully place the sensor on the wheel and the body, make sure when wheel spins, the magnet on the wheel has a small but close enough gap with the sensor body. Used lego part for some adjustments.
Before we continue, I would like to test in action, to make sure the gap is OK. It is possible monitor realtime reading of the sensor, by using the Conbee API.
I wrote a simple web app (source code) with javascript and WebSocket, to visualize the realtime reading. The WebSocket API is provided by the Conbee application, see the document here.
Under the hood:
I made a protective shell from a spare plastic box, and mounted on the wheel. Therefore Qiuqiu cannot chew on the sensor. I even made a small hole on the shell to easily use a stick for pressing the sensor reset button, without remove the whole thing.
The BeeCon 2 is a USB-based zigbee gateway that can be attached to a PC or raspberry pi. It talks to the zigbee mesh network and receives signals from sensors. For example, the sensor on the wheel send the following json payload, one for “close” event (magnet and sensor are closed) and another one for “open” event (magnet and sensor are parted). Logically one open-close event pair indicates a finished cycle:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22{
"e": "changed",
"id": "3",
"r": "sensors",
"state": {
"lastupdated": "2020-08-05T17:32:37.102",
"open": false
},
"t": "event",
"uniqueid": "00:15:8d:00:04:5c:d8:d3-01-0006"
}
{
"e": "changed",
"id": "3",
"r": "sensors",
"state": {
"lastupdated": "2020-08-05T17:32:37.227",
"open": true
},
"t": "event",
"uniqueid": "00:15:8d:00:04:5c:d8:d3-01-0006"
}
By connecting Conbee gateway with Home Assistant via deCONZ integration, it is fairly easy to export the data as a CSV file. (I plan to build a data pipeline with time series database in the later stage, but for now let’s stay with manual data export.)
Picture: exported csv, with 4 columns
Now it is time for having some python/jupyter notebook fun. Here we are going to use https://www.kaggle.com/. You can read more comparison of online jupyter notebook hosting at here.
The above code snippet does:
Let’s take a look at the raw data. The first thing I noticed is the “noise” of each cycle. As the door-window close sensor is not designed for tracking a spin, whenever a cycle finished, instead of report simple 2 events: on and off, it actually generates a sequence of events: on-off-on-off-on-off. This is a “noise” that we need to take care of.
it is worth noting that not all cycles follow the same pattern. For example, the 3rd red circle on the screenshot shows an exception: it only has one “on-off” event pair.
We need a way to “group” the multiple events (“on-off-on-off-on-off”) into one event that indicates a cycle, but we cannot group by a fixed pattern as there are exceptions (as we mentioned above).
After some quick research and testing, without diving into hard-core data science part, I found the rolling window calculation can be a solution for our case.
Lets set the rolling windows to 150 ms - it is “magic number” that works good with the raw data. It purely depends on how fast the hamster runs.
Lets visualize the rolling calculation results.1
2
3
4
5
6
7
8
9
10
11import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter(x=df.index, y=df['finshedOneRound'], name='raw'))
fig.add_trace(go.Scatter(x=df.index, y=df['finshedOneRound_rolled'], mode='lines+markers', name='rolling'))
fig.update_xaxes(rangeslider_visible=True)
fig.update_yaxes(tick0=0, dtick=1)
fig.show()
Now you can see that the result of rolling calculation does generate unique markers for each cycle (the green circles), and it works for different patterns in the raw data!
Extract the markers (where rolling result == 1) into a new dataframe df_cycle_log for the next step.1
df_cycle_log = df.loc[df["finshedOneRound_rolled"] == 1]
I would like to know:
Lets do some match and populate the results:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20import math
def get_distance_by_wheel_cycle_count(cycle_count):
diameter = 0.2 #the wheel diameter is 20cm
return cycle_count * diameter * math.pi
def get_speed_in_KMh(traveled_range_in_m, run_time_in_sec):
return traveled_range_in_m / run_time_in_sec * 3.6 #(1m/s = 3.6 km/h)
def get_speed_by_cycle_count(cycle_count, run_time_in_sec):
distance = get_distance_by_wheel_cycle_count(cycle_count)
return get_speed_in_KMh(distance, run_time_in_sec)
#Aggreate the cycle counts every 30sec, popluate the data
run_time_segment = "30s"
df_result = pd.DataFrame()
df_result["cycle_count"] = df_cycle_log["finshedOneRound_rolled"].resample(run_time_segment).count()
df_result["distance"] = df_result["cycle_count"].apply(get_distance_by_wheel_cycle_count)
df_result["speed_km"] = df_result["cycle_count"].apply(lambda count: get_speed_by_cycle_count(count, 30))
Then plot1
2
3
4
5
6
7
8
9
10
11
12import plotly.graph_objects as go
from plotly.subplots import make_subplots
# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])
#fig.add_trace(go.Scatter(x=df_result.index, y=df_result['cycle_count'], name="wheel count"))
fig.add_trace(go.Scatter(x=df_result.index, y=df_result['speed_km'], name="speed(km/h)"), secondary_y=False)
fig.add_trace(go.Scatter(x=df_result.index, y=df_result["distance"].cumsum()/1000, name="distance(km)"), secondary_y=True)
fig.update_xaxes(rangeslider_visible=True)
fig.show()
Conclusion from the result:
According to the internet, Qiuqiu is not the fastest runner (a hamster can run up to 5-9 KM/h), and also ran slightly less than average range 9 KM in that evening.
Of course the speed/range can vary from hamster to hamster, and this is data for one evening. The next step is to build a fully automated data pipeline with time series database, create some Grafana dashboards with daily/weekly baseline for long term tracking.
Thanks for the reading.
]]>1 | #Used Measure-Command for measuring performance |
The data.json file looks perfectly fine, but during import to ADX, it reported error “invalid json format”.
Using online validation tool such as https://jsonlint.com/, copy & paste the content from data.json. The json objects are valid.
Using local tool jsonlint, reports error. It shows the data.json file has encoding issue.
1 | PS C:\Users\lufeng\Desktop> jsonlint .\data.json |
Switch to a different powershell command solved the problem
1 | Invoke-WebRequest -Uri 'THE_API_END_POINT' -OutFile data.json |
EOF
]]>Recently our QA reported an interesting issue regarding the native app and our website: When the webpage was shared on Linkedin iOS App and/or Facebook iOS App, the built-in browsers cannot show it correctly but a blank page.
Native App | Platform | Result |
---|---|---|
iOS | Not OK | |
iOS | Not OK | |
Facebook messenger | iOS | Not OK |
Slack | iOS | OK |
Skype for Business | iOS | OK |
Android | OK | |
Android | OK | |
Facebook messenger | Android | OK |
Slack | Android | OK |
Skype for Business | Android | OK |
Safari | iOS | OK |
Chrome for iOS | iOS | OK |
Any desktop browser | Win 10 | OK |
So the problem is about iOS in-app browser in some native apps. But unfortunately these apps (LinkedIn and Facebook) are too important to ignore, so we will have to fix it.
It is challenging to debug this issue, as it only happens in some of the iOS apps. It can not be reproduced in Safari or other browsers. Possible approaches are:
The #1 and #2 are long shots, then I will continue with approach #3. The following diagram shows the architecture.
There some many ways to place a “Man-in-the-middle” between mobile and internet. For example, the famous fiddler can do it.
Follow the documentation https://docs.telerik.com/fiddler/Configure-Fiddler/Tasks/ConfigureForiOS. The key steps are:
Now I have started comparing the HTTPS response for the same URL but from different Apps, and quickly narrowed down the cause to the different values in response header The Content-Security-Policy (CSP).
Content-Security-Policy in the App that have the problem1
2
3
4
5
6
7
8
9
10
11
12
13
14#Added line break for better readability
Content-Security-Policy: script-src az416426.vo.msecnd.net
veracitycdn.azureedge.net 'unsafe-inline' https://tagmanager.google.com
https://www.googletagmanager.com www.google-analytics.com
sjs.bizographics.com/insight.min.js https://px.ads.linkedin.com/
https://*.hotjar.com https://*.hotjar.io; connect-src
dc.services.visualstudio.com https://*.hotjar.com:* https://*.hotjar.io;
frame-src https://*.hotjar.com https://*.hotjar.io
https://www.googletagmanager.com/ns.html; img-src www.google-analytics.com
stats.g.doubleclick.net ssl.gstatic.com www.gstatic.com
https://px.ads.linkedin.com/ www.google.no www.google.com px.ads.linkedin.com
www.linkedin.com; font-src data: fonts.gstatic.com; style-src
tagmanager.google.com fonts.googleapis.com
'sha256-SvLgADqEePEV9RNxBrRQXSBJafFHcVNG7cPzHz6h9eA='
Content-Security-Policy in the Apps that do NOT have the problem1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32#Added line break for better readability
Content-Security-Policy: default-src 'self' veracitystatic.azureedge.net
veracitycdn.azureedge.net veracity-cdn.azureedge.net
veracity-static.azureedge.net veracity.azureedge.net; style-src 'self'
'sha256-UTjtaAWWTyzFjRKbltk24jHijlTbP20C1GUYaWPqg7E=' tagmanager.google.com
fonts.googleapis.com 'sha256-SvLgADqEePEV9RNxBrRQXSBJafFHcVNG7cPzHz6h9eA=';
img-src 'self' data: veracityprod.blob.core.windows.net
veracitycdn.azureedge.net veracitystatic.azureedge.net
veracity-cdn.azureedge.net veracity-static.azureedge.net
veracitytest.azureedge.net veracity.azureedge.net brandcentral.dnvgl.com
devtestdevprofile.blob.core.windows.net testdevprofile.blob.core.windows.net
stagdevprofile.blob.core.windows.net cdn.sanity.io
devprofile.blob.core.windows.net www.google-analytics.com
stats.g.doubleclick.net ssl.gstatic.com www.gstatic.com
https://px.ads.linkedin.com/ www.google.no www.google.com px.ads.linkedin.com
www.linkedin.com; script-src 'self' veracitycdn.azureedge.net
veracity.azureedge.net https://localhost:3010 az416426.vo.msecnd.net
'unsafe-inline' https://tagmanager.google.com https://www.googletagmanager.com
www.google-analytics.com sjs.bizographics.com/insight.min.js
https://px.ads.linkedin.com/ https://*.hotjar.com https://*.hotjar.io;
media-src 'self' veracityprod.blob.core.windows.net
veracitystatic.azureedge.net veracitycdn.azureedge.net
veracity-cdn.azureedge.net veracity-static.azureedge.net veracity.azureedge.net
cdn.sanity.io brandcentral.dnvgl.com; connect-src 'self'
veracitystatic.azureedge.net veracitycdn.azureedge.net
veracity-cdn.azureedge.net veracity-static.azureedge.net veracity.azureedge.net
cdn.sanity.io wss://localhost:3011 dc.services.visualstudio.com
https://*.hotjar.com:* https://*.hotjar.io; font-src veracitycdn.azureedge.net
data: fonts.gstatic.com; report-uri
https://veracitycommon.report-uri.com/r/d/csp/enforce; report-to
https://veracitycommon.report-uri.com/a/d/g; frame-src https://*.hotjar.com
https://*.hotjar.io https://www.googletagmanager.com/ns.html
It is pretty clear that due to the incorrect (much shorter) value of Content-Security-Policy caused the problem.
Now we need to check what caused the different CSP values. By comparing the requests that these apps were sending in Fiddler, I have quickly identified the request header “User-Agent” is the key.
User-Agent values from Apps that cause wrong CSP1
2
3
4
5
6
7
8#Linkedin
Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148 [LinkedInApp]
#Facebook
Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148 [FBAN/FBIOS;FBDV/iPhone11,2;FBMD/iPhone;FBSN/iOS;FBSV/13.3.1;FBSS/3;FBID/phone;FBLC/en_US;FBOP/5;FBCR/Telenor]
#Facebook Messenger
Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148 LightSpeed [FBAN/MessengerLiteForiOS;FBAV/256.0.1.26.113;FBBV/203261359;FBDV/iPhone11,2;FBMD/iPhone;FBSN/iOS;FBSV/13.3.1;FBSS/3;FBCR/;FBID/phone;FBLC/en_NO;FBOP/0]
User-Agent values from Apps that cause correct CSP1
2
3
4
5#Slack
Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Mobile/15E148 Safari/604.1
#Skype for Business
Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Mobile/15E148 Safari/604.1
Although we cannot change the logic of these apps, we can still easily manipulate the request or response, to simulate the different behaviors.
Head to Fiddler, go to “Filters” table, then you can
Some findings are:
Original LinkedIn User-Agent (with issue) | Updated LinkedIn User-Agent (without issue) |
---|---|
Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148 [LinkedInApp] | Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148 Version/13.0.5 [LinkedInApp] |
Generally, the web site should return the same CSP for most of the cases. So this is an issue that we should fix on the website.
The investigation led us to the opensource library Helmet where we reported a bug https://github.com/helmetjs/csp/issues/105.
Now we have fixed this issue locally and once Helmet merged the PR, we are ready to go.
Docker Desktop (or Docker for Windows) is a nice environment for developers on Windows. The community stable version of Docker Desktop is good enough for this jump-start, just make sure the version you installed include Kubernetes 1.14.x or higher. (I am using Docker Desktop Community 2.1.0.3).
Once installed, you can enable Kubernetes in Setting (see detailed info at here)
Then, you can verify it by running “kubectl version“ in powershell (or Command window)
In my case, I got error while connecting to [::1]:8080:1
2
3
4PS C:\> kubectl version
#Output:
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.3", GitCommit:"5e53fd6bc17c0dec8434817e69b04a25d8ae0ff0", GitTreeState:"clean", BuildDate:"2019-06-06T01:44:30Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"windows/amd64"}
Unable to connect to the server: dial tcp [::1]:8080: connectex: No connection could be made because the target machine actively refused it.
This is because I am missing an environment variable “KUBECONFIG“. Set this variable to your user directory such as “C:\Users\YOUR__USER_NAME\.kube\config“.
After adding this and restart your powershell, it should work.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19PS C:\> Get-Item -Path Env:KUBECONFIG
#Output:
Name Value
---- -----
KUBECONFIG C:\Users\lufeng\.kube\config
PS C:\> kubectl version
#Output:
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.3", GitCommit:"5e53fd6bc17c0dec8434817e69b04a25d8ae0ff0", GitTreeState:"clean", BuildDate:"2019-06-06T01:44:30Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"windows/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.3", GitCommit:"5e53fd6bc17c0dec8434817e69b04a25d8ae0ff0", GitTreeState:"clean", BuildDate:"2019-06-06T01:36:19Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
PS C:\> kubectl get namespaces
#Output:
NAME STATUS AGE
default Active 18h
docker Active 18h
kube-node-lease Active 18h
kube-public Active 18h
kube-system Active 18h
It is always nice to have a GUI for a complicated system such as Kubernetes, so lets install the dashboard https://github.com/kubernetes/dashboard.
1 | PS C:\> kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v1.10.1/src/deploy/recommended/kubernetes-dashboard.yaml |
First of all, we need to enable the proxy, so you can access the dashboard from your localhost:1
2
3PS C:\> kubectl proxy
#Output:
Starting to serve on 127.0.0.1:8001
Once the proxy is up and running, visit the dashboard URL: http://localhost:8001/api/v1/namespaces/kube-system/services/https:kubernetes-dashboard:/proxy/
Normally you will meet this [login view] (https://github.com/kubernetes/dashboard/blob/master/docs/user/access-control/README.md#login-view)
You can find more info from the dashboard github about Access control, but here we will do it simpler (This is for demo purpose, do not apply the same setup in your production environment).
Get the default token name1
2
3
4PS C:\> kubectl get secrets
#Output:
NAME TYPE DATA AGE
default-token-n92hz kubernetes.io/service-account-token 3 18h
Then get the token1
2
3
4
5
6
7
8
9
10
11
12
13
14
15PS C:\> kubectl describe secrets default-token-n92hz
#Output:
Name: default-token-n92hz
Namespace: default
Labels: <none>
Annotations: kubernetes.io/service-account.name: default
kubernetes.io/service-account.uid: c56ad00e-e5e5-11e9-91a0-00155d3a9005
Type: kubernetes.io/service-account-token
Data
====
ca.crt: 1025 bytes
namespace: 7 bytes
token: eyJhbGciOiJSUzI1NiIsImt3NlcnZpY2UtYWNjb......CIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjfv4TPDVZoOrLWHZecEw-8XBQ
PS C:\>
Use the token in the login form, then you are in.
Helm is a tool for managing Kubernetes charts. Charts are packages of pre-configured Kubernetes resources. You can read more at https://helm.sh/.
According to the installation guide, we are going to:
1 | PS C:\> scoop install helm |
1 | PS C:\> Get-Item -Path Env:HELM_HOME |
1 | #Check current kubernetes cluster context |
Istio is a microservice-mesh management framework, that provides traffic management, policy enforcement, and telemetry collection.
We are going to:
Simply follow the steps in https://istio.io/docs/setup/install/helm/, remember to config docker desktop as mentioned. Unzip the downloaded package into “c:\Istio“ as we might want to update some files there.
1 | PS C:\> helm repo add istio.io https://storage.googleapis.com/istio-release/releases/1.3.1/charts/ |
Then select a configuration profile. We go with “demo“ as it include some nice addons such as Kiali.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22#Installation
PS C:\istio> helm install install/kubernetes/helm/istio --name istio --namespace istio-system --values install/kubernetes/helm/istio/values-istio-demo.yaml
#Verify
PS C:\istio> kubectl get pods -n istio-system
#Output:
NAME READY STATUS RESTARTS AGE
grafana-6fc987bd95-zj4kn 1/1 Running 0 98s
istio-citadel-55646d8965-wvflc 1/1 Running 0 97s
istio-egressgateway-7bdb7bf7b5-ck4k6 1/1 Running 0 98s
istio-galley-56bf6b7497-c9szw 1/1 Running 0 98s
istio-ingressgateway-64dbd4b954-64gj8 1/1 Running 0 98s
istio-init-crd-10-1.3.1-tvnr4 0/1 Completed 0 4h1m
istio-init-crd-11-1.3.1-qz4fh 0/1 Completed 0 4h1m
istio-init-crd-12-1.3.1-6rk5w 0/1 Completed 0 4h1m
istio-pilot-5d4c86d576-crn2k 2/2 Running 0 97s
istio-policy-759d4988df-c7tnb 2/2 Running 1 97s
istio-sidecar-injector-5d6ff6d758-8tlrx 1/1 Running 0 97s
istio-telemetry-7c88764b9c-245mk 2/2 Running 1 97s
istio-tracing-669fd4b9f8-gmlh9 1/1 Running 0 97s
kiali-94f8cbd99-zwz8z 1/1 Running 0 98s
prometheus-776fdf7479-jwnvh 1/1 Running 0 97s
You can also verify these pod via dashboard
As we installed the Demo configuration profile of Istio, Kiali was also installed. Kiali is an observability console for Istio with service mesh configuration capabilities. (Read more at https://istio.io/docs/tasks/telemetry/kiali/ also)
To open Kiali UI, pls run1
2
3
4PS C:\istio> kubectl -n istio-system port-forward $(kubectl -n istio-system get pod -l app=kiali -o jsonpath='{.items[0].metadata.name}') 20001:20001
#Output:
Forwarding from 127.0.0.1:20001 -> 20001
Forwarding from [::1]:20001 -> 20001
Then go to http://localhost:20001 for visting Kiali UI.
Again, it ask for login. As in this case, Kiali was installed as a part of the Demo configuration profile, you can use default user name “admin“ and password “admin“ to login.
Now, lets deploy a demo application composed of four separate microservices. The detailed doc can be found at https://istio.io/docs/examples/bookinfo/.
Start the application services
1 | #1. Set automatic sidecar injection |
Establish gateway for the bookinfo app
1 | #1. Apply gateway |
Confirm the app is accessible from outside the cluster
Go to http://localhost/productpage to verify you can open the page. You can refresh the page several times for generating telemtries.
Kiali Visualization
Assuming the 20001 port forwarding is still running, then you can visualize the service relationship in Kiali http://localhost:20001/
Let’s deploy a single-container-application (Grafana) to the cluster, which is described at https://grafana.com/docs/installation/docker/
1. Docker version1
docker run -d -p 3000:3000 grafana/grafana
2. Kubernetes kubectl command version1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25# 1. Deployment
PS C:\> kubectl run grafana-test --generator=run-pod/v1 --image=grafana/grafana --port=3000
#Output:
pod/grafana-test created
# 2. Check the name of the grafana pod. Note it is sitting in "default" namespace
PS C:\> kubectl -n default get pod
#Output:
NAME READY STATUS RESTARTS AGE
details-v1-c5b5f496d-sgr6w 2/2 Running 0 29h
grafana-test 2/2 Running 0 97s
kubernetes-bootcamp-b94cb9bff-vsprh 2/2 Running 0 3h6m
productpage-v1-c7765c886-6cpr9 2/2 Running 0 29h
ratings-v1-f745cf57b-87m7q 2/2 Running 0 29h
reviews-v1-75b979578c-vmzn2 2/2 Running 0 29h
reviews-v2-597bf96c8f-plml7 2/2 Running 0 29h
reviews-v3-54c6c64795-x67ss 2/2 Running 0 29h
# 4. Enable port forwarding.
# In case you wanna use select as the pod name contains random string,
# Use "kubectl -n default port-forward $(kubectl -n default get pod -l run=grafana-test -o jsonpath='{.items[0].metadata.name}') 3000:3000"
PS C:\> kubectl -n default port-forward grafana-test 3000:3000
#Output:
Forwarding from 127.0.0.1:3000 -> 3000
Forwarding from [::1]:3000 -> 3000
3. Kubernetes YAML deployment version
It is recommended to use YAML file for defining a deployment. See doc at https://kubernetes.io/docs/concepts/workloads/controllers/deployment/
Create a deployment grafana-deployment.yaml file as below:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana-yaml-deployment
labels:
app: grafana-yaml
spec:
replicas: 1
selector:
matchLabels:
app: grafana-yaml
template:
metadata:
labels:
app: grafana-yaml
spec:
containers:
- name: grafana-yaml
image: grafana/grafana
ports:
- containerPort: 3000
Then apply the yaml file and run1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53#1. Deployment
PS C:\> kubectl apply -f .\grafana-deployment.yaml
#Output:
deployment.apps/grafana-yaml-deployment created
#2. Verify
PS C:\> kubectl get deployments
#Output:
NAME READY UP-TO-DATE AVAILABLE AGE
details-v1 1/1 1 1 29h
grafana-yaml-deployment 1/1 1 1 40s
kubernetes-bootcamp 1/1 1 1 3h27m
productpage-v1 1/1 1 1 29h
ratings-v1 1/1 1 1 29h
reviews-v1 1/1 1 1 29h
reviews-v2 1/1 1 1 29h
reviews-v3 1/1 1 1 29h
#3. Enable forward port, by using selector app=grafana-yaml
PS C:\> kubectl -n default port-forward $(kubectl -n default get pod -l app=grafana-yaml -o jsonpath='{.items[0].metadata.name}') 3000:3000
#4. Expose the service via nodeport
PS C:\> kubectl expose deployment grafana-yaml-deployment --type=NodePort --port=3000
#Output:
service/grafana-yaml-deployment exposed
#5. Get the external ip and port
PS C:\> kubectl get services
#Output:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
details ClusterIP 10.110.165.24 <none> 9080/TCP 3d8h
grafana-yaml-deployment NodePort 10.98.52.86 <none> 3000:30857/TCP 9s
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 3d15h
productpage ClusterIP 10.97.123.119 <none> 9080/TCP 3d8h
ratings ClusterIP 10.111.216.40 <none> 9080/TCP 3d8h
reviews ClusterIP 10.109.244.28 <none> 9080/TCP 3d8h
PS C:\> kubectl describe service grafana-yaml-deployment
Name: grafana-yaml-deployment
Namespace: default
Labels: app=grafana-yaml
Annotations: <none>
Selector: app=grafana-yaml
Type: NodePort
IP: 10.98.52.86
LoadBalancer Ingress: localhost
Port: <unset> 3000/TCP
TargetPort: 3000/TCP
NodePort: <unset> 30857/TCP
Endpoints: 10.1.0.208:3000
Session Affinity: None
External Traffic Policy: Cluster
Events: <none>
Then you can access grafana pod via http://localhost:30857
Now, you should have a kubernetes environment up and running, together with Istio and Kiali enabled. It can be used as your sandbox, for developing and testing your applications in Kubernetes. With Istio and Kiali, you can also play with service mesh. Everything is running locally in “one box”, so you do not need to worry about any cloud running cost.
Have fun.
]]>Nowadays it is pretty common to share articles on social media such as Facebook and Linkedin. Thanks to the widely implemented Open Graph protocol, sharing is no long just a dry url, but with enrich text and thumbnails.
However, there are still some web pages that do not have Open Graph implemented, which significantly reduces the readers’ willingness for clicking it.
In addition, even you introduced the Open Graph tags as a hotfix, some times you will have wait for approximately 7 days for linkedin crawler to refresh the preview caching, as mentioned in linkedin documentation:
The first time that LinkedIn’s crawlers visit a webpage when asked to share content via a URL, the data it finds (Open Graph values or our own analysis) will be cached for a period of approximately 7 days.
This means that if you subsequently change the article’s description, upload a new image, fix a typo in the title, etc., you will not see the change represented during any subsequent attempts to share the page until the cache has expired and the crawler is forced to revisit the page to retrieve fresh content.
Some solutions are here and here, but they are more like a workaround.
We can overcome this issue by using linkedin API, which provide huge flexibility for customizing the sharing experiences.
Head to https://www.linkedin.com/developers/ and create an application. As showed in the screenshot, I created an application named “Linkedin Poster”. Take notes on Client ID and Client Secret, set the Redirect URLs as https://www.getpostman.com/oauth2/callback.
Use postman application to generate OAuth 2.0 token (Authorization Code Flow). The detailed documentation is here.
Login to generate token
In order to post articles in LinkedIn via API, we need to provide the user id.
Make a GET request to API https://api.linkedin.com/v2/me (see document), make sure the token from step 2 is included. The result is something like below:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26{
"localizedLastName": "Lu",
"profilePicture": {
"displayImage": "urn:li:digitalmediaAsset:BACABCqwPVej-w"
},
"firstName": {
"localized": {
"en_US": "Feng"
},
"preferredLocale": {
"country": "US",
"language": "en"
}
},
"lastName": {
"localized": {
"en_US": "Lu"
},
"preferredLocale": {
"country": "US",
"language": "en"
}
},
"id": "ABC123-ab1",
"localizedFirstName": "Feng"
}
Ref to the documentation it is pretty straightforward for customizing the shared content.
In my case, I would like to share http://feng.lu/archives/ (which does not have Open Graph) with a nice archive picture.
POST to https://api.linkedin.com/v2/shares with body:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22{
"content": {
"contentEntities": [
{
"entityLocation": "http://feng.lu/archives/",
"thumbnails": [
{
"resolvedUrl": "http://feng.lu/2019/02/06/Customize-social-sharing-on-Linkedin-via-API/archives.jpg"
}
]
}
],
"title": "Article archives of feng.lu"
},
"distribution": {
"linkedInDistributionTarget": {}
},
"owner": "urn:li:person:MY_LINKEDIN_ID",
"text": {
"text": "Checkout my blog archives! Hopefully you will find it useful. :)"
}
}
Checkout the result:
By using LinkedIn API, we can easily customize the sharing experience with your professional networks. It does not only overcome the challenges such as missing Open Graph implementation, but also can improve the social media campaign experience and better integration with CMS.
]]>In the second part of this series, we have went though both the detailed technical design that is based on IOTA. Some quick recap are:
Although the core data schema is quite easy to implement, companies and developers might meet some challenges to get started, such as:
We want to address above challenges, and help everyone to gain benefits of data integrity and lineage. Therefore, we have built “Data Lineage Service“ application. Developers and companies can apply this technology without deep understanding of IOTA and MAM protocol. It can be used either as a standalone application, or a microservice that integrates with existing systems.
The key functions are:
Also, for one who simply wanna try it out in the live environment, we are hosting this service that connects to the live DLT environment (IOTA tangle mainnet).
As a live environment, it allows anyone to:
The source code is hosted in Github: https://github.com/veracity/data-lineage-service
The live demo environment can be found at https://datalineage-viewer.azurewebsites.net
This live environment is backed by IOTA public network (public IOTA nodes). Feel free to use it (either GUI or API swagger) to store your integrity and lineage data into IOTA mainnet, as well as visualize the existing data.
The API Swagger is at https://datalineage-viewer.azurewebsites.net/swagger/
Screenshot:
By using this service, an IoT device can ensure the integrity of its IoT data stream. As a demo, I have a raspberry pi with sense hat that is reporting temperature as well as saving the integrity information to DLT. The integrity information can be read at here from DLT.
The source code of this demo is at https://github.com/linkcd/data-integrity-on-pi
From day 1, the performance of DLT is a known issue. By expanding this technology into the IoT and real-time data exchanging world, the performance can be a blocking issue. This is also the reason that we started look into IOTA in the beginning, hope its performance can meet the need.
We have conducted the performance testing in 3 iterations:
In each iteration, we tested performance both for reading and writing. The testing code is also open-sourced at github https://github.com/veracity/IOTA-MAM-performance-testing
Test results (on 20.09.2018)
Conclusion:
In veracity we are researching and building Data Integrity and Lineage as a Service (DILAAS) to bring down the barriers for both data providers and data consumers. DILAAS offers:
Other articles in this series:
]]>In my previous article, we discussed different approaches for solving the data integrity and lineage challenges, and concluded that the “Hashing with DLT“ solution is the direction we will move forward. In this article, we will have deep dive into it. Please not that Veracity’s work on data integrity and data lineage is testing many technologies in parallel. We utilise and test proven centralized technologies as well as new distributed ledger technologies like Tangle and Blockchain. This article series uses the IOTA Tangle as the distributed ledger technology. The use cases described can be solved with other technologies. This article does not necessarily reflect the technologies used in Veracity production environments.
As Veracity is part of an Open Industry Ecosystem we have focused our data integrity and data lineage work using public DLT and open sourced technologies. We believe that to succeed with providing transparency from the user to the origin of data many technology vendors must collaborate around common standards and technologies. The organizational setup and philosophies for some of the public distributed ledgers provides the right environment to learn and develop fast with an adaptive ecosystem.
There are many public DLT platforms nowadays, but not all of them (such as Bitcoin and Ethereum) are suitable for Big Data or IoT scenarios, such as:
We have been watching closely at the technology evolution of distributed ledgers and exploring different possibilities. Currently we are exploring IOTA, which is a new type of DLT that foundationally different from other blockchain-based technologies. The high-level comparison can be found at IOTA FAQs, question “How is IOTA different from Blockchain?”
We decide to test our solution on top of IOTA, due to the following key features that IOTA offers:
This is not an article of introducing IOTA, but you can learn more from https://www.iota.org
In addition, IOTA provides a protocol named Masked Authenticated Messaging (MAM) that easily fit into our solution. MAM provides an abstract data structure layer (channels) on top of regular transactions. In our solution, all read and write data into DLT (tangle) is around MAM channels. Check the article appendix for more resources of MAM.
Therefore, Alice can publish the hash values like the following diagram
In above case, Alice creates one channel with her private seed. Then she sends messages into this channel, one address has one message.
There is a sample code for sending message into IOTA tangle at https://github.com/linkcd/IOTAPoC/blob/master/tangleWriter.js from my repository.
This code simply:
First, let’s agree some design principles and conceptual entities
The verification process of both data integrity and data lineage should be self-service. It means that all verification information should be available to public. Data provider should not be bothered by this process.
(Technically it is possible to have permission control of the verification process, by using private or restricted MAM channels, it means that data provider has to response to the ad-hoc verification requests)
It means that data lineage will not impact the existing data flow, nor become a bottleneck.
An atomic unit in data flow from one data source to another data source. For example:
The unique ID of a data package in the scope of a data source. A typical data package ID is a number, a GUID or a time-stamp.
Data stream is data package series from the same data source. It contains more than one packages and their IDs.
Goal: Data consumers can verify the integrity of data packages from a data source.
The high level overview of data integrity workflow is as following:
Data source creates a MAM public channel by using its private seed, then share the root address of this channel with public. It can be done, for instance, via data source’s web site.
The data source can, of course, publish all individual addresses for all messages, but it will be too many. As long as consumers have the channel root address, they can go through all addresses from the root, to find the specific address/message to verify. See step 5.
In order to allow the consumer to verify the integrity, data source need to provide enough information to make it possible. Therefore, you need to decide what information should be stored in the tangle as json object.
All object must have the following core fields. All of them are mandatory.1
2
3
4
5{
datapackageId: string,
wayofProof:string,
valueOfProof:string
}
| Field | Description | Example |
|—|—|—|
| datapackageId | The package ID is used for querying the data lineage info from the channel. Data source decides the ID format, such as integer or GUID. Different channels can have the same package ID. | “123456” |
| wayofProof | Information about how to verify the integrity based on valueOfProof. For example, it explains the used hash algorithms (SHA1 or SHA2 or others), or it simply copied the data package content into field valueOfProof. | “SHA256(packageId, data-content)” |
| valueOfProof | The value of the proof, such as hash value, or the copy of the data content in clear text. | (hash value or data itself) |
Example
An application (aka data source) generates big csv files and pass to the downstream (aka data consumer). All csv files have a unique file name. The application decides to hash the file content together with the file name. The hash function can be one of the Secure Hash Algorithms, such as SHA-512/256.
Therefore, for file “file201.csv”, the application computed the hash based on SHA512(“201”, filecontent.string()), which is “7EC8E…AAFAA”1
2
3
4
5{
datapackageId: "201",
wayofProof:"SHA512(201, filecontent.string())",
valueOfProof:""7EC8E...AAFAA"
}
Use hash for reducing calls to tangle
The hash is also useful if you want to reduce the data sent to tangle. For example, a data source is generating a small file per second. However, pushing data to tangle in every second can be a performance bottleneck. If the data source pack all files of every 10 minutes into one, assign an ID and compute the hash value of this data trunk, it can still publish integrity data into tangle but with much lower frequency.
In addition to the above mandatory fields, you can extend the json object by adding more additional fields for fitting your logic.
For example, you can add a field “location” for storing the application name, and “sensorType” of application owner type. These fields will be tightly coupled with these core fields, and stored in the tangle.1
2
3
4
5
6
7
8
9
10{
datapackageId: "201",
wayofProof:"SHA512(201, filecontent.string())",
valueOfProof:""7EC8E...AAFAA"
applicationName:"temperature reporter v2.1",
applicationOwner:"feng.lu@veracity.com",
...
additionalField:...
...
}
Note
“timestamp” is not a mandatory field, as all transactions in Tangle already has a system timestamp that shows when the data was submitted to the tangle. You can add a “package-received-timestamp” field that shows when the original data package was collected.
Data source sends above json object to the MAM channel (IOTA tangle). The json object will be stored in a MAM address inside of the channel. This can be done by using the demo code shown above.
— At this point, data source completes all needed tasks—-
Data consumer goes to the website from step 1 to get the root address of the MAM channel that belongs to the data source.
Data consumer goes through the MAM channel, address by address, to find the json object for the specific package, by using packageID. Then it obtains the wayOfProof and valueOfProof for that package.
The pseudo-code is1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24// get root address of the channel, from data source website
currentAddress = channel-root-address-got-from-data-source-website
// to find message for package #201
targetPackageId = 201
// start from the root address, go through all messages in this channel, to find the target message
while(currentAddress != null)
{
var currentInfo = MAM.GetInfoFromAddress(currentAddress)
if(currentInfo.PackageId == targetPackageId)
{
// found the verification info for the target package, return
return currentInfo
}
else
{
//the current address is not for the target package, go to next address
currentAddress = currentAddress.nextAddress
}
}
//check currentInfo.wayOfProof and currentInfo.valueOfProof
Data consumer read the “wayOfProof” to understand how to check the “valueOfProof” field. For example, compute the hash by using the same hash function “SHA512(201, filecontent.string())” for package 201.
Data consumer compares the hash values from the MAM channel and the local compute process.
Let’s look at a real life case:
You as an American tourist was having an vacation in Norway. You were driving a car and had a great experience of the fjords. Unfortunately you had a small accident outside of a gas station, and the car windshield was damaged (but lucky, no one has been injured). The local police station was informed and issued a form (in Norwegian!) about this accident.
Now you would like to report this to your insurance company.
Most likely the insurance company would like to know if they can trust the damage report. You can of course explain that the data flow is:
If we can store and verify this flow (data lineage of the report), it will:
On the top of the data integrity layer that we discussed above, it is easy to extend the format to build the data lineage layer.
Now we extend the format to include an optional field “inputs“, which is an array of MAM addresses. These addresses represent the data integrity information of all inputs of the current data package. A MAM address is a global unique identifier in Tangle, regardless of which channel it belongs to.
1 | { |
Depends on if you have any input, “inputs“ field is optional. You can ignore this field, or have this field, but the value is null.
The illustration is as below:
Then, for above insurance report case, by using the additional inputs field, it is easy to establish the follow data lineage flow:
It means that
Q1: Can I use IOTA and MAM protocol without IOTA token?
A1: Yes. Technically all MAM messages are 0(zero) value transactions. You can create unlimited MAM messages without any IOTA token.
Q2: Can I use IOTA and MAM protocol without hosting an IOTA node?
A1: Yes. You can use public hosted nodes on mainnet. For example, check https://thetangle.org/nodes or google “iota public nodes”. However, for production-ready solutions, having a managed node is recommended, which can offer you, for example, capacity of permanode, see Q3.
Q3: What are Public nodes and Permanode? Which one should I use?
A3: Some explanation can be found at here, for example. The short answer is, if you need to keep the historical MAM messages away from IOTA snapshots, go with permanode. Veracity is planning to host permanode(s) for the platform and its partners/customers.
Q4: Is it free to create a private seed and send messages to MAM channels?
A4: Yes, you can simply create a seed (a string) locally and store data into MAM channel. Feel free to generate seeds for your sensors/applications.
Q5: I do not want to use public MAM channels that anyone can take a look at, even if I know the messages only contain hash values. How can I protect the channels?
A5: MAM channel supports 3 access levels: public, private and restricted. In our solution, in order to make the verification as a self-service, we decided to use public channels. But it is possible to switch to private or restricted channels, and grant access to the selected data consumers to the channel.
Q6: I have an application that is sending out data to consumers. Do I need to do anything if a new consumer starts using my data and build the lineage on top of it?
A6: No. As a data source sitting in the upstream, you do not need to do anything for downstream consumers.
Q7: I am a data consumer. What information do I need to create the whole data lineage covering all inputs in different levels? For example, if the data flow is Alice->Bob->Carol->myself, do I need to know the MAM root address of Alice, Bob and Carol?
A7: No, you only need MAM root of Carol. As far as you follow the input fields recursively,you can check integrity and lineage of Bob (Carol’s upstream) and Alice (Bob’s upstream). In above insurance case, the insurance company can also follow the inputs fields to check, for example, the translator’s message and the Norwegian police station’s message.
Q8: This solution sounds great, but it can take some efforts to build it, such as build the UX for data lineage visualization, API for read/write MAM messages, manage seeds properly, etc. Is there anything we can reuse?
A8: I am glad that you asked. In veracity we are building Data Integrity and Lineage as a Service (DILAAS) to bring down the barriers for both data providers and data consumers. DILAAS offers:
In this article, we discussed the detailed design of verification data schema, such as these crucial fields: “datapackageId”, “wayofProof” , “valueOfProof” and “inputs”. We also implemented the solution on the selected DLT: IOTA and its MAM protocol. In next article of this series, we will put this solution into actions, and have closer look at some components of the DILAAS.
Masked Authenticated Messaging (MAM) was introduced by IOTA in Nov 2017. The high level description can be found here. In addition, some deep dive information of Tangle transaction and MAM can be found at:
Other articles in this series:
]]>With the proliferation of data – collecting and storing it, sharing it, mining it for gains – a basic question goes unanswered: is this data even good? The quality of data is of utmost concern because you can’t do meaningful analysis on data which you can’t trust. Here in Veracity, we are trying to address this is very concern. This is a 3 part series, going all the way from concept to a working implementation using DLT (Distributed Ledger Technology).
Side note, Veracity is designed to help companies unlock, qualify, combine and prepare data for analytics and benchmarking. It helps data providers to easily onboard data to the platform, and enable data consumers to access and mine value. The data can be from various sources, such as sensors and edge devices, production systems, historical databases and human inputs. Data is generated, transferred, processed and stored, from one system to another system, one company to another company.
Veracity is by DNV GL, and DNV GL has held a strong brand for more than 150 years as being a trusted 3rd party, yet it is still pretty common to hear questions from data consumers such as:
In order to answer these questions and bring more transparency to the data process lifecycle, we must address both data integrity and data lineage. Both Data integrity and data lineage are the foundation of trust.
In this series of articles, we are going to look at different challenges of data integrity and lineage, and evolve the solution. (Note that integrity is one of the 3 parts of CIA triad: confidentiality, integrity and availability, but we will not cover confidentiality and availability in this series.)
Let’s start with a basic example:
Alice sends messages (ie. files) to Bob. The messages were sent via an insecure channel, such as http-based data transfer, ftp, file share or even an USB stick.
There are 2 basic requirements for any data communication:
There are mainly 2 ways to ensure this: encryption and/or hashing. (A nice articles for comparing hashing and encryption can be found at here.)
In iteration 1 we focus on solving requirement #1: The messages were not tampered by man-in-the-middle. We either use encryption or hashing.
We can address the Con by introducing a trusted area for Alice. For example, Alice also publishes the hash values of the messages on https://alice.com. Bob can verify the message by compare the hash values. It is also OK to make the trusted area public, as hash value is irreversible - nobody can obtain the data by using the hash value, they can only check the message integrity.
This solution is sort of adding a secured “safeguard” track on the side, to help verifying the data flowing in the insecure channel.
In iteration #2, in additional to requirement #1, we also need to also fulfill requirement #2: The messages that Bob received, are indeed from Alice.
This normally requires asymmetric encryption: Alice encrypts the message with her private key, and Bob decrypts it with Alice’s public key. Therefore, Bob is confident that Alice is the message author.
Hashing solution with a trusted area can also meet this requirement, simply counting on ONLY Alice can write into the trusted area.
For ensuring the basic requirements, both solutions work:
Now, an interesting real challenge: Once Alice sends out a message, she can neither deny the message was sent, the message’s origin nor the original content. In another word, the challenge is about accountability and non-repudiation.
It can be explained by the following example:
(click to enlarge the picture)
At this point, these above solutions that we have so far cannot help Bob. For example, Alice can replace both the message and the hash value in https://alice.com.
With encryption solution, although Bob can prove the buggy version of message #2 is from Alice, but he cannot prove that Alice sent out the buggy version on Monday.
In general, Bob (and we) need an immutable history that provides immutable traceability of data, such as when and what data was sent and processed.
It definitely helps provides data consumers like Bob, but there are benefits for well-behaved data providers as well: By offering the immutable history, it increases the data’s acceptance from that providers, as well as increases value and trust of the provider.
Distributed Ledger Technology (DLT) shows its potential capacity to become a natural place for storing data integrity and data lineage information, as it has the following key features:
As DLT does support built-authentication, there is no need to use asymmetric(public/private) key for identity purpose. You can still use symmetric key for protecting the message from unauthorized access.
However, there are some limitations for using DLT as the secured channel. The biggest one is the size limitation of the message. For example, bitcoin size limitation is 1 MB and ethereum is of similar size. For lots of the cases, this limitation is show-stopper.
Therefore, the hashing solution with DLT is more realizable. See below.
By only putting hash value of the messages into DLT, we can solve the size limitation issue.
It means:
The goal of data lineage is to track data over its entire lifecycle, to gain a better understanding of what happens to data as it moves through the course of its life. It increases trust and acceptance of result of data process. It also helps to trace errors back to the root cause, and comply with laws and regulations. You can easily compare this with the traditional supply chain of raw materials in manufacturing industry and/or logistic industry.
For example, Bob is running a data process. This process takes inputs from Alice, then produces results. The results are sent to Carol.
For Carol, some typical questions are:
Now we continue building on top of the Hashing solution with DLT. Whenever a data provider (for example, Bob) sends out data, he writes into DLT that contains:
It means that the DLT contains the end-to-end data lifecycle information. Carol (and anyone else) only need to query the public information from DLT to build the lineage diagram.
With this solution, Carol can:
In above process, Bob is a data processor that accepts inputs from Alice (upstream), process it and send results to Carol (downstream).
This solution also provides an extra protection for Bob. For example, if Bob sent a data to Carol based on an incorrect input from Alice, Bob can simply explain that the root cause of the error is not on him but Alice, and Alice cannot deny that.
This also means this solution can greatly simplifying the ability to trace errors back to the root cause, even the whole process includes different parties/organizations.
Now we have went through different requirements and evolved solutions accordingly. At the end we believe the hashing solution with DLT can solve both data integrity and data lineage challenges. If the eco-system (data source, data processors and platform) can follows the same design, it will significantly increase the trust of data consumers as well as build more value into the data.
In the next article, we will look at this solution in action, by using IOTA as the selected DLT.
Other articles in this series:
]]>(picture copyright: www.dreamhost.com)
And, (after googling), yes! It is possible! .lu is the Internet country code top-level domain for Luxembourg. OK… (continue googling) “Can I register a .lu domain without been a Luxembourgers?” “No problem!” Great!
Long story short, after some quick research on vendors and paid 24 Euro, I got the brand new feng.lu domain! :)
The remaining is pretty straightforward:
Happy blogging!
]]>2018.09.25
This article is now expanded to an article series, where we have more detailed discussion and open-source code, check them out!
2018.08.26 - Updated the data schema:
If we say “Data is the new oil”, then data lineage is an issue that we must to solve. Various data sets are generated (most likely by sensors), transferred, processed, aggregated and flowed from upstream to downstream.
The goal of data lineage is to track data over its entire lifecycle, to gain a better understanding of what happens to data as it moves through the course of its life. It increases trust and acceptance of result of data process. It also helps to trace errors back to the root cause, and comply with laws and regulations.
You can easily compare this with the traditional supply chain of raw materials in manufacturing industry and/or logistic industry. However, compares to the traditional industries, data lineage are facing new challenges.
Some top challenges are:
In addition,
Distributed Ledger Technology (DLT) shows its potential capacity to become the neutral and trustworthy 3rd party in data lineage world, as it has the following key features:
But not all DLT are suitable for Big Data or IoT scenarios, when we have, for example, following requirements:
Therefore, IOTA becomes an outstanding DTL compares to other blockchain platform, by offering the following features:
This is not an article of introducing IOTA, but you can learn more from https://www.iota.org/ and https://blog.iota.org and IOTA channel in Discord.
But most importantly, it brings Masked Authenticated Messaging (MAM) which fits into our need for data integrity and data lineage.
Masked Authenticated Messaging (MAM) was introduced by IOTA in Nov 2017. The high level description can be found at here.
Besides, some deep dive information of Tangle transaction and MAM can be found at:
Data Integrity is the prerequisite of Data Lineage, and they can be addressed separately.
The verification process of both data integrity and data lineage should be self-service. It means that all verification information should be available to public. Data provider should not be bothered by this process.
(Technically it is possible to have permission control of the verification process, it means that data provider has to response to the ad-hoc verification requests)
It means that data lineage will not impact the existing data flow, nor become bottleneck.
An atomic unit in data flow from one data source to another data source. For example:
The unique ID of a data package in the scope of a data source. A typical data package id is a number, a GUID or a time-stamp.
Data stream is data package series from the same data source. It contains more than more packages and their IDs.
Goal: One can verify the integrity of data packages from a data source.
Data source creates a public MAM public channel by using its private seed. The private seed ensures only the the data source can publish information into that channel, and so the channel is trusted by others.
In order to allow the consumer to verify the integrity, you need to provide enough information to make it possible. Therefore you need to decide what information should be stored in the tangle as json object.
All object must have the following core fields. All of them are mandatory.1
2
3
4
5{
datapackageId: string,
wayofProof:string,
valueOfProof:string
}
| Field | Description | Example |
|—|—|—|
| datapackageId | The package ID is used for querying the data lineage info from the channel. Data source decides the ID format, such as integer or GUID. Different channels can have the same package ID. | “123456” |
| wayofProof | Information about how to verify the integrity based on valueOfProof. For example, it explains the used hash algorithms (SHA1 or SHA2 or others), or it simply copied the data package content into field valueOfProof. | “SHA256(packageId, original-data-content)” |
| valueOfProof | The value of the proof, such as hash value, or the copy of the data content in clear text. | (hash value or data itself) |
Case 1
A temperature sensor decides to use timestamp as package Id, and since the data point is small and not confidential, it decides to put the data point as clear text in the integrity information object.
Therefore, at 2012-08-29 11:38:22, the temperature is 20 degree. It sends the integrity json into its own MAM channel:1
2
3
4
5{
datapackageId: "1346236702",
wayofProof:"copy of original data",
valueOfProof:"20"
}
Case 2
An application generates big csv files and pass to the down stream. all csv files have an unique file name. Since we do not have do expose the csv file itself, either due to confidentiality or huge file size, it decides to use hash value in the integrity json. The hash function can be one of the Secure Hash Algorithms, such as SHA-512/256. This application decide to hash the file content together with the filename.
therefore, for file with unique name “file075.csv”, the application computed the hash based on SHA256(“file075.csv”+”:”+filecontent.string()), which is “8c20f3d24…43a6cfb7c4”1
2
3
4
5{
datapackageId: "file075.csv",
wayofProof:"SHA256("file075.csv"+":"+filecontent.string())",
valueOfProof:""8c20f3d24...43a6cfb7c4"
}
The hash is also useful if you wanna reduce the data sent to tangle. For example, a data source is generating a small file per second. However, pushing data to tangle in every second can be a performance bottleneck. If the data source pack all files of every 10 minutes into one, assign an ID and compute the hash value of this data trunk, it can still publish integrity data into tangle but with much lower frequency.
In addition to the above mandatory fields, you can extend the json object by adding more additional fields for fitting your logic.
For example, for case 1, you can add a field “location” for storing the location of that sensor, and “sensorType” of sensor type. These fields will be tightly coupled with these core fields, and stored in the tangle.1
2
3
4
5
6
7
8
9
10{
datapackageId: "1346236702",
wayofProof:"copy of original data",
valueOfProof:"20",
location:"Oslo",
sensorType:"temperature sensor XY200",
...
additionalField:...
...
}
Note
“timestamp” is not a mandatory field, as all transactions in Tangle already has a system timestamp shows when the data was submitted to the tangle. You can add “timestamp” field to store when the original data was collected.
Data source sends data or hash value of the data to the MAM channel (IOTA tangle), it ensures:
You can have a look at the sample code of sending message into IOTA tangle at https://github.com/linkcd/IOTAPoC/blob/master/tangleWriter.js from my repository.
This code simply:
Data source publish (on web site or equivalent) the following information for anyone who wants to verify the integrity:
If a data consumer would like to verify the data he/she got from the data source is not tampered, the consumer can:
Lets look at an real life case:
You as an American tourist was having an vacation in Norway. You were driving a car and had a great experience of the fjord. Unfortunately you had a small accident outside of a gas station, and the car windshield was damaged (but lucky, no one has been injured). The local police station was informed and issued a form (in Norwegian!) about this accident.
Now you would like to report this to your insurance company.
Most likely the insurance company would like to know if they can trust the damage report. You can of course explain that the data flow is:
If we can store and verify this flow (data lineage of the report), it will:
On the top of the data integrity layer that we discussed above, it is easy to extend the format to build the data lineage layer.
Now we extend the format to include an optional field inputs, which is an array of MAM addresses. These addresses represent the data integrity information of all inputs of the current data package.
Depends on if you have any input, inputs is optional. You can ignore this field, or have this field but the value is null.
1 | { |
By using the additional inputs field, it is easy to establish the follow data lineage flow:
It means that
Data integrity and data lineage play important roles in the coming data-first era. By using DLT, especially IOTA, it is possible to build the infrastructure of them. However, we have to keep in mind, even IOTA looks promising, it is under development, and it is not production ready. We will continue our investigation and collaboration with IOTA team/communities to continue this journey.
]]>Installing an IOTA light wallet is pretty straightforward, but running a full node is not. But thanks to the great playbook, I managed to setup a Virtual Private Server to run as an IOTA full node.
There are lots of things you need to think about when you are hosting a 24/7 server on the internet. This blog and Security Hardening section provides a good guideline.
In addition, if you are using the playbook installer , you basically have the default user name and ports for your full node. IT IS IMPORTANT TO CHANGE THEM! Otherwise the attacker only need to crack the password, as they already know your user name (iotapm) and your ports.
1 | nano /opt/iri-playbook/group_vars/all/iotapm.yml |
You can perform the following steps after you completed the installer.
1 | htpasswd -D /etc/nginx/.htpasswd iotpm |
1 | htpasswd /etc/nginx/.htpasswd new_user_account |
1 | systemctl stop grafana-server |
1 | rm -f /var/lib/grafana/grafana.db |
1 | systemctl start grafana-server |
Overview of connected neighbors
The node in the map: http://field.carriota.com/
Also, connect the wallet to the our node
If you are looking for neighbors, or would like to connect your wallet to this node, please feel free to let me know.
If you would like to donate, please use the following address. :)1
LPQRSZKJM9IRXHMUYJZQLKMAKJHJQDERJWIPSLKCYAPXVZPGEWG9QDXQUNTXCMZYLLIHPHGULVGFIAZAWDFECWYKGC
EoF.
]]>It has be been a while since my last post. It is because I was quite busy leading a team in a program for delivering veracity.com, the open industry data platform from DNV GL. It is a pretty exciting project - to build an open, independent data platform with bleeding edge technologies, to serve a large user base (100 000 registered users). You can read more about veracity at here and here.
It actually is a long and interesting story behind veracity (and its predecessor), together with all challenges that we encountered in this journey. Hopefully I can share them with you in the future.
Anyway, today I would like to talk about in the real world, how Infrastructure-as-Code looks like, together with Azure and VSTS.
There are tons of azure templates which is a great start point to use Infrastructure-as-Code in Azure. However, in the real world project, we always need to do a lot of extra work due to:
Above introduces the complexity to the CI/CD process, so it is important to have some best practices and common understanding in the team.
Let’s start with something simple, then we evolve it overtime for addressing different challenges.
Lets say we are going to build a simple nodejs web application as following, and host it in Azure. This application is named “MyWords”.
It has 3 components:
As a developer, you can simple go to Azure portal and create them manually. It is perfectly OK especially when you are building a PoC.
Now as usual, when the project became serious, we need multiple environments for a better control. In this case, they are Nightlybuild, Testing and Production.
For now, these 3 environments are identical (of course, they are 3 different web app with different url, 3 storage accounts and 3 application-insight instances).
Now the manual steps from previous stage become tedious and time consuming, we would like to automate them.
It can be simply achieved by using Azure Resource Manager template and VSTS task.
Tips:
At the end, we have our infra-as-code for our applications.
The result of provisioning is
It will be several iteration before you get the template right, but if your using VS2017, you can use some GUI for debugging.
VS2017 is simply calling the following command (Deploy-AzureResourceGroup.ps1 is a standard powershell that VS generated for you, you can also download it here)1
Deploy-AzureResourceGroup.ps1 -ResourceGroupName 'Real-life-infra-as-code-manual-testing' -ResourceGroupLocation 'northeurope' -TemplateFile 'azuredeploy.json' -TemplateParametersFile 'azuredeploy.nightlybuild.parameters.json' -ValidateOnly
Pay attention to the switch parameter: -ValidateOnly. Without it, you can actually provision resources.
As alternative, you can run1
Test-AzureRmResourceGroupDeployment -ResourceGroupName 'Real-life-infra-as-code-manual-testing' -TemplateFile $Env:BUILD_SOURCESDIRECTORY\Real-life-infra-as-code\azuredeploy.json -TemplateParameterFile $Env:BUILD_SOURCESDIRECTORY\Real-life-infra-as-code\azuredeploy.nightlybuild.parameters.json
It is nice to let the script to create resources for us, but there is only a hard-coded value that we specified in the json file is stored in application settings. Therefore we still have to manually copy keys and connection strings from application-insight and storage account into web site app settings.
1 | { |
To automate this process, we can look into Azure RM template functions. The function listkeys and listvalue is useful for fetching values of a resource.
Now we use this function for passing keys from storage account, and direclty use InstrumentationKey property to get the key from application insight.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23{
"name": "[parameters('webAppName')]",
"type": "Microsoft.Web/sites",
....
"appSettings": [
{
"name": "WEBSITE_NODE_DEFAULT_VERSION",
"value": "6.11.1"
},
{
"name": "STORAGE_ACCOUNT",
"value": "[variables('storageAccountName')]"
},
{
"name": "STORAGE_ACCESSKEY",
"value": "[listKeys(resourceId('Microsoft.Storage/storageAccounts', variables('storageAccountName')), '2016-01-01').keys[0].value]"
},
{
"name": "APPINSIGHTS_INSTRUMENTATIONKEY",
"value": "[reference(resourceId('microsoft.insights/components', variables('appinsightName')), '2015-05-01').InstrumentationKey]"
}
]
....
In addtion, make sure we add dependencies into website resource, to make sure that we have created the storage account and application insight before the website is created1
2
3
4
5
6
7
8
9
10 {
"name": "[parameters('webAppName')]",
"type": "Microsoft.Web/sites",
....
"dependsOn": [
"[concat('Microsoft.Web/serverfarms/', variables('serverFarmName'))]",
"[concat('Microsoft.Storage/storageAccounts/', variables('storageAccountName'))]",
"[concat('microsoft.insights/components/', variables('appinsightName'))]"
]
.....
Double check in application settings:
The updated code can be found at here.
In the coming articles, we will continue addressing the following challenges:
To be continued.
]]>This article demonstrates the basic steps for setting up both the server side (WebAPI) as well as the client application.
First of all, let’s create an AAD B2C tenant with domain luconsultingb2c.onmicrosoft.com by following the steps in this document.
Then you can switch by using the top-right menu
If you want, you can connect the AAD to an existing Azure subscription
Now you can start using this tenant
Follow the step in Azure Active Directory B2C: Provide sign-up and sign-in to consumers with LinkedIn accounts.
Create a Linkedin App to generate the client id and secret
Add Linkedin as an identity provider in AAD B2C, together with Email as local accounts
Remember to give it a meaningful name for your Linkedin idp, as the name will be used in the login page. (Do not use “LI” as the Microsoft article suggested)
Follow the steps in Azure Active Directory B2C: Register your application to register a web api named B2CEchoWebAPI
Note:
Once Web API is registered, open the app’s Published Scopes blade and add any extra scopes you want.
Note:
Write down the AppID URI and Published Scopes values, You will need them in your client application code. The format for calling will be “https://{tenent}/{AppID URI}/{scope value}”, for example “https://luconsultingb2c.onmicrosoft.com/B2CEchoWebAPI/performXYZ".
In this case, I am using a forked version of Microsoft sample code, with small modification. You can access it at https://github.com/linkcd/active-directory-b2c-javascript-nodejs-webapi.
It is using the common passport and passport-azure-ad for AAD strategies.
(full code is at https://github.com/linkcd/active-directory-b2c-javascript-nodejs-webapi/blob/master/index.js)1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34var express = require("express");
var passport = require("passport");
var BearerStrategy = require('passport-azure-ad').BearerStrategy;
//our tenent
var tenantID = "luconsultingb2c.onmicrosoft.com";
//client id of registered web api: "B2CEchoWebAPI"
var clientID = "f40734c1-5990-47fc-91b5-deceebac0089";
//our defined policy, include Linkedin
var policyName = "B2C_1_SiUpIn";
var options = {
identityMetadata: "https://login.microsoftonline.com/" + tenantID + "/v2.0/.well-known/openid-configuration/",
clientID: clientID,
policyName: policyName,
isB2C: true,
validateIssuer: true,
loggingLevel: 'info',
passReqToCallback: false
};
var bearerStrategy = new BearerStrategy(options,
function (token, done) {
// Send user info using the second argument
done(null, {}, token);
}
);
var app = express();
app.use(passport.initialize());
passport.use(bearerStrategy);
Then define API endpoint1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23app.use(function (req, res, next) {
res.header("Access-Control-Allow-Origin", "*");
res.header("Access-Control-Allow-Headers", "Authorization, Origin, X-Requested-With, Content-Type, Accept");
next();
});
app.get("/hello",
passport.authenticate('oauth-bearer', {session: false}),
function (req, res) {
var claims = req.authInfo;
console.log('User info: ', req.user);
console.log('Validated claims: ', claims);
//do this ONLY if the required scope include "read"
if (claims['scp'].split(" ").indexOf("read") >= 0) {
// Service relies on the name claim.
res.status(200).json({'name': claims['name']});
} else {
console.log("Invalid Scope, 403");
res.status(403).json({'error': 'insufficient_scope'});
}
}
);
Now run it at local
And confirm that the endpoint is protected
Now the Web API part is done, let’s move to the client part.
Follow the steps in register your single page application in your B2C tenant, so that your client has its own Application/client ID.
Note:
Again, I am using a forked version of Microsoft sample code, you can find it at https://github.com/linkcd/active-directory-b2c-javascript-msal-singlepageapp1
2
3
4
5
6
7
8
9
10
11<script class="pre">
// The current application coordinates were pre-registered in a B2C tenant.
var applicationConfig = {
clientID: 'df8f3cb5-b668-4e11-a8ca-ad4f78cb87f4',
authority: "https://login.microsoftonline.com/tfp/luconsultingb2c.onmicrosoft.com/b2c_1_siupin",
//use scope "read", as it is required in the Web API (see the webapi code in above)
b2cScopes: ["https://luconsultingb2c.onmicrosoft.com/B2CEchoWebAPI/read"],
webApi: 'http://localhost:5000/hello',
};
</script>
Look at the claim properties
As you can see that most of the user properties are from my Linkedin profile. Since this is the first time I signin, the newUser is true.
Also verify that the new user is created in AAD B2C
EOF.