Tutorial: Data Ingest for CSV Sources

DarkLight can ingest data in either JSON or CSV (table) formats. This tutorial will walk through the process of going from file to graph.

Topics Covered:

  • Creating a Folder Data Feed
  • Creating a CSV Table Reify Configuration
  • Using a Playbook to process the file through the reify config

Just to keep everything clean while you work on this tutorial, let's work in a separate workspace.

  1. Choose File→Switch Workspace
  2. In the dialog that comes up, type in the name of the folder you want to hold the tutorial, or click the Browse button and choose a new empty folder.

  3. Click Yes to allow DarkLight to create the new directory
  4. Click Yes to restart DarkLight
    1. DarkLight will restart and create your new workspace
  5. Close the Welcome window
  1. Open the PRO Playbooks Perspective from the top-right corner

  2. Create a new Data Feed with Connection Type of "Folder" from the Data Feeds view.

  3. Click on the new item in the Data Feeds list

  4. Enter a new name in the Name field. For this tutorial, we'll use the name "CSV Folder"
  5. Click the Browse button and choose or create an empty folder on your system where you will drop files you want ingested into DarkLight

    CAUTION

    DarkLight does not discriminate the files dropped into this folder. It will ingest any file that is in this folder and try to send it to a PRO Playbook. If there are no playbooks that ingest the file, it will be deleted.
  6. Check the Archive button if you want the files dropped into the folder to be copied into an "Archive" folder before they are ingested.
  7. Check the box next to the folder name to activate the Feed. You should see a white cloud icon show up to the left of it. This indicates that the feed is configured correctly but no data has been processed with the feed yet.
  1. Open the Reify Perspective from the top-right corner

  2. Create a new Reify config with Source Type of "TABLE"

  3. Download this sample file of employee data: tut-sample-employees.csv
    1. Note: This data contains synthetic user information generated at http://generatedata.com.
  4. Click the Load/Paste button and browse to the file

  5. Make sure Comma and 1st Row Header remain selected, and click OK
  6. Change the "Reifier Name" to "Employees"

  7. Click <Choose Class and ID> and set the class to ent:Employee. This will be the type of object created, and will be at the center of the graph.

  8. Choose the Class+JSONPath Value option from IRI Options, and pick $.ID from the list. Since the ID value uniquely identifies this employee, it will give the Employee object a name something like "Employee-Donovan06156".
  9. Click the Save Reifier button at the bottom of the window in the orange bar

    A Note About Ontologies

    To keep this tutorial concise, we're using an ontology that comes with DarkLight called tag:champtc:enterprise and a shorthand prefix of ent. When you do this process with your own data, you will either be using a different ontology or make your own. To learn more about ontologies, see Ontology: Planning Your Data Model
  10. Add the first field in the CSV file to the configuration side by dragging it to the right.

  11. Now we need to tell DarkLight what kind of association this property is to the Employee. Click <Assign Data Property> and choose ent:hasFirstName. Note that you can type in the filter box and narrow your choices.

  12. Repeat the previous steps for the remaining fields (we won't be using the "number" field; it was just used to generate the ID in the sample data). Instead of dragging over one item at a time, however, use the Shift key to select all three then drag them all at the same time.

  13. Click the <Assign Data Propery> buttons next to each of the properties you just added. Choose ent:hasLastName, ent:hasEmailAddress, and ent:hasAccountName for the Data Properties. These data properties (links in the graph) will point to the values that describe the Employee object.

  14. Click the Save Reifier button to save your changes.

Now we need to put all the pieces together and tell DarkLight what to do with the file that gets added to the monitored folder. This playbook will also publish the reified data into the Contextual Memory database.

We are headed toward something that looks like this:

  1. Switch to the PRO Playbooks perspective

  2. Create a new Playbook in the PRO Playbook Manager view. A new, blank window called "Untitled" will open up next to the PRO Playbook Manager.

  3. Change the playbook name from "Untitled" to "Employee Ingest"
  4. Add the first step in the playbook by clicking the + icon and choosing Ingest

  5. Click the Ingest step in the playbook to select it. The settings for it will show up to the right in the Step Editor view. Set the Choose Data Feed to the "CSV Folder" made at the beginning of this tutorial. Any files dropped into the folder of that feed will be processed by this playbook.


    See Also: Step: Ingest

  6. Add a new step to the True side of the Ingest step either with the + icon in the toolbar or with the Add a Step button in the Step Editor. When the Create Step dialog opens, choose "Convert CSV to Table (multi-line)". This step will convert the CSV text into something the playbook can use, called a multi-dimensional array, but that's a mouthful, so we'll just call it a table.

  7. Click on the Convert CSV to Table step and enter rawInput into the Input Variable field. Note that there is a capital letter I in the name, and variable names are case-sensitive. Oh, by the way, do you see the icon at the end of that input box? That tells you that the data that will come out of this step will be a table (the icon will make more sense in the next steps).
  8. The next step only works if the playbook has been saved once, so click on Save PRO Playboook in the orange bar at the bottom of the Playbook view.

  9. Now it's time to test our work so far. Anytime you start creating a playbook that uses variables it's a good idea to put a piece of sample data into the playbook and check it with the Inventory view. Click on the triangle menu of the Playbook and choose Load Sample Input.
  10. In the Load/Paste Sample Input dialog, click the Browse button and load in the same CSV file downloaded earlier. This is the file that will be copied into the monitor folder when the playbook is running so we also want to use it to test with.

  11. Both of the steps should turn bright green, indicating that the data package successfully was processed by both steps. (If the steps aren't green, make sure you have saved the playbook at least once.) Now we need to use the Inventory view to see the details about the package traveling through these steps. The Inventory tab is probably tucked behind the Step Editor view tab. Click the Inventory tab to bring it to the front.

  12. Click on the Convert CSV to Table step and notice that the Inventory view now has a "Package In" and a "Package Out" section with content in it. A package is a JSON-formatted object that is used to contain all of the information necessary to process a piece of data through the steps of the playbook. The package includes a section that holds tabular information, called miscData and section that holds graphs, called graphData.

    This "raw" option is nice for seeing the full package, but a more useful way to use the Inventory view is to choose the Data Out option by clicking on Data Out at the top of the Inventory view.

  13. The Data Out option of the Inventory view shows that the output of the Convert CSV to Table step is a variable called "result" and it is a list of multi-value rows (a table), indicated by the "[*][*]" label (and as seen in the icon used above). We'll need that name in the next step. Notice that the first row of header information is not there because one of the settings of the Convert CSV step was to ignore the first line.

  14. We ultimately want to reify (turn into a graph) this data, but we only want one row at a time to go through the reifier. To do that, we need to split the package into one new package for each row in the "results" table. Add the Split Package step to the True + side of the Convert CSV to Table step.

  15. In order to configure the Split Package step, we need to see the Step Editor again (which is behind the Inventory view now). To see both the Step Editor and Inventory view, grab the Inventory tab and drag it into the Tutorial view.
  16. Click on the Split Package step and type in "result" in the Input Variable to Split.

    Note: The step configurations keep the text you've entered while you switch to other steps. For example, you can click on the Convert CSV step and look at the Data Out option of the Inventory view if you need to remember the name of the variable that holds the data you need.
  17. Click the Save PRO Playbook button, and look at the Data Out option of the Inventory view. Each time you save the Playbook, the sample data will run through the playbook again and update the Inventory view. This is showing that the Split Package step took the first line from the result variable and stored the values in a new variable called singleItem. The [*] hint means that each row can be addressed by using an index number (0,1,2, etc.). This way of representing data is called a single-dimensional array, or we also call it a list.

  18. Bonus Tip: We won't be using this feature in this tutorial, but there are times when you will want to reference values in a list like this. DarkLight does this by using the FreeMarker template specification. The Inventory view helps you get the syntax of the FreeMarker right in the Template section at the bottom. Click on a value, like the last name, and the Template will show you the FreeMarker variable you'd use to refer to that value in a step. You'll know you need to use this syntax when the step asks for a "value" instead of a "variable" and it has the icon.

  19. You may have noticed that the Inventory view no longer shows the "results" variable and only has details for one employee now. This is a result of using the Split Package step. The top of the Inventory view now will show the total number of packages that were created by the split, and the arrow buttons let you step through each new package and see the variables for that package in the rest of the Inventory view.

  20. Add a Reify Table Row step to the True + side of the Split Package Step.

  21. Click on the Reify Table Row step. From the dropdown menu for Reify Configuration, choose the "Employees" reify config made previously in this tutorial and then enter "singleItem" (with a capital I) into the Input Variable field. This tells the step to take the value of the singleItem variable and use it as the source for the Employee reify config. It will add a graph called "_default_" to the package.
  22. Click the Save PRO Playbook button to save your changes and automatically run the sample file through the steps again. The Inventory view should now show a graph called _default_. It won't be shown as a graph you may be used to looking at with nodes and edges. It is shown in JSON-LD(https://json-ld.org) as text.

    1. You may have noticed that the "Primary Individual" of the graph (the center node we configured as an ent:Employee in the reify config) does not have the ID value as a part of its name, like it should. This is a bug in DarkLight 3.7.0 for the CSV reifier.
  23. Add a new step of Publish to Knowledge Base to the True + side of the Reify Table Row step, and configure it to Publish to Contextual Memory.

  24. Click the Save PRO Playbook button to save your changes and automatically run the sample file through the steps again. Click on the Review perspective in the upper-right, then look in the bottom-left view called "Contextual Memory." There should be an item in there called "Employee (100)".



  25. Click on the Employee (100) label in Contextual Memory, and the 100 employee objects will load into the Results views. Click on any Employee ID in the table to see its details.


    See Also: Viewing Results

Now it's time to get everything turned on and see it work like it will in production.

  1. Let's clear out the sample data from Contextual Memory so we know it works when we use the folder we set up. In the Contextual Memory view, right-click on the Employee label and choose Clear. Click OK in the confirmation dialog that will appear.

  2. Switch back to the PRO Playbooks perspective and make sure that both the "CSV Folder" Data Feed and the "Employee Ingest" PRO Playbook have checkmarks next to them.

  3. From your file system, copy the csv file downloaded earlier, and paste the file into your monitor folder. DarkLight will ingest and then delete the file, moving it to the Archive folder if you chose that option in the Data Feed configuration. The file should then flow through the playbook and the end result should be 100 Employee objects in Contextual Memory.
  • tutorial/data-ingest
  • Last modified: 2019/03/29 22:59