ETL

Data Catalog: Sources

386 views September 11, 2018 May 29, 2019 admin 0

Our mission at UDM is not just to be the premier one-stop shop for all external data, served timely and accurately, but to also educate and empower our users on data they use in their analysis and reporting. We also strive for maximum transparency, down to a level of every record. To achieve these two objectives, we maintain a collection of Sources. Each Source belongs to one, and only one Report, and uniquely identifies one internet address where data from a particular Report can be obtained. For example, the following are five sources for the report Announcement of Class and Component Prices, published by the U.S. Department of Agriculture:

  • Announcement of Class and Component Prices — Archived Report at https://www.ams.usda.gov/market-news/search-market-news
  • Announcement of Class and Component Prices — Datamart at https://mpr.datamart.ams.usda.gov/menu.do?path=Products\Dairy\All%20Dairy\(DY_CL102)%20National%20Dairy%20Products%20Prices%20-%20Monthly
  • Announcement of Class and Component Prices — PDF File at https://www.ams.usda.gov/mnreports/dymclassprices.pdf
  • Announcement of Class and Component Prices — PDF File at http://usda.mannlib.cornell.edu/usda/ams/DYMCLASSPRICES.pdf
  • Announcement of Class and Component Prices — TXT File at https://www.ams.usda.gov/mnreports/wa_da300.txt

Each Source belongs to one, and only one Report, but report may have multiple Sources.

In the Excel version of our Data Catalog, our Excel users can see where we obtain data for a particular Report or Publisher by filtering on those columns in the tab Sources:

In each record in our database, we store the information on the source from which the data was obtained. SourceID field in Value_* feeds also helps to unique tie each value record back to a single Report. Each value stored in the database belongs to one, and only one Series. Each Series, however, may belong to multiple Reports, so information on Source is necessary to unique identify the Report at the record level. Consider for example the series FMMO Butterfat Test of Producer Milk. This series is published in separate Reports by each of the 10 Federal Milk Marketing Orders in the United States. Federal Milk Order 32 publishes this variable in the report Federal Milk Order 32 – Producer Price Differential, Component Prices & Average Component Tests. Federal Milk Order 1 publishes it in Federal Milk Order 1 – Pool Price Announcement, etc. In total, this Product belongs to more than 10 Reports. Each record, on every Value_* feed on UDMDirect API has a unique SourceID identifying precisely the online source from which the value has been obtained.

The data catalog endpoint with the full list of all Measures is /UDM_Catalog_Sources. The fields are:

  • SourceID (int, not null)
  • ReportID (int, not null)
  • SourceName (nvarchar(400), not null)
  • LastModified (datetime2(6), not null)
  • Active (bit, not null)

SourceID is the unique identifying number assigned to each Source, i.e. the primary key for the table where this data resides. SourceName is the brief name for the Source, typically containing the exact URL from which the data can be obtained. ReportID is the foreign key to identify the Report to which a Source belongs.

The feed UDM_Catalog_Sources contains the timestamp field LastModified which can be used to facilitate incremental pulls. No record is ever hard-deleted from the underlying UDM database table. Instead, soft-deleted records have Active field set to 0. Soft-delete principle in conjunction with the timestamp of last change allows the users to optimize their data pulls by only requesting those fields that have been modified since the last time the data feed was pulled. For example, the request to the server may be:

http://udmdirect.dairymarkets.com/Universal/UDM_Catalog_Sources?$filter=LastModified gt DateTime’2018-09-01T04:30:00.0000000′

or more simply:

http://udmdirect.dairymarkets.com/Universal/UDM_Catalog_Sources?$filter=LastModified gt DateTime’2018-06-01′

ETL users should pull in the cross-reference catalog feed /UDM_Catalog_XRef to obtain information on all relevant many-to-many relationships in the UDM Database.

To add this table to your database:

CREATE TABLE [UDM].[Catalog.Sources]( 
		[SourceID] [int] NOT NULL,
		[ReportID] [int] NOT NULL,
		[SourceName] [nvarchar](400) NOT NULL,
		[LastModified] DATETIME2(6) NOT NULL, 
		[Active] BIT NOT NULL, 
    CONSTRAINT [PK_Catalog.Sources] PRIMARY KEY ([SourceID])
) 
GO 

ALTER TABLE [UDM].[Catalog.Sources] ADD CONSTRAINT [FK_ReportID_Catalog_Sources] FOREIGN KEY([ReportID])
REFERENCES [UDM].[Catalog.Reports] ([ReportID])
GO

Was this helpful?