Warehousing semi-relational data

Your non-relational data has a lot of relationships.

Unless you’ve got a time series in a vacuum, all that data in your nosql database relates to other data sources in your general universe. Take your log data, which encodes information about your individual users’ interaction with their personal data. What happens when you want to segment your user activity by country, by gender, by all of those other variables in your core relational data?

Your relational data isn’t always relational.

The most common operation in a user-facing relational database is usually to retrieve dozens of rows joined across several tables. Indexes and highly evolved query engines make these operations relatively simple to do in real time even with thousands of connections. But that doesn’t mean your relational data is always relational.

Data scientists typically aggregate millions of rows (i.e. table scans) in large tables. Relative table sizes in many applications follow a log-linear relationship, with the largest tables containing millions of rows. Moreover, the questions we are asking as data scientists take a more column-oriented flavor than the product team requires (what is the trend, not what are the specific values for a single user). We are often analyzing relational data using column oriented methods and nonrelational data by combining it with our other relational data stores.

chart_1 (1)chart_1 (2)


Leave a comment