Jim Scott is head of developer relations, Data Science, at NVIDIA (www.nvidia.com).
Over his career, Scott has held positions running operations, engineering, architecture, and QA teams in the big data, regulatory, digital advertising, retail analytics, IoT, financial services, manufacturing, healthcare, chemicals, and geographical information systems industries. He has built systems that handle more than 50 billion transactions per day, and his work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop.
Articles by Jim Scott
Securing information systems and data is a foundation for any organization. Detection of insider threats can be a considerable challenge for threat detection systems and security analysts. This is due to the difficulty of determining non-normal actions from internal system behavior data.
Posted April 05, 2021
The tectonic technology shifts we saw with the advent of Hadoop will not be seen again for quite a long time, but don't get too comfortable because the development of new tools and technologies to support the data science space isn't slowing down any time soon.
Posted January 18, 2021
When talking about data science, most people feel as if they are in one of two camps as far as data size. The first is really small data—hundreds of megabytes to a few gigabytes. The second is gigabytes to terabytes. Notice I didn't say "big data," nor did I say "petabytes." The source datasets may start at petabyte-scale, but keep in mind that data is often very raw, and most of it is ignored.
Posted September 11, 2020
The pandemic is revealing gaps in our critical infrastructure security, supply chain fragility, and utilization of modern technologies to mitigate and recover communities globally. Although advanced technology platforms have been used by large international corporations, the pandemic is exposing the fact that emergency management and public health agencies are behind the curve or underutilizing data science, open source software, and high-performance computing resources.
Posted May 28, 2020
As data sizes have grown over the last decade, so has the amount of time it takes to run ETL processes to support the myriad downstream workloads. A decade ago, most people were only thinking about making their KPI dashboards faster. As time rolled forward, they started to think about getting more intelligent analytics out of their data, and the data sizes quickly grew from gigabytes to terabytes.
Posted May 18, 2020
Accelerating the Data Science Ecosystem
Posted December 16, 2019
It is well-known that data scientists spend about 90% of their time performing data logistics-related tasks. Anything that a data scientist can do to reduce it is a good use of their time, and a benefit to the organization as a whole. Enter RAPIDS—a data science framework offering support for executing an end-to-end data science pipeline entirely on the GPU.
Posted September 03, 2019
Apache Airflow is turning heads these days. It integrates with many different systems and it is quickly becoming as full-featured as anything that has been around for workflow management over the last 30 years. This is predominantly attributable to the hundreds of operators for tasks such as executing Bash scripts, executing Hadoop jobs, and querying data sources with SQL.
Posted May 16, 2019
Kubeflow is a workflow tool which prides itself on making machine learning workflows simple to build, scalable, and portable. It provides graphical end-user tools to set up and define the steps in a pipeline. Most importantly, as data scientists build out their use cases, they add more and more steps and, when using Kubeflow, they end up with a documented, repeatable process.
Posted April 24, 2019
The Importance of Data for Applications and AI
Posted January 09, 2019
Thanks to the dramatic uptick in GPU capabilities, gone are the days when data scientists created and ran models in a one-off manual process. This is good news because the one-off model was typically not optimized to get the best results. The bad news is that with an increase in the total number of models created—including iterations over time—the amount of data used as inputs and generated by the models quickly spirals out of control. The additional bad news is that there are a variety of complexities associated with model, data, job, and workflow management.
Posted September 26, 2018
Blockchain has been one of the most loudly trumpeted new technologies on the enterprise database scene in recent history. However, the concept of a blockchain is not really a new notion. It is more of a repackaging of existing constructs to deliver a new set of benefits to any organization leveraging it for their use cases. It provides the benefit of irrevocable proof, and it reduces friction with information exchange.
Posted May 11, 2018
Data, data, data. With the exponential data growth that has occurred in the last 15 years, a need has arisen to sift through that data to find answers to questions.
Posted March 26, 2018
In a world where new technologies are often presented to the industry as rainbows and unicorns, there is always someone in a cubicle trying to figure out how to solve business problems and just make these great new technologies work together. The truth is that all of these technologies take time to learn, and it also takes time to identify the problems that can be solved by each of them.
Posted January 05, 2018
Many people are unsure of the differences between deep learning, machine learning, and artificial intelligence. Generally speaking, and with minimal debate, it is reasonably well-accepted that artificial intelligence can most easily be categorized as that which we have not yet figured out how to solve, while machine learning is a practical application with the know-how to solve problems, such as with anomaly detectio
Posted September 20, 2017
When people talk about the next generation of applications or infrastructure, what is often echoed throughout the industry is the cloud. On the application side, the concept of "serverless" is becoming less of a pipe dream and more of a reality. The infrastructure side has already proven that it is possible to deliver the ability to pay for compute on an hourly or more granular basis.
Posted May 15, 2017
It is difficult to find someone not talking about or considering using containers to deploy and manage their enterprise applications. A container just looks like another process running on a system; a dedicated CPU and pre-allocated memory aren't required in order to run a container. The simplicity of building, deploying, and managing containers is among the reasons that containers are growing rapidly in popularity.
Posted April 07, 2017
Hadoop Fundamentals and Key Technologies in the Evolving Hadoop Ecosystem
Posted February 03, 2017
Choosing when to leverage cloud infrastructure is a topic that should not be taken lightly. There are a few issues that should be considered when debating cloud as part of a business strategy.
Posted October 04, 2016
7 Key Technologies in the Evolving Hadoop Ecosystem
Posted June 03, 2016
NoSQL databases were born out of the need to scale transactional persistence stores more efficiently. In a world where the relational database management system (RDBMS) was king, this was easier said than done.
Posted March 29, 2016
Hadoop and Its Accompanying Ecosystem Are Here to Stay
Posted January 19, 2016
It's Always a Good Time for Real Time Data Access
Posted October 13, 2015
Google white papers have inspired many great open source projects. What has been missing until now, however, has been a way of bringing these technologies together such that any data-centric organization can benefit from the capabilities of each technology across its entire data center, and in new ways not documented by any single white paper. This is called the "Zeta Architecture."
Posted May 19, 2015
In order to truly appreciate Apache Drill, it is important to understand the history of the projects in this space, as well as the design principles and the goals of its implementation.
Posted April 08, 2015