Summary: Pyspark DataFrames have a join method which takes three parameters: DataFrame on the right side of the join, Which fields are being joined on, and what type of join (inner, outer, left_outer, right_outer, leftsemi). You call the join method from the left side DataFrame object such as df1.join(df2, df1.col1 == df2.col1, 'inner'). One of […]| Learn by Marketing
Summary: Logistic regression produces coefficients that are the log odds. Take e raised to the log odds to get the coefficients in odds. Odds have an exponential growth rather than a linear growth for every one unit increase. A two unit increase in x results in a squared increase from the odds coefficient. To get […]| Learn by Marketing
Summary: The simplest way of of getting a data.frame to a transaction is by reading it from a csv into R. An alternative is to convert it to a logical matrix and coerce it into a transaction object. I occasionally use the arules package to do some light association rule mining. The biggest frustration has […]| Learn by Marketing
Summary: Before adding a person to your analytical team, it’s important to create templates for reporting, centralize data access, and automating reoccurring reports. I believe a lot of software development practices can be applied to business and analysis. I was inspired by this post on how one programmer built up his position and a team. […]| Learn by Marketing
Lately, I’ve written a few iterations of pyspark to develop a recommender system (I’ve had some practice creating recommender systems in pyspark). I ran into a situation where I needed to generate some recommendations on some different datasets. My problem was that I had to decipher some of the prediction documentation. Because of my struggles, […]| Learn by Marketing
Summary: Writing better quality data mining code requires you to write code that is self-explanatory and does one thing at a time well. In terms of analysis, you should be cross-validating and watching for slowly changing relationships in the data. Methodology Quality Even before you think about writing a piece of code, you should be […]| Learn by Marketing
Summary: XGBoost and ensembles take the Kaggle cake but they’re mainly used for classification tasks. Some tools like factorization machines and vowpal wabbit make occasional appearances. Algorithm Frequency An important question we can ask this data set is “What tools” are being used. And we can answer this question, better than my original anecdote driven […]| Learn by Marketing
Summary: To stay on top of your personal development, try learning new things like a programming language, an instrument, or exposure to a new field (e.g. biology or accounting). Exposure to new ideas helps you avoid confirmation bias and increase you willingness to explore your analysis further. As an analyst, it’s easy to think you’re […]| Learn by Marketing
Summary: The foreach package provides parallel operations for many packages (including randomForest). Packages like gbm and caret have parallelization built into their functions. Other tools like bigmemory and ff solve handling large datasets with memory management. [table nl=”|”] Package, Purpose, Benefits foreach, Workhorse of parallel processing in R. Uses %dopar% to parallelize tasks and returns […]| Learn by Marketing
My friend, Josh Jacquet, and I competed in the DMA’s 2016 Analytics Challenge (powered by EY) and placed 4th out of the 50 entrants. Given that the majority of the other contestants were agencies vying for a little exposure, I think we did well.| www.learnbymarketing.com