Indy Navarro post about tech

Using evotrees.jl for time series prediction

Thu, 02 Mar 2023 00:00:00 +0900

1. Introduction

In this post, I want to show an analysis of a time series that I've been working on. Usually, when dealing with time series, it is not so common to use machine learning algorithms (without at least trying more traditional models like the ARIMA family), but I still wanted to test how well a GBM model fits for these kinds of problems that are so popular.

NOTE: I don't recommend starting with models of this type for time series problems. There are simpler models to understand that are less computationally expensive.

2. Dataset Preparation

2.1. Data Extraction

You can find the repository here, The codes you will see here, I prototyped in notebooks/tutorial.jl.

Now we start by making the corresponding imports.

using DataFrames
using Plots
using MLJ
using EvoTrees
using UrlDownload
using ZipFile
using HTTP
using CSV
using Dates
using Statistics
using MLJClusteringInterface
using Clustering
using FreqTables
using StatsPlots
using RollingFunctions
using StatsBase
using ShiftedArrays

There are several libraries in this section, and I must admit it took me some time to use each one. But anyway to start reading the dataframe, we can get it directly from its repository.

data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00235/household_power_consumption.zip"
f = download(data_url)
z = ZipFile.Reader(f)
z_by_filename = Dict( f.name => f for f in z.files)
data = CSV.read(z_by_filename["household_power_consumption.txt"], DataFrame,)

The dataframe looks more or less like this:

     Row │ Date        Time      Global_active_power  Global_reactive_power  Voltage  Global_intensity  Sub_metering_1  Sub_metering_2  Sub_metering_3
         │ String15    Time      String7              String7                String7  String7           String7         String7         Float64?
─────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       1 │ 16/12/2006  17:24:00  4.216                0.418                  234.840  18.400            0.000           1.000                     17.0
       2 │ 16/12/2006  17:25:00  5.360                0.436                  233.630  23.000            0.000           1.000                     16.0
       3 │ 16/12/2006  17:26:00  5.374                0.498                  233.290  23.000            0.000           2.000                     17.0
       4 │ 16/12/2006  17:27:00  5.388                0.502                  233.740  23.000            0.000           1.000                     17.0
       5 │ 16/12/2006  17:28:00  3.666                0.528                  235.680  15.800            0.000           1.000                     17.0
       6 │ 16/12/2006  17:29:00  3.520                0.522                  235.020  15.000            0.000           2.000                     17.0
       7 │ 16/12/2006  17:30:00  3.702                0.520                  235.090  15.800            0.000           1.000                     17.0
       8 │ 16/12/2006  17:31:00  3.700                0.520                  235.220  15.800            0.000           1.000                     17.0
       9 │ 16/12/2006  17:32:00  3.668                0.510                  233.990  15.800            0.000           1.000                     17.0
      10 │ 16/12/2006  17:33:00  3.662                0.510                  233.860  15.800            0.000           2.000                     16.0
      11 │ 16/12/2006  17:34:00  4.448                0.498                  232.860  19.600            0.000           1.000                     17.0
      12 │ 16/12/2006  17:35:00  5.412                0.470                  232.780  23.200            0.000           1.000                     17.0
      13 │ 16/12/2006  17:36:00  5.224                0.478                  232.990  22.400            0.000           1.000                     16.0
      14 │ 16/12/2006  17:37:00  5.268                0.398                  232.910  22.600            0.000           2.000                     17.0
    ⋮    │     ⋮          ⋮               ⋮                     ⋮               ⋮            ⋮                ⋮               ⋮               ⋮
 2075246 │ 26/11/2010  20:49:00  0.948                0.000                  238.160  4.000             0.000           1.000                      0.0
 2075247 │ 26/11/2010  20:50:00  1.198                0.128                  238.110  5.000             0.000           1.000                      0.0
 2075248 │ 26/11/2010  20:51:00  1.024                0.106                  238.840  4.200             0.000           1.000                      0.0
 2075249 │ 26/11/2010  20:52:00  0.946                0.000                  239.050  4.000             0.000           0.000                      0.0
 2075250 │ 26/11/2010  20:53:00  0.944                0.000                  238.720  4.000             0.000           0.000                      0.0
 2075251 │ 26/11/2010  20:54:00  0.946                0.000                  239.310  4.000             0.000           0.000                      0.0
 2075252 │ 26/11/2010  20:55:00  0.946                0.000                  239.740  4.000             0.000           0.000                      0.0
 2075253 │ 26/11/2010  20:56:00  0.942                0.000                  239.410  4.000             0.000           0.000                      0.0
 2075254 │ 26/11/2010  20:57:00  0.946                0.000                  240.330  4.000             0.000           0.000                      0.0
 2075255 │ 26/11/2010  20:58:00  0.946                0.000                  240.430  4.000             0.000           0.000                      0.0
 2075256 │ 26/11/2010  20:59:00  0.944                0.000                  240.000  4.000             0.000           0.000                      0.0
 2075257 │ 26/11/2010  21:00:00  0.938                0.000                  239.820  3.800             0.000           0.000                      0.0
 2075258 │ 26/11/2010  21:01:00  0.934                0.000                  239.700  3.800             0.000           0.000                      0.0
 2075259 │ 26/11/2010  21:02:00  0.932                0.000                  239.550  3.800             0.000           0.000                      0.0
                                                                                                                                   2075231 rows omitted

2.2. Dataset Cleaning

As can be seen, it is a quite large dataset and we can take the opportunity to create new variables, so we have the possibility to obtain relevant information.

#Create a variable 
date_time = [DateTime(d, t) for (d,t) in zip(data[!,1], data[!,2])]

data[!,:date_time] = date_time

#Create variable for date
data[!,:year] = Dates.value.(Year.(data[!,1]))
data[!,:month] = Dates.value.(Month.(data[!,1]))
data[!,:day] = Dates.value.(Day.(data[!,1]))

#Create variable for time
data[!, :hour] = Dates.value.(Hour.(data[!,2]))
data[!, :minute] = Dates.value.(Minute.(data[!,2]))

#Create variable for weekends
data[!, :dayofweek] = [dayofweek(date) for date in data.Date]
data[!, :weekend] = [day in [6, 7] for day in data.dayofweek]

In addition, we notice that the variables are in String format. We can make some changes to put them in the appropriate form.

for i in 3:8
    data[!,i] = parse.(Float64, data[!,i])
end
data[!,1] = replace.(data[!,1], "/" => "-")
data[!,1] = Date.(data[!,1], "d-m-y")

3. Preliminary Visualizations

A classic way to plot all the variables is with the following code:

plot([plot(data[1:50000, :date_time],data[1:50000,col]; label = col, xrot=30) for col in ["Global_active_power",  "Global_reactive_power", "Global_intensity", "Voltage", "Sub_metering_1",  "Sub_metering_2", "Sub_metering_3"]]...)

Figure 1: Line Plot All

Note that we only take a sample of 50,000 data points to avoid overloading the graphs with information, and in the same way, we can create histograms.

plot([histogram(data[1:50000, col],label = col, bins = 20 ) for col in ["Global_active_power",  "Global_reactive_power", "Global_intensity", "Voltage", "Sub_metering_1",  "Sub_metering_2", "Sub_metering_3"]]...)

Figure 2: Line Plot All

For now, we can recognize that the time series in its global variables have a white noise behavior, and Voltage also has it, however, it is the only one that seems to have a distribution that is similar to a normal distribution, while the sub-metering, are signs of use of household appliances.

4. A brief clustering with kmeans

In this section, we are interested in building a clustering model on the time series. The purpose? It is simply a way of evaluating behavior patterns over time, one hypothesis would be to see irregular behavior patterns over time, given that greater consumption would be seen at specific periods of the day or season.

An interesting issue that I was unaware of was that time series clustering is possible and you can use k-means, however in these cases, they cannot be treated from the same perspective, and other types of variants of these algorithms should be used to consider the temporality of neighboring observations when clustering. But since this project is just a toy, and the use of this technique is only for EDA, we will stick with the classical algorithm.

If you want to know more about this topic, yu can read this articule

Continuing with the problem, we can cluster by applying the following code.

X = data[!, 3:9]
transformer_instance = Standardizer()
transformer_model = machine(transformer_instance, X)
fit!(transformer_model)
X = MLJ.transform(transformer_model, X);
KMeans= @load KMeans pkg=Clustering
kmeans = KMeans(k=3)

mach = machine(kmeans, X) |> fit!

# cluster X into 3 clusters using K-means
Xsmall = MLJ.transform(mach);
selectrows(Xsmall, 1:4) |> pretty
yhat = MLJ.predict(mach)
data[!,:cluster] = yhat

In this case, we have 3 clusters that are ordered as follows.

cluster nrow
CategoricalValue    Int64
1   1   741077
2   2   1257309
3   3   50894

And if we try to plot the clusters, we would have the following.

plot([scatter(data[1:20000, :date_time],data[1:20000,col]; group=data[1:20000,:].cluster, size=(1200, 1000), title = col, xrot=30) for col in ["Global_active_power",  "Global_reactive_power", "Global_intensity", "Voltage", "Sub_metering_1",  "Sub_metering_2", "Sub_metering_3"]]...)

Figure 3: scatter plot cluster

It looks a bit confusing, although if we look at the voltage variable, we can already size up a certain trend. For now, let's consider a boxplot of the main variables but considering the clusters.

b1 =@df data boxplot(string.(:cluster), :Global_active_power, fillalpha=0.75, linewidth=2, title ="Global active power")
b2 =@df data boxplot(string.(:cluster), :Global_reactive_power, fillalpha=0.75, linewidth=2, title = "Global reactive power")
b3 = @df data boxplot(string.(:cluster), :Global_intensity, fillalpha=0.75, linewidth=2, title ="Global intensity")
b4 = @df data boxplot(string.(:cluster), :Voltage, fillalpha=0.75, linewidth=2, title = "Voltage")


plot(b1, b2, b3, b4 ,layout=(2,2), legend=false)

Figure 4: scatter plot cluster

The truth is that we notice slight differences between the clusters, where we have certain consumption patterns in each category, but in some of their variables these do not necessarily lead us to any conclusion. However, as we had mentioned at the beginning, the idea of clustering was to study consumption patterns during time intervals, so we add the following.

h1 =heatmap(freqtable(data,:cluster,:dayofweek)./freqtable(data,:cluster), title = "day of week")
h2 =heatmap(freqtable(data,:cluster,:hour)./freqtable(data,:cluster), title = "hour")
h3 = heatmap(freqtable(data,:cluster,:month)./freqtable(data,:cluster), title = "month")
h4 = heatmap(freqtable(data,:cluster,:day)./freqtable(data,:cluster), title = "day")

plot(h1, h2, h3, h4 ,layout=(2,2), legend=false)

Figure 5: scatter plot cluster

It might be a bit confusing initially, but let me take an example that might help you understand. If you take into account cluster 2, it corresponds to the lowest use of the global intensity used. If we go to the heatmap that represents the hours, we will see that the time where this pattern of behavior is most present is at night, which corresponds to the hours we are usually sleeping. I hope this make more sense.

This might give us a slight hint that time frames might be necessary, we'll take this information for featuer engineering at this point later. Now let's start with the next phase.

5. Using EvoTrees for prediction.

For now, we want to predict voltage. I'm not an expert in the field of electricity and consumption, but for a simple exercise, we will use the MLJ library (for Python users it would be equivalent to Scikit-Learn). Due to the amount of data and the algorithm we are going to use, it is not practical to perform training with cross-validation, this will take too much time, so we will prefer to only use a train/test split as a strategy.

let's generate a lag and cut the data in the following way:

data[!, :lag_30] = Array(ShiftedArray(data.Voltage, 30))
replace!(data.lag_30, missing => 0);

And to assign the training and testing, we use the following.

train = copy(filter(x -> x.Date < Date(2010,10,01), data))
test = copy(filter(x -> x.Date >= Date(2010,10,01), data))

Then, we remove some variables that we won't use to train the model, and we save our voltage variable.

select!(train, Not([:Date, :Time, :date_time, :cluster, ]))
select!(test, Not([:Date, :Time, :date_time, :cluster, ]))
y_train = copy(train[!,:Voltage])
y_test = copy(test[!,:Voltage])

Now we are going to apply a cyclical encoder to be able to work with the data, we have several new variables related to time (month, day, hour, among others), and all these variables will be more helpful if we allow extracting their cyclical character, that is why we use a trigonometric transformation

function cyclical_encoder(df::DataFrame, columns::Union{Array, Symbol}, max_val::Union{Array, Int} )
    for (column, max) in zip(columns, max_val)

        df[:, Symbol(string(column) * "_sin")] = sin.(2*pi*df[:, column]/max)
        df[:, Symbol(string(column) * "_cos")] = cos.(2*pi*df[:, column]/max)
    end
    return df
end

Finally, we can apply this new function to our dataset.

columns_selected = [:day, :year, :month, :hour, :minute, :dayofweek]
max_val = [31, 2010, 12, 23, 59, 7]
train_cyclical = cyclical_encoder(train, columns_selected, max_val)
test_cyclical = cyclical_encoder(test, columns_selected, max_val)

And finally, let's train the model.

EvoTreeRegressor = @load EvoTreeRegressor pkg=EvoTrees verbosity=0
etr_start = EvoTreeRegressor(max_depth =15)

machreg = machine(etr_start, train_cyclical[!,14:end], y_train);
fit!(machreg);


pred_etr_train = MLJ.predict(machreg, train_cyclical[!,14:end]);
rms_score_train = rms(pred_etr_train, y_train)
println("The rms in train is $rms_score_train")

pred_etr = MLJ.predict(machreg, test_cyclical[!,14:end]);
rms_score = rms(pred_etr, y_test)
println("The rms in test is $rms_score")

This is our result: * The rms in train is 2.5364451392238085 * The rms in test is 3.438565163838837

In this section, we plot the residual left by our model, and here we can detect some signs of overfitting, considering that our model has a much better score in the training dataset than in the test dataset. On the other hand, the plots are showing us that our model has biases in its predictions, it is not being able to recognize trends.

Figure 6: prediction

Finally, we can see how the predictions compare to the test data.

Figure 7: pred-vs-real

As we have confirmed earlier, the prediction does not seem to have been able to determine the magnitudes of the voltage in the testing of the dataset. Despite the fact that our variable is fairly stable over time, the model was trained with different parameters, but ultimately none of the options managed to show a significant improvement.

6. Conclusions

With this small exercise, we only tried to test that GBM, while a powerful tool and popular in places like Kaggle, requires a certain level of expertise both in the model and in the use case to achieve good performance. A naive approach may not generate results that satisfy the users. This, on one hand, requires:

Understanding how to perform feature engineering for a time series, such as obtaining the decomposition of the time series. This can help capture trends that cannot always be obtained solely with the time horizon.
Applying smoothing strategies like moving averages could help recognize the underlying pattern, but then you will need to estimate that moving average into the future.

Overall, time series analysis requires a deep understanding of the data, proper preprocessing techniques, feature engineering, and selecting appropriate models that can capture the specific patterns and dynamics of the data.

World Happiness Report - EDA and clustering with Julia

Wed, 23 Nov 2022 00:00:00 +0900

The purpose of this post is to show Julia as a language for data analysis and Machine Learning. Sadly Kaggle does not support Julia Kernels (hopefully, they will add it in the future). Therefore I wanted to take advantage of this space to show a reimplementation of Python/R Notebooks to Julia. In this context, I took data on happiness in countries in 2021 and some factors considered in this exciting survey.

You can get the dataset in Kaggle
The full code is in my Github

1. Packages used

I'm using Julia version 1.8.0 in this project, and the library versions are in the Project.toml, there are some installed that I didn't end up using for this analysis, but these are the important ones

using DataFrames
using DataFramesMeta
using CSV
using Plots
using StatsPlots
using Statistics
using HypothesisTests
Plots.theme(:ggplot2)

Let's start reading the file.

df_2021 = DataFrame(CSV.File("./data/2021.csv", normalizenames=true))

You can see the dataset in the REPL.

julia> df_2021 = DataFrame(CSV.File("./data/2021.csv", normalizenames=true))
149×20 DataFrame
 Row │ Country_name    Regional_indicator            Ladder_score  Standard_error_of_ladder_score  upperwhi ⋯
     │ String31        String                        Float64       Float64                         Float64  ⋯
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ Finland         Western Europe                       7.842                           0.032         7 ⋯
   2 │ Denmark         Western Europe                       7.62                            0.035         7
   3 │ Switzerland     Western Europe                       7.571                           0.036         7
   4 │ Iceland         Western Europe                       7.554                           0.059         7
   5 │ Netherlands     Western Europe                       7.464                           0.027         7 ⋯
   6 │ Norway          Western Europe                       7.392                           0.035         7
   7 │ Sweden          Western Europe                       7.363                           0.036         7
   8 │ Luxembourg      Western Europe                       7.324                           0.037         7
   9 │ New Zealand     North America and ANZ                7.277                           0.04          7 ⋯
  10 │ Austria         Western Europe                       7.268                           0.036         7
  11 │ Australia       North America and ANZ                7.183                           0.041         7
  12 │ Israel          Middle East and North Africa         7.157                           0.034         7
  13 │ Germany         Western Europe                       7.155                           0.04          7 ⋯
  14 │ Canada          North America and ANZ                7.103                           0.042         7
  ⋮  │       ⋮                      ⋮                     ⋮                      ⋮                      ⋮   ⋱
 136 │ Togo            Sub-Saharan Africa                   4.107                           0.077         4
 137 │ Zambia          Sub-Saharan Africa                   4.073                           0.069         4
 138 │ Sierra Leone    Sub-Saharan Africa                   3.849                           0.077         4 ⋯
 139 │ India           South Asia                           3.819                           0.026         3
 140 │ Burundi         Sub-Saharan Africa                   3.775                           0.107         3
 141 │ Yemen           Middle East and North Africa         3.658                           0.07          3
 142 │ Tanzania        Sub-Saharan Africa                   3.623                           0.071         3 ⋯
 143 │ Haiti           Latin America and Caribbean          3.615                           0.173         3
 144 │ Malawi          Sub-Saharan Africa                   3.6                             0.092         3
 145 │ Lesotho         Sub-Saharan Africa                   3.512                           0.12          3
 146 │ Botswana        Sub-Saharan Africa                   3.467                           0.074         3 ⋯
 147 │ Rwanda          Sub-Saharan Africa                   3.415                           0.068         3
 148 │ Zimbabwe        Sub-Saharan Africa                   3.145                           0.058         3
 149 │ Afghanistan     South Asia                           2.523                           0.038         2

To see the columns name, simply use

names(df_2021)

getting a vector with all column names

julia> names(df_2021)
20-element Vector{String}:
 "Country_name"
 "Regional_indicator"
 "Ladder_score"
 "Standard_error_of_ladder_score"
 "upperwhisker"
 "lowerwhisker"
 "Logged_GDP_per_capita"
 "Social_support"
 "Healthy_life_expectancy"
 "Freedom_to_make_life_choices"
 "Generosity"
 "Perceptions_of_corruption"
 "Ladder_score_in_Dystopia"
 "Explained_by_Log_GDP_per_capita"
 "Explained_by_Social_support"
 "Explained_by_Healthy_life_expectancy"
 "Explained_by_Freedom_to_make_life_choices"
 "Explained_by_Generosity"
 "Explained_by_Perceptions_of_corruption"
 "Dystopia_residual"

The features of this dataset are as follow:

Country_name: Name of the country
Regional_indicator: The region to which the country belongs.
Ladder_score: The English wording of the question is "Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?", this metric represent the average of this response by country
Standard_error_of_ladder_score: This metric represent the standard error of the Ladder_score metric.
upperwhisker: Refers to the upper part of a box plot of the Ladder_score metric
lowerwhisker Refers to the lower part of a box plot of the Ladder_score metric
Logged_GDP_per_capita: GDP per Capita Registered to the date
Social_support: Average of the question based on: 'If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?' The response is 0 for 'no' and 1 for 'yes'.
Healthy_life_expectancy: Average lifespan by country, information extracted from the World Health Organization's (WHO) Global Health Observatory data repository.
Freedom_to_make_life_choices: National average of responses to the GWP question "Are you satisfied or dissatisfied with your freedom to choose what you do with your life?"
Generosity: Is the residual of regressing national average of response to the GWP question "Have you donated money to a charity in the past month?" on GDP per capita.
Perceptions_of_corruption: "Is corruption widespread throughout the government or not" and "Is corruption widespread within businesses or not?" The overall perception is just the average of the two 0-or-1 responses.
The variable 'Dystopia' and the explained variables that come from a built-in regression model are not taken into consideration for this project.

To see what is a regional indicator, we can see how every country is grouped.

julia> unique(df_2021.Regional_indicator)
10-element Vector{String}:
 "Western Europe"
 "North America and ANZ"
 "Middle East and North Africa"
 "Latin America and Caribbean"
 "Central and Eastern Europe"
 "East Asia"
 "Southeast Asia"
 "Commonwealth of Independent States"
 "Sub-Saharan Africa"
 "South Asia"

Let's do a simple operation with the dataframe getting the number of countries by regional indicator and sorting those

sort(
    combine(groupby(df_2021, :Regional_indicator), nrow), 
    :nrow
)

Getting this output

julia> sort(
           combine(groupby(df_2021, :Regional_indicator), nrow),
           :nrow
       )
10×2 DataFrame
 Row │ Regional_indicator                 nrow
     │ String                             Int64
─────┼──────────────────────────────────────────
   1 │ North America and ANZ                  4
   2 │ East Asia                              6
   3 │ South Asia                             7
   4 │ Southeast Asia                         9
   5 │ Commonwealth of Independent Stat…     12
   6 │ Middle East and North Africa          17
   7 │ Central and Eastern Europe            17
   8 │ Latin America and Caribbean           20
   9 │ Western Europe                        21
  10 │ Sub-Saharan Africa                    36

With this, we can see a more significant number of countries in Sub-Saharan Africa and only a smaller group of countries in North America and ANZ.

Now, let's try to slice our data. We will create a data frame called float_df that contains only the Float64 variables but excludes the "explained_" variables. This new dataframe will help us with some operations later.

#Get all columns Float64
float_df = select(df_2021, findall(col -> eltype(col) <: Float64, eachcol(df_2021)))

#Take away the Explained variables
float_df = float_df[:,Not(names(select(float_df, r"Explained")))]

Let's make our first plot.

scatter(
    df_2021.Social_support,
    df_2021.Ladder_score,
    size = (1000,800),
    label="country",
    xaxis = "Social Support",
    yaxis = "Ladder Score",
    title = "Relation between Social Support and Happiness Index Score by country"
)

Figure 1: scatterplot with ladder score and social support

If we want a view of all float variables in several histograms, we can add this code using Statsplots.

N = ncol(float_df)
numerical_cols = Symbol.(names(float_df,Real))
@df float_df Plots.histogram(cols();
                             layout=N,
                             size=(1400,800),
                             title=permutedims(numerical_cols),
                             label = false)

Figure 2: Histogram of all variables

And If we want to compare it with boxplots.

@df float_df boxplot(cols(), 
                     fillalpha=0.75, 
                     linewidth=2,
                     title = "Comparing distribution for all variables in dataset",
                     legend = :topleft)

Figure 3: Boxplot all variables

Without going into so much detail, we can affirm that the Ladder Score is the variable related to the result of the survey on the degree of happiness in the country (our dependent variable). Explained variables correspond to the preprocessing to build the Ladder Score, for this reason, we remove them from the dataframe and will hold with only the raw data.

What are the top 5 countries and bottom 5?

# Top 5 and bottom 5 countries by ladder score
sort!(df_2021, :Ladder_score, rev=true)
plot(
    bar(
        first(df_2021.Country_name, 5 ),
        first(df_2021.Ladder_score, 5 ),
        color= "green",
        title = "Top 5 countries by Happiness score",
        legend = false,
    ),
    bar(
        last(df_2021.Country_name, 5 ),
        last(df_2021.Ladder_score, 5 ),
        color ="red",
        title = "Bottom 5 countries by Happiness score",
        legend = false,
    ),
size=(1000,800),
yaxis = "Happines Score",
)

Figure 4: top5 and bottom 5

And the classic heatmap for correlation with the following function.

function heatmap_cor(df)
    cm = cor(Matrix(df))
    cols = Symbol.(names(df))

    (n,m) = size(cm)
    display(
    heatmap(cm, 
        fc = cgrad([:white,:dodgerblue4]),
        xticks = (1:m,cols),
        xrot= 90,
        size= (800, 800),
        yticks = (1:m,cols),
        yflip=true))
    display(
    annotate!([(j, i, text(round(cm[i,j],digits=3),
                       8,"Computer Modern",:black))
           for i in 1:n for j in 1:m])
    )
end

Figure 5: heatmap

And now, we can build a function where we can get the mean ladder score by regional indicator and compare it with the distribution of all countries.

function distribution_plot(df)
    display(
        @df df density(:Ladder_score,
        legend = :topleft, size=(1000,800) , 
        fill=(0, .3,:yellow),
        label="Distribution" ,
        xaxis="Happiness Index Score", 
        yaxis ="Density", 
        title ="Comparison Happiness Index Score by Region 2021") 
    )
    display(
        plot!([mean(df_2021.Ladder_score)],
        seriestype="vline",
        line = (:dash), 
        lw = 3,
        label="Mean")
    )
    for element in unique(df_2021.Regional_indicator)
        display(
            plot!(
            [mean(mean([filter(row->row["Regional_indicator"]==element, df).Ladder_score]))],
            seriestype="vline",
            lw = 3,
            label="$element") 
        )
    end
end

Figure 6: distribution region

Suppose we want to try the same idea but with countries. In that case, we can take advantage of multiple dispatch and create a function that receives a list of countries and creates a variation of the distribution with countries.

function distribution_plot(df, var_filter, list_elements)
    display(
        @df df density(:Ladder_score,
        legend = :topleft, size=(1000,800) , 
        fill=(0, .3,:yellow),
        label="Distribution" ,
        xaxis="Happiness Index Score", 
        yaxis ="Density", 
        title ="Happiness index score compare by countries 2021") 
    )
    display(
        plot!([mean(df_2021.Ladder_score)],
        seriestype="vline",
        line = (:dash), 
        lw = 3,
        label="Mean")
    )
    for element in list_elements
        display(
            plot!(
            mean([filter(row->row[var_filter]==element, df).Ladder_score]),
            seriestype="vline",
            lw = 3,
            label="$element") 
        )
    end
end

Let's test our new function, comparing three countries.

distribution_plot(df_2021, "Country_name", ["Chile",
                                            "United States",
                                            "Japan",
                                           ])

Figure 7: distribution countries

Here we can see how the USA has the highest score, followed by Chile and Japan.

To end the first part, let's apply some statistical tests. We will use an equal variance T-test to compare distribution from different regions. The function is as follows.

# Perform a simple test to compare distributions
# This function performs a two-sample t-test of the null hypothesis that s1 and s2 
# come from distributions with equal means and variances 
# against the alternative hypothesis that the distributions have different means 
# but equal variances.
function t_test_sample(df, var, x , y)
    x = filter(row ->row[var] == x, df).Ladder_score
    y = filter(row ->row[var] == y, df).Ladder_score
    EqualVarianceTTest(vec(x), vec(y))
end

We will have this output if we compare Western Europe and North America and ANZ.

t_test_sample(df_2021, "Regional_indicator", "Western Europe", "North America and ANZ")

julia> t_test_sample(df_2021, "Regional_indicator", "Western Europe", "North America and ANZ")
Two sample t-test (equal variance)
----------------------------------
Population details:
    parameter of interest:   Mean difference
    value under h_0:         0
    point estimate:          -0.213595
    95% confidence interval: (-0.9068, 0.4796)

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.5301

Details:
    number of observations:   [21,4]
    t-statistic:              -0.6374218416101513
    degrees of freedom:       23
    empirical standard error: 0.3350924366753546

We don't have enough evidence to reject the hypothesis that these samples come from distributions with equal means and variance. On another side, if we try comparing Western Europe with South Asia, we can see this:

julia> t_test_sample(df_2021, "Regional_indicator", "South Asia", "Western Europe")
Two sample t-test (equal variance)
----------------------------------
Population details:
    parameter of interest:   Mean difference
    value under h_0:         0
    point estimate:          -2.47305
    95% confidence interval: (-3.144, -1.802)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           <1e-07

Details:
    number of observations:   [7,21]
    t-statistic:              -7.576776118465833
    degrees of freedom:       26
    empirical standard error: 0.32639840222022687

In this case, we can reject that hypothesis.

2. Clustering

Now we will cluster the countries using the popular algorithm Kmeans. My first option was to use clustering.jl. However, determining the ideal number of clusters is necessary to get the Wcss (within-cluster sum of the square). With this, we can evaluate it with the elbow method, so I used Scikit-learn wrapper. I also include an issue. Well, let's continue with the last part. I started adding some libraries.

using Random
using ScikitLearn
using PyCall

@sk_import preprocessing: StandardScaler
@sk_import cluster: KMeans

Let's take out from the float_df all the variables related to Ladder_score, and keep only the variables considered in the survey.

select!(float_df, Not([:Standard_error_of_ladder_score, 
                           :Ladder_score, 
                           :Ladder_score_in_Dystopia, 
                           :Dystopia_residual]))

To train our model, we need to standardize the data, and then we will create a list to retrieve the wcss in every iteration. The function is as follows:

function kmeans_train(df)
    X = fit_transform!(StandardScaler(), Matrix(df))

    wcss = []
    for n in 1:10

        Random.seed!(123)
        cluster =KMeans(n_clusters=n,
                        init = "k-means++",
                        max_iter = 20,
                        n_init = 10,
                        random_state = 0)
        cluster.fit(X)
        push!(wcss, cluster.inertia_)
    end
    return wcss
end

Let's invoke the function and plot the wcss.

wcss = kmeans_train(float_df)

plot(wcss, title = "wcss in each cluster",
    xaxis = "cluster",
   yaxis = "Wcss")

Figure 8: Elbow Method

In this case, I decided to go for three clusters. We can abuse make use of multiple dispatch again, adding n for a defined number of clusters.

function kmeans_train(df, n)
    X = fit_transform!(StandardScaler(), Matrix(df))

    Random.seed!(123)
    cluster =KMeans(n_clusters=n,
                    init = "k-means++",
                    max_iter = 20,
                    n_init = 10,
                    random_state = 0)
    cluster.fit(X)
    return cluster
end

cluster= kmeans_train(float_df, 3)

If we take the first plot we did at the beginning of the post, but now we add the cluster labels, we have this plot.

scatter(filter(row ->row.cluster ==1,df).Social_support, filter(row ->row.cluster ==1,df).Ladder_score, title = "Distribution of Happiness Score by Cluster", xaxis = "Social Support", yaxis="Ladder Score", label = "Cluster 1", legend = :topleft)
scatter!(filter(row ->row.cluster ==3,df).Social_support, filter(row ->row.cluster ==3,df).Ladder_score,  label = "Cluster 2")
scatter!(filter(row ->row.cluster ==2,df).Social_support, filter(row ->row.cluster ==2,df).Ladder_score,  label = "Cluster 3")

Figure 9: Scatter with cluster

Here are the lists in these 3 clusters:

Cluster 1: Australia, Austria, Canada, Denmark, Estonia, Finland, France, Germany, Hong Kong S.A.R. of China, Iceland, Ireland, Luxembourg, Malta, Netherlands, New Zealand, Norway, Singapore, Sweden, Switzerland, United Arab Emirates, United Kingdom, United States, Uzbekistan.

Cluster 2: Albania, Argentina, Armenia, Azerbaijan, Bahrain, Belarus, Belgium, Bolivia, Bosnia and Herzegovina, Brazil, Bulgaria, Chile, China, Colombia, Costa Rica, Croatia, Cyprus, Czech Republic, Dominican Republic, Ecuador, El Salvador, Greece, Guatemala, Honduras, Hungary, Israel, Italy, Jamaica, Japan, Kazakhstan, Kosovo, Kuwait, Kyrgyzstan, Latvia, Libya, Lithuania, Malaysia, Maldives, Mauritius, Mexico, Moldova, Mongolia, Montenegro, Nicaragua, North Cyprus, North Macedonia, Panama, Paraguay, Peru, Philippines, Poland, Portugal, Romania, Russia, Saudi Arabia, Serbia, Slovakia, Slovenia, South Korea, Spain, Taiwan Province of China, Tajikistan, Thailand, Turkey, Turkmenistan, Ukraine, Uruguay, Venezuela, Vietnam.

Cluster 3: Afghanistan, Algeria, Bangladesh, Benin, Botswana, Burkina Faso, Burundi, Cambodia, Cameroon, Chad, Comoros, Congo (Brazzaville), Egypt, Ethiopia, Gabon, Gambia, Georgia, Ghana, Guinea, Haiti, India, Indonesia, Iran, Iraq, Ivory Coast, Jordan, Kenya, Laos, Lebanon, Lesotho, Liberia, Madagascar, Malawi, Mali, Mauritania, Morocco, Mozambique, Myanmar, Namibia, Nepal, Niger, Nigeria, Pakistan, Palestinian Territories, Rwanda, Senegal, Sierra Leone, South Africa, Sri Lanka, Swaziland, Tanzania, Togo, Tunisia, Uganda, Yemen, Zambia, Zimbabwe.

histogram(filter(row ->row.cluster ==1,df).Ladder_score, label = "cluster 1", title = "Distribution of Happiness Score by Cluster", xaxis = "Ladder Score", yaxis="n° countries")
histogram!(filter(row ->row.cluster ==3,df).Ladder_score, label = "cluster 2")
histogram!(filter(row ->row.cluster ==2,df).Ladder_score, label = "cluster 3")

Figure 10: histogram happiness cluster

Finally, we can compare how this cluster affects all the variables.

@df float_df Plots.density(cols();
                             layout=N,
                             size=(1600,1200),
                             title=permutedims(numerical_cols),
                             group = df.cluster,
                             label = false)

Figure 11: Distribution by variables with cluster

3. Conclusions

From my experience using Python for about two years in data analysis and recently dabbling with Julia, I can say that the ecosystem generally seems quite mature for this purpose. I had some questions that the community immediately answered on Julia Discourse. More content like this is needed so that the data science community can more widely adopt this technology.

Creating your own blog with Julia and Franklin

Wed, 16 Aug 2023 00:00:00 +0900

In this post, we are going to discuss how to build your own blog with Julia and Franklin.jl, a popular static site generator among Julia users who create their own blogs or even build websites for tutorials. I hope that if you are reading this entry and you don't have your own space, it can motivate you to build your own website.

1. Some Reasons to Create Your Own Blog

Blogs may sound old-fashioned, something created by people who are still living in the 90s, typing with passion about the political system while listening to Soundgarden in the background and drinking some kind of cheap beer… or programmers. And because if you are reading this content, you're probably at least the second one, you should consider that having a blog is a nice way to:

Track your progress in your field
Generate content that can be useful for somebody else
Help the open-source community with diffusion, tutorials, etc.
Create your own space and adapt it you your needs
Build your personal brand and help you to find a job

But why Franklin? Franklin is one of the most popular libraries for this purpose in Julia. It offers seamless integration with running Julia scripts so you can use julia for demostrations in your blog this coud be harder with other static site generators. If you only want to create basic entries with some code and images, perhaps Franklin.jl might not be that different from Hugo or Jekyll.

2. Installation

The first step is to create a folder where you will save your project. Once you are ready, open the Julia REPL in the location where the folder should be. When it's ready, type ] to activate the package manager and then type:

(@v1.9) pkg> add Franklin

then, return to the Julia Repl and import the library:

julia> using Franklin

Remember to make sure you have successfully installed the Franklin library before trying to import it.

3. First Steps

To create your website, you can choose one of the templates available. In my case, I just used the basic one, but if you have a different preference, feel free to go ahead; they all follow similar structures. You can also import another template that you like more and adapt it to your website. Please read the documentation for instructions on how to do this.

3.1. Selecting a template

Once you have decided your template, type in the REPL the next instruction

julia> newsite("myBlog", template="basic") #you can choose another name and template

This will create a folder with various directories and elements. It will also activate the environment inside the project. So, if you verify the project with ], it should display the name of your project.

.
├── 404.md            # Page for error 404
├── Manifest.toml     # The typical toml files for Julia development project
├── Project.toml
├── __site            # Generate your full website.
├── _assets           # You can add pictures and images here
├── _css              # All related to styling your website
├── _layout           # All related to the structure of your website
├── _libs             # Here will go all elements for website like katex, searchbar, etc  
├── _rss              # A couple of files related to rss feed, 
├── config.md         # Set Global variables for your website
├── index.md          # Main landing page
├── pages.md          # All your pages / you can create your folder or organize in different way
└── utils.jl          # Julia File for setting some configurations

Finally type:

julia> serve()

It should open your website locally in the browser, and it should look exactly the same as the template website you chose.

Figure 1: starting template

From this point, it's time to delete some files and content. You might also want to add some pages for your projects, about, contact, etc. This is up to you, but for now, we are going to keep just 2 pages: one for the main "about" page and another to host all your posts.

3.2. Cleaning the template

Now, go to the "index.md" page and delete all its content. This page will become your main page, and you can mix HTML and Markdown in this file to add whatever you want to it.

# Welcome to my blog
## I am using Franklin

~~~
    
    
    

~~~


This is an introductory message

You might have noticed that in our main page, there are four links to different pages. You can choose to keep those links or delete them all. However, for the purpose of creating a blog section, let's use one of those links. To do that, follow these steps:

Go to the "header.html" file located in the "layout" folder.
Modify the code in the "header.html" file to something like this:

<header>
<div class="blog-name"><a href="/">a>Amazing Blogdiv>
<nav>
  <ul>
    <li><a href="/">Homea>li>
    <li><a href="/menu1/">Bloga>li>
  ul>
  <img src="/assets/hamburger.svg" id="menu-icon">
nav>
header>

If you're looking to change the background color to something more interesting than white, now is the time to showcase your frontend skills. Follow these steps:

Navigate to the "franklin.css" file.
In the first block of code, add the background color that you prefer. For instance:

:root {
  --block-background: hsl(0, 0%, 94%);
  --output-background: hsl(0, 0%, 98%);
  --small: 14px;
  --normal: 19px;
  --text-color: hsv(0, 0%, 20%);
    background-color: aqua;
}

Finally, after making these modifications, the result should look something like this:

Figure 2: frontend

3.3. Creating your first post

Now, if you're ready to start your own blog, here's how you can set up the "posts" folder to add your articles, create a new folder named "posts" in the same root directory as your other folders. Is important to consider this things.

Inside the "posts" folder, you can add all your articles. You have the flexibility to use both Markdown files and HTML files for your articles.
If you're doing literate programming with tools like Pluto or Jupyter, you can export your notebooks to HTML format and place them in the "posts" folder. This way, anyone can easily view your data science projects.

For now, let's add a file called test1.md inside the posts folder and you can add some text

# This is a title in my first post

So I can write anything

## Here is an introduction

We are going to write some code:

using LinearAlgebra
a = [1, 2, 3, 3, 4, 5, 2, 2]
@show dot(a, a)
println(dot(a, a))

Then, go to the menu1.md file, erase the remaining content, and create a link to the test1.md file. This is as simple as:

If you save it, and navigate to http://localhost:8000/posts/test1/, you should see your post displayed clearly. This page will include your "about" section and the space to write your blog content. Congratulations! You now have a basic understanding of how Franklin works and can make any further edits or modifications you desire.

If you wish to further style your website, please go ahead and customize it to your heart's content.

4. Deployment

Now it's time to host your website in some place. One of the most straightforward options is using GitHub. Here's how you can do it:

Create a Repository: Go to your GitHub account and create an empty repository. When entering the name of your project, you have two paths to choose from:
1. If this is a personal website or organization, the name of your project should be something like username.github.io.
2. You can create your own custom name for your project, like myblog.

If you're unsure which option to choose, I recommend going with option (a) because it's more straightforward. If you choose option (b), you'll need to define a prepath variable in your config.md with the name of that project. For instance: @def prepath = "myblog".

Upload Your Project: Now upload your project to GitHub, following the instructions in your repository.
Configure GitHub Pages: Once you've pushed your project, go to the Settings tab in your repository. Then navigate to GitHub Pages. In the Source dropdown, select gh-pages. If you see a message indicating success, your project is now live.
Check Your Website: You can now open your web browser and enter the link of your project, which would be username.github.io. If you can see your website, congratulations! Your blog is now live on the internet.

By following these steps, you've successfully hosted your Franklin-generated website on GitHub Pages. It's now accessible to anyone with the link, and you can share your content with the world.

4.1. Hosting in a different domain (optional)

If you're hesitant to share your GitHub username due to its lengthy or unconventional extension, or if you prefer a more professional-looking link, you might want to consider an alternative domain, such as .com or .dev. You can purchase a domain and link it to your website. For example, you can use services like Google Domains to find and purchase a domain that suits your preference.

Once you've found and acquired the domain you like, you can proceed to link it to your website. To do this, you need to configure the DNS settings. You can find detailed explanations about custom domains and GitHub Pages in the documentation. In a nutshell, follow these steps:

Go to Google Domains, select your domain, and navigate to the DNS section.
Configure the DNS records, as shown below:

Figure 3: dns_setup

After correctly setting up the DNS records, go to your GitHub project repository's settings, then navigate to Pages and enter your custom domain:

Figure 4: custom_domain

If everything is set up correctly, GitHub will confirm the configuration. In a few minutes, your website should become accessible via your new custom domain.

By following these steps, you'll be able to link a custom domain to your Franklin-generated website, providing a more personalized and professional web presence.

5. RSS and Tags

Now that your website is up and running, setting up an RSS feed is important for people who want to stay updated on your new articles without having to visit your website daily. Tools like Newsboat or Inoreader help users keep track of updates from various websites, making an RSS feed a valuable addition to your blog.

Thankfully, Franklin makes setting up an RSS feed quite simple. All you need to do is go to each page in your "posts" folder and add a small description within +++ brackets, like this:

+++
tags = ["Julia", "Writing"] 

rss_title = "Creating your own blog with Julia and Franklin"
rss_description = "Describing the steps to create your own blog, so you can stop posting your code on Instagram"
rss_pubdate = Date(2023, 8, 10) 
+++

The RSS fields you add will be included in the information extracted by platforms like Newsboat. From these applications, I can read the title, a brief description, and the publication date and all the content if it's available. Additionally, you'll notice a "tags" section. This is also important because it allows users to filter by topics. For example, if you write different blogs about topics ranging from Julia programming to analysis of Shakira's new songs, users can select the topics they're specifically interested in.

To share your blog's RSS feed, you'll need a URL like https://www.yourdomain.com/feed.xml. Make sure to prominently display this URL in your website so that readers can easily find and subscribe to your feed.

5.1. Host your Feed to JuliaBloggers (optional)

Lastly, if you're considering writing about Julia and want to contribute to the community, don't hesitate to share your work. Whether it's a calculator project, a website, a 2D game, or a cutting-edge machine learning algorithm, your contributions will help the Julia community grow and provide valuable insights for others to learn from.

Visit the JuliaBloggers Website and add your information. In the "Feed URL" field, you can use a URL similar to the first example you mentioned, like:

http://indymnv.dev/tag/julia/feed/

Once you've submitted this information, every time you publish a new post on your website, the community will be able to see it. If you want to test this process first, you can use an RSS reader like Newsboat or Inoreader to ensure that your updates are being picked up as expected.

6. Conclusions

I hope you enjoyed reading this article. If you haven't yet created your own website, I hope it serves as motivation to get started, whether you choose to use Franklin or another static site generator. Having your own online space to write about your interests and dive as deep as you like is a rewarding endeavor. Don't hesitate to embark on this journey and create a platform that showcases your passion and expertise. Happy blogging!

7. Acknowledgment

I also want to thank Thibaut Lienart, who is the main developer of Franklin. His work has been incredibly beneficial for the community.

How to scrape data with Python using selenium and Pandas

Thu, 15 Dec 2022 00:00:00 +0900

1. Introduction

In this tutorial, I will dedicate myself to explaining how web scraping can be done from a platform where a dynamic interaction of the web application is required, this is quite useful when obtaining data from different links within the platform and where it is necessary a management scheme of the front-end components to carry it out.

Here there are mainly two essential libraries, the first is selenium which corresponds to a framework that operates for multiple languages and serves to automate and control the browser, while Pandas for data manipulation will allow us to read data tables directly.

Many times, the beautiful soup library is used to extract html elements from the web, but as we will see, it is not necessary to do so in this case.

For this example, I am going to use the chilean dairy production platform, this platform is used to obtain information on the production of products dairy products from different factories nationwide.

2. Requirements

To start, you must have Python installed. In my case, I am using version 3.9, you also have to have your browser (Mozilla or Chrome) secured. In this project, I will use the chrome one, but the codes should be similar to the one we are using here, then to work with selenium, you have to download the [executable](https://chromedriver.chromium.org/downloads) that corresponds to your browser and its respective version

If you use pip you can install it using:

pip install -U selenium

The import the libraries

from selenium import webdriver
import pandas as pd
import lxml
from selenium.webdriver.support.ui import Select
import sys
import time

Once you have imported the corresponding libraries, we will perform the first test with the chromedriver.exe (the one you downloaded from the selenium portal). For simplicity, I recommend having it in the same directory as this scrapper.

driver = webdriver.Chrome('/Your/path/to/the/project/chromedriver')
driver.get("http://aplicativos.odepa.cl/recepcion-industria-lactea.do")

This should allow the web page to be opened from the Chrome browser, the driver variable that we have assigned the chromedriver will drive the states of our browser. We can now add this snippet code

time.sleep(5)
driver.quit()

With this, we add a timeout of 5 seconds, and with driver.quit() we close the browser. The reason for adding waiting times is that while we have to operate within the browser, either due to internet connections or latency of the web platform, we will therefore have to wait for the elements we need to be available.

It is time to see how we can start interacting with the web page elements. For example, if we want to click on certain features, what we have to do is right-click on the component on the web page, place inspect and then recognize the element and how we can call it according to how it is identified, this can be by id, name, XPath, etc. I often use the XPath, which you can copy and paste into your code.

#Select elements
driver.find_element_by_id('tipoConsulta2').click()
driver.find_element_by_id('filterByRegionOrPlanta2').click()
driver.find_element_by_id('filterByRegionOrPlanta2').click()

#Extract the list of years
driver.find_element_by_xpath('//*[@id="divFechaDetalleMensual"]/img').click()
driver.find_element_by_xpath('//*[@id="ui-datepicker-div"]/div[1]/div/select').click()
years = driver.find_elements_by_tag_name("option")

Here what we have done is open the web page and make the necessary selections and filters to access the data, we end up creating a list called years, where we will have all the years available in this web application.

Now with this, we can get the elements. Using the following code.

  list_years = []
for year in years:
    list_years.append(year.get_attribute('value'))

#here I added a filter by year which is optional (you can delete it)
list_years = [element for element in list_years if element != '' and int(element)> 2000]

Now we will obtain the list of elements of all the years to be able to iterate. Then if we want to get the plants, we can use the following:

  #Extract all the factory names:
plantasposibles=driver.find_element_by_id('planta')
plantasposibles=plantasposibles.find_elements_by_tag_name("option")
valoresplantas=[]
nombresplantas=[]

for option in plantasposibles:
    valoresplantas.append(option.get_attribute("value"))
    nombresplantas.append(option.get_attribute("text"))

We locate the dropdown that corresponds to the list of available plants, with this, we take the elements and build the list of plants. This will allow us to perform the following iteration:

tabla=pd.DataFrame() #Here we create the dataframe

driver.find_element_by_xpath('//*[@id="divFechaDetalleMensual"]/img').click()

for lastyear in list_years:
    for i in range(1,len(valoresplantas)):
    ...

We need to start controlling the options and release the report with the data. From there, we perform reading and data extraction, this is where Pandas shines. If we remember the last double loop, what should go inside is the following.

  #Select options
driver.execute_script("document.getElementById('planta').value="+ valoresplantas[i])
driver.find_element_by_xpath("//*[@id='divFechaDetalleMensual']/img").click()
time.sleep(1)
select=Select(driver.find_element_by_xpath("//*[@id='ui-datepicker-div']/div[1]/div/select"))
select.select_by_visible_text(str(lastyear))        
driver.find_element_by_xpath("//*[@id='ui-datepicker-div']/div[2]/button").click()
driver.find_element_by_id('fechaDetalleMensual').send_keys(lastyear)
timeout=15
driver.find_element_by_id('btnVerInforme').click()
timeout=20

############################## PANDAS #######################################

prueba_html=driver.page_source
df = pd.read_html(prueba_html, flavor='html5lib')[0]
df=df.drop(df.columns[14:397],axis=1)
df=df.drop(df.index[0:8],axis=0)
df=df.drop(df.index[1],axis=0)
df=df.drop(df.index[8:9],axis=0)
df['Year']=lastyear
df['Factory_Name']=nombresplantas[i]
tabla=pd.concat([tabla,df])

In case it fails, which is typical when working in selenium, the try/catch options are the best to handle exceptions intelligently. Obviously, it depends a lot on the case and the nature of the project on how to use them, but here I just proceeded to close the application and operate again where it was. To summarize this point, the double loop would look like this:

for lastyear in list_years:
  for i in range(1,len(valoresplantas)):
      try: 
          driver.execute_script("document.getElementById('planta').value="+ valoresplantas[i])
          driver.find_element_by_xpath("//*[@id='divFechaDetalleMensual']/img").click()
          time.sleep(1)
          select=Select(driver.find_element_by_xpath("//*[@id='ui-datepicker-div']/div[1]/div/select"))
          select.select_by_visible_text(str(lastyear))        
          driver.find_element_by_xpath("//*[@id='ui-datepicker-div']/div[2]/button").click()
          driver.find_element_by_id('fechaDetalleMensual').send_keys(lastyear)
          timeout=15
          driver.find_element_by_id('btnVerInforme').click()
          timeout=20


          prueba_html=driver.page_source
          df = pd.read_html(prueba_html, flavor='html5lib')[0]
          df=df.drop(df.columns[14:397],axis=1)
          df=df.drop(df.index[0:8],axis=0)
          df=df.drop(df.index[1],axis=0)
          df=df.drop(df.index[8:9],axis=0)
          df['Year']=lastyear
          df['Factory_Name']=nombresplantas[i]
          tabla=pd.concat([tabla,df])

      except:

          #If fail close and open up the window again
          #driver.quit()
          time.sleep(5)
          driver.get("http://aplicativos.odepa.cl/recepcion-industria-lactea.do")
          time.sleep(5)
          driver.find_element_by_id('tipoConsulta2').click()
          driver.find_element_by_id('filterByRegionOrPlanta2').click()
          driver.find_element_by_id('filterByRegionOrPlanta2').click()

3. Final steps with Pandas

With this, we would be finishing the process, the only thing left is to integrate the final data with some extra elements and save the dataframe. We will ensure that each component is integrated with its period since the months are in columns, so we will make a single column that contains them.

tabla=tabla[['Year', 'Factory_Name', 'Product', 'Unit','Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']]

lista=range(len(tabla.index))
tabla.index=lista

tablafinal=pd.DataFrame()
tablaparcial=tabla.drop(tabla.columns[4:],axis=1)

for month in tabla.columns[4:len(tabla.columns)]:

    tablaparcial['Month']=month
    tablaparcial['Quantity']=tabla[month]
    tablafinal=pd.concat([tablafinal,tablaparcial])

tablafinal.to_csv("data.csv", index = False)

Finally, we can do the extraction and a simple preprocessing to leave them more prepared for some analysis or save them to a database.

4. Conclusions

In this project, we show how we can perform scrapping using selenium and pandas, this of course, can be done thanks to the pandas tools to extract data from HTML, simplifying the extraction. Selenium is an excellent tool to carry out this automation and test web pages, so I recommend it for the design of web apps, for example, failures in the results or scenarios where there are possible bugs.

Notes about Functional Programing with Julia

Sat, 10 Aug 2024 00:00:00 +0900

I am writing here some general ideas that were taken from some sources like boot.dev about functional programming, many of these sources were written in Python and I just rewrote Julia in most of the cases. Because Julia is a program more suitable for FP I considered a good exercise in the long run to translate the concepts that I am learning about this paradigm.

1. What is Functional Programming

compose functions instead of mutating states,
What you want to happen rather than how you want to happen

1.1. Inmutability

Once the Value is created it can't be changed, this can be easier to debug

1.2. Declarative

Functional aims to be declarative rather than imperative

1.3. Math Style

imperative style

function get_average(nums)
    total = 0
    for num in nums
        total += num
    end
    return total / length(nums)
end

functional style

function get_average(nums)
    return sum(nums) / length(nums)
end

In general to make a bit more functional style, we should avoid loops and mutate any variable

Classes encourage you to think about the world as a hierarchical collection of objects. Objects bundle behavior, data, and state together in a way that draws boundaries between instances of things, like chess pieces on a board.

Functions encourage you to think about the world as a series of data transformations. Functions take data as input and return a transformed output. For example, a function might take the entire state of a chess board and a move as inputs, and return the new state of the board as output.

OOP is not quite the opposite with FP, but the 4 pillars of the first one (abstraction, encapsulation, inheritance and polymorphism) inheritance is the one that can produce changes in classes, so break the rule of inmutability in FP

1.4. Functions are First Class

We can treat functions as values

function add(x,y)
    return x+ y
end

addition = add

println(addition(2,7)

# print 9

1.4.1. Anonymous Functions

Basically functions that doesn't have name, similar like python use lambda functions

function filter_var(df, value)
    return filter!(row -> row.colum != value , df)
end

in the last case row -> row.colum ! value= is an anonymous function

1.4.2. Higher Order Functions

In the case that the programming language threat functions like any other variable, so Functions are first class then we can pass functions as an arguments to other functions.

function square(x)
    return x * x
end

function my_map(func, arg_list)
    result = []
    for i in arg_list
        push!(result, func(i))
    end
    return result
end

squares = my_map(square, [1, 2, 3, 4, 5])
println(squares)
# [1, 4, 9, 16, 25]

In the last case my_map() is a higher order function

Map, Filter and Reduce
Map, filter and reduce are three typical examples of Higher order functions that are quite useful, for a map function you need an iterable (An object capable of returning its members one at a time.) and a function, and apply the function to all the elements of this iterable
```
function say_hello(name)
    return "Hello " * name
end

list_names = ["Chris", "Hector", "Benito"]

map(say_hello, list_names)
# ["Hello Chris, "Hello Hector", "Hello Benito"]
```
Filter was already shown in an example before, but basically takes an iterable, a function and return also an iterable that is a subset of the original.

Finally the Reduce function take same arguments but now it reduce everything to a single value, like the following example
```
function add(sum_so_far, x)
    prinln("sum_so_far: $sum_so_far, x: $x")
    return sum_so_far + x
end

numbers = [1, 2, 3, 4]
sum = reduce(add, numbers)

# sum_so_far: 1, x: 2
# sum_so_far: 3, x: 3
# sum_so_far: 6, x: 4
# 10

println(sum)

# 10
```
This higher order functions allow us to write functions without using loops in some cases avoiding stateful iterations and mutation of variables.

1.5. Pure Functions

Pure functions has to accomplish two properties:

They always return the same value given the same arguments.
Running them causes no side effects

Figure 1: pure

function findMax(nums)
    max_val = -Inf
    for num in nums
        if max_val < num
            max_val = num
        end
    end
    return max_val
end

Let's compare with this other case

# instead of returning a value
# this function modifies a global variable
global_max = -Inf

function findMax(nums)
    global global_max
    for num in nums
        if global_max < num
            global_max = num
        end
    end
end

In the first case we keep a function which clearly define an input and return and output while in the second case we produce a global variable that change the state of this (breaking the rule of inmutability) and does not return anything but our global variable has changed. In summary, pure functions:

Return the same result if given the same input, so they are deterministic (which no randomness is involved in the development of future states of the system.). Also there is the term referentially transparent
Do not change the external state of the program. For example, they do not change any variables outside of their scope.
Do not perform any I/O operation like printing, accessing to data via HTTP or reading files.

1.6. Reference and Value

There are functions that allow you to pass by references, this are mutable, you can see this when appending values in a list. In this case the function has access to the original value. For other side a function that receive variables as values are receiving copy of the original and do not attemt to change the original (inmutability), you can do in Julia using deepcopy(var) to create copies

1.6.1. Pass by Reference Impurity

To avoid side effects we can create copies of the variables inside of a function without changing any variables that is out of the scope (this includes the input of the function)

function remove_format(default_formats, old_format)
    new_formats = deepcopy(default_formats)
    new_formats[old_format] = false
    return new_formats
end

With this we avoid mutating any input or global variable making it easier to debug and test.

1.7. Input and Output

While I/O operations are part of impure functions, these are necessaries (or our program is completely useless) so It tries to use only when is neccesary.

1.8. NO-OP

Functions that does nothing, or better said doesn't return anything, probably are impure functions

function square(x)
    x * x
end

That function doesn't do anything, but also there are functions that perform some side effect:

y = 5
function add_to_y(x)
    global y
    y += x
end

add_to_y(3)
# y = 8

Even the print() function technically has an impure side effect

1.9. Memoization

This is storing a copy of a result a computation so we don't have it to compute it again in the future, it holds a trade-off between memory and speed. This only can be achieved with pure functions.

const fibmem = Dict{Int,Int}()
function fib(n)
    get!(fibmem, n) do
        n < 3 ? 1 : fib(n-1) + fib(n-2)
    end
end

2. Recursion

Function that define itself, for example the classic factorial. This kind of functions are quite useful for unknown tree structure

function factorial_rec(x)
    if x == 0 
        return 1
    else
        return x * factorial_rec(x - 1)
    end
end

julia> factorial_rec(0)
1

julia> factorial_rec(3)
6

A recursive function should have some dangerous edge case that deserve attention:

Requires base case to avoid infinite loops.
Each function call requires a bit of memory, so in long trees structures can cause a stack overflow and will crash your program
In some languages recursion is slow, like python where is even slower than loops. Use of Tail call Optimizations can deal with that

2.1. Function Transformations

Specific type of Higher order functions that receive functions as input and return functions as output, special for some cases of code reusability

function multiply(x, y)
    return x * y
end

function add(x, y)
    return x + y
end

# self_math is a higher order function
# input: a function that takes two arguments and returns a value
# output: a new function that takes one argument and returns a value
function self_math(math_func)
    function inner_func(x)
        return math_func(x, x)
    end
    return inner_func
end

square_func = self_math(multiply)
double_func = self_math(add)

println(square_func(5))
# prints 25

println(double_func(5))
# prints 10

2.2. Closures

A closure is a function that references variables from outside its own function body. The function definition and its environment are bundled together into a single entity so a closure can change the value outside its body

Figure 2: closure

julia> function make_adder(amount)
           function add(x)
               return x + amount
           end
       end;

julia> add_one = make_adder(1);

julia> add_two = make_adder(2);

julia> 10 |> add_one
11

julia> 10 |> add_two
12

In the case of Julia, generate global variables can cause Type Instability and there are some discussions about avoiding closures when performance is required, However that doesn't mean that using closures should be avoided completely, a lot of discussions are here also interesting content here

Naturally if a function can change a a non local variable then is not a pure function, so many cases closures are not pure functions because they can mutate outside of their scope and have side effects.

Notice that also there are concept of Decorators in some languages like Python, that are just syntactic sugar for higher order functions

2.3. Currying

Function currying is a specific kind of function transformation where we translate a single function that accepts multiple arguments into multiple functions that each accept a single argument.

Figure 3: currying

This is a normal function without currying

function sum(a,b)
    return a+b
end

With currying

function sum(a)
    function inner_sum(b)
        return a + b
    end
    return inner_sum
end

With this option now we can return a function as a value (inner_sum) and change it's signature to make it conform to specific parameter

2.4. Wrapping up

These are just basic ideas about functional programming, there are more concepts to deal with, but at least here is an starting point for people like me who is not a cs person…