學習筆記

概念驗證:可視化方法（Visualization Methods）

在開始進行圖表製作前，我會先進行「可視化方法的概念驗證」，目的是確認資料是否具備可視化潛力，並釐清哪些圖表形式最能有效呈現想要傳達的資訊。
初期使用 Power BI 進行資料瀏覽與檢查，這可以讓我專注在資料內容本身，避免被繪圖語法干擾。等資料結構確認後，正式視覺化階段則改用 Python 的 Plotly + Panel，以支援互動性與網頁嵌入需求。

補充：一些個人經驗
- Power BI 在資料初期探索時真的很方便，尤其是拖拉圖表檢查資料分佈，很直覺。
- 後來想要放上網站時才發現 Power BI 嵌入要付費帳號，直接放棄…
- Plotly 的互動效果很好，加上 Panel 可以切換資料與圖表頁面，實際展示時彈性大很多。
- 參考其他人的建議後，我挑選了三種方式來應對多數視覺化情境：靜態（matplotlib）、動態（plotly）、圖形介面（Power BI）。

完整程式碼。

環境 & 資料

import pandas as pd
import sqlite3
import panel as pn # 用於建立互動式面板與數據更新
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots # 用於建立多圖子圖畫布

# 載入的資料集（簡要列出即可）
responses                        # 原始回應資料
responses_single_choice          # 單選題資料（已處理）
kaggle_question_reference_table  # 問題對照表（跨年份欄位對應）
salary_order                     # 薪資排序
country_area                     # 國家地區分類
coding_exp_years_order           # 程式經驗年資排序
prog_lang_skill_group            # 程式語言技能群組

單變量

資料處理本段資料處理會依輸入的分類名稱，擷取該分類對應的所有欄位資料，並將其整理為長表格式。為了便於後續圖表繪製，會保留部分群組欄位（如職稱、薪資），只篩選 Data-related 職群的回應，並將長字串回答裁切為 short_label，以利呈現在圖表中。每題資料會存成 dict 的形式，方便之後個別取出繪圖。

資料處理流程：
1. 擷取指定分類的問題欄位
2. 合併資料中職稱與薪資分類資訊
3. 過濾出 Data-related 職群
4. 整理成長表，並依題目拆成多筆資料儲存

# 依題目分類拆分回應資料
def data(c): # 輸入為分類名稱，例如 '基礎輪廓分析'，函式會擷取對應題目的回答資料。

    # 每題資料會存成 dict 的形式，方便之後個別取出繪圖。
    contents_dict  = {}

    # 資料條件
    col = kaggle_question_reference_table.loc[kaggle_question_reference_table['分類'] == c, 'col_eng']

    # 讀取資料
    df = responses[responses['question_index'].isin(col)]

    # 合併群組
    df = df.merge(responses_single_choice[['id', 'job_title_group', 'salary_group']], left_on='id', right_on='id', how='left')

    # 僅保留特定群組資料（如 Data-related）
    df = df[df['job_title_group']== 'Data-related']

    # salary_group 僅適用 salary，所以只去除 salary 資料中的錯誤資訊。
    df = df[~((df['salary_group']=='wrong_info') & (df['question_index']=='salary'))]

    df = df.iloc[:,:-2].groupby(['response', 'surveyed_in', 'question_index']).count().reset_index().rename(columns={'id':'count'}).sort_values('count')

    # 將回答切短為 short_label
    df['short_label'] = df['response'].apply(lambda x: x[:10] + '...' if len(x) > 12 else x)

    for i in kaggle_question_reference_table.loc[kaggle_question_reference_table['分類'] == c , 'col_eng']:
        x = kaggle_question_reference_table.loc[kaggle_question_reference_table['col_eng'] == i, '欄位'].values[0]
        contents_dict[x] = df[df['question_index'].isin([i])]

    return contents_dict

根據 surveyed_in 欄位判斷資料是否包含多個年度。若為多年度資料，將回應統整後製作圓餅圖；若為單年度資料，則保留當年回應並使用橫條圖呈現。

# 資料處理
def data_process(d):
    d = d.reset_index(drop=True)
    len_d = len(d)

    is_multi_year = len(d['surveyed_in'].drop_duplicates()) > 1
    # 圓餅圖：若有多個年份的資料，聚合後處理 top 9 + other
    if is_multi_year:
        x = {}
        d = d.groupby(['response','question_index','short_label']).sum().sort_values('count').reset_index()
        for i in d.columns :
            if (i == 'response') and ((len_d > 9) and (len(d['surveyed_in'].drop_duplicates()) > 1)):
                x[i] = ['other']
            elif i == 'count':
                x[i] = [d[i][:-9].sum()]
            else:
                x[i] = [d[i][0]]
        d = pd.concat([d[-9:], pd.DataFrame(x)], axis=0).reset_index(drop=True).sort_values('count')

        # 計算百分比。
        d['count_pct'] = round((d['count'] / d['count'].sum()) * 100, 2)

        # 產生 text 文字，將資料轉為文字(str)，例如：1000 人 (50.0%)
        d['text'] = d.apply(lambda row: f"{row['count']}人 ({row['count_pct']:.1f} %)", axis=1)
        return d.drop(columns='surveyed_in').reset_index(drop=True)

    # 橫條圖：若只有單一年度，直接抓 top 9
    else:
        # 計算百分比。
        d['count_pct'] = round((d['count'] / d['count'].sum()) * 100, 2)

        # 產生 text 文字，將資料轉為文字(str)，例如：1000 人 (50.0%)
        d['text'] = d.apply(lambda row: f"{row['count']}人 ({row['count_pct']:.1f} %)", axis=1)
        return d[-9:]

概念驗證橫條圖圖表 1：單年度長條圖

```python
# 載入資料 dict
df = data('基礎輪廓分析')

# 本圖示資料：使用 2021 年資料，針對「國家」欄位回應進行統計，並以橫條圖呈現前 9 名的回答分布情況。
df_country = df['國家']
df_2021 = df_country[df_country['surveyed_in'] == 2021]
process_df = data_process(df_2021)

# 建立圖表(橫條圖)
fig = px.bar(
    process_df,
    x='count',
    y='response',
    orientation='h',      # 轉橫條圖
    title='2021年',       # 圖表名稱
    width=800,            # 畫布寬度
    height=500            # 畫布高度
)

# update_traces：圖表中的「資料圖層」
fig.update_traces(
    texttemplate=process_df['text'],            # 顯示資料標籤
    textposition='auto'                         # 自動決定文字顯示在柱內或柱外
)

# update_layout：圖表的整體版面、軸線、標題、圖例、尺寸
fig.update_layout(
    xaxis_title=None, # 移除 x軸標籤
    yaxis_title=None  # 移除 y軸標籤
)

# 顯示圖片
fig.show()

# 輸出圖片
fig.write_html(f'data_scientists_toolbox/橫條圖.html', auto_open=True)
```

概念驗證圓餅圖將「國家」欄位的回應資料整合所有年度，進行總計後使用圓餅圖呈現整體分布概況。此圖能幫助觀察整體樣本中，來自各國的比例差異。
圖表 2：多年度總計圓餅圖（圓餅圖.html

# 對「國家」欄位的所有年份資料進行統整處理（多年度 → 圓餅圖邏輯會自動套用）
data_all = data_process(df['國家'])

fig = px.pie(data_all,  
            names='response', 
            values='count', 
            hole=.5,              # 製作甜甜圈圖樣式，使視覺更清爽
            width=800,            # 畫布寬度
            height=500            # 畫布高度
            )

fig.update_traces(
    texttemplate=data_all['text'], # 顯示格式為：1000人 (50.0%)，由 data_process() 預先計算好 text 欄位
    textposition='outside'         # 決定文字顯示在柱內或柱外
)

# 顯示圖片
fig.show()

# 輸出圖片
fig.write_html(f'data_scientists_toolbox/圓餅圖.html', auto_open=True)

概念驗證橫條圖圓餅圖合併為同時呈現年度差異與整體結構，橫條圖聚焦於各年度 top 回答，而圓餅圖則顯示跨年度總計結果。本圖使用 make_subplots() 建立 2x2 的子圖畫布，前 3 張為 2020–2022 各年度的橫條圖，右下角則為圓餅圖表示各年度合計分布狀況。搭配 subplot_titles 設定標題，使每張圖更具辨識性。

圖表 3：年度分圖 + 圓餅合併（多圖合併.html）

# 建立畫布
fig = make_subplots(
    rows=2, cols=2,
    specs=[[{'type': 'xy'}, {'type': 'xy'}],
          [{'type': 'xy'}, {'type': 'domain'}]],          # 每格子圖型別（xy 為折線、長條圖，domain 為圓餅圖）
    subplot_titles=["2020", "2021", "2022", "總計"],      # 子圖標題
    horizontal_spacing = 0.1,                             # 子圖間水平距離（0 ~ 1）
    vertical_spacing = 0.1                                # 子圖間垂直距離（0 ~ 1）
)

'''
| type 值      | 說明                      |
| ----------- | ----------------------- |
| `'xy'`      | 折線圖、長條圖、散點圖等常見圖         |
| `'domain'`  | 圓餅圖（Pie）、指標圖（Indicator） |
| `'scene'`   | 3D 圖（散點、曲面圖）            |
| `'polar'`   | 極座標圖                    |
| `'ternary'` | 三元圖                     |
'''

# 加入長條圖
fig.add_trace(go.Bar(y=data_2020['response'], 
                    x=data_2020['count'], 
                    orientation='h', 
                    text=data_2020['text'],
                    hovertemplate = ''), 
              row=1, col=1)

fig.add_trace(go.Bar(y=data_2021['response'],
                    x=data_2021['count'], 
                    orientation='h', 
                    text=data_2021['text'],
                    hovertemplate = ''), 
              row=1, col=2)

fig.add_trace(go.Bar(y=data_2022['response'],
                    x=data_2022['count'], 
                    orientation='h', 
                    text=data_2022['text'],
                    hovertemplate = ''), 
              row=2, col=1)


# 加入圓餅圖
fig.add_trace(go.Pie(labels=data_all['response'], 
                    values=data_all['count'], 
                    text=data_all['count'],
                    hole=.5, # 製作甜甜圈圖樣式，使視覺更清爽
                    hovertemplate = ''), 
              row=2, col=2)

fig.update_traces(
    textposition='outside', # 設定圓餅圖文字顯示位置，將數值標示移至圖形外側
    row=2, col=2
)

fig.update_layout(
    xaxis_title=None,   # 移除 x軸標籤
    yaxis_title=None    # 移除 y軸標籤
)

# 更新版面
fig.update_layout(title_text="年度資料分析互動圖表", title_font_size = 24)

# 顯示圖片
fig.show()

# 輸出圖片
fig.write_html(f'data_scientists_toolbox/多圖合併.html', auto_open=True)

資料更新測試起初嘗試直接操作原始 dataframe（responses），但在 Panel 的互動過程中遇到更新失效問題。後來將資料預處理為 dict 格式後成功解決，這也讓我了解 Panel 對資料綁定的限制與最佳實踐方式。
圖表 4：互動式年度切換長條圖（interactive_dashboard.html）

    # 模擬資料
    data_dict = {
        '2020': pd.DataFrame({'category': ['A', 'B', 'C'], 'value': [100, 200, 300]}),
        '2021': pd.DataFrame({'category': ['A', 'B', 'C'], 'value': [120, 180, 260]}),
        '2022': pd.DataFrame({'category': ['A', 'B', 'C'], 'value': [90, 150, 240]})
    }

    # 下拉選單元件
    year_selector = pn.widgets.Select(name='年份', options=list(data_dict.keys()), value='2020')

    # 圖表顯示區
    plot_pane = pn.pane.Plotly()

    # Callback：當選單變更時，更新圖表內容
    def update_plot(event):
        df = data_dict[year_selector.value]
        fig = px.bar(df, x='category', y='value', title=f"{year_selector.value} 年資料")
        plot_pane.object = fig # 將新圖表指派給圖表面板

    # 註冊 Callback（回呼函式）
    year_selector.param.watch(update_plot, 'value')  # 當 year_selector 的值，一旦變動就執行 update_plot()

    # 初始化一次圖表
    update_plot(None)

    # 包裝頁面
    app = pn.Column("# 年度資料分析互動圖表", year_selector, plot_pane)

    app.show()
    # 輸出圖片
    app.save('練習專案三：資料科學家的工具箱/data_scientists_toolbox/interactive_dashboard.html', embed=True)

多變量

折線圖本圖呈現各洲別的薪資中位數隨程式經驗變化的趨勢。以程式經驗（年數）為 X 軸、薪資中位數為 Y 軸、洲別為不同的線條。資料來自薪資填答正確且職務為 Data-related 的回應者，並篩選國家為「高收入」或「中高收入」的群體。此圖可用來觀察在不同洲別中，程式經驗是否與薪資中位數存在趨勢差異，是否存在早期成長快、飽和期快的區域現象。

圖表 5：折線圖 (洲別_經驗_薪資.html)

# 建立折線圖 go.Scatter
def go_scatter(line_df_group, file_path_name):
    data1 = go.Scatter(
        x = line_df_group['coding_exp_years'],
        y = line_df_group['Asia'],
        mode = "lines+markers+text",
        name = 'Asia',
        textposition = "top center",
        line = dict(width=3),
        text = line_df_group['Asia']
    )

    data2 = go.Scatter(
        x = line_df_group['coding_exp_years'],
        y = line_df_group['Europe'],
        mode = "lines+markers+text",
        name = 'Europe',
        textposition = "top center",
        line = dict(width=3),
        text = line_df_group['Europe']
    )

    data3 = go.Scatter(
        x = line_df_group['coding_exp_years'],
        y = line_df_group['North America'],
        mode = "lines+markers+text",
        name = 'North America',
        textposition = "top center",
        line = dict(width=3),
        text = line_df_group['North America']
    )

    data4 = go.Scatter(
        x = line_df_group['coding_exp_years'],
        y = line_df_group['Oceania'],
        mode = "lines+markers+text",
        name = 'Oceania',
        textposition = "top center",
        line = dict(width=3),
        text = line_df_group['Oceania']
    )

    data5 = go.Scatter(
        x = line_df_group['coding_exp_years'],
        y = line_df_group['South America'],
        mode = "lines+markers+text",
        name = 'South America',
        textposition = "top center",
        line = dict(width=3),
        text = line_df_group['South America']
    )

    layout = go.Layout(
        title = '薪資 & 經驗 & 洲別',
        title_font_size = 30,
        xaxis = dict(title='程式經驗 (年)', tickfont=dict(size=10)),
        yaxis = dict(title='年資(中位數)', tickfont=dict(size=10)),
        margin = dict(l=50, r=50, t=60, b=60),
        showlegend = True
    )

    fig = go.Figure(data = [data1, data2, data3, data4, data5], layout = layout)
    fig.show()
    fig.write_html(file_path_name, auto_open=True) 

# 讀取資料
def line_data():
    # 讀取資料
    line_df = responses_single_choice[['id','surveyed_in', 'coding_exp_years','country', 'salary', 'salary_group', 'job_title_group']]

    # 設定條件 salary_group = correct_info 、 job_title_group = Data-related。
    col = (line_df['salary_group'] == 'correct_info') & (line_df['job_title_group'] == 'Data-related')

    # 合併 salary_order
    line_df = line_df[col].merge(salary_order, left_on='salary', right_on='salary', how='left').rename(columns={'rank':'salary_rank'})

    # 合併 country_area
    line_df = line_df.merge(country_area[['country', 'gdp_group', 'area']], left_on='country', right_on='country', how='left')

    # 設定條件 gdp_group = ['高收入', '中高收入']
    line_df = line_df[line_df['gdp_group'].isin(['高收入', '中高收入'])]
    line_df = line_df.merge(coding_exp_years_order, left_on='coding_exp_years', right_on='coding_exp_years', how='left')

    # 建立中位數
    line_df_group = line_df[['rank', 'coding_exp_years', 'area', 'salary_mean']].groupby(['rank', 'coding_exp_years', 'area']).median().reset_index().rename(columns={'salary_mean':'salary_median'})

    # 展開資料：每洲別一欄，用於繪製多條線
    line_df_group = line_df_group.pivot(index=['rank', 'coding_exp_years'], columns='area', values='salary_median').reset_index()

    return line_df_group

file_path_name = 'data_scientists_toolbox/洲別_經驗_薪資.html'
go_scatter(line_data(), file_path_name)

氣泡圖本圖為多變量氣泡圖，用以視覺化職稱（job_title）在不同年薪區間中，對應的平均技能數量、年薪中位數與樣本數。
- X 軸：每個職稱平均會學習的程式語言數量（lang_count_mean） - Y 軸：年薪中位數（salary_median） - 氣泡大小：該群體樣本數（count） - 顏色：職稱類型（job_title） - 滑鼠提示：顯示平均年資資訊（coding_exp_year_mean）
本圖可以觀察是否技能數量較多的職稱會有更高的薪資，以及在不同職稱中薪資與年資、技能數的相對趨勢。

圖表 6：氣泡圖 (年薪_技能(數量)_職稱.html)

def px_scatter(scatter_df, file_path_name):
    fig = px.scatter(
            scatter_df,                       # 數據來源的 DataFrame
            x="lang_count_mean",              # 技能數量(平均)
            y="salary_median",                # 年薪(中位數)
            size="count",                     # 設定氣泡大小對應 人數
            color="job_title",                # 各職稱來區分顏色
            hover_name="text",                # 滑鼠懸停時顯示 (經驗 = 3.5 year)
            size_max=50,                      # 設定氣泡的最大大小
            range_x=[1.5, 5.5],               # 設定 x 軸的範圍
            range_y=[0, 200000],              # 設定 y 軸的範圍
            log_x=True                        # 以對數尺度顯示 x 軸，有助於處理資料密集區的壓縮與分布差異
        )

    fig.update_layout(title_text='年薪 & 技能(數量) & 職稱', 
                    title_font_size = 24,
                    xaxis=dict(
                        title=dict(text='學習程式語言數量 的平均'),
                        gridcolor='white',
                        type='log',
                        gridwidth=2,
                    ), 
                    yaxis_title=None) # 關閉 y 軸標籤
    fig.show()
    fig.write_html(file_path_name, auto_open=True) # 儲存檔案

def scatter_data():
    # 讀取資料
    scatter_df = responses_single_choice[['id','surveyed_in', 'coding_exp_years', 'job_title','country', 'salary', 'salary_group', 'job_title_group']]

    # 設定條件 salary_group = correct_info 、 job_title_group = Data-related。
    col = (scatter_df['salary_group'] == 'correct_info') & (scatter_df['job_title_group'] == 'Data-related')

    # 合併 salary_order
    scatter_df = scatter_df[col].merge(salary_order, left_on='salary', right_on='salary', how='left')

    # 合併 country_area
    scatter_df = scatter_df.merge(country_area[['country', 'gdp_group', 'area']], left_on='country', right_on='country', how='left')

    # 設定條件 gdp_group = ['高收入', '中高收入']
    scatter_df = scatter_df[scatter_df['gdp_group'].isin(['高收入', '中高收入'])]

    # 合併 coding_exp_years_order
    scatter_df = scatter_df.merge(coding_exp_years_order, left_on='coding_exp_years', right_on='coding_exp_years', how='left')

    # 合併 prog_lang_skill_group
    scatter_df = scatter_df.merge(prog_lang_skill_group, left_on='id', right_on='id', how='left')

    # 提取資料
    scatter_df = scatter_df[['id', 'salary','job_title','count', 'coding_exp_years_mean', 'salary_mean']]

    # 移除錯誤資料，正常的中高收入國家的年薪不應該低於 5000 美金，感覺應該要設定 10000 美金比較正確。
    scatter_df = scatter_df[scatter_df['salary_mean'] > 5000]

    # 建立 人數、技能數量(平均)、程式經驗年(平均)、年薪(中位數)。
    scatter_df = scatter_df.groupby(['job_title','salary']).agg(count = ('id', 'count'),
                                                                lang_count_mean=('count', 'mean'),
                                                                coding_exp_year_mean=('coding_exp_years_mean', 'mean'),
                                                                salary_median=('salary_mean', 'median')
                                                                ).round(2).reset_index()

    # 建立 文字顯示
    scatter_df['text'] = '經驗 = ' + scatter_df['coding_exp_year_mean'].astype(str) + ' year'

    return scatter_df

file_path_name = '練習專案三：資料科學家的工具箱/data_scientists_toolbox/年薪_技能(數量)_職稱.html'
px_scatter(scatter_data(), file_path_name)

練習專案三：資料科學家的工具箱 4

概念驗證:可視化方法（Visualization Methods）