[00:00] *** closure has joined #archiveteam-bs [00:47] SketchCow: any news? [01:01] *** closure has quit IRC (Read error: Connection reset by peer) [01:04] *** closure has joined #archiveteam-bs [01:26] *** m007a83 has joined #archiveteam-bs [01:44] so i'm looking thur the news archives of ina.fr [01:46] i think its a mix of different news sources with there archive or at least from what i can tell with journal du page here: http://www.ina.fr/journal-anniversaire [01:48] if its mix then i will have do it based what i get from them and put ids as ina-journal-du-${y}-${m}-${d} with original file name most likely put into a text file [01:48] only say that cause of french chars may cause some problems with uploading item [01:58] *** closure has quit IRC (Read error: Operation timed out) [02:01] *** closure has joined #archiveteam-bs [02:28] the video file will be named will be put into a text file or maybe the id page will be put into a text file [02:29] i say that cause i did put --add-metadata to all of these videos using youtube-dl so full original name is metadata of video [03:00] *** closure has quit IRC (Ping timeout: 252 seconds) [03:01] *** closure has joined #archiveteam-bs [03:30] *** bitBaron has quit IRC (Quit: Bye.) [03:36] why do websites force you to interact with javascript to load site data? [03:42] *** archodg_ has joined #archiveteam-bs [03:43] moufu, are you still around? I'm unsure what you meant by printing the url in the download_child_p callback [03:44] *** odemg has quit IRC (Ping timeout: 246 seconds) [03:44] *** archodg__ has quit IRC (Ping timeout: 252 seconds) [03:48] kyounko: reducing initial page render time (whether intentionally or because of bad architecture), javascript-first developers, getting better information on what people are actually reading [03:50] the javascript developer rationale goes something like "we'll need some of these dynamic loading features anyway, might as well require javascript to avoid duplicating this rendering logic on the server" [03:58] *** odemg has joined #archiveteam-bs [03:59] *** closure has quit IRC (Read error: Operation timed out) [04:04] *** closure has joined #archiveteam-bs [05:00] *** closure has quit IRC (Read error: Connection reset by peer) [05:00] *** closure_ has joined #archiveteam-bs [05:28] *** Odd0002_ has joined #archiveteam-bs [05:33] *** Odd0002 has quit IRC (Read error: Operation timed out) [05:33] *** Odd0002_ is now known as Odd0002 [06:00] *** closure_ has quit IRC (Read error: Connection reset by peer) [08:04] ivan: would it be insane to do things like the 1995 web? that still works [08:05] i haven't been to craigslist in years, but in 2014 it seemed "ancient" [09:32] *** BlueMax has quit IRC (Quit: Leaving) [11:00] *** REiN^ has joined #archiveteam-bs [11:05] *** plue has quit IRC (Quit: leaving) [11:06] *** plue has joined #archiveteam-bs [12:15] *** Mateon1 has quit IRC (Ping timeout: 633 seconds) [12:18] *** Mateon1 has joined #archiveteam-bs [12:36] adinbied: https://github.com/alard/wget-lua/wiki/Wget-with-Lua-hooks#download_child_p [12:37] adinbied: something like if not verdict then io.stdout:write(urlpos["url"]["url"].." rejected ("..reason..")\n"); io.stdout:flush() end might work, haven't tested though [13:25] *** closure has joined #archiveteam-bs [14:00] *** closure has quit IRC (Read error: Connection reset by peer) [14:05] *** closure has joined #archiveteam-bs [14:25] kyounko: there's a whole new generation of 'web developers' who genuinely do not understand what a browser can do out of the box, and who believe that writing 'frontend JS' is the only way to build a website today [14:25] and that is not an exaggeration [14:25] I have to talk people off this ledge on an almost-daily basis [14:25] it's fucking infuriating [14:26] It makes archiving... _difficult_ [14:26] not just that [14:26] it makes everything worse [14:26] contrary to popular belief, such JS-heavy sites are actually *considerably* slower than plain HTML and CSS with forms and links [14:26] because you're breaking half the browser's optimizations [14:27] it also typically breaks the browser's standard behaviour and controls (and why wouldn't it? new generation of web devs doesn't even know they *exist*) [14:27] effectively reimplementing half a browser client-side, poorly [15:00] *** closure has quit IRC (Ping timeout: 252 seconds) [15:01] *** closure has joined #archiveteam-bs [15:06] *** plue has quit IRC (Quit: leaving) [15:07] *** plue has joined #archiveteam-bs [15:42] *** Sanky is now known as Sanqui [16:00] *** closure has quit IRC (Ping timeout: 268 seconds) [16:18] *** closure has joined #archiveteam-bs [16:32] I'm still quite unfamiliar with Lua - I'm still wanting to learn more at some point, but for the moment, moufu, would you be able to create some sort of implementation? [17:05] *** closure has quit IRC (Read error: Connection reset by peer) [17:08] *** closure_ has joined #archiveteam-bs [17:48] adinbied: https://0x0.st/s36_.diff [17:53] *** plue has quit IRC (Quit: leaving) [17:59] *** closure_ has quit IRC (Read error: Connection reset by peer) [18:23] *** wp494 has quit IRC (Read error: Operation timed out) [18:24] *** wp494 has joined #archiveteam-bs [18:45] moufu, that didn't do it either - it's still not grabbing images/site resources [18:49] *** closure has joined #archiveteam-bs [18:53] *** schbirid has joined #archiveteam-bs [18:59] *** closure has quit IRC (Ping timeout: 246 seconds) [19:01] what site are you testing it on? [19:03] it seems to work when I run wget-lua with the arguments from pipeline.py manually (not sure how to test the pipeline itself since it returns tracker error) on some random sites found on google [19:06] *** plue has joined #archiveteam-bs [19:06] So I've got a tracker dev-env VM running with the items as defined here: https://github.com/adinbied/angelfire-items [19:08] Maybe it's something to do with the way the items and queue are being handled in my particular case? [19:15] *** tuluu_ has quit IRC (Read error: Connection refused) [19:16] *** tuluu has joined #archiveteam-bs [19:17] okay I can reproduce it when wrabbing from sitemap.xml [19:19] oh I might know what the problem is [19:20] yeah works now, forgot to pass link_expect_html [19:20] https://0x0.st/s3IO.diff [19:21] adinbied: ↑ [19:22] Aha! [19:22] Thank you so much! I'm still learning as I go, and was getting hung up on why it just wouldn't work [19:23] Thanks again for the help moufu! It's greatly appreciated! [19:25] np, I should've remembered html pages don't get parsed without that option [19:46] *** closure_ has joined #archiveteam-bs [20:01] *** closure_ has quit IRC (Read error: Connection reset by peer) [20:03] *** schbirid has quit IRC (Remote host closed the connection) [21:56] *** closure has joined #archiveteam-bs [22:01] *** closure has quit IRC (Read error: Connection reset by peer) [22:01] *** closure_ has joined #archiveteam-bs [22:04] *** zhongfu has quit IRC (Quit: No Ping reply in 180 seconds.) [22:05] *** zhongfu has joined #archiveteam-bs [22:18] *** BlueMax has joined #archiveteam-bs [22:32] *** closure_ has quit IRC (Read error: Connection reset by peer) [22:34] *** closure has joined #archiveteam-bs [22:37] *** Stiletto has quit IRC (Read error: Operation timed out) [23:00] *** closure_ has joined #archiveteam-bs [23:00] *** closure has quit IRC (Read error: Connection reset by peer) [23:33] *** closure_ has quit IRC (Read error: Connection reset by peer) [23:38] *** closure has joined #archiveteam-bs