[00:03] *** Stilett0 has quit IRC (Ping timeout: 252 seconds) [00:18] *** Stilett0 has joined #archiveteam-bs [00:20] *** BlueMax has joined #archiveteam-bs [00:52] *** Odd0002 has quit IRC (Quit: ZNC - http://znc.in) [00:56] *** Odd0002 has joined #archiveteam-bs [01:06] *** ndiddy has quit IRC (Read error: Connection reset by peer) [01:07] *** ndiddy has joined #archiveteam-bs [01:08] *** Yurume has quit IRC (Read error: Operation timed out) [01:10] *** Yurume has joined #archiveteam-bs [01:16] *** Stilett0 has quit IRC (Read error: Operation timed out) [01:17] *** ndiddy has quit IRC (Remote host closed the connection) [01:17] *** ndiddy has joined #archiveteam-bs [01:18] *** ndiddy has quit IRC (Client Quit) [01:18] *** Jusque has quit IRC (Ping timeout: 260 seconds) [01:24] *** Jusque has joined #archiveteam-bs [01:28] *** BlueMax has quit IRC (Read error: Connection reset by peer) [01:30] *** BlueMax has joined #archiveteam-bs [01:40] *** Stilett0 has joined #archiveteam-bs [01:41] *** icedice has joined #archiveteam-bs [03:08] *** odemg has quit IRC (Ping timeout: 260 seconds) [03:20] *** odemg has joined #archiveteam-bs [03:22] *** icedice has quit IRC (Quit: Leaving) [03:23] *** tsr has quit IRC (Read error: Operation timed out) [03:30] *** tsr has joined #archiveteam-bs [04:07] i already have wget, i use it to mirror open directories, so learning new switches to use with that sounds fairly simple. but, `wpull', i tried `brew install wpull' (on macos high sierra here) and got nothing. do i need to download the binary from somewhere special? [04:15] w0rmhole: Install it with pip [04:41] w0rmhole I use sitesucker [04:42] Sorry my bad deep vacuum [04:48] I use grab-site!! please help me stop using grab-site [04:49] JAA you know what browser-based WARC thing could do, patch chromium to preserve the entire header block (or at least enough to satisfy IA) [04:49] proxies are super annoying [05:17] *** nuc is now known as Somebody2 [05:31] dxrt: pip3? is that okay? [05:31] yeah [05:32] thanks ill give it a shot [05:34] ok, downloaded it. seems to have worked i guess. i just typed `wpull' and it spit out a python error at me. i dont speak python. mind if i pm you the output? just wondering if its working okay or not [05:35] sure [05:45] *** nyaomin is now known as nyaomi [08:05] *** Mateon1 has quit IRC (Ping timeout: 268 seconds) [08:05] *** Mateon1 has joined #archiveteam-bs [08:20] ivan: I doubt extensions can patch Chromium's network stack though... [08:20] w0rmhole: Use wpull 1.2.3 or 2.0.3 (from FalconK's fork or mine on GitHub), not 2.0.1. [08:22] i got wpull from somewhere, the version's 1.2.3. i believe it was grabbed with py 2 [08:22] I don't think wpull 1.2.3 supports Python 2. It's all Python 3 (fortunately). [08:23] JAA: well yes you'd have to compile chromium [08:24] oh wait nvm it was py 3 i got it with [08:24] ivan: Yeah, that would work. You just need a 64-core machine with 128 TB of RAM and 10 PB of disk space. Or was that Firefox? I forget. :-) [08:25] JAA what browser do you use? [08:25] Firefox, why? [08:26] Just curious. I use Firefox at home and chrome at school cause I hate safari and Firefox isn’t supported at school [08:29] ivan: I've also been thinking about whether it would be worth patching the JS engine to record non-deterministic values from there. E.g. timestamp, user agent, random values, etc. That might improve playback when such values are used to determine URLs, for example. But it'd be a lot of work for fairly little gain. [08:37] afaik it takes ~1.5 hours on a normal machine with ~4 cores [08:38] I'm too sleepy to think about the playback problem, I generally don't even want to playback someone's JavaScript [08:38] unless someone has written some nifty JavaScript game [08:41] ivan: javascript games!?!?! http://netives.com [08:43] https://www.chiark.greenend.org.uk/~sgtatham/puzzles/ too [08:43] ivan: Me neither. Unfortunately, more and more websites are relying on it in a ridiculous way. MEGA is a great example. You can't even read their press releases without launching the entire JS thing app and allowing cookies (otherwise it fails with a security error). [08:44] <3 Puzzles [08:45] well I would hope you can capture the press releases as DOM [08:45] inB4 mega is a security error [08:46] Yeah, probably. [08:47] Flashfire: Funnily enough, they are/were: https://mega.nz/blog_47 [08:50] ivan: One example that wouldn't work without JS is Instagram. User profiles only have the currently visible posts (plus a few above and below it) in the DOM. So if you capture the DOM, you'll only get those few posts, not the entire post history. [08:55] didnt read the link yet, but jaa, is that talking about the mega chrome extension? [08:56] Yes [08:57] iirc i saw that on either /r/piracy or /r/privacy [08:57] probably both lol [09:03] JAA: in many cases I think you can just scroll down to capture everything (with a javascript shim to avoid unloading images if necessary), then to playback the DOM, maybe optionally lazy-load images to prevent memory consumption from getting out of hand [09:04] oh I see what you mean now [09:04] yeah preventing unloading of DOM might be another troublesome problem [09:04] You'd have to compile an artificial DOM while scrolling. [09:04] the web was better before these React people got their hands on it [09:04] No doubt [09:06] I'm scrolling down on instagram and it doesn't seem too crazy, the images are getting unloaded but all of the page is there [09:06] so, with warc capturing, ignores can only be defined while the page/domain is being captured? and they take place immediately? [09:06] Is it? I haven't checked in a while, but it broke my method of extracting all post links back before I wrote my scraper. [09:07] earlier I mentioned my desire for a generic out-of-viewport DOM change canceler [09:08] w0rmhole: You're mixing things up. WARC just stores whatever you throw into it. You can always filter stuff out afterwards as well. Ignores only affect the retrieval, i.e. they prevent the crawler from descending to certain URLs (for whatever reason, e.g. to prevent infinite recursion like on calendars, or because URLs are inaccessible anyway). [09:12] so, if im understanding this correctly, i would need to run some other command afterwards to remove the stuff i dont need from the warc? i was just hoping to use `grab-site' and instruct it to capture everything from a single page, except for some facebook .js files under a single url. [09:13] No, you can specify the relevant ignore at the start of (or during) the crawl. Then it'll never end up in the WARC to begin with. [09:14] ok i see. my apologies, this is my first time working with warcs [09:15] how would i go about specifying the ignores at the start? [09:16] With wpull, --reject-regex (and some other options, e.g. if you want to ignore entire domains, which I don't remember right now). I don't know how it works with grab-site as I've never used it. Should be in the documentation though. [09:17] ok thanks, i'll keep it in mind. it's 04:16 over here so i need to get to bed. see you all later [09:17] Good night [09:26] *** VADemon_ has joined #archiveteam-bs [09:30] *** VADemon has quit IRC (Read error: Operation timed out) [09:50] *** VADemon_ has quit IRC (Read error: Connection reset by peer) [09:54] *** VADemon has joined #archiveteam-bs [10:27] *** BlueMax has quit IRC (Read error: Connection reset by peer) [13:24] *** redlizard has quit IRC (Remote host closed the connection) [14:07] *** wp494 has quit IRC (Ping timeout: 506 seconds) [14:08] *** wp494 has joined #archiveteam-bs [14:18] *** ndiddy has joined #archiveteam-bs [14:54] *** ndiddy has quit IRC (Ping timeout: 268 seconds) [15:19] *** RichardG_ has joined #archiveteam-bs [15:19] *** RichardG has quit IRC (Read error: Connection reset by peer) [15:36] so i need some help [15:36] i found out about euscreen.eu [15:36] but there using a ton of java script to make there pages [15:37] i can't download the videos that hosted using youtube-dl [15:37] and can't use custom phantomjs scripts i have to download the page either [15:37] Lets throw it into archivebot! [15:37] http://euscreen.eu/item.html?id=TN_1988-07-25 [15:38] kiska: JS-heavy pages don't work well with ArchiveBot. [15:39] We can try [15:39] Yeah, that grabbed about what I expected it to grab. [15:40] btw JAA does the dashboard show ao's [15:40] Of course. [15:41] I am only asking since the 3rd version isn't showing it for me [15:42] Maybe you're lagging behind. I've seen that before. (Yes, even on fast connections.) [15:44] Wow, that site is disgusting. It sends a POST request to get an XML blob, which it then PUTs to another URL to get a 400 KiB blob of HTML + scripts back. [15:45] xD [15:45] Well for maximum uncrawlability thats something people would do [15:48] JAA: its a very weird site [15:48] i was hoping i could have just customize my scripts for grabbing animeheaven.eu videos [15:49] i know it was a long shoot but it did use phantomjs to save the page [15:50] Anyway here is the video link for that video http://stream18.noterik.com/progressive/stream18//domain/euscreenxl/user/eu_ctv/video/EUS_5800A98E71C22960834A8DBEE8F9F1A8/rawvideo/1/raw.mp4?ticket=21513250 [15:50] Oh... its doing a 403 on that [15:51] the trick is the page has to be opening for the grab [15:51] *open [15:52] It was open [15:52] Referrer, cookies? [15:52] Most likely [15:53] Maybe this "smt__sessionid" [15:59] *** ndiddy has joined #archiveteam-bs [16:07] *** RichardG_ is now known as RichardG [16:43] so i found out about this : https://www.radio.cz/en/broadcast-archive/2004-09-30 [16:44] its in real media format going back to 2002-02-25 [17:01] *2002-03-25 [17:31] *** chferfa has joined #archiveteam-bs [17:56] starting uploading Radio Prague : https://archive.org/details/radio-prague-english-2002-03-25 [18:46] Would it be worthwhile uploading news from Lithuanian Panorama going back to 2013-12-20, It's 611 GB and not in immediate danger of disappearing? [19:58] *** VADemon has quit IRC (Read error: Connection reset by peer) [20:51] *** ndiddy has quit IRC (Read error: Operation timed out) [21:25] JAA can we grab some Burt Reynolds stuff he has died [22:08] *** vectr0n has quit IRC (Remote host closed the connection) [22:09] *** vectr0n has joined #archiveteam-bs [22:18] *** Stilett0 has quit IRC (Ping timeout: 246 seconds) [22:35] *** ndiddy has joined #archiveteam-bs [23:23] *** Sk2d has joined #archiveteam-bs [23:23] SketchCow: i think the vhsvault needs a sub-collection all Godane vhs rips [23:23] *all=call [23:26] *** Sk1d has quit IRC (Read error: Operation timed out) [23:26] *** Sk2d is now known as Sk1d [23:28] i only say that cause now my videos maybe hard to find for me in that collection [23:28] anyways here are the latest digitize rips uploaded: https://www.patreon.com/posts/digitize-tapes-21256310 [23:29] Ha ha [23:29] Maybbbeeeee [23:30] at least i have full list of the items ids for my videos [23:31] if anything else if flemishdog can get a collection then maybe i should have one too [23:32] :-D [23:45] *** Stilett0 has joined #archiveteam-bs