#archiveteam-bs 2018-09-06,Thu

↑back Search

Time Nickname Message
00:03 πŸ”— Stilett0 has quit IRC (Ping timeout: 252 seconds)
00:18 πŸ”— Stilett0 has joined #archiveteam-bs
00:20 πŸ”— BlueMax has joined #archiveteam-bs
00:52 πŸ”— Odd0002 has quit IRC (Quit: ZNC - http://znc.in)
00:56 πŸ”— Odd0002 has joined #archiveteam-bs
01:06 πŸ”— ndiddy has quit IRC (Read error: Connection reset by peer)
01:07 πŸ”— ndiddy has joined #archiveteam-bs
01:08 πŸ”— Yurume has quit IRC (Read error: Operation timed out)
01:10 πŸ”— Yurume has joined #archiveteam-bs
01:16 πŸ”— Stilett0 has quit IRC (Read error: Operation timed out)
01:17 πŸ”— ndiddy has quit IRC (Remote host closed the connection)
01:17 πŸ”— ndiddy has joined #archiveteam-bs
01:18 πŸ”— ndiddy has quit IRC (Client Quit)
01:18 πŸ”— Jusque has quit IRC (Ping timeout: 260 seconds)
01:24 πŸ”— Jusque has joined #archiveteam-bs
01:28 πŸ”— BlueMax has quit IRC (Read error: Connection reset by peer)
01:30 πŸ”— BlueMax has joined #archiveteam-bs
01:40 πŸ”— Stilett0 has joined #archiveteam-bs
01:41 πŸ”— icedice has joined #archiveteam-bs
03:08 πŸ”— odemg has quit IRC (Ping timeout: 260 seconds)
03:20 πŸ”— odemg has joined #archiveteam-bs
03:22 πŸ”— icedice has quit IRC (Quit: Leaving)
03:23 πŸ”— tsr has quit IRC (Read error: Operation timed out)
03:30 πŸ”— tsr has joined #archiveteam-bs
04:07 πŸ”— w0rmhole i already have wget, i use it to mirror open directories, so learning new switches to use with that sounds fairly simple. but, `wpull', i tried `brew install wpull' (on macos high sierra here) and got nothing. do i need to download the binary from somewhere special?
04:15 πŸ”— dxrt w0rmhole: Install it with pip
04:41 πŸ”— Flashfire w0rmhole I use sitesucker
04:42 πŸ”— Flashfire Sorry my bad deep vacuum
04:48 πŸ”— ivan I use grab-site!! please help me stop using grab-site
04:49 πŸ”— ivan JAA you know what browser-based WARC thing could do, patch chromium to preserve the entire header block (or at least enough to satisfy IA)
04:49 πŸ”— ivan proxies are super annoying
05:17 πŸ”— nuc is now known as Somebody2
05:31 πŸ”— w0rmhole dxrt: pip3? is that okay?
05:31 πŸ”— dxrt yeah
05:32 πŸ”— w0rmhole thanks ill give it a shot
05:34 πŸ”— w0rmhole ok, downloaded it. seems to have worked i guess. i just typed `wpull' and it spit out a python error at me. i dont speak python. mind if i pm you the output? just wondering if its working okay or not
05:35 πŸ”— dxrt sure
05:45 πŸ”— nyaomin is now known as nyaomi
08:05 πŸ”— Mateon1 has quit IRC (Ping timeout: 268 seconds)
08:05 πŸ”— Mateon1 has joined #archiveteam-bs
08:20 πŸ”— JAA ivan: I doubt extensions can patch Chromium's network stack though...
08:20 πŸ”— JAA w0rmhole: Use wpull 1.2.3 or 2.0.3 (from FalconK's fork or mine on GitHub), not 2.0.1.
08:22 πŸ”— w0rmhole i got wpull from somewhere, the version's 1.2.3. i believe it was grabbed with py 2
08:22 πŸ”— JAA I don't think wpull 1.2.3 supports Python 2. It's all Python 3 (fortunately).
08:23 πŸ”— ivan JAA: well yes you'd have to compile chromium
08:24 πŸ”— w0rmhole oh wait nvm it was py 3 i got it with
08:24 πŸ”— JAA ivan: Yeah, that would work. You just need a 64-core machine with 128 TB of RAM and 10 PB of disk space. Or was that Firefox? I forget. :-)
08:25 πŸ”— Flashfire JAA what browser do you use?
08:25 πŸ”— JAA Firefox, why?
08:26 πŸ”— Flashfire Just curious. I use Firefox at home and chrome at school cause I hate safari and Firefox isn’t supported at school
08:29 πŸ”— JAA ivan: I've also been thinking about whether it would be worth patching the JS engine to record non-deterministic values from there. E.g. timestamp, user agent, random values, etc. That might improve playback when such values are used to determine URLs, for example. But it'd be a lot of work for fairly little gain.
08:37 πŸ”— ivan afaik it takes ~1.5 hours on a normal machine with ~4 cores
08:38 πŸ”— ivan I'm too sleepy to think about the playback problem, I generally don't even want to playback someone's JavaScript
08:38 πŸ”— ivan unless someone has written some nifty JavaScript game
08:41 πŸ”— w0rmhole ivan: javascript games!?!?! http://netives.com
08:43 πŸ”— ivan https://www.chiark.greenend.org.uk/~sgtatham/puzzles/ too
08:43 πŸ”— JAA ivan: Me neither. Unfortunately, more and more websites are relying on it in a ridiculous way. MEGA is a great example. You can't even read their press releases without launching the entire JS thing app and allowing cookies (otherwise it fails with a security error).
08:44 πŸ”— JAA <3 Puzzles
08:45 πŸ”— ivan well I would hope you can capture the press releases as DOM
08:45 πŸ”— Flashfire inB4 mega is a security error
08:46 πŸ”— JAA Yeah, probably.
08:47 πŸ”— JAA Flashfire: Funnily enough, they are/were: https://mega.nz/blog_47
08:50 πŸ”— JAA ivan: One example that wouldn't work without JS is Instagram. User profiles only have the currently visible posts (plus a few above and below it) in the DOM. So if you capture the DOM, you'll only get those few posts, not the entire post history.
08:55 πŸ”— w0rmhole didnt read the link yet, but jaa, is that talking about the mega chrome extension?
08:56 πŸ”— JAA Yes
08:57 πŸ”— w0rmhole iirc i saw that on either /r/piracy or /r/privacy
08:57 πŸ”— w0rmhole probably both lol
09:03 πŸ”— ivan JAA: in many cases I think you can just scroll down to capture everything (with a javascript shim to avoid unloading images if necessary), then to playback the DOM, maybe optionally lazy-load images to prevent memory consumption from getting out of hand
09:04 πŸ”— ivan oh I see what you mean now
09:04 πŸ”— ivan yeah preventing unloading of DOM might be another troublesome problem
09:04 πŸ”— JAA You'd have to compile an artificial DOM while scrolling.
09:04 πŸ”— ivan the web was better before these React people got their hands on it
09:04 πŸ”— JAA No doubt
09:06 πŸ”— ivan I'm scrolling down on instagram and it doesn't seem too crazy, the images are getting unloaded but all of the page is there
09:06 πŸ”— w0rmhole so, with warc capturing, ignores can only be defined while the page/domain is being captured? and they take place immediately?
09:06 πŸ”— JAA Is it? I haven't checked in a while, but it broke my method of extracting all post links back before I wrote my scraper.
09:07 πŸ”— ivan earlier I mentioned my desire for a generic out-of-viewport DOM change canceler
09:08 πŸ”— JAA w0rmhole: You're mixing things up. WARC just stores whatever you throw into it. You can always filter stuff out afterwards as well. Ignores only affect the retrieval, i.e. they prevent the crawler from descending to certain URLs (for whatever reason, e.g. to prevent infinite recursion like on calendars, or because URLs are inaccessible anyway).
09:12 πŸ”— w0rmhole so, if im understanding this correctly, i would need to run some other command afterwards to remove the stuff i dont need from the warc? i was just hoping to use `grab-site' and instruct it to capture everything from a single page, except for some facebook .js files under a single url.
09:13 πŸ”— JAA No, you can specify the relevant ignore at the start of (or during) the crawl. Then it'll never end up in the WARC to begin with.
09:14 πŸ”— w0rmhole ok i see. my apologies, this is my first time working with warcs
09:15 πŸ”— w0rmhole how would i go about specifying the ignores at the start?
09:16 πŸ”— JAA With wpull, --reject-regex (and some other options, e.g. if you want to ignore entire domains, which I don't remember right now). I don't know how it works with grab-site as I've never used it. Should be in the documentation though.
09:17 πŸ”— w0rmhole ok thanks, i'll keep it in mind. it's 04:16 over here so i need to get to bed. see you all later
09:17 πŸ”— JAA Good night
09:26 πŸ”— VADemon_ has joined #archiveteam-bs
09:30 πŸ”— VADemon has quit IRC (Read error: Operation timed out)
09:50 πŸ”— VADemon_ has quit IRC (Read error: Connection reset by peer)
09:54 πŸ”— VADemon has joined #archiveteam-bs
10:27 πŸ”— BlueMax has quit IRC (Read error: Connection reset by peer)
13:24 πŸ”— redlizard has quit IRC (Remote host closed the connection)
14:07 πŸ”— wp494 has quit IRC (Ping timeout: 506 seconds)
14:08 πŸ”— wp494 has joined #archiveteam-bs
14:18 πŸ”— ndiddy has joined #archiveteam-bs
14:54 πŸ”— ndiddy has quit IRC (Ping timeout: 268 seconds)
15:19 πŸ”— RichardG_ has joined #archiveteam-bs
15:19 πŸ”— RichardG has quit IRC (Read error: Connection reset by peer)
15:36 πŸ”— godane so i need some help
15:36 πŸ”— godane i found out about euscreen.eu
15:36 πŸ”— godane but there using a ton of java script to make there pages
15:37 πŸ”— godane i can't download the videos that hosted using youtube-dl
15:37 πŸ”— godane and can't use custom phantomjs scripts i have to download the page either
15:37 πŸ”— kiska Lets throw it into archivebot!
15:37 πŸ”— godane http://euscreen.eu/item.html?id=TN_1988-07-25
15:38 πŸ”— JAA kiska: JS-heavy pages don't work well with ArchiveBot.
15:39 πŸ”— kiska We can try
15:39 πŸ”— JAA Yeah, that grabbed about what I expected it to grab.
15:40 πŸ”— kiska btw JAA does the dashboard show ao's
15:40 πŸ”— JAA Of course.
15:41 πŸ”— kiska I am only asking since the 3rd version isn't showing it for me
15:42 πŸ”— JAA Maybe you're lagging behind. I've seen that before. (Yes, even on fast connections.)
15:44 πŸ”— JAA Wow, that site is disgusting. It sends a POST request to get an XML blob, which it then PUTs to another URL to get a 400 KiB blob of HTML + scripts back.
15:45 πŸ”— kiska xD
15:45 πŸ”— kiska Well for maximum uncrawlability thats something people would do
15:48 πŸ”— godane JAA: its a very weird site
15:48 πŸ”— godane i was hoping i could have just customize my scripts for grabbing animeheaven.eu videos
15:49 πŸ”— godane i know it was a long shoot but it did use phantomjs to save the page
15:50 πŸ”— kiska Anyway here is the video link for that video http://stream18.noterik.com/progressive/stream18//domain/euscreenxl/user/eu_ctv/video/EUS_5800A98E71C22960834A8DBEE8F9F1A8/rawvideo/1/raw.mp4?ticket=21513250
15:50 πŸ”— kiska Oh... its doing a 403 on that
15:51 πŸ”— godane the trick is the page has to be opening for the grab
15:51 πŸ”— godane *open
15:52 πŸ”— kiska It was open
15:52 πŸ”— JAA Referrer, cookies?
15:52 πŸ”— kiska Most likely
15:53 πŸ”— kiska Maybe this "smt__sessionid"
15:59 πŸ”— ndiddy has joined #archiveteam-bs
16:07 πŸ”— RichardG_ is now known as RichardG
16:43 πŸ”— godane so i found out about this : https://www.radio.cz/en/broadcast-archive/2004-09-30
16:44 πŸ”— godane its in real media format going back to 2002-02-25
17:01 πŸ”— godane *2002-03-25
17:31 πŸ”— chferfa has joined #archiveteam-bs
17:56 πŸ”— godane starting uploading Radio Prague : https://archive.org/details/radio-prague-english-2002-03-25
18:46 πŸ”— jut Would it be worthwhile uploading news from Lithuanian Panorama going back to 2013-12-20, It's 611 GB and not in immediate danger of disappearing?
19:58 πŸ”— VADemon has quit IRC (Read error: Connection reset by peer)
20:51 πŸ”— ndiddy has quit IRC (Read error: Operation timed out)
21:25 πŸ”— Flashfire JAA can we grab some Burt Reynolds stuff he has died
22:08 πŸ”— vectr0n has quit IRC (Remote host closed the connection)
22:09 πŸ”— vectr0n has joined #archiveteam-bs
22:18 πŸ”— Stilett0 has quit IRC (Ping timeout: 246 seconds)
22:35 πŸ”— ndiddy has joined #archiveteam-bs
23:23 πŸ”— Sk2d has joined #archiveteam-bs
23:23 πŸ”— godane SketchCow: i think the vhsvault needs a sub-collection all Godane vhs rips
23:23 πŸ”— godane *all=call
23:26 πŸ”— Sk1d has quit IRC (Read error: Operation timed out)
23:26 πŸ”— Sk2d is now known as Sk1d
23:28 πŸ”— godane i only say that cause now my videos maybe hard to find for me in that collection
23:28 πŸ”— godane anyways here are the latest digitize rips uploaded: https://www.patreon.com/posts/digitize-tapes-21256310
23:29 πŸ”— SketchCow Ha ha
23:29 πŸ”— SketchCow Maybbbeeeee
23:30 πŸ”— godane at least i have full list of the items ids for my videos
23:31 πŸ”— godane if anything else if flemishdog can get a collection then maybe i should have one too
23:32 πŸ”— godane :-D
23:45 πŸ”— Stilett0 has joined #archiveteam-bs

irclogger-viewer