#archiveteam-bs 2018-09-06,Thu

↑back Search

Time	Nickname	Message
00:03 ^🔗		Stilett0 has quit IRC (Ping timeout: 252 seconds)
00:18 ^🔗		Stilett0 has joined #archiveteam-bs
00:20 ^🔗		BlueMax has joined #archiveteam-bs
00:52 ^🔗		Odd0002 has quit IRC (Quit: ZNC - http://znc.in)
00:56 ^🔗		Odd0002 has joined #archiveteam-bs
01:06 ^🔗		ndiddy has quit IRC (Read error: Connection reset by peer)
01:07 ^🔗		ndiddy has joined #archiveteam-bs
01:08 ^🔗		Yurume has quit IRC (Read error: Operation timed out)
01:10 ^🔗		Yurume has joined #archiveteam-bs
01:16 ^🔗		Stilett0 has quit IRC (Read error: Operation timed out)
01:17 ^🔗		ndiddy has quit IRC (Remote host closed the connection)
01:17 ^🔗		ndiddy has joined #archiveteam-bs
01:18 ^🔗		ndiddy has quit IRC (Client Quit)
01:18 ^🔗		Jusque has quit IRC (Ping timeout: 260 seconds)
01:24 ^🔗		Jusque has joined #archiveteam-bs
01:28 ^🔗		BlueMax has quit IRC (Read error: Connection reset by peer)
01:30 ^🔗		BlueMax has joined #archiveteam-bs
01:40 ^🔗		Stilett0 has joined #archiveteam-bs
01:41 ^🔗		icedice has joined #archiveteam-bs
03:08 ^🔗		odemg has quit IRC (Ping timeout: 260 seconds)
03:20 ^🔗		odemg has joined #archiveteam-bs
03:22 ^🔗		icedice has quit IRC (Quit: Leaving)
03:23 ^🔗		tsr has quit IRC (Read error: Operation timed out)
03:30 ^🔗		tsr has joined #archiveteam-bs
04:07 ^🔗	w0rmhole	i already have wget, i use it to mirror open directories, so learning new switches to use with that sounds fairly simple. but, `wpull', i tried `brew install wpull' (on macos high sierra here) and got nothing. do i need to download the binary from somewhere special?
04:15 ^🔗	dxrt	w0rmhole: Install it with pip
04:41 ^🔗	Flashfire	w0rmhole I use sitesucker
04:42 ^🔗	Flashfire	Sorry my bad deep vacuum
04:48 ^🔗	ivan	I use grab-site!! please help me stop using grab-site
04:49 ^🔗	ivan	JAA you know what browser-based WARC thing could do, patch chromium to preserve the entire header block (or at least enough to satisfy IA)
04:49 ^🔗	ivan	proxies are super annoying
05:17 ^🔗		nuc is now known as Somebody2
05:31 ^🔗	w0rmhole	dxrt: pip3? is that okay?
05:31 ^🔗	dxrt	yeah
05:32 ^🔗	w0rmhole	thanks ill give it a shot
05:34 ^🔗	w0rmhole	ok, downloaded it. seems to have worked i guess. i just typed `wpull' and it spit out a python error at me. i dont speak python. mind if i pm you the output? just wondering if its working okay or not
05:35 ^🔗	dxrt	sure
05:45 ^🔗		nyaomin is now known as nyaomi
08:05 ^🔗		Mateon1 has quit IRC (Ping timeout: 268 seconds)
08:05 ^🔗		Mateon1 has joined #archiveteam-bs
08:20 ^🔗	JAA	ivan: I doubt extensions can patch Chromium's network stack though...
08:20 ^🔗	JAA	w0rmhole: Use wpull 1.2.3 or 2.0.3 (from FalconK's fork or mine on GitHub), not 2.0.1.
08:22 ^🔗	w0rmhole	i got wpull from somewhere, the version's 1.2.3. i believe it was grabbed with py 2
08:22 ^🔗	JAA	I don't think wpull 1.2.3 supports Python 2. It's all Python 3 (fortunately).
08:23 ^🔗	ivan	JAA: well yes you'd have to compile chromium
08:24 ^🔗	w0rmhole	oh wait nvm it was py 3 i got it with
08:24 ^🔗	JAA	ivan: Yeah, that would work. You just need a 64-core machine with 128 TB of RAM and 10 PB of disk space. Or was that Firefox? I forget. :-)
08:25 ^🔗	Flashfire	JAA what browser do you use?
08:25 ^🔗	JAA	Firefox, why?
08:26 ^🔗	Flashfire	Just curious. I use Firefox at home and chrome at school cause I hate safari and Firefox isn’t supported at school
08:29 ^🔗	JAA	ivan: I've also been thinking about whether it would be worth patching the JS engine to record non-deterministic values from there. E.g. timestamp, user agent, random values, etc. That might improve playback when such values are used to determine URLs, for example. But it'd be a lot of work for fairly little gain.
08:37 ^🔗	ivan	afaik it takes ~1.5 hours on a normal machine with ~4 cores
08:38 ^🔗	ivan	I'm too sleepy to think about the playback problem, I generally don't even want to playback someone's JavaScript
08:38 ^🔗	ivan	unless someone has written some nifty JavaScript game
08:41 ^🔗	w0rmhole	ivan: javascript games!?!?! http://netives.com
08:43 ^🔗	ivan	https://www.chiark.greenend.org.uk/~sgtatham/puzzles/ too
08:43 ^🔗	JAA	ivan: Me neither. Unfortunately, more and more websites are relying on it in a ridiculous way. MEGA is a great example. You can't even read their press releases without launching the entire JS thing app and allowing cookies (otherwise it fails with a security error).
08:44 ^🔗	JAA	<3 Puzzles
08:45 ^🔗	ivan	well I would hope you can capture the press releases as DOM
08:45 ^🔗	Flashfire	inB4 mega is a security error
08:46 ^🔗	JAA	Yeah, probably.
08:47 ^🔗	JAA	Flashfire: Funnily enough, they are/were: https://mega.nz/blog_47
08:50 ^🔗	JAA	ivan: One example that wouldn't work without JS is Instagram. User profiles only have the currently visible posts (plus a few above and below it) in the DOM. So if you capture the DOM, you'll only get those few posts, not the entire post history.
08:55 ^🔗	w0rmhole	didnt read the link yet, but jaa, is that talking about the mega chrome extension?
08:56 ^🔗	JAA	Yes
08:57 ^🔗	w0rmhole	iirc i saw that on either /r/piracy or /r/privacy
08:57 ^🔗	w0rmhole	probably both lol
09:03 ^🔗	ivan	JAA: in many cases I think you can just scroll down to capture everything (with a javascript shim to avoid unloading images if necessary), then to playback the DOM, maybe optionally lazy-load images to prevent memory consumption from getting out of hand
09:04 ^🔗	ivan	oh I see what you mean now
09:04 ^🔗	ivan	yeah preventing unloading of DOM might be another troublesome problem
09:04 ^🔗	JAA	You'd have to compile an artificial DOM while scrolling.
09:04 ^🔗	ivan	the web was better before these React people got their hands on it
09:04 ^🔗	JAA	No doubt
09:06 ^🔗	ivan	I'm scrolling down on instagram and it doesn't seem too crazy, the images are getting unloaded but all of the page is there
09:06 ^🔗	w0rmhole	so, with warc capturing, ignores can only be defined while the page/domain is being captured? and they take place immediately?
09:06 ^🔗	JAA	Is it? I haven't checked in a while, but it broke my method of extracting all post links back before I wrote my scraper.
09:07 ^🔗	ivan	earlier I mentioned my desire for a generic out-of-viewport DOM change canceler
09:08 ^🔗	JAA	w0rmhole: You're mixing things up. WARC just stores whatever you throw into it. You can always filter stuff out afterwards as well. Ignores only affect the retrieval, i.e. they prevent the crawler from descending to certain URLs (for whatever reason, e.g. to prevent infinite recursion like on calendars, or because URLs are inaccessible anyway).
09:12 ^🔗	w0rmhole	so, if im understanding this correctly, i would need to run some other command afterwards to remove the stuff i dont need from the warc? i was just hoping to use `grab-site' and instruct it to capture everything from a single page, except for some facebook .js files under a single url.
09:13 ^🔗	JAA	No, you can specify the relevant ignore at the start of (or during) the crawl. Then it'll never end up in the WARC to begin with.
09:14 ^🔗	w0rmhole	ok i see. my apologies, this is my first time working with warcs
09:15 ^🔗	w0rmhole	how would i go about specifying the ignores at the start?
09:16 ^🔗	JAA	With wpull, --reject-regex (and some other options, e.g. if you want to ignore entire domains, which I don't remember right now). I don't know how it works with grab-site as I've never used it. Should be in the documentation though.
09:17 ^🔗	w0rmhole	ok thanks, i'll keep it in mind. it's 04:16 over here so i need to get to bed. see you all later
09:17 ^🔗	JAA	Good night
09:26 ^🔗		VADemon_ has joined #archiveteam-bs
09:30 ^🔗		VADemon has quit IRC (Read error: Operation timed out)
09:50 ^🔗		VADemon_ has quit IRC (Read error: Connection reset by peer)
09:54 ^🔗		VADemon has joined #archiveteam-bs
10:27 ^🔗		BlueMax has quit IRC (Read error: Connection reset by peer)
13:24 ^🔗		redlizard has quit IRC (Remote host closed the connection)
14:07 ^🔗		wp494 has quit IRC (Ping timeout: 506 seconds)
14:08 ^🔗		wp494 has joined #archiveteam-bs
14:18 ^🔗		ndiddy has joined #archiveteam-bs
14:54 ^🔗		ndiddy has quit IRC (Ping timeout: 268 seconds)
15:19 ^🔗		RichardG_ has joined #archiveteam-bs
15:19 ^🔗		RichardG has quit IRC (Read error: Connection reset by peer)
15:36 ^🔗	godane	so i need some help
15:36 ^🔗	godane	i found out about euscreen.eu
15:36 ^🔗	godane	but there using a ton of java script to make there pages
15:37 ^🔗	godane	i can't download the videos that hosted using youtube-dl
15:37 ^🔗	godane	and can't use custom phantomjs scripts i have to download the page either
15:37 ^🔗	kiska	Lets throw it into archivebot!
15:37 ^🔗	godane	http://euscreen.eu/item.html?id=TN_1988-07-25
15:38 ^🔗	JAA	kiska: JS-heavy pages don't work well with ArchiveBot.
15:39 ^🔗	kiska	We can try
15:39 ^🔗	JAA	Yeah, that grabbed about what I expected it to grab.
15:40 ^🔗	kiska	btw JAA does the dashboard show ao's
15:40 ^🔗	JAA	Of course.
15:41 ^🔗	kiska	I am only asking since the 3rd version isn't showing it for me
15:42 ^🔗	JAA	Maybe you're lagging behind. I've seen that before. (Yes, even on fast connections.)
15:44 ^🔗	JAA	Wow, that site is disgusting. It sends a POST request to get an XML blob, which it then PUTs to another URL to get a 400 KiB blob of HTML + scripts back.
15:45 ^🔗	kiska	xD
15:45 ^🔗	kiska	Well for maximum uncrawlability thats something people would do
15:48 ^🔗	godane	JAA: its a very weird site
15:48 ^🔗	godane	i was hoping i could have just customize my scripts for grabbing animeheaven.eu videos
15:49 ^🔗	godane	i know it was a long shoot but it did use phantomjs to save the page
15:50 ^🔗	kiska	Anyway here is the video link for that video http://stream18.noterik.com/progressive/stream18//domain/euscreenxl/user/eu_ctv/video/EUS_5800A98E71C22960834A8DBEE8F9F1A8/rawvideo/1/raw.mp4?ticket=21513250
15:50 ^🔗	kiska	Oh... its doing a 403 on that
15:51 ^🔗	godane	the trick is the page has to be opening for the grab
15:51 ^🔗	godane	*open
15:52 ^🔗	kiska	It was open
15:52 ^🔗	JAA	Referrer, cookies?
15:52 ^🔗	kiska	Most likely
15:53 ^🔗	kiska	Maybe this "smt__sessionid"
15:59 ^🔗		ndiddy has joined #archiveteam-bs
16:07 ^🔗		RichardG_ is now known as RichardG
16:43 ^🔗	godane	so i found out about this : https://www.radio.cz/en/broadcast-archive/2004-09-30
16:44 ^🔗	godane	its in real media format going back to 2002-02-25
17:01 ^🔗	godane	*2002-03-25
17:31 ^🔗		chferfa has joined #archiveteam-bs
17:56 ^🔗	godane	starting uploading Radio Prague : https://archive.org/details/radio-prague-english-2002-03-25
18:46 ^🔗	jut	Would it be worthwhile uploading news from Lithuanian Panorama going back to 2013-12-20, It's 611 GB and not in immediate danger of disappearing?
19:58 ^🔗		VADemon has quit IRC (Read error: Connection reset by peer)
20:51 ^🔗		ndiddy has quit IRC (Read error: Operation timed out)
21:25 ^🔗	Flashfire	JAA can we grab some Burt Reynolds stuff he has died
22:08 ^🔗		vectr0n has quit IRC (Remote host closed the connection)
22:09 ^🔗		vectr0n has joined #archiveteam-bs
22:18 ^🔗		Stilett0 has quit IRC (Ping timeout: 246 seconds)
22:35 ^🔗		ndiddy has joined #archiveteam-bs
23:23 ^🔗		Sk2d has joined #archiveteam-bs
23:23 ^🔗	godane	SketchCow: i think the vhsvault needs a sub-collection all Godane vhs rips
23:23 ^🔗	godane	*all=call
23:26 ^🔗		Sk1d has quit IRC (Read error: Operation timed out)
23:26 ^🔗		Sk2d is now known as Sk1d
23:28 ^🔗	godane	i only say that cause now my videos maybe hard to find for me in that collection
23:28 ^🔗	godane	anyways here are the latest digitize rips uploaded: https://www.patreon.com/posts/digitize-tapes-21256310
23:29 ^🔗	SketchCow	Ha ha
23:29 ^🔗	SketchCow	Maybbbeeeee
23:30 ^🔗	godane	at least i have full list of the items ids for my videos
23:31 ^🔗	godane	if anything else if flemishdog can get a collection then maybe i should have one too
23:32 ^🔗	godane	:-D
23:45 ^🔗		Stilett0 has joined #archiveteam-bs

irclogger-viewer