#warrior 2018-12-17,Mon

↑back Search

Time	Nickname	Message
00:04 ^🔗		Magirot has quit IRC (Quit: Magirot)
00:34 ^🔗		GitHub181 has joined #warrior
00:34 ^🔗		GitHub181 has left
00:34 ^🔗		GitHub163 has joined #warrior
00:34 ^🔗		GitHub163 has left
02:09 ^🔗		SentientC has joined #warrior
02:11 ^🔗		SentientC has left
02:30 ^🔗		bithippo has quit IRC (Textual IRC Client: www.textualapp.com)
03:05 ^🔗		balrog has quit IRC (Quit: Bye)
03:14 ^🔗		balrog has joined #warrior
03:14 ^🔗		swebb sets mode: +o balrog
05:40 ^🔗		Kaz has quit IRC (Ping timeout: 265 seconds)
05:40 ^🔗		Kaz has joined #warrior
06:10 ^🔗		Magirot has joined #warrior
07:47 ^🔗		Atom has quit IRC (Read error: Operation timed out)
07:51 ^🔗		bibop has joined #warrior
07:51 ^🔗		bibop has left Leaving
07:53 ^🔗		Atom has joined #warrior
08:12 ^🔗		Atom has quit IRC (Ping timeout: 252 seconds)
08:13 ^🔗		Atom has joined #warrior
08:55 ^🔗		chferfa has quit IRC (Quit: Connection closed for inactivity)
09:23 ^🔗		AnEggFore has quit IRC (Ping timeout: 255 seconds)
10:02 ^🔗		surewhyno has joined #warrior
10:03 ^🔗	surewhyno	hey folks. i'm curious is there a reason why there isn't a version of warrior for web browsers using javascript? i bet it would be easier to convince people who have no technical knoweldge to help with archiving that way
10:04 ^🔗	surewhyno	is there a technical reason why it hasn't been done?
10:22 ^🔗	kiska	Because chrome and all other browsers don't give us the raw headers. Warcs rely on request - response pairs. Which don't happen in a browser
10:22 ^🔗	kiska	And also wanna imagine the performance of running a hypervisor in a browser?
10:22 ^🔗	surewhyno	not sure how warcs work. but javascript can see the response headers of a url it fetches
10:22 ^🔗	surewhyno	why run a hypervisor in a browser? O_o
11:10 ^🔗	kpcyrd	surewhyno: unless the target website disables the same origin policy with cross origin resource sharing you can't archive from a browser
11:11 ^🔗	surewhyno	ah...
11:12 ^🔗	surewhyno	i'd have thought that the archiver website would be able to disable that and ignore any policies the archivee sets. i guess i don't know much about browsers heh
11:18 ^🔗	kpcyrd	surewhyno: the target needs to do that, not the source :) but if eg. tumblr does that we would get a large number of clean ips and you can use some javascript black magic to cloak the source
11:19 ^🔗	surewhyno	then why not a browser plugin?
11:19 ^🔗	surewhyno	that should be able to bypass any SOP nastiness that prevents JS from doing the archival, and would be easier for people to use and install than a VM
11:21 ^🔗	kpcyrd	a browser plugin should work if you rewrite everything to javascript
11:22 ^🔗	surewhyno	would that be a useful thing to work on? i could probably do that
11:28 ^🔗	kpcyrd	surewhyno: I've looked at tumblr specifically, you can't crawl the blogs unless you have an api key, but you could theoretically use javascript as a proxy to ban evade on 66.media.tumblr.com
11:30 ^🔗	surewhyno	well i was thinking of having it interface with the warrior tracker
11:31 ^🔗	surewhyno	not thinking about tumblr specifically, but for future cases
11:31 ^🔗	surewhyno	especially when there's a short period of time. i mean i asked a few people to run their own warrior but all but one did not want to because either they were worried about running a whole vm or it was too complex
13:49 ^🔗		surewhyno has quit IRC (Quit: leaving)
16:03 ^🔗		Chewable has joined #warrior
16:05 ^🔗		Chewable has quit IRC (Client Quit)
16:06 ^🔗		Chewable has joined #warrior
16:08 ^🔗		Chewable has quit IRC (Client Quit)
16:08 ^🔗		Chewable has joined #warrior
16:08 ^🔗		Chewable has quit IRC (Client Quit)
16:08 ^🔗		Chewable has joined #warrior
16:09 ^🔗		Chewable has quit IRC (Client Quit)
16:09 ^🔗		Chewable has joined #warrior
16:09 ^🔗		Chewable has quit IRC (Client Quit)
16:09 ^🔗		Chewable has joined #warrior
16:09 ^🔗		Chewable has quit IRC (Client Quit)
16:09 ^🔗		Chewable has joined #warrior
16:09 ^🔗		Chewable has quit IRC (Client Quit)
16:09 ^🔗		Chewable has joined #warrior
16:09 ^🔗		Chewable has quit IRC (Client Quit)
16:10 ^🔗		Chewable has joined #warrior
16:10 ^🔗		Chewable has quit IRC (Client Quit)
16:10 ^🔗		Chewable has joined #warrior
16:10 ^🔗		Chewable has quit IRC (Client Quit)
16:10 ^🔗		Chewable has joined #warrior
16:10 ^🔗		Chewable has quit IRC (Client Quit)
16:10 ^🔗		Chewable has joined #warrior
16:10 ^🔗		Chewable has quit IRC (Client Quit)
16:10 ^🔗		Chewable has joined #warrior
16:10 ^🔗		Chewable has quit IRC (Client Quit)
16:12 ^🔗		Chewable has joined #warrior
16:12 ^🔗		Chewable has quit IRC (Client Quit)
16:12 ^🔗		Chewable has joined #warrior
16:12 ^🔗		Chewable has quit IRC (Client Quit)
16:17 ^🔗		Chewable has joined #warrior
16:18 ^🔗	Chewable	can anyone see me?
16:23 ^🔗	Chewable	The item I was working on is downloading thousands of URLs and it's taking hours.
16:34 ^🔗	chfoo	item for what project?
16:36 ^🔗	WantedFre	Chewable, I've got two items with 13000+ urls and have been going for over 9 hours.
16:36 ^🔗	WantedFre	This will take time
16:38 ^🔗	Chewable	is it ok if I shutdown the warrior when the project is still running?
16:38 ^🔗	Chewable	It was tumblr btw
16:41 ^🔗	chfoo	shutting down the warrior without letting it finish will abandon the item. you can save and resume the virtual machine if you need to pause for a few hours
16:42 ^🔗	Chewable	Got it. Thank you for the response guys :)
16:46 ^🔗	teej_	Chewable: Also check the #tumbledown channel for Tumblr stuff.
16:47 ^🔗	teej_	You'll get answers faster.
16:47 ^🔗	Chewable	Thank you!
17:13 ^🔗		chimyatta has joined #warrior
17:20 ^🔗	JAA	surewhyno, kiska, kpcyrd: Besides SOP, one big issue is that you can't get the raw data as sent across the wire, which is what's supposed to be stored in the WARCs. Browsers simply don't expose such low-level APIs. In the best case, you get a normalised version of it with transfer encoding removed, capitalisation of the header fields standardised, whitespace in header fields trimmed, etc. That's not
17:20 ^🔗	JAA	acceptable for proper archival.
18:08 ^🔗	kpcyrd	JAA: is there a reason we care that much about the headers? I would assume a correct archival of the response body is sufficient for most users.
18:39 ^🔗		JL421_2 has joined #warrior
18:47 ^🔗		Chewable has quit IRC (Leaving)
18:52 ^🔗	chfoo	there's a difference between an archive and just hosting files. it would be like a stripping the cover pages off a book. you'd still have the story intact, but you'd lose info about the author, publisher, etc things that you need to catalog it
20:05 ^🔗		JL421_2 has quit IRC (Quit: http://www.kiwiirc.com/ - A hand crafted IRC client)
20:26 ^🔗		WantedFre has quit IRC (Quit: brb)

irclogger-viewer