[00:04] *** Magirot has quit IRC (Quit: Magirot) [00:34] *** GitHub181 has joined #warrior [00:34] *** GitHub181 has left [00:34] *** GitHub163 has joined #warrior [00:34] *** GitHub163 has left [02:09] *** SentientC has joined #warrior [02:11] *** SentientC has left [02:30] *** bithippo has quit IRC (Textual IRC Client: www.textualapp.com) [03:05] *** balrog has quit IRC (Quit: Bye) [03:14] *** balrog has joined #warrior [03:14] *** swebb sets mode: +o balrog [05:40] *** Kaz has quit IRC (Ping timeout: 265 seconds) [05:40] *** Kaz has joined #warrior [06:10] *** Magirot has joined #warrior [07:47] *** Atom has quit IRC (Read error: Operation timed out) [07:51] *** bibop has joined #warrior [07:51] *** bibop has left Leaving [07:53] *** Atom has joined #warrior [08:12] *** Atom has quit IRC (Ping timeout: 252 seconds) [08:13] *** Atom has joined #warrior [08:55] *** chferfa has quit IRC (Quit: Connection closed for inactivity) [09:23] *** AnEggFore has quit IRC (Ping timeout: 255 seconds) [10:02] *** surewhyno has joined #warrior [10:03] hey folks. i'm curious is there a reason why there isn't a version of warrior for web browsers using javascript? i bet it would be easier to convince people who have no technical knoweldge to help with archiving that way [10:04] is there a technical reason why it hasn't been done? [10:22] Because chrome and all other browsers don't give us the raw headers. Warcs rely on request - response pairs. Which don't happen in a browser [10:22] And also wanna imagine the performance of running a hypervisor in a browser? [10:22] not sure how warcs work. but javascript can see the response headers of a url it fetches [10:22] why run a hypervisor in a browser? O_o [11:10] surewhyno: unless the target website disables the same origin policy with cross origin resource sharing you can't archive from a browser [11:11] ah... [11:12] i'd have thought that the archiver website would be able to disable that and ignore any policies the archivee sets. i guess i don't know much about browsers heh [11:18] surewhyno: the target needs to do that, not the source :) but if eg. tumblr does that we would get a large number of clean ips and you can use some javascript black magic to cloak the source [11:19] then why not a browser plugin? [11:19] that should be able to bypass any SOP nastiness that prevents JS from doing the archival, and would be easier for people to use and install than a VM [11:21] a browser plugin should work if you rewrite everything to javascript [11:22] would that be a useful thing to work on? i could probably do that [11:28] surewhyno: I've looked at tumblr specifically, you can't crawl the blogs unless you have an api key, but you could theoretically use javascript as a proxy to ban evade on 66.media.tumblr.com [11:30] well i was thinking of having it interface with the warrior tracker [11:31] not thinking about tumblr specifically, but for future cases [11:31] especially when there's a short period of time. i mean i asked a few people to run their own warrior but all but one did not want to because either they were worried about running a whole vm or it was too complex [13:49] *** surewhyno has quit IRC (Quit: leaving) [16:03] *** Chewable has joined #warrior [16:05] *** Chewable has quit IRC (Client Quit) [16:06] *** Chewable has joined #warrior [16:08] *** Chewable has quit IRC (Client Quit) [16:08] *** Chewable has joined #warrior [16:08] *** Chewable has quit IRC (Client Quit) [16:08] *** Chewable has joined #warrior [16:09] *** Chewable has quit IRC (Client Quit) [16:09] *** Chewable has joined #warrior [16:09] *** Chewable has quit IRC (Client Quit) [16:09] *** Chewable has joined #warrior [16:09] *** Chewable has quit IRC (Client Quit) [16:09] *** Chewable has joined #warrior [16:09] *** Chewable has quit IRC (Client Quit) [16:09] *** Chewable has joined #warrior [16:09] *** Chewable has quit IRC (Client Quit) [16:10] *** Chewable has joined #warrior [16:10] *** Chewable has quit IRC (Client Quit) [16:10] *** Chewable has joined #warrior [16:10] *** Chewable has quit IRC (Client Quit) [16:10] *** Chewable has joined #warrior [16:10] *** Chewable has quit IRC (Client Quit) [16:10] *** Chewable has joined #warrior [16:10] *** Chewable has quit IRC (Client Quit) [16:10] *** Chewable has joined #warrior [16:10] *** Chewable has quit IRC (Client Quit) [16:12] *** Chewable has joined #warrior [16:12] *** Chewable has quit IRC (Client Quit) [16:12] *** Chewable has joined #warrior [16:12] *** Chewable has quit IRC (Client Quit) [16:17] *** Chewable has joined #warrior [16:18] can anyone see me? [16:23] The item I was working on is downloading thousands of URLs and it's taking hours. [16:34] item for what project? [16:36] Chewable, I've got two items with 13000+ urls and have been going for over 9 hours. [16:36] This will take time [16:38] is it ok if I shutdown the warrior when the project is still running? [16:38] It was tumblr btw [16:41] shutting down the warrior without letting it finish will abandon the item. you can save and resume the virtual machine if you need to pause for a few hours [16:42] Got it. Thank you for the response guys :) [16:46] Chewable: Also check the #tumbledown channel for Tumblr stuff. [16:47] You'll get answers faster. [16:47] Thank you! [17:13] *** chimyatta has joined #warrior [17:20] surewhyno, kiska, kpcyrd: Besides SOP, one big issue is that you can't get the raw data as sent across the wire, which is what's supposed to be stored in the WARCs. Browsers simply don't expose such low-level APIs. In the best case, you get a normalised version of it with transfer encoding removed, capitalisation of the header fields standardised, whitespace in header fields trimmed, etc. That's not [17:20] acceptable for proper archival. [18:08] JAA: is there a reason we care that much about the headers? I would assume a correct archival of the response body is sufficient for most users. [18:39] *** JL421_2 has joined #warrior [18:47] *** Chewable has quit IRC (Leaving) [18:52] there's a difference between an archive and just hosting files. it would be like a stripping the cover pages off a book. you'd still have the story intact, but you'd lose info about the author, publisher, etc things that you need to catalog it [20:05] *** JL421_2 has quit IRC (Quit: http://www.kiwiirc.com/ - A hand crafted IRC client) [20:26] *** WantedFre has quit IRC (Quit: brb)