[00:02] *** rejon has quit IRC (Read error: Operation timed out)
[00:31] *** garyrh has quit IRC (Remote host closed the connection)
[00:58] *** Smiley has quit IRC (Ping timeout: 370 seconds)
[00:59] *** garyrh has joined #archiveteam-bs
[01:02] *** mistym has quit IRC (Remote host closed the connection)
[01:19] *** mistym has joined #archiveteam-bs
[01:20] *** DFJustin has joined #archiveteam-bs
[01:20] *** swebb sets mode: +o DFJustin
[01:24] *** primus104 has quit IRC (Leaving.)
[01:34] *** egg_ has quit IRC (quit)
[01:38] *** nico_ is now known as nico_32
[01:44] *** Smiley has joined #archiveteam-bs
[02:21] *** APerti has joined #archiveteam-bs
[02:47] *** APerti_ has joined #archiveteam-bs
[02:50] *** APerti has quit IRC (Read error: Operation timed out)
[03:49] <chfoo> http://thedailywh.at/2015/01/distraction-of-the-day-you-can-now-play-oregon-trail-and-other-ms-dos-games-online/
[04:33] *** aaaaaaaaa has quit IRC (Leaving)
[05:21] *** S_aus_Eur has joined #archiveteam-bs
[05:21] *** S_aus_Eur has left 
[05:29] <godane> so some of the npr morning radio episodes are going to be in real media
[05:29] <godane> these real media files don't derive right at all
[05:29] <godane> old ones don't have this problem
[05:30] <godane> i hope some one can at least look at them to see what is the problem with IA deriving them
[05:31] <godane> these are in real media only: https://archive.org/details/npr-morning-edition-01-02-2003
[05:34] *** mistym has quit IRC (Remote host closed the connection)
[05:37] *** mistym has joined #archiveteam-bs
[05:59] <DFJustin> huh never knew there was a dos oregon trail
[07:13] *** mistym has quit IRC (Remote host closed the connection)
[07:25] *** APerti_ has quit IRC (Read error: Operation timed out)
[07:44] <Ctrl-S> I've started work on a tumblr archiver, here is code so far: https://mega.co.nz/#!bxJFzL4Z!8h1TQHKJT7WvJRkgiZTPkgbO2gDw7a4VbxFSa1Go-k4
[08:06] *** primus104 has joined #archiveteam-bs
[08:38] <joepie91> godane: fwiw, it has derived now
[08:38] <joepie91> Ctrl-S: use some sort of git hosting, please :D
[08:38] <joepie91> especially since you're already using git...
[08:39] <joepie91> (or at the very least tar.gz, zip isn't very good at unix perms)
[08:51] *** GLaDOS has quit IRC (Ping timeout: 272 seconds)
[08:51] *** GLaDOS has joined #archiveteam-bs
[08:51] *** swebb sets mode: +o GLaDOS
[09:07] *** brayden has quit IRC (Ping timeout: 607 seconds)
[09:54] *** schbirid has joined #archiveteam-bs
[09:55] *** brayden has joined #archiveteam-bs
[10:01] *** kvieta has quit IRC (Read error: Operation timed out)
[10:12] *** kvieta has joined #archiveteam-bs
[11:43] *** primus104 has quit IRC (Leaving.)
[11:47] *** yan has joined #archiveteam-bs
[12:03] <Ctrl-S> is this good enough for you? https://github.com/woodenphone/tumblr-to-db
[12:03] <Ctrl-S> still WIP
[12:03] <godane> looks like 20100919 marshill hd video doesn't work
[12:04] <joepie91> Ctrl-S: yes, git is good :D
[12:04] <godane> so i try to get the tv_sd_progressive version of that video
[12:04] <Ctrl-S> goal is to save tumblr blogs to a db so i can scrape remotely and retreive to my metered home connection
[12:04] <Ctrl-S> HTTrack automation just doesn't cut it
[12:05] <Ctrl-S> also HTTrack does not remember where it has been
[12:05] <Ctrl-S> or rather, it does not understand the difference between posts and the listings
[12:13] <joepie91> Ctrl-S: shouldn't you be using WARC, though?
[12:14] <Ctrl-S> Filesize must be minimised
[12:14] <Ctrl-S> Purpose is to save the blogs, not to shove into IA
[12:14] <Ctrl-S> Most important things are the poss and the media
[12:15] <joepie91> WARC overhead is negligible, really
[12:15] <joepie91> WARC isn't just for IA either :)
[12:16] <Ctrl-S> basically the problem this software is supposed to address is: Tumblr makes it really easy to get a blog deleted
[12:16] <joepie91> Ctrl-S: thing is, if you're making HTTP requests anyway, you might as well dump them into a WARC?
[12:16] <joepie91> mm
[12:16] <Ctrl-S> I have a metered home connection
[12:17] <joepie91> ok?
[12:17] <Ctrl-S> so unless WARC can handle lots of compression, it's not going to be suitable
[12:17] <joepie91> wha
[12:17] <Ctrl-S> if it can, i can switch over
[12:17] <joepie91> Ctrl-S: WARC is a storage format
[12:17] <Ctrl-S> I know
[12:17] <joepie91> it has nothing to do with your connection
[12:17] <joepie91> at all
[12:18] <joepie91> it stores data that your client has *anyway*
[12:18] <Ctrl-S> I plan to run it in another country where data is cheaper
[12:18] <Ctrl-S> then pull once it's finished
[12:18] <joepie91> Ctrl-S: what does 'data cap' have to do with WARC? you keep refering to it, but I don't see where it comes into the picture
[12:18] <Ctrl-S> to get the data from a remote machine that runs the script to my machine
[12:18] <joepie91> ...?
[12:19] <joepie91> I still don't get it...
[12:19] <Ctrl-S> me with metered home connection <-> friend with big unmetered pipe <-> internet
[12:19] <joepie91> yes?
[12:20] <joepie91> again, what does this have to do with WARC?
[12:20] <Ctrl-S> If i extract data into a DB there is less data to move
[12:20] <Ctrl-S> half the HTML will be removed
[12:20] <Ctrl-S> or more
[12:20] <joepie91> move from where to where?
[12:21] <Ctrl-S> The data follows this path: Tumblr -> Scraper machine -> my machine
[12:22] <Ctrl-S> that second link is the bottleneck
[12:22] <joepie91> why would you need to move the WARC to your machine?
[12:22] <Ctrl-S> Can't trust remote storage
[12:22] <joepie91> ....?
[12:22] <Ctrl-S> Much better to have a HDD i can hold myself
[12:23] <Ctrl-S> unless WARC is more than HTML with metadata
[12:23] <joepie91> Ctrl-S: I don't really understand where you're seeing a problem
[12:23] <joepie91> you are *already* extracting the content and storing it locally
[12:23] <joepie91> storing the WARC elsewhere doesn't make you lose anything
[12:23] <joepie91> at best it will make you have a WARC in a remote location
[12:23] <Ctrl-S> I'm sorry, I don't understand
[12:23] <snuffy> extracting the butane from it all into a world class mma fighter how is that bullshit
[12:23] <joepie91> at worst the WARC will be lost and you'll still have the same data as when you're not making a WARC
[12:23] <Ctrl-S> I can make it dump to warc
[12:24] <Ctrl-S> That's probably easier then using a db
[12:24] <joepie91> Ctrl-S: I'm not saying to replace one with the other
[12:24] <Ctrl-S> it's just that I want as small a file size as possible after the download has finished
[12:24] <joepie91> I'm saying that you can *also* dump to WARC
[12:24] <snuffy> to replace the police
[12:24] <Ctrl-S> I intend to try for both if i add warc stuff
[12:24] <joepie91> can somebody kick that markov bot please
[12:25] <Ctrl-S> markov bot?
[12:25] <joepie91> balrog: closure: DFJustin: ersi: Famicoman: Kenshin: SadDM: SketchCow: swebb: underscor: yipdw: sorry for the mass highlight, but we have a markov bot misbehaving (snuffy)
[12:25] <joepie91> see above
[12:25] <joepie91> I don't have +o
[12:26] <Ctrl-S> I'll look at libraries for WARC now
[12:26] <joepie91> Ctrl-S: pseudo-AI bot, absorbs what people say then starts randomly outputting vaguely related-seeming sentences
[12:26] <joepie91> can be amusing, but not in discussions...
[12:26] <Ctrl-S> you could tell that from one message?
[12:26] <joepie91> yes
[12:26] <joepie91> they have fairly predictable patterns
[12:26] <joepie91> look carefully
[12:26] <joepie91> [13:23] <joepie91> you are *already* extracting the content and storing it locally
[12:26] <joepie91> [13:23] <snuffy> extracting the butane from it all into a world class mma fighter how is that bullshit
[12:26] <Ctrl-S> oh
[12:26] <Ctrl-S> yeah
[12:26] <Ctrl-S> i see
[12:26] <joepie91> nonsensical sentence, valid grammar, copying an unusual word
[12:27] <joepie91> very typical markov bot pattern :P
[12:27] <joepie91> it uses word associations, basically
[12:27] <joepie91> anyway
[12:27] <joepie91> back to the topic
[12:27] <Ctrl-S> so to make a WARC, what data do I need?
[12:27] <joepie91> Ctrl-S: extracting into a DB is fine for personal copies, but it's probably a good idea to just remotely store a copy of the WARC.. there's a python lib for it afaik
[12:27] <Ctrl-S> ATM I use mechanize for web requests
[12:27] <snuffy> my friend requests
[12:28] <joepie91> the request headers and body (usually just headers), and the response headers and body
[12:28] <joepie91> that's it really
[12:28] <joepie91> warc lib should tell you the specific data needed
[12:28] <joepie91> hopefully
[12:30] <Ctrl-S> is there an easy way to tell HTTrack to output to WARC?
[12:31] <joepie91> httrack doesn't understand warc, as far as I am aware
[12:31] <joepie91> that is why I recommend wget to people :P
[12:31] <Ctrl-S> windows
[12:31] <joepie91> Ctrl-S: wget for windows is a thing
[12:32] <Ctrl-S> I think I had problems with the filename handling?
[12:32] <joepie91> http://gnuwin32.sourceforge.net/packages/wget.htm
[12:32] <joepie91> no idea
[12:33] <Ctrl-S> know of anythign that uses both WARC and mechanize in python?
[12:33] <Ctrl-S> example code makes everything easier
[12:35] *** rejon has joined #archiveteam-bs
[12:36] <joepie91> Ctrl-S: no clue
[12:36] <Ctrl-S> I would honestly rather get this working than search for information on linking the warc stuff to mechanize, but once it's done i'll consider doing it
[12:37] <Ctrl-S> everything goes through a single get() function for web requests, so i suppose i coudl slip something into that afterwards
[12:38] <Ctrl-S> something that works now, perfection later
[12:38] <joepie91> mhmm
[12:39] <SketchCow> snuffy: Destination Drigible
[12:40] <SketchCow> snuffy: Last broken maid harvey clam, bring destination forgotten grass-fed.
[12:40] *** SketchCow sets mode: +b *!*bkr@*.mindhackers.org
[12:40] *** snuffy was kicked by SketchCow (snuffy)
[12:40] <Ctrl-S> WARC doesn't need context, just URL, metadata for both directions, and the response, right?
[12:41] <Ctrl-S> if that is true, I can just change one function afterwards to set it up
[12:43] <joepie91> Ctrl-S: also request body, but if you're only doing GET requests that doesn;t really matter
[12:43] <joepie91> SketchCow: hehe, poisioning its word association cache? :P
[12:43] <joepie91> also, thanks
[12:44] <Ctrl-S> the function is named get(), it takes a URL and returns the page/file
[12:44] <Ctrl-S> it hides the cookies ect from the rest of the code
[12:44] <joepie91> yes, you'll need to capture the request headers also
[12:45] *** BlueMaxim has quit IRC (Quit: Leaving)
[12:45] <Ctrl-S> sounds doable, one i learn how to work with the libs.
[12:46] <Ctrl-S> eurgh, I have to get the date of the post from the archive page, rather than the post itself
[12:47] <Ctrl-S> I was hoping to pass a signe numerical string
[12:51] <joepie91> is anybody grabbing the coverage from Paris?
[12:51] <Ctrl-S> what coverage?
[12:52] <joepie91> Ctrl-S: http://www.theguardian.com/world/live/2015/jan/07/shooting-paris-satirical-magazine-charlie-hebdo
[12:52] <joepie91> .t
[12:52] <botpie91> Wed, 07 Jan 2015 12:52:09 GMT
[12:52] <joepie91> ...
[12:52] <joepie91> .title http://www.theguardian.com/world/live/2015/jan/07/shooting-paris-satirical-magazine-charlie-hebdo
[12:52] <botpie91> joepie91: Charlie Hebdo shooting: twelve dead at Paris offices of satirical magazine – live updates | World news | The Guardian
[12:53] <Ctrl-S> do we have archives of this satirical newspaper?
[12:53] <joepie91> I don't know, but we should
[12:53] *** ersi sets mode: +o joepie91
[12:53] <joepie91> ivan`: ?
[12:53] <joepie91> what's the status on that?
[12:53] <joepie91> ersi: thanks
[12:54] <joepie91> uh oh
[12:54] <joepie91> Ctrl-S: https://t.co/bHl4vKTZUg
[12:54] <joepie91> does this load for you
[12:54] <Ctrl-S> slowly
[12:54] <Ctrl-S> blank page so far
[12:54] <Ctrl-S> connected...
[12:55] <Ctrl-S> i'm in wa.au, btw
[12:55] <Ctrl-S> perth
[12:55] <Ctrl-S> might want to ask someone in france
[12:55] <Ctrl-S> 504
[12:56] <joepie91> :/
[12:56] <joepie91> yeah, it's down I think...
[13:03] <midas> joepie91: works here
[13:03] <midas> via ovh proxy
[13:04] <raylee> works here, .uk
[13:07] <joepie91> yeah, works here now as well, but slow
[13:09] <midas> yep
[13:11] *** primus104 has joined #archiveteam-bs
[13:14] *** Ravenloft has quit IRC (Ping timeout: 370 seconds)
[14:06] <Kazzy> I can't check this right now, apparently it's a video of the shooting.. http://www.liveleak.com/view?i=bc6_1420632668
[14:06] <Kazzy> probably nsfw/l, don't click if you don't want to.
[14:13] <joepie91> Kazzy: contains one person shot to death :(
[14:14] <Kazzy> sigh :(
[14:15] <godane> whats the name of the magazine?
[14:16] <Kazzy> charlie hebdo
[14:18] <Ctrl-S> is someone archiving the video?
[14:18] <Kazzy> liveleak video was grabbed through archivebot
[14:20] <joepie91> the video? or just the page?
[14:20] *** APerti has joined #archiveteam-bs
[14:21] <Kazzy> i have no idea if it grabbed the video too, if someone has stuff on hand to grab it, please do.
[14:22] <joepie91> Kazzy: youtube-dl'ing it
[14:22] <joepie91> looks like youtube-dl groks liveleak, so that's good
[14:27] *** sankin has joined #archiveteam-bs
[14:28] *** garyrh has quit IRC (Read error: Operation timed out)
[14:57] *** norbert79 has quit IRC (Quit: leaving)
[15:00] <balrog> chfoo: how feasible would it be for wpull to feed youtube links into youtube-dl or something like that?
[15:06] *** bauruine has joined #archiveteam-bs
[15:08] <Ctrl-S> what is wpull?
[15:09] <Ctrl-S> this is possible: https://github.com/woodenphone/Youtube-dl-runner
[15:09] <joepie91> Ctrl-S: it;s a drop-in replacement (with some changes) for wget written in Python
[15:09] <Ctrl-S> no idea about the wpull side
[15:11] <Kazzy> Ctrl-S: https://github.com/chfoo/wpull if you're interested
[15:18] <Kazzy> if someone can grab a copy of this, please do soon.. it's liveupdating so probably not worth grabbing just yet http://www.bbc.com/news/live/world-europe-30710777
[15:18] <Ctrl-S> Httrack with new output dir each run?
[15:19] <Ctrl-S> shell script run it at 5-10 min interval?
[15:21] <Kazzy> I'm stuck on a chromebook with 10% battery, can't do much from here :p
[15:24] <Ctrl-S> I have a linux box, you write a script to install and run the whatever it is to download the stuff, i'll run it
[15:25] <Ctrl-S> I thought that chmod -R 777 * was a good idea
[15:25] <midas> chmod -R 777 /
[15:25] <Ctrl-S> so i'm not the guy that should write it
[15:25] <midas> anddd run
[15:25] <Ctrl-S> it did help fix my problem
[15:25] <Ctrl-S> maybe
[15:34] *** mistym has joined #archiveteam-bs
[15:35] *** garyrh has joined #archiveteam-bs
[15:37] *** mistym has quit IRC (Remote host closed the connection)
[15:39] *** norbert79 has joined #archiveteam-bs
[15:46] <midas> can we grab this? https://www.youtube.com/watch?v=LeIy0zH77MM#t=1624 livestream on YT
[15:46] <midas> (dump the timemarker btw)
[15:51] *** aaaaaaaaa has joined #archiveteam-bs
[15:55] *** mistym has joined #archiveteam-bs
[16:16] *** bauruine has quit IRC (Ping timeout: 265 seconds)
[16:19] *** godane has quit IRC (Read error: Operation timed out)
[16:21] *** bauruine has joined #archiveteam-bs
[16:22] *** Start is now known as StartAway
[16:22] *** StartAway is now known as Start
[16:31] *** godane has joined #archiveteam-bs
[16:40] *** dashcloud has quit IRC (Remote host closed the connection)
[16:41] *** dashcloud has joined #archiveteam-bs
[16:54] *** rejon has quit IRC (Ping timeout: 335 seconds)
[16:58] *** mistym has quit IRC (Remote host closed the connection)
[17:09] *** Kassia19 has joined #archiveteam-bs
[17:10] *** Kassia19 has quit IRC (Read error: Connection reset by peer)
[17:14] *** mistym has joined #archiveteam-bs
[17:19] <yipdw> Ctrl-S: fyi archivebot does tumblr archiving ok
[17:22] *** rejon has joined #archiveteam-bs
[17:34] <schbirid> woot, i found a bug on github
[17:34] <schbirid> too dumb to figure out if it is a vulnerability though
[17:36] <joepie91> schbirid: it's Ruby, I think? so yes, probably
[17:36] <joepie91> :P
[17:36] <aaaaaaaaa> do they have a bounty program?
[17:37] <schbirid> yeah
[17:45] <schbirid> hm, seems just to escape one element too many
[17:45] <schbirid> not one too few
[17:59] *** midas1 has joined #archiveteam-bs
[18:12] *** rejon has quit IRC (Read error: Operation timed out)
[18:18] *** Coderjoe_ has joined #archiveteam-bs
[18:21] *** primus104 has quit IRC (hub.se irc.efnet.pl)
[18:21] *** schbirid has quit IRC (hub.se irc.efnet.pl)
[18:21] *** primus has quit IRC (hub.se irc.efnet.pl)
[18:21] *** Coderjoe has quit IRC (hub.se irc.efnet.pl)
[18:22] *** primus_ has joined #archiveteam-bs
[18:27] *** schbirid2 has joined #archiveteam-bs
[19:15] *** rejon has joined #archiveteam-bs
[19:37] *** rejon has quit IRC (Ping timeout: 335 seconds)
[19:55] *** Ravenloft has joined #archiveteam-bs
[20:12] *** mistym has quit IRC (Remote host closed the connection)
[20:36] *** mistym has joined #archiveteam-bs
[20:42] *** aaaaaaaaa has quit IRC (Read error: Operation timed out)
[21:04] *** mistym has quit IRC (Remote host closed the connection)
[21:07] *** aaaaaaaaa has joined #archiveteam-bs
[21:20] *** mistym has joined #archiveteam-bs
[21:27] *** bsmith093 has quit IRC (Read error: Connection reset by peer)
[21:34] *** abartov has quit IRC (Ping timeout: 258 seconds)
[21:39] *** bsmith093 has joined #archiveteam-bs
[21:43] *** yipdw has quit IRC (Quit: yipdw)
[21:43] *** dashcloud has quit IRC (Read error: Operation timed out)
[21:43] *** yipdw has joined #archiveteam-bs
[21:45] *** schbirid2 has quit IRC (Quit: Leaving)
[21:47] *** dashcloud has joined #archiveteam-bs
[21:49] *** abartov has joined #archiveteam-bs
[21:57] *** sankin has quit IRC (Leaving.)
[22:10] *** dashcloud has quit IRC (Read error: Operation timed out)
[22:11] <chfoo> balrog: if it works using a http proxy, it should be doable
[22:11] <balrog> chfoo: it would involve detecting a supported URL and feeding it to the program I think
[22:11] <balrog> I'm a little worried that archivebot doesn't capture youtube videos themselves
[22:12] <balrog> oh, it's in python
[22:13] <yipdw> balrog: it could be done, I'd prefer to have a working replay solution first
[22:13] <balrog> replay?
[22:13] <yipdw> that's why I pointed out that pywb-webrecorder can do it
[22:14] <balrog> doesn't archive.org already have some method of grabbing some youtube stuff?
[22:14] <yipdw> maybe, but as far as I can tell it's not documented
[22:14] <balrog> ah :/
[22:14] <yipdw> anyway, pywb seems to have Deep Magic From Before The Dawn of Time to do this, so I keep thinking it might be interesting to use its proxy + wpull
[22:15] *** dashcloud has joined #archiveteam-bs
[22:15] <yipdw> another problem is making this not cause WARC size to blow up any more than they do in the default !a case
[22:22] <balrog> Deep Magic From Before The Dawn of Time where?
[22:22] <balrog> https://github.com/ikreymer/pywb/blob/4c08a6a06404388e673ed37a6969023712d91c18/pywb/static/vidrw.js
[22:22] <balrog> it's doing a bunch of transformation
[22:42] <yipdw> yeah
[22:42] <yipdw> also injecting flowplayer, etc.
[23:04] *** APerti has quit IRC (Read error: Operation timed out)
[23:13] *** APerti has joined #archiveteam-bs
[23:13] *** dashcloud has quit IRC (Read error: Operation timed out)
[23:13] *** dashcloud has joined #archiveteam-bs
[23:18] *** BlueMaxim has joined #archiveteam-bs
[23:22] *** APerti has quit IRC (Read error: Operation timed out)
[23:33] *** abartov has quit IRC (Ping timeout: 258 seconds)
[23:34] *** Ebony27 has joined #archiveteam-bs
[23:35] *** Ebony27 has quit IRC (Read error: Connection reset by peer)
[23:42] <Start> http://techcrunch.com/2015/01/07/is-youtube-the-yahoo-of-2015/
[23:58] <joepie91> Even BuzzFeed knows point No. 5, and they are the intellectual toilet of the Internet.
[23:58] <joepie91> ouch
[23:58] <BlueMaxim> *flush*