[00:02] *** rejon has quit IRC (Read error: Operation timed out) [00:31] *** garyrh has quit IRC (Remote host closed the connection) [00:58] *** Smiley has quit IRC (Ping timeout: 370 seconds) [00:59] *** garyrh has joined #archiveteam-bs [01:02] *** mistym has quit IRC (Remote host closed the connection) [01:19] *** mistym has joined #archiveteam-bs [01:20] *** DFJustin has joined #archiveteam-bs [01:20] *** swebb sets mode: +o DFJustin [01:24] *** primus104 has quit IRC (Leaving.) [01:34] *** egg_ has quit IRC (quit) [01:38] *** nico_ is now known as nico_32 [01:44] *** Smiley has joined #archiveteam-bs [02:21] *** APerti has joined #archiveteam-bs [02:47] *** APerti_ has joined #archiveteam-bs [02:50] *** APerti has quit IRC (Read error: Operation timed out) [03:49] http://thedailywh.at/2015/01/distraction-of-the-day-you-can-now-play-oregon-trail-and-other-ms-dos-games-online/ [04:33] *** aaaaaaaaa has quit IRC (Leaving) [05:21] *** S_aus_Eur has joined #archiveteam-bs [05:21] *** S_aus_Eur has left [05:29] so some of the npr morning radio episodes are going to be in real media [05:29] these real media files don't derive right at all [05:29] old ones don't have this problem [05:30] i hope some one can at least look at them to see what is the problem with IA deriving them [05:31] these are in real media only: https://archive.org/details/npr-morning-edition-01-02-2003 [05:34] *** mistym has quit IRC (Remote host closed the connection) [05:37] *** mistym has joined #archiveteam-bs [05:59] huh never knew there was a dos oregon trail [07:13] *** mistym has quit IRC (Remote host closed the connection) [07:25] *** APerti_ has quit IRC (Read error: Operation timed out) [07:44] I've started work on a tumblr archiver, here is code so far: https://mega.co.nz/#!bxJFzL4Z!8h1TQHKJT7WvJRkgiZTPkgbO2gDw7a4VbxFSa1Go-k4 [08:06] *** primus104 has joined #archiveteam-bs [08:38] godane: fwiw, it has derived now [08:38] Ctrl-S: use some sort of git hosting, please :D [08:38] especially since you're already using git... [08:39] (or at the very least tar.gz, zip isn't very good at unix perms) [08:51] *** GLaDOS has quit IRC (Ping timeout: 272 seconds) [08:51] *** GLaDOS has joined #archiveteam-bs [08:51] *** swebb sets mode: +o GLaDOS [09:07] *** brayden has quit IRC (Ping timeout: 607 seconds) [09:54] *** schbirid has joined #archiveteam-bs [09:55] *** brayden has joined #archiveteam-bs [10:01] *** kvieta has quit IRC (Read error: Operation timed out) [10:12] *** kvieta has joined #archiveteam-bs [11:43] *** primus104 has quit IRC (Leaving.) [11:47] *** yan has joined #archiveteam-bs [12:03] is this good enough for you? https://github.com/woodenphone/tumblr-to-db [12:03] still WIP [12:03] looks like 20100919 marshill hd video doesn't work [12:04] Ctrl-S: yes, git is good :D [12:04] so i try to get the tv_sd_progressive version of that video [12:04] goal is to save tumblr blogs to a db so i can scrape remotely and retreive to my metered home connection [12:04] HTTrack automation just doesn't cut it [12:05] also HTTrack does not remember where it has been [12:05] or rather, it does not understand the difference between posts and the listings [12:13] Ctrl-S: shouldn't you be using WARC, though? [12:14] Filesize must be minimised [12:14] Purpose is to save the blogs, not to shove into IA [12:14] Most important things are the poss and the media [12:15] WARC overhead is negligible, really [12:15] WARC isn't just for IA either :) [12:16] basically the problem this software is supposed to address is: Tumblr makes it really easy to get a blog deleted [12:16] Ctrl-S: thing is, if you're making HTTP requests anyway, you might as well dump them into a WARC? [12:16] mm [12:16] I have a metered home connection [12:17] ok? [12:17] so unless WARC can handle lots of compression, it's not going to be suitable [12:17] wha [12:17] if it can, i can switch over [12:17] Ctrl-S: WARC is a storage format [12:17] I know [12:17] it has nothing to do with your connection [12:17] at all [12:18] it stores data that your client has *anyway* [12:18] I plan to run it in another country where data is cheaper [12:18] then pull once it's finished [12:18] Ctrl-S: what does 'data cap' have to do with WARC? you keep refering to it, but I don't see where it comes into the picture [12:18] to get the data from a remote machine that runs the script to my machine [12:18] ...? [12:19] I still don't get it... [12:19] me with metered home connection <-> friend with big unmetered pipe <-> internet [12:19] yes? [12:20] again, what does this have to do with WARC? [12:20] If i extract data into a DB there is less data to move [12:20] half the HTML will be removed [12:20] or more [12:20] move from where to where? [12:21] The data follows this path: Tumblr -> Scraper machine -> my machine [12:22] that second link is the bottleneck [12:22] why would you need to move the WARC to your machine? [12:22] Can't trust remote storage [12:22] ....? [12:22] Much better to have a HDD i can hold myself [12:23] unless WARC is more than HTML with metadata [12:23] Ctrl-S: I don't really understand where you're seeing a problem [12:23] you are *already* extracting the content and storing it locally [12:23] storing the WARC elsewhere doesn't make you lose anything [12:23] at best it will make you have a WARC in a remote location [12:23] I'm sorry, I don't understand [12:23] extracting the butane from it all into a world class mma fighter how is that bullshit [12:23] at worst the WARC will be lost and you'll still have the same data as when you're not making a WARC [12:23] I can make it dump to warc [12:24] That's probably easier then using a db [12:24] Ctrl-S: I'm not saying to replace one with the other [12:24] it's just that I want as small a file size as possible after the download has finished [12:24] I'm saying that you can *also* dump to WARC [12:24] to replace the police [12:24] I intend to try for both if i add warc stuff [12:24] can somebody kick that markov bot please [12:25] markov bot? [12:25] balrog: closure: DFJustin: ersi: Famicoman: Kenshin: SadDM: SketchCow: swebb: underscor: yipdw: sorry for the mass highlight, but we have a markov bot misbehaving (snuffy) [12:25] see above [12:25] I don't have +o [12:26] I'll look at libraries for WARC now [12:26] Ctrl-S: pseudo-AI bot, absorbs what people say then starts randomly outputting vaguely related-seeming sentences [12:26] can be amusing, but not in discussions... [12:26] you could tell that from one message? [12:26] yes [12:26] they have fairly predictable patterns [12:26] look carefully [12:26] [13:23] you are *already* extracting the content and storing it locally [12:26] [13:23] extracting the butane from it all into a world class mma fighter how is that bullshit [12:26] oh [12:26] yeah [12:26] i see [12:26] nonsensical sentence, valid grammar, copying an unusual word [12:27] very typical markov bot pattern :P [12:27] it uses word associations, basically [12:27] anyway [12:27] back to the topic [12:27] so to make a WARC, what data do I need? [12:27] Ctrl-S: extracting into a DB is fine for personal copies, but it's probably a good idea to just remotely store a copy of the WARC.. there's a python lib for it afaik [12:27] ATM I use mechanize for web requests [12:27] my friend requests [12:28] the request headers and body (usually just headers), and the response headers and body [12:28] that's it really [12:28] warc lib should tell you the specific data needed [12:28] hopefully [12:30] is there an easy way to tell HTTrack to output to WARC? [12:31] httrack doesn't understand warc, as far as I am aware [12:31] that is why I recommend wget to people :P [12:31] windows [12:31] Ctrl-S: wget for windows is a thing [12:32] I think I had problems with the filename handling? [12:32] http://gnuwin32.sourceforge.net/packages/wget.htm [12:32] no idea [12:33] know of anythign that uses both WARC and mechanize in python? [12:33] example code makes everything easier [12:35] *** rejon has joined #archiveteam-bs [12:36] Ctrl-S: no clue [12:36] I would honestly rather get this working than search for information on linking the warc stuff to mechanize, but once it's done i'll consider doing it [12:37] everything goes through a single get() function for web requests, so i suppose i coudl slip something into that afterwards [12:38] something that works now, perfection later [12:38] mhmm [12:39] snuffy: Destination Drigible [12:40] snuffy: Last broken maid harvey clam, bring destination forgotten grass-fed. [12:40] *** SketchCow sets mode: +b *!*bkr@*.mindhackers.org [12:40] *** snuffy was kicked by SketchCow (snuffy) [12:40] WARC doesn't need context, just URL, metadata for both directions, and the response, right? [12:41] if that is true, I can just change one function afterwards to set it up [12:43] Ctrl-S: also request body, but if you're only doing GET requests that doesn;t really matter [12:43] SketchCow: hehe, poisioning its word association cache? :P [12:43] also, thanks [12:44] the function is named get(), it takes a URL and returns the page/file [12:44] it hides the cookies ect from the rest of the code [12:44] yes, you'll need to capture the request headers also [12:45] *** BlueMaxim has quit IRC (Quit: Leaving) [12:45] sounds doable, one i learn how to work with the libs. [12:46] eurgh, I have to get the date of the post from the archive page, rather than the post itself [12:47] I was hoping to pass a signe numerical string [12:51] is anybody grabbing the coverage from Paris? [12:51] what coverage? [12:52] Ctrl-S: http://www.theguardian.com/world/live/2015/jan/07/shooting-paris-satirical-magazine-charlie-hebdo [12:52] .t [12:52] Wed, 07 Jan 2015 12:52:09 GMT [12:52] ... [12:52] .title http://www.theguardian.com/world/live/2015/jan/07/shooting-paris-satirical-magazine-charlie-hebdo [12:52] joepie91: Charlie Hebdo shooting: twelve dead at Paris offices of satirical magazine – live updates | World news | The Guardian [12:53] do we have archives of this satirical newspaper? [12:53] I don't know, but we should [12:53] *** ersi sets mode: +o joepie91 [12:53] ivan`: ? [12:53] what's the status on that? [12:53] ersi: thanks [12:54] uh oh [12:54] Ctrl-S: https://t.co/bHl4vKTZUg [12:54] does this load for you [12:54] slowly [12:54] blank page so far [12:54] connected... [12:55] i'm in wa.au, btw [12:55] perth [12:55] might want to ask someone in france [12:55] 504 [12:56] :/ [12:56] yeah, it's down I think... [13:03] joepie91: works here [13:03] via ovh proxy [13:04] works here, .uk [13:07] yeah, works here now as well, but slow [13:09] yep [13:11] *** primus104 has joined #archiveteam-bs [13:14] *** Ravenloft has quit IRC (Ping timeout: 370 seconds) [14:06] I can't check this right now, apparently it's a video of the shooting.. http://www.liveleak.com/view?i=bc6_1420632668 [14:06] probably nsfw/l, don't click if you don't want to. [14:13] Kazzy: contains one person shot to death :( [14:14] sigh :( [14:15] whats the name of the magazine? [14:16] charlie hebdo [14:18] is someone archiving the video? [14:18] liveleak video was grabbed through archivebot [14:20] the video? or just the page? [14:20] *** APerti has joined #archiveteam-bs [14:21] i have no idea if it grabbed the video too, if someone has stuff on hand to grab it, please do. [14:22] Kazzy: youtube-dl'ing it [14:22] looks like youtube-dl groks liveleak, so that's good [14:27] *** sankin has joined #archiveteam-bs [14:28] *** garyrh has quit IRC (Read error: Operation timed out) [14:57] *** norbert79 has quit IRC (Quit: leaving) [15:00] chfoo: how feasible would it be for wpull to feed youtube links into youtube-dl or something like that? [15:06] *** bauruine has joined #archiveteam-bs [15:08] what is wpull? [15:09] this is possible: https://github.com/woodenphone/Youtube-dl-runner [15:09] Ctrl-S: it;s a drop-in replacement (with some changes) for wget written in Python [15:09] no idea about the wpull side [15:11] Ctrl-S: https://github.com/chfoo/wpull if you're interested [15:18] if someone can grab a copy of this, please do soon.. it's liveupdating so probably not worth grabbing just yet http://www.bbc.com/news/live/world-europe-30710777 [15:18] Httrack with new output dir each run? [15:19] shell script run it at 5-10 min interval? [15:21] I'm stuck on a chromebook with 10% battery, can't do much from here :p [15:24] I have a linux box, you write a script to install and run the whatever it is to download the stuff, i'll run it [15:25] I thought that chmod -R 777 * was a good idea [15:25] chmod -R 777 / [15:25] so i'm not the guy that should write it [15:25] anddd run [15:25] it did help fix my problem [15:25] maybe [15:34] *** mistym has joined #archiveteam-bs [15:35] *** garyrh has joined #archiveteam-bs [15:37] *** mistym has quit IRC (Remote host closed the connection) [15:39] *** norbert79 has joined #archiveteam-bs [15:46] can we grab this? https://www.youtube.com/watch?v=LeIy0zH77MM#t=1624 livestream on YT [15:46] (dump the timemarker btw) [15:51] *** aaaaaaaaa has joined #archiveteam-bs [15:55] *** mistym has joined #archiveteam-bs [16:16] *** bauruine has quit IRC (Ping timeout: 265 seconds) [16:19] *** godane has quit IRC (Read error: Operation timed out) [16:21] *** bauruine has joined #archiveteam-bs [16:22] *** Start is now known as StartAway [16:22] *** StartAway is now known as Start [16:31] *** godane has joined #archiveteam-bs [16:40] *** dashcloud has quit IRC (Remote host closed the connection) [16:41] *** dashcloud has joined #archiveteam-bs [16:54] *** rejon has quit IRC (Ping timeout: 335 seconds) [16:58] *** mistym has quit IRC (Remote host closed the connection) [17:09] *** Kassia19 has joined #archiveteam-bs [17:10] *** Kassia19 has quit IRC (Read error: Connection reset by peer) [17:14] *** mistym has joined #archiveteam-bs [17:19] Ctrl-S: fyi archivebot does tumblr archiving ok [17:22] *** rejon has joined #archiveteam-bs [17:34] woot, i found a bug on github [17:34] too dumb to figure out if it is a vulnerability though [17:36] schbirid: it's Ruby, I think? so yes, probably [17:36] :P [17:36] do they have a bounty program? [17:37] yeah [17:45] hm, seems just to escape one element too many [17:45] not one too few [17:59] *** midas1 has joined #archiveteam-bs [18:12] *** rejon has quit IRC (Read error: Operation timed out) [18:18] *** Coderjoe_ has joined #archiveteam-bs [18:21] *** primus104 has quit IRC (hub.se irc.efnet.pl) [18:21] *** schbirid has quit IRC (hub.se irc.efnet.pl) [18:21] *** primus has quit IRC (hub.se irc.efnet.pl) [18:21] *** Coderjoe has quit IRC (hub.se irc.efnet.pl) [18:22] *** primus_ has joined #archiveteam-bs [18:27] *** schbirid2 has joined #archiveteam-bs [19:15] *** rejon has joined #archiveteam-bs [19:37] *** rejon has quit IRC (Ping timeout: 335 seconds) [19:55] *** Ravenloft has joined #archiveteam-bs [20:12] *** mistym has quit IRC (Remote host closed the connection) [20:36] *** mistym has joined #archiveteam-bs [20:42] *** aaaaaaaaa has quit IRC (Read error: Operation timed out) [21:04] *** mistym has quit IRC (Remote host closed the connection) [21:07] *** aaaaaaaaa has joined #archiveteam-bs [21:20] *** mistym has joined #archiveteam-bs [21:27] *** bsmith093 has quit IRC (Read error: Connection reset by peer) [21:34] *** abartov has quit IRC (Ping timeout: 258 seconds) [21:39] *** bsmith093 has joined #archiveteam-bs [21:43] *** yipdw has quit IRC (Quit: yipdw) [21:43] *** dashcloud has quit IRC (Read error: Operation timed out) [21:43] *** yipdw has joined #archiveteam-bs [21:45] *** schbirid2 has quit IRC (Quit: Leaving) [21:47] *** dashcloud has joined #archiveteam-bs [21:49] *** abartov has joined #archiveteam-bs [21:57] *** sankin has quit IRC (Leaving.) [22:10] *** dashcloud has quit IRC (Read error: Operation timed out) [22:11] balrog: if it works using a http proxy, it should be doable [22:11] chfoo: it would involve detecting a supported URL and feeding it to the program I think [22:11] I'm a little worried that archivebot doesn't capture youtube videos themselves [22:12] oh, it's in python [22:13] balrog: it could be done, I'd prefer to have a working replay solution first [22:13] replay? [22:13] that's why I pointed out that pywb-webrecorder can do it [22:14] doesn't archive.org already have some method of grabbing some youtube stuff? [22:14] maybe, but as far as I can tell it's not documented [22:14] ah :/ [22:14] anyway, pywb seems to have Deep Magic From Before The Dawn of Time to do this, so I keep thinking it might be interesting to use its proxy + wpull [22:15] *** dashcloud has joined #archiveteam-bs [22:15] another problem is making this not cause WARC size to blow up any more than they do in the default !a case [22:22] Deep Magic From Before The Dawn of Time where? [22:22] https://github.com/ikreymer/pywb/blob/4c08a6a06404388e673ed37a6969023712d91c18/pywb/static/vidrw.js [22:22] it's doing a bunch of transformation [22:42] yeah [22:42] also injecting flowplayer, etc. [23:04] *** APerti has quit IRC (Read error: Operation timed out) [23:13] *** APerti has joined #archiveteam-bs [23:13] *** dashcloud has quit IRC (Read error: Operation timed out) [23:13] *** dashcloud has joined #archiveteam-bs [23:18] *** BlueMaxim has joined #archiveteam-bs [23:22] *** APerti has quit IRC (Read error: Operation timed out) [23:33] *** abartov has quit IRC (Ping timeout: 258 seconds) [23:34] *** Ebony27 has joined #archiveteam-bs [23:35] *** Ebony27 has quit IRC (Read error: Connection reset by peer) [23:42] http://techcrunch.com/2015/01/07/is-youtube-the-yahoo-of-2015/ [23:58] Even BuzzFeed knows point No. 5, and they are the intellectual toilet of the Internet. [23:58] ouch [23:58] *flush*