#archiveteam-bs 2015-01-07,Wed

↑back Search

Time Nickname Message
00:02 🔗 rejon has quit IRC (Read error: Operation timed out)
00:31 🔗 garyrh has quit IRC (Remote host closed the connection)
00:58 🔗 Smiley has quit IRC (Ping timeout: 370 seconds)
00:59 🔗 garyrh has joined #archiveteam-bs
01:02 🔗 mistym has quit IRC (Remote host closed the connection)
01:19 🔗 mistym has joined #archiveteam-bs
01:20 🔗 DFJustin has joined #archiveteam-bs
01:20 🔗 swebb sets mode: +o DFJustin
01:24 🔗 primus104 has quit IRC (Leaving.)
01:34 🔗 egg_ has quit IRC (quit)
01:38 🔗 nico_ is now known as nico_32
01:44 🔗 Smiley has joined #archiveteam-bs
02:21 🔗 APerti has joined #archiveteam-bs
02:47 🔗 APerti_ has joined #archiveteam-bs
02:50 🔗 APerti has quit IRC (Read error: Operation timed out)
03:49 🔗 chfoo http://thedailywh.at/2015/01/distraction-of-the-day-you-can-now-play-oregon-trail-and-other-ms-dos-games-online/
04:33 🔗 aaaaaaaaa has quit IRC (Leaving)
05:21 🔗 S_aus_Eur has joined #archiveteam-bs
05:21 🔗 S_aus_Eur has left
05:29 🔗 godane so some of the npr morning radio episodes are going to be in real media
05:29 🔗 godane these real media files don't derive right at all
05:29 🔗 godane old ones don't have this problem
05:30 🔗 godane i hope some one can at least look at them to see what is the problem with IA deriving them
05:31 🔗 godane these are in real media only: https://archive.org/details/npr-morning-edition-01-02-2003
05:34 🔗 mistym has quit IRC (Remote host closed the connection)
05:37 🔗 mistym has joined #archiveteam-bs
05:59 🔗 DFJustin huh never knew there was a dos oregon trail
07:13 🔗 mistym has quit IRC (Remote host closed the connection)
07:25 🔗 APerti_ has quit IRC (Read error: Operation timed out)
07:44 🔗 Ctrl-S I've started work on a tumblr archiver, here is code so far: https://mega.co.nz/#!bxJFzL4Z!8h1TQHKJT7WvJRkgiZTPkgbO2gDw7a4VbxFSa1Go-k4
08:06 🔗 primus104 has joined #archiveteam-bs
08:38 🔗 joepie91 godane: fwiw, it has derived now
08:38 🔗 joepie91 Ctrl-S: use some sort of git hosting, please :D
08:38 🔗 joepie91 especially since you're already using git...
08:39 🔗 joepie91 (or at the very least tar.gz, zip isn't very good at unix perms)
08:51 🔗 GLaDOS has quit IRC (Ping timeout: 272 seconds)
08:51 🔗 GLaDOS has joined #archiveteam-bs
08:51 🔗 swebb sets mode: +o GLaDOS
09:07 🔗 brayden has quit IRC (Ping timeout: 607 seconds)
09:54 🔗 schbirid has joined #archiveteam-bs
09:55 🔗 brayden has joined #archiveteam-bs
10:01 🔗 kvieta has quit IRC (Read error: Operation timed out)
10:12 🔗 kvieta has joined #archiveteam-bs
11:43 🔗 primus104 has quit IRC (Leaving.)
11:47 🔗 yan has joined #archiveteam-bs
12:03 🔗 Ctrl-S is this good enough for you? https://github.com/woodenphone/tumblr-to-db
12:03 🔗 Ctrl-S still WIP
12:03 🔗 godane looks like 20100919 marshill hd video doesn't work
12:04 🔗 joepie91 Ctrl-S: yes, git is good :D
12:04 🔗 godane so i try to get the tv_sd_progressive version of that video
12:04 🔗 Ctrl-S goal is to save tumblr blogs to a db so i can scrape remotely and retreive to my metered home connection
12:04 🔗 Ctrl-S HTTrack automation just doesn't cut it
12:05 🔗 Ctrl-S also HTTrack does not remember where it has been
12:05 🔗 Ctrl-S or rather, it does not understand the difference between posts and the listings
12:13 🔗 joepie91 Ctrl-S: shouldn't you be using WARC, though?
12:14 🔗 Ctrl-S Filesize must be minimised
12:14 🔗 Ctrl-S Purpose is to save the blogs, not to shove into IA
12:14 🔗 Ctrl-S Most important things are the poss and the media
12:15 🔗 joepie91 WARC overhead is negligible, really
12:15 🔗 joepie91 WARC isn't just for IA either :)
12:16 🔗 Ctrl-S basically the problem this software is supposed to address is: Tumblr makes it really easy to get a blog deleted
12:16 🔗 joepie91 Ctrl-S: thing is, if you're making HTTP requests anyway, you might as well dump them into a WARC?
12:16 🔗 joepie91 mm
12:16 🔗 Ctrl-S I have a metered home connection
12:17 🔗 joepie91 ok?
12:17 🔗 Ctrl-S so unless WARC can handle lots of compression, it's not going to be suitable
12:17 🔗 joepie91 wha
12:17 🔗 Ctrl-S if it can, i can switch over
12:17 🔗 joepie91 Ctrl-S: WARC is a storage format
12:17 🔗 Ctrl-S I know
12:17 🔗 joepie91 it has nothing to do with your connection
12:17 🔗 joepie91 at all
12:18 🔗 joepie91 it stores data that your client has *anyway*
12:18 🔗 Ctrl-S I plan to run it in another country where data is cheaper
12:18 🔗 Ctrl-S then pull once it's finished
12:18 🔗 joepie91 Ctrl-S: what does 'data cap' have to do with WARC? you keep refering to it, but I don't see where it comes into the picture
12:18 🔗 Ctrl-S to get the data from a remote machine that runs the script to my machine
12:18 🔗 joepie91 ...?
12:19 🔗 joepie91 I still don't get it...
12:19 🔗 Ctrl-S me with metered home connection <-> friend with big unmetered pipe <-> internet
12:19 🔗 joepie91 yes?
12:20 🔗 joepie91 again, what does this have to do with WARC?
12:20 🔗 Ctrl-S If i extract data into a DB there is less data to move
12:20 🔗 Ctrl-S half the HTML will be removed
12:20 🔗 Ctrl-S or more
12:20 🔗 joepie91 move from where to where?
12:21 🔗 Ctrl-S The data follows this path: Tumblr -> Scraper machine -> my machine
12:22 🔗 Ctrl-S that second link is the bottleneck
12:22 🔗 joepie91 why would you need to move the WARC to your machine?
12:22 🔗 Ctrl-S Can't trust remote storage
12:22 🔗 joepie91 ....?
12:22 🔗 Ctrl-S Much better to have a HDD i can hold myself
12:23 🔗 Ctrl-S unless WARC is more than HTML with metadata
12:23 🔗 joepie91 Ctrl-S: I don't really understand where you're seeing a problem
12:23 🔗 joepie91 you are *already* extracting the content and storing it locally
12:23 🔗 joepie91 storing the WARC elsewhere doesn't make you lose anything
12:23 🔗 joepie91 at best it will make you have a WARC in a remote location
12:23 🔗 Ctrl-S I'm sorry, I don't understand
12:23 🔗 snuffy extracting the butane from it all into a world class mma fighter how is that bullshit
12:23 🔗 joepie91 at worst the WARC will be lost and you'll still have the same data as when you're not making a WARC
12:23 🔗 Ctrl-S I can make it dump to warc
12:24 🔗 Ctrl-S That's probably easier then using a db
12:24 🔗 joepie91 Ctrl-S: I'm not saying to replace one with the other
12:24 🔗 Ctrl-S it's just that I want as small a file size as possible after the download has finished
12:24 🔗 joepie91 I'm saying that you can *also* dump to WARC
12:24 🔗 snuffy to replace the police
12:24 🔗 Ctrl-S I intend to try for both if i add warc stuff
12:24 🔗 joepie91 can somebody kick that markov bot please
12:25 🔗 Ctrl-S markov bot?
12:25 🔗 joepie91 balrog: closure: DFJustin: ersi: Famicoman: Kenshin: SadDM: SketchCow: swebb: underscor: yipdw: sorry for the mass highlight, but we have a markov bot misbehaving (snuffy)
12:25 🔗 joepie91 see above
12:25 🔗 joepie91 I don't have +o
12:26 🔗 Ctrl-S I'll look at libraries for WARC now
12:26 🔗 joepie91 Ctrl-S: pseudo-AI bot, absorbs what people say then starts randomly outputting vaguely related-seeming sentences
12:26 🔗 joepie91 can be amusing, but not in discussions...
12:26 🔗 Ctrl-S you could tell that from one message?
12:26 🔗 joepie91 yes
12:26 🔗 joepie91 they have fairly predictable patterns
12:26 🔗 joepie91 look carefully
12:26 🔗 joepie91 [13:23] <joepie91> you are *already* extracting the content and storing it locally
12:26 🔗 joepie91 [13:23] <snuffy> extracting the butane from it all into a world class mma fighter how is that bullshit
12:26 🔗 Ctrl-S oh
12:26 🔗 Ctrl-S yeah
12:26 🔗 Ctrl-S i see
12:26 🔗 joepie91 nonsensical sentence, valid grammar, copying an unusual word
12:27 🔗 joepie91 very typical markov bot pattern :P
12:27 🔗 joepie91 it uses word associations, basically
12:27 🔗 joepie91 anyway
12:27 🔗 joepie91 back to the topic
12:27 🔗 Ctrl-S so to make a WARC, what data do I need?
12:27 🔗 joepie91 Ctrl-S: extracting into a DB is fine for personal copies, but it's probably a good idea to just remotely store a copy of the WARC.. there's a python lib for it afaik
12:27 🔗 Ctrl-S ATM I use mechanize for web requests
12:27 🔗 snuffy my friend requests
12:28 🔗 joepie91 the request headers and body (usually just headers), and the response headers and body
12:28 🔗 joepie91 that's it really
12:28 🔗 joepie91 warc lib should tell you the specific data needed
12:28 🔗 joepie91 hopefully
12:30 🔗 Ctrl-S is there an easy way to tell HTTrack to output to WARC?
12:31 🔗 joepie91 httrack doesn't understand warc, as far as I am aware
12:31 🔗 joepie91 that is why I recommend wget to people :P
12:31 🔗 Ctrl-S windows
12:31 🔗 joepie91 Ctrl-S: wget for windows is a thing
12:32 🔗 Ctrl-S I think I had problems with the filename handling?
12:32 🔗 joepie91 http://gnuwin32.sourceforge.net/packages/wget.htm
12:32 🔗 joepie91 no idea
12:33 🔗 Ctrl-S know of anythign that uses both WARC and mechanize in python?
12:33 🔗 Ctrl-S example code makes everything easier
12:35 🔗 rejon has joined #archiveteam-bs
12:36 🔗 joepie91 Ctrl-S: no clue
12:36 🔗 Ctrl-S I would honestly rather get this working than search for information on linking the warc stuff to mechanize, but once it's done i'll consider doing it
12:37 🔗 Ctrl-S everything goes through a single get() function for web requests, so i suppose i coudl slip something into that afterwards
12:38 🔗 Ctrl-S something that works now, perfection later
12:38 🔗 joepie91 mhmm
12:39 🔗 SketchCow snuffy: Destination Drigible
12:40 🔗 SketchCow snuffy: Last broken maid harvey clam, bring destination forgotten grass-fed.
12:40 🔗 SketchCow sets mode: +b *!*bkr@*.mindhackers.org
12:40 🔗 snuffy was kicked by SketchCow (snuffy)
12:40 🔗 Ctrl-S WARC doesn't need context, just URL, metadata for both directions, and the response, right?
12:41 🔗 Ctrl-S if that is true, I can just change one function afterwards to set it up
12:43 🔗 joepie91 Ctrl-S: also request body, but if you're only doing GET requests that doesn;t really matter
12:43 🔗 joepie91 SketchCow: hehe, poisioning its word association cache? :P
12:43 🔗 joepie91 also, thanks
12:44 🔗 Ctrl-S the function is named get(), it takes a URL and returns the page/file
12:44 🔗 Ctrl-S it hides the cookies ect from the rest of the code
12:44 🔗 joepie91 yes, you'll need to capture the request headers also
12:45 🔗 BlueMaxim has quit IRC (Quit: Leaving)
12:45 🔗 Ctrl-S sounds doable, one i learn how to work with the libs.
12:46 🔗 Ctrl-S eurgh, I have to get the date of the post from the archive page, rather than the post itself
12:47 🔗 Ctrl-S I was hoping to pass a signe numerical string
12:51 🔗 joepie91 is anybody grabbing the coverage from Paris?
12:51 🔗 Ctrl-S what coverage?
12:52 🔗 joepie91 Ctrl-S: http://www.theguardian.com/world/live/2015/jan/07/shooting-paris-satirical-magazine-charlie-hebdo
12:52 🔗 joepie91 .t
12:52 🔗 botpie91 Wed, 07 Jan 2015 12:52:09 GMT
12:52 🔗 joepie91 ...
12:52 🔗 joepie91 .title http://www.theguardian.com/world/live/2015/jan/07/shooting-paris-satirical-magazine-charlie-hebdo
12:52 🔗 botpie91 joepie91: Charlie Hebdo shooting: twelve dead at Paris offices of satirical magazine – live updates | World news | The Guardian
12:53 🔗 Ctrl-S do we have archives of this satirical newspaper?
12:53 🔗 joepie91 I don't know, but we should
12:53 🔗 ersi sets mode: +o joepie91
12:53 🔗 joepie91 ivan`: ?
12:53 🔗 joepie91 what's the status on that?
12:53 🔗 joepie91 ersi: thanks
12:54 🔗 joepie91 uh oh
12:54 🔗 joepie91 Ctrl-S: https://t.co/bHl4vKTZUg
12:54 🔗 joepie91 does this load for you
12:54 🔗 Ctrl-S slowly
12:54 🔗 Ctrl-S blank page so far
12:54 🔗 Ctrl-S connected...
12:55 🔗 Ctrl-S i'm in wa.au, btw
12:55 🔗 Ctrl-S perth
12:55 🔗 Ctrl-S might want to ask someone in france
12:55 🔗 Ctrl-S 504
12:56 🔗 joepie91 :/
12:56 🔗 joepie91 yeah, it's down I think...
13:03 🔗 midas joepie91: works here
13:03 🔗 midas via ovh proxy
13:04 🔗 raylee works here, .uk
13:07 🔗 joepie91 yeah, works here now as well, but slow
13:09 🔗 midas yep
13:11 🔗 primus104 has joined #archiveteam-bs
13:14 🔗 Ravenloft has quit IRC (Ping timeout: 370 seconds)
14:06 🔗 Kazzy I can't check this right now, apparently it's a video of the shooting.. http://www.liveleak.com/view?i=bc6_1420632668
14:06 🔗 Kazzy probably nsfw/l, don't click if you don't want to.
14:13 🔗 joepie91 Kazzy: contains one person shot to death :(
14:14 🔗 Kazzy sigh :(
14:15 🔗 godane whats the name of the magazine?
14:16 🔗 Kazzy charlie hebdo
14:18 🔗 Ctrl-S is someone archiving the video?
14:18 🔗 Kazzy liveleak video was grabbed through archivebot
14:20 🔗 joepie91 the video? or just the page?
14:20 🔗 APerti has joined #archiveteam-bs
14:21 🔗 Kazzy i have no idea if it grabbed the video too, if someone has stuff on hand to grab it, please do.
14:22 🔗 joepie91 Kazzy: youtube-dl'ing it
14:22 🔗 joepie91 looks like youtube-dl groks liveleak, so that's good
14:27 🔗 sankin has joined #archiveteam-bs
14:28 🔗 garyrh has quit IRC (Read error: Operation timed out)
14:57 🔗 norbert79 has quit IRC (Quit: leaving)
15:00 🔗 balrog chfoo: how feasible would it be for wpull to feed youtube links into youtube-dl or something like that?
15:06 🔗 bauruine has joined #archiveteam-bs
15:08 🔗 Ctrl-S what is wpull?
15:09 🔗 Ctrl-S this is possible: https://github.com/woodenphone/Youtube-dl-runner
15:09 🔗 joepie91 Ctrl-S: it;s a drop-in replacement (with some changes) for wget written in Python
15:09 🔗 Ctrl-S no idea about the wpull side
15:11 🔗 Kazzy Ctrl-S: https://github.com/chfoo/wpull if you're interested
15:18 🔗 Kazzy if someone can grab a copy of this, please do soon.. it's liveupdating so probably not worth grabbing just yet http://www.bbc.com/news/live/world-europe-30710777
15:18 🔗 Ctrl-S Httrack with new output dir each run?
15:19 🔗 Ctrl-S shell script run it at 5-10 min interval?
15:21 🔗 Kazzy I'm stuck on a chromebook with 10% battery, can't do much from here :p
15:24 🔗 Ctrl-S I have a linux box, you write a script to install and run the whatever it is to download the stuff, i'll run it
15:25 🔗 Ctrl-S I thought that chmod -R 777 * was a good idea
15:25 🔗 midas chmod -R 777 /
15:25 🔗 Ctrl-S so i'm not the guy that should write it
15:25 🔗 midas anddd run
15:25 🔗 Ctrl-S it did help fix my problem
15:25 🔗 Ctrl-S maybe
15:34 🔗 mistym has joined #archiveteam-bs
15:35 🔗 garyrh has joined #archiveteam-bs
15:37 🔗 mistym has quit IRC (Remote host closed the connection)
15:39 🔗 norbert79 has joined #archiveteam-bs
15:46 🔗 midas can we grab this? https://www.youtube.com/watch?v=LeIy0zH77MM#t=1624 livestream on YT
15:46 🔗 midas (dump the timemarker btw)
15:51 🔗 aaaaaaaaa has joined #archiveteam-bs
15:55 🔗 mistym has joined #archiveteam-bs
16:16 🔗 bauruine has quit IRC (Ping timeout: 265 seconds)
16:19 🔗 godane has quit IRC (Read error: Operation timed out)
16:21 🔗 bauruine has joined #archiveteam-bs
16:22 🔗 Start is now known as StartAway
16:22 🔗 StartAway is now known as Start
16:31 🔗 godane has joined #archiveteam-bs
16:40 🔗 dashcloud has quit IRC (Remote host closed the connection)
16:41 🔗 dashcloud has joined #archiveteam-bs
16:54 🔗 rejon has quit IRC (Ping timeout: 335 seconds)
16:58 🔗 mistym has quit IRC (Remote host closed the connection)
17:09 🔗 Kassia19 has joined #archiveteam-bs
17:10 🔗 Kassia19 has quit IRC (Read error: Connection reset by peer)
17:14 🔗 mistym has joined #archiveteam-bs
17:19 🔗 yipdw Ctrl-S: fyi archivebot does tumblr archiving ok
17:22 🔗 rejon has joined #archiveteam-bs
17:34 🔗 schbirid woot, i found a bug on github
17:34 🔗 schbirid too dumb to figure out if it is a vulnerability though
17:36 🔗 joepie91 schbirid: it's Ruby, I think? so yes, probably
17:36 🔗 joepie91 :P
17:36 🔗 aaaaaaaaa do they have a bounty program?
17:37 🔗 schbirid yeah
17:45 🔗 schbirid hm, seems just to escape one element too many
17:45 🔗 schbirid not one too few
17:59 🔗 midas1 has joined #archiveteam-bs
18:12 🔗 rejon has quit IRC (Read error: Operation timed out)
18:18 🔗 Coderjoe_ has joined #archiveteam-bs
18:21 🔗 primus104 has quit IRC (hub.se irc.efnet.pl)
18:21 🔗 schbirid has quit IRC (hub.se irc.efnet.pl)
18:21 🔗 primus has quit IRC (hub.se irc.efnet.pl)
18:21 🔗 Coderjoe has quit IRC (hub.se irc.efnet.pl)
18:22 🔗 primus_ has joined #archiveteam-bs
18:27 🔗 schbirid2 has joined #archiveteam-bs
19:15 🔗 rejon has joined #archiveteam-bs
19:37 🔗 rejon has quit IRC (Ping timeout: 335 seconds)
19:55 🔗 Ravenloft has joined #archiveteam-bs
20:12 🔗 mistym has quit IRC (Remote host closed the connection)
20:36 🔗 mistym has joined #archiveteam-bs
20:42 🔗 aaaaaaaaa has quit IRC (Read error: Operation timed out)
21:04 🔗 mistym has quit IRC (Remote host closed the connection)
21:07 🔗 aaaaaaaaa has joined #archiveteam-bs
21:20 🔗 mistym has joined #archiveteam-bs
21:27 🔗 bsmith093 has quit IRC (Read error: Connection reset by peer)
21:34 🔗 abartov has quit IRC (Ping timeout: 258 seconds)
21:39 🔗 bsmith093 has joined #archiveteam-bs
21:43 🔗 yipdw has quit IRC (Quit: yipdw)
21:43 🔗 dashcloud has quit IRC (Read error: Operation timed out)
21:43 🔗 yipdw has joined #archiveteam-bs
21:45 🔗 schbirid2 has quit IRC (Quit: Leaving)
21:47 🔗 dashcloud has joined #archiveteam-bs
21:49 🔗 abartov has joined #archiveteam-bs
21:57 🔗 sankin has quit IRC (Leaving.)
22:10 🔗 dashcloud has quit IRC (Read error: Operation timed out)
22:11 🔗 chfoo balrog: if it works using a http proxy, it should be doable
22:11 🔗 balrog chfoo: it would involve detecting a supported URL and feeding it to the program I think
22:11 🔗 balrog I'm a little worried that archivebot doesn't capture youtube videos themselves
22:12 🔗 balrog oh, it's in python
22:13 🔗 yipdw balrog: it could be done, I'd prefer to have a working replay solution first
22:13 🔗 balrog replay?
22:13 🔗 yipdw that's why I pointed out that pywb-webrecorder can do it
22:14 🔗 balrog doesn't archive.org already have some method of grabbing some youtube stuff?
22:14 🔗 yipdw maybe, but as far as I can tell it's not documented
22:14 🔗 balrog ah :/
22:14 🔗 yipdw anyway, pywb seems to have Deep Magic From Before The Dawn of Time to do this, so I keep thinking it might be interesting to use its proxy + wpull
22:15 🔗 dashcloud has joined #archiveteam-bs
22:15 🔗 yipdw another problem is making this not cause WARC size to blow up any more than they do in the default !a case
22:22 🔗 balrog Deep Magic From Before The Dawn of Time where?
22:22 🔗 balrog https://github.com/ikreymer/pywb/blob/4c08a6a06404388e673ed37a6969023712d91c18/pywb/static/vidrw.js
22:22 🔗 balrog it's doing a bunch of transformation
22:42 🔗 yipdw yeah
22:42 🔗 yipdw also injecting flowplayer, etc.
23:04 🔗 APerti has quit IRC (Read error: Operation timed out)
23:13 🔗 APerti has joined #archiveteam-bs
23:13 🔗 dashcloud has quit IRC (Read error: Operation timed out)
23:13 🔗 dashcloud has joined #archiveteam-bs
23:18 🔗 BlueMaxim has joined #archiveteam-bs
23:22 🔗 APerti has quit IRC (Read error: Operation timed out)
23:33 🔗 abartov has quit IRC (Ping timeout: 258 seconds)
23:34 🔗 Ebony27 has joined #archiveteam-bs
23:35 🔗 Ebony27 has quit IRC (Read error: Connection reset by peer)
23:42 🔗 Start http://techcrunch.com/2015/01/07/is-youtube-the-yahoo-of-2015/
23:58 🔗 joepie91 Even BuzzFeed knows point No. 5, and they are the intellectual toilet of the Internet.
23:58 🔗 joepie91 ouch
23:58 🔗 BlueMaxim *flush*
