Time |
Nickname |
Message |
00:02
🔗
|
|
rejon has quit IRC (Read error: Operation timed out) |
00:31
🔗
|
|
garyrh has quit IRC (Remote host closed the connection) |
00:58
🔗
|
|
Smiley has quit IRC (Ping timeout: 370 seconds) |
00:59
🔗
|
|
garyrh has joined #archiveteam-bs |
01:02
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
01:19
🔗
|
|
mistym has joined #archiveteam-bs |
01:20
🔗
|
|
DFJustin has joined #archiveteam-bs |
01:20
🔗
|
|
swebb sets mode: +o DFJustin |
01:24
🔗
|
|
primus104 has quit IRC (Leaving.) |
01:34
🔗
|
|
egg_ has quit IRC (quit) |
01:38
🔗
|
|
nico_ is now known as nico_32 |
01:44
🔗
|
|
Smiley has joined #archiveteam-bs |
02:21
🔗
|
|
APerti has joined #archiveteam-bs |
02:47
🔗
|
|
APerti_ has joined #archiveteam-bs |
02:50
🔗
|
|
APerti has quit IRC (Read error: Operation timed out) |
03:49
🔗
|
chfoo |
http://thedailywh.at/2015/01/distraction-of-the-day-you-can-now-play-oregon-trail-and-other-ms-dos-games-online/ |
04:33
🔗
|
|
aaaaaaaaa has quit IRC (Leaving) |
05:21
🔗
|
|
S_aus_Eur has joined #archiveteam-bs |
05:21
🔗
|
|
S_aus_Eur has left |
05:29
🔗
|
godane |
so some of the npr morning radio episodes are going to be in real media |
05:29
🔗
|
godane |
these real media files don't derive right at all |
05:29
🔗
|
godane |
old ones don't have this problem |
05:30
🔗
|
godane |
i hope some one can at least look at them to see what is the problem with IA deriving them |
05:31
🔗
|
godane |
these are in real media only: https://archive.org/details/npr-morning-edition-01-02-2003 |
05:34
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
05:37
🔗
|
|
mistym has joined #archiveteam-bs |
05:59
🔗
|
DFJustin |
huh never knew there was a dos oregon trail |
07:13
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
07:25
🔗
|
|
APerti_ has quit IRC (Read error: Operation timed out) |
07:44
🔗
|
Ctrl-S |
I've started work on a tumblr archiver, here is code so far: https://mega.co.nz/#!bxJFzL4Z!8h1TQHKJT7WvJRkgiZTPkgbO2gDw7a4VbxFSa1Go-k4 |
08:06
🔗
|
|
primus104 has joined #archiveteam-bs |
08:38
🔗
|
joepie91 |
godane: fwiw, it has derived now |
08:38
🔗
|
joepie91 |
Ctrl-S: use some sort of git hosting, please :D |
08:38
🔗
|
joepie91 |
especially since you're already using git... |
08:39
🔗
|
joepie91 |
(or at the very least tar.gz, zip isn't very good at unix perms) |
08:51
🔗
|
|
GLaDOS has quit IRC (Ping timeout: 272 seconds) |
08:51
🔗
|
|
GLaDOS has joined #archiveteam-bs |
08:51
🔗
|
|
swebb sets mode: +o GLaDOS |
09:07
🔗
|
|
brayden has quit IRC (Ping timeout: 607 seconds) |
09:54
🔗
|
|
schbirid has joined #archiveteam-bs |
09:55
🔗
|
|
brayden has joined #archiveteam-bs |
10:01
🔗
|
|
kvieta has quit IRC (Read error: Operation timed out) |
10:12
🔗
|
|
kvieta has joined #archiveteam-bs |
11:43
🔗
|
|
primus104 has quit IRC (Leaving.) |
11:47
🔗
|
|
yan has joined #archiveteam-bs |
12:03
🔗
|
Ctrl-S |
is this good enough for you? https://github.com/woodenphone/tumblr-to-db |
12:03
🔗
|
Ctrl-S |
still WIP |
12:03
🔗
|
godane |
looks like 20100919 marshill hd video doesn't work |
12:04
🔗
|
joepie91 |
Ctrl-S: yes, git is good :D |
12:04
🔗
|
godane |
so i try to get the tv_sd_progressive version of that video |
12:04
🔗
|
Ctrl-S |
goal is to save tumblr blogs to a db so i can scrape remotely and retreive to my metered home connection |
12:04
🔗
|
Ctrl-S |
HTTrack automation just doesn't cut it |
12:05
🔗
|
Ctrl-S |
also HTTrack does not remember where it has been |
12:05
🔗
|
Ctrl-S |
or rather, it does not understand the difference between posts and the listings |
12:13
🔗
|
joepie91 |
Ctrl-S: shouldn't you be using WARC, though? |
12:14
🔗
|
Ctrl-S |
Filesize must be minimised |
12:14
🔗
|
Ctrl-S |
Purpose is to save the blogs, not to shove into IA |
12:14
🔗
|
Ctrl-S |
Most important things are the poss and the media |
12:15
🔗
|
joepie91 |
WARC overhead is negligible, really |
12:15
🔗
|
joepie91 |
WARC isn't just for IA either :) |
12:16
🔗
|
Ctrl-S |
basically the problem this software is supposed to address is: Tumblr makes it really easy to get a blog deleted |
12:16
🔗
|
joepie91 |
Ctrl-S: thing is, if you're making HTTP requests anyway, you might as well dump them into a WARC? |
12:16
🔗
|
joepie91 |
mm |
12:16
🔗
|
Ctrl-S |
I have a metered home connection |
12:17
🔗
|
joepie91 |
ok? |
12:17
🔗
|
Ctrl-S |
so unless WARC can handle lots of compression, it's not going to be suitable |
12:17
🔗
|
joepie91 |
wha |
12:17
🔗
|
Ctrl-S |
if it can, i can switch over |
12:17
🔗
|
joepie91 |
Ctrl-S: WARC is a storage format |
12:17
🔗
|
Ctrl-S |
I know |
12:17
🔗
|
joepie91 |
it has nothing to do with your connection |
12:17
🔗
|
joepie91 |
at all |
12:18
🔗
|
joepie91 |
it stores data that your client has *anyway* |
12:18
🔗
|
Ctrl-S |
I plan to run it in another country where data is cheaper |
12:18
🔗
|
Ctrl-S |
then pull once it's finished |
12:18
🔗
|
joepie91 |
Ctrl-S: what does 'data cap' have to do with WARC? you keep refering to it, but I don't see where it comes into the picture |
12:18
🔗
|
Ctrl-S |
to get the data from a remote machine that runs the script to my machine |
12:18
🔗
|
joepie91 |
...? |
12:19
🔗
|
joepie91 |
I still don't get it... |
12:19
🔗
|
Ctrl-S |
me with metered home connection <-> friend with big unmetered pipe <-> internet |
12:19
🔗
|
joepie91 |
yes? |
12:20
🔗
|
joepie91 |
again, what does this have to do with WARC? |
12:20
🔗
|
Ctrl-S |
If i extract data into a DB there is less data to move |
12:20
🔗
|
Ctrl-S |
half the HTML will be removed |
12:20
🔗
|
Ctrl-S |
or more |
12:20
🔗
|
joepie91 |
move from where to where? |
12:21
🔗
|
Ctrl-S |
The data follows this path: Tumblr -> Scraper machine -> my machine |
12:22
🔗
|
Ctrl-S |
that second link is the bottleneck |
12:22
🔗
|
joepie91 |
why would you need to move the WARC to your machine? |
12:22
🔗
|
Ctrl-S |
Can't trust remote storage |
12:22
🔗
|
joepie91 |
....? |
12:22
🔗
|
Ctrl-S |
Much better to have a HDD i can hold myself |
12:23
🔗
|
Ctrl-S |
unless WARC is more than HTML with metadata |
12:23
🔗
|
joepie91 |
Ctrl-S: I don't really understand where you're seeing a problem |
12:23
🔗
|
joepie91 |
you are *already* extracting the content and storing it locally |
12:23
🔗
|
joepie91 |
storing the WARC elsewhere doesn't make you lose anything |
12:23
🔗
|
joepie91 |
at best it will make you have a WARC in a remote location |
12:23
🔗
|
Ctrl-S |
I'm sorry, I don't understand |
12:23
🔗
|
snuffy |
extracting the butane from it all into a world class mma fighter how is that bullshit |
12:23
🔗
|
joepie91 |
at worst the WARC will be lost and you'll still have the same data as when you're not making a WARC |
12:23
🔗
|
Ctrl-S |
I can make it dump to warc |
12:24
🔗
|
Ctrl-S |
That's probably easier then using a db |
12:24
🔗
|
joepie91 |
Ctrl-S: I'm not saying to replace one with the other |
12:24
🔗
|
Ctrl-S |
it's just that I want as small a file size as possible after the download has finished |
12:24
🔗
|
joepie91 |
I'm saying that you can *also* dump to WARC |
12:24
🔗
|
snuffy |
to replace the police |
12:24
🔗
|
Ctrl-S |
I intend to try for both if i add warc stuff |
12:24
🔗
|
joepie91 |
can somebody kick that markov bot please |
12:25
🔗
|
Ctrl-S |
markov bot? |
12:25
🔗
|
joepie91 |
balrog: closure: DFJustin: ersi: Famicoman: Kenshin: SadDM: SketchCow: swebb: underscor: yipdw: sorry for the mass highlight, but we have a markov bot misbehaving (snuffy) |
12:25
🔗
|
joepie91 |
see above |
12:25
🔗
|
joepie91 |
I don't have +o |
12:26
🔗
|
Ctrl-S |
I'll look at libraries for WARC now |
12:26
🔗
|
joepie91 |
Ctrl-S: pseudo-AI bot, absorbs what people say then starts randomly outputting vaguely related-seeming sentences |
12:26
🔗
|
joepie91 |
can be amusing, but not in discussions... |
12:26
🔗
|
Ctrl-S |
you could tell that from one message? |
12:26
🔗
|
joepie91 |
yes |
12:26
🔗
|
joepie91 |
they have fairly predictable patterns |
12:26
🔗
|
joepie91 |
look carefully |
12:26
🔗
|
joepie91 |
[13:23] <joepie91> you are *already* extracting the content and storing it locally |
12:26
🔗
|
joepie91 |
[13:23] <snuffy> extracting the butane from it all into a world class mma fighter how is that bullshit |
12:26
🔗
|
Ctrl-S |
oh |
12:26
🔗
|
Ctrl-S |
yeah |
12:26
🔗
|
Ctrl-S |
i see |
12:26
🔗
|
joepie91 |
nonsensical sentence, valid grammar, copying an unusual word |
12:27
🔗
|
joepie91 |
very typical markov bot pattern :P |
12:27
🔗
|
joepie91 |
it uses word associations, basically |
12:27
🔗
|
joepie91 |
anyway |
12:27
🔗
|
joepie91 |
back to the topic |
12:27
🔗
|
Ctrl-S |
so to make a WARC, what data do I need? |
12:27
🔗
|
joepie91 |
Ctrl-S: extracting into a DB is fine for personal copies, but it's probably a good idea to just remotely store a copy of the WARC.. there's a python lib for it afaik |
12:27
🔗
|
Ctrl-S |
ATM I use mechanize for web requests |
12:27
🔗
|
snuffy |
my friend requests |
12:28
🔗
|
joepie91 |
the request headers and body (usually just headers), and the response headers and body |
12:28
🔗
|
joepie91 |
that's it really |
12:28
🔗
|
joepie91 |
warc lib should tell you the specific data needed |
12:28
🔗
|
joepie91 |
hopefully |
12:30
🔗
|
Ctrl-S |
is there an easy way to tell HTTrack to output to WARC? |
12:31
🔗
|
joepie91 |
httrack doesn't understand warc, as far as I am aware |
12:31
🔗
|
joepie91 |
that is why I recommend wget to people :P |
12:31
🔗
|
Ctrl-S |
windows |
12:31
🔗
|
joepie91 |
Ctrl-S: wget for windows is a thing |
12:32
🔗
|
Ctrl-S |
I think I had problems with the filename handling? |
12:32
🔗
|
joepie91 |
http://gnuwin32.sourceforge.net/packages/wget.htm |
12:32
🔗
|
joepie91 |
no idea |
12:33
🔗
|
Ctrl-S |
know of anythign that uses both WARC and mechanize in python? |
12:33
🔗
|
Ctrl-S |
example code makes everything easier |
12:35
🔗
|
|
rejon has joined #archiveteam-bs |
12:36
🔗
|
joepie91 |
Ctrl-S: no clue |
12:36
🔗
|
Ctrl-S |
I would honestly rather get this working than search for information on linking the warc stuff to mechanize, but once it's done i'll consider doing it |
12:37
🔗
|
Ctrl-S |
everything goes through a single get() function for web requests, so i suppose i coudl slip something into that afterwards |
12:38
🔗
|
Ctrl-S |
something that works now, perfection later |
12:38
🔗
|
joepie91 |
mhmm |
12:39
🔗
|
SketchCow |
snuffy: Destination Drigible |
12:40
🔗
|
SketchCow |
snuffy: Last broken maid harvey clam, bring destination forgotten grass-fed. |
12:40
🔗
|
|
SketchCow sets mode: +b *!*bkr@*.mindhackers.org |
12:40
🔗
|
|
snuffy was kicked by SketchCow (snuffy) |
12:40
🔗
|
Ctrl-S |
WARC doesn't need context, just URL, metadata for both directions, and the response, right? |
12:41
🔗
|
Ctrl-S |
if that is true, I can just change one function afterwards to set it up |
12:43
🔗
|
joepie91 |
Ctrl-S: also request body, but if you're only doing GET requests that doesn;t really matter |
12:43
🔗
|
joepie91 |
SketchCow: hehe, poisioning its word association cache? :P |
12:43
🔗
|
joepie91 |
also, thanks |
12:44
🔗
|
Ctrl-S |
the function is named get(), it takes a URL and returns the page/file |
12:44
🔗
|
Ctrl-S |
it hides the cookies ect from the rest of the code |
12:44
🔗
|
joepie91 |
yes, you'll need to capture the request headers also |
12:45
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
12:45
🔗
|
Ctrl-S |
sounds doable, one i learn how to work with the libs. |
12:46
🔗
|
Ctrl-S |
eurgh, I have to get the date of the post from the archive page, rather than the post itself |
12:47
🔗
|
Ctrl-S |
I was hoping to pass a signe numerical string |
12:51
🔗
|
joepie91 |
is anybody grabbing the coverage from Paris? |
12:51
🔗
|
Ctrl-S |
what coverage? |
12:52
🔗
|
joepie91 |
Ctrl-S: http://www.theguardian.com/world/live/2015/jan/07/shooting-paris-satirical-magazine-charlie-hebdo |
12:52
🔗
|
joepie91 |
.t |
12:52
🔗
|
botpie91 |
Wed, 07 Jan 2015 12:52:09 GMT |
12:52
🔗
|
joepie91 |
... |
12:52
🔗
|
joepie91 |
.title http://www.theguardian.com/world/live/2015/jan/07/shooting-paris-satirical-magazine-charlie-hebdo |
12:52
🔗
|
botpie91 |
joepie91: Charlie Hebdo shooting: twelve dead at Paris offices of satirical magazine – live updates | World news | The Guardian |
12:53
🔗
|
Ctrl-S |
do we have archives of this satirical newspaper? |
12:53
🔗
|
joepie91 |
I don't know, but we should |
12:53
🔗
|
|
ersi sets mode: +o joepie91 |
12:53
🔗
|
joepie91 |
ivan`: ? |
12:53
🔗
|
joepie91 |
what's the status on that? |
12:53
🔗
|
joepie91 |
ersi: thanks |
12:54
🔗
|
joepie91 |
uh oh |
12:54
🔗
|
joepie91 |
Ctrl-S: https://t.co/bHl4vKTZUg |
12:54
🔗
|
joepie91 |
does this load for you |
12:54
🔗
|
Ctrl-S |
slowly |
12:54
🔗
|
Ctrl-S |
blank page so far |
12:54
🔗
|
Ctrl-S |
connected... |
12:55
🔗
|
Ctrl-S |
i'm in wa.au, btw |
12:55
🔗
|
Ctrl-S |
perth |
12:55
🔗
|
Ctrl-S |
might want to ask someone in france |
12:55
🔗
|
Ctrl-S |
504 |
12:56
🔗
|
joepie91 |
:/ |
12:56
🔗
|
joepie91 |
yeah, it's down I think... |
13:03
🔗
|
midas |
joepie91: works here |
13:03
🔗
|
midas |
via ovh proxy |
13:04
🔗
|
raylee |
works here, .uk |
13:07
🔗
|
joepie91 |
yeah, works here now as well, but slow |
13:09
🔗
|
midas |
yep |
13:11
🔗
|
|
primus104 has joined #archiveteam-bs |
13:14
🔗
|
|
Ravenloft has quit IRC (Ping timeout: 370 seconds) |
14:06
🔗
|
Kazzy |
I can't check this right now, apparently it's a video of the shooting.. http://www.liveleak.com/view?i=bc6_1420632668 |
14:06
🔗
|
Kazzy |
probably nsfw/l, don't click if you don't want to. |
14:13
🔗
|
joepie91 |
Kazzy: contains one person shot to death :( |
14:14
🔗
|
Kazzy |
sigh :( |
14:15
🔗
|
godane |
whats the name of the magazine? |
14:16
🔗
|
Kazzy |
charlie hebdo |
14:18
🔗
|
Ctrl-S |
is someone archiving the video? |
14:18
🔗
|
Kazzy |
liveleak video was grabbed through archivebot |
14:20
🔗
|
joepie91 |
the video? or just the page? |
14:20
🔗
|
|
APerti has joined #archiveteam-bs |
14:21
🔗
|
Kazzy |
i have no idea if it grabbed the video too, if someone has stuff on hand to grab it, please do. |
14:22
🔗
|
joepie91 |
Kazzy: youtube-dl'ing it |
14:22
🔗
|
joepie91 |
looks like youtube-dl groks liveleak, so that's good |
14:27
🔗
|
|
sankin has joined #archiveteam-bs |
14:28
🔗
|
|
garyrh has quit IRC (Read error: Operation timed out) |
14:57
🔗
|
|
norbert79 has quit IRC (Quit: leaving) |
15:00
🔗
|
balrog |
chfoo: how feasible would it be for wpull to feed youtube links into youtube-dl or something like that? |
15:06
🔗
|
|
bauruine has joined #archiveteam-bs |
15:08
🔗
|
Ctrl-S |
what is wpull? |
15:09
🔗
|
Ctrl-S |
this is possible: https://github.com/woodenphone/Youtube-dl-runner |
15:09
🔗
|
joepie91 |
Ctrl-S: it;s a drop-in replacement (with some changes) for wget written in Python |
15:09
🔗
|
Ctrl-S |
no idea about the wpull side |
15:11
🔗
|
Kazzy |
Ctrl-S: https://github.com/chfoo/wpull if you're interested |
15:18
🔗
|
Kazzy |
if someone can grab a copy of this, please do soon.. it's liveupdating so probably not worth grabbing just yet http://www.bbc.com/news/live/world-europe-30710777 |
15:18
🔗
|
Ctrl-S |
Httrack with new output dir each run? |
15:19
🔗
|
Ctrl-S |
shell script run it at 5-10 min interval? |
15:21
🔗
|
Kazzy |
I'm stuck on a chromebook with 10% battery, can't do much from here :p |
15:24
🔗
|
Ctrl-S |
I have a linux box, you write a script to install and run the whatever it is to download the stuff, i'll run it |
15:25
🔗
|
Ctrl-S |
I thought that chmod -R 777 * was a good idea |
15:25
🔗
|
midas |
chmod -R 777 / |
15:25
🔗
|
Ctrl-S |
so i'm not the guy that should write it |
15:25
🔗
|
midas |
anddd run |
15:25
🔗
|
Ctrl-S |
it did help fix my problem |
15:25
🔗
|
Ctrl-S |
maybe |
15:34
🔗
|
|
mistym has joined #archiveteam-bs |
15:35
🔗
|
|
garyrh has joined #archiveteam-bs |
15:37
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
15:39
🔗
|
|
norbert79 has joined #archiveteam-bs |
15:46
🔗
|
midas |
can we grab this? https://www.youtube.com/watch?v=LeIy0zH77MM#t=1624 livestream on YT |
15:46
🔗
|
midas |
(dump the timemarker btw) |
15:51
🔗
|
|
aaaaaaaaa has joined #archiveteam-bs |
15:55
🔗
|
|
mistym has joined #archiveteam-bs |
16:16
🔗
|
|
bauruine has quit IRC (Ping timeout: 265 seconds) |
16:19
🔗
|
|
godane has quit IRC (Read error: Operation timed out) |
16:21
🔗
|
|
bauruine has joined #archiveteam-bs |
16:22
🔗
|
|
Start is now known as StartAway |
16:22
🔗
|
|
StartAway is now known as Start |
16:31
🔗
|
|
godane has joined #archiveteam-bs |
16:40
🔗
|
|
dashcloud has quit IRC (Remote host closed the connection) |
16:41
🔗
|
|
dashcloud has joined #archiveteam-bs |
16:54
🔗
|
|
rejon has quit IRC (Ping timeout: 335 seconds) |
16:58
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
17:09
🔗
|
|
Kassia19 has joined #archiveteam-bs |
17:10
🔗
|
|
Kassia19 has quit IRC (Read error: Connection reset by peer) |
17:14
🔗
|
|
mistym has joined #archiveteam-bs |
17:19
🔗
|
yipdw |
Ctrl-S: fyi archivebot does tumblr archiving ok |
17:22
🔗
|
|
rejon has joined #archiveteam-bs |
17:34
🔗
|
schbirid |
woot, i found a bug on github |
17:34
🔗
|
schbirid |
too dumb to figure out if it is a vulnerability though |
17:36
🔗
|
joepie91 |
schbirid: it's Ruby, I think? so yes, probably |
17:36
🔗
|
joepie91 |
:P |
17:36
🔗
|
aaaaaaaaa |
do they have a bounty program? |
17:37
🔗
|
schbirid |
yeah |
17:45
🔗
|
schbirid |
hm, seems just to escape one element too many |
17:45
🔗
|
schbirid |
not one too few |
17:59
🔗
|
|
midas1 has joined #archiveteam-bs |
18:12
🔗
|
|
rejon has quit IRC (Read error: Operation timed out) |
18:18
🔗
|
|
Coderjoe_ has joined #archiveteam-bs |
18:21
🔗
|
|
primus104 has quit IRC (hub.se irc.efnet.pl) |
18:21
🔗
|
|
schbirid has quit IRC (hub.se irc.efnet.pl) |
18:21
🔗
|
|
primus has quit IRC (hub.se irc.efnet.pl) |
18:21
🔗
|
|
Coderjoe has quit IRC (hub.se irc.efnet.pl) |
18:22
🔗
|
|
primus_ has joined #archiveteam-bs |
18:27
🔗
|
|
schbirid2 has joined #archiveteam-bs |
19:15
🔗
|
|
rejon has joined #archiveteam-bs |
19:37
🔗
|
|
rejon has quit IRC (Ping timeout: 335 seconds) |
19:55
🔗
|
|
Ravenloft has joined #archiveteam-bs |
20:12
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
20:36
🔗
|
|
mistym has joined #archiveteam-bs |
20:42
🔗
|
|
aaaaaaaaa has quit IRC (Read error: Operation timed out) |
21:04
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
21:07
🔗
|
|
aaaaaaaaa has joined #archiveteam-bs |
21:20
🔗
|
|
mistym has joined #archiveteam-bs |
21:27
🔗
|
|
bsmith093 has quit IRC (Read error: Connection reset by peer) |
21:34
🔗
|
|
abartov has quit IRC (Ping timeout: 258 seconds) |
21:39
🔗
|
|
bsmith093 has joined #archiveteam-bs |
21:43
🔗
|
|
yipdw has quit IRC (Quit: yipdw) |
21:43
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
21:43
🔗
|
|
yipdw has joined #archiveteam-bs |
21:45
🔗
|
|
schbirid2 has quit IRC (Quit: Leaving) |
21:47
🔗
|
|
dashcloud has joined #archiveteam-bs |
21:49
🔗
|
|
abartov has joined #archiveteam-bs |
21:57
🔗
|
|
sankin has quit IRC (Leaving.) |
22:10
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
22:11
🔗
|
chfoo |
balrog: if it works using a http proxy, it should be doable |
22:11
🔗
|
balrog |
chfoo: it would involve detecting a supported URL and feeding it to the program I think |
22:11
🔗
|
balrog |
I'm a little worried that archivebot doesn't capture youtube videos themselves |
22:12
🔗
|
balrog |
oh, it's in python |
22:13
🔗
|
yipdw |
balrog: it could be done, I'd prefer to have a working replay solution first |
22:13
🔗
|
balrog |
replay? |
22:13
🔗
|
yipdw |
that's why I pointed out that pywb-webrecorder can do it |
22:14
🔗
|
balrog |
doesn't archive.org already have some method of grabbing some youtube stuff? |
22:14
🔗
|
yipdw |
maybe, but as far as I can tell it's not documented |
22:14
🔗
|
balrog |
ah :/ |
22:14
🔗
|
yipdw |
anyway, pywb seems to have Deep Magic From Before The Dawn of Time to do this, so I keep thinking it might be interesting to use its proxy + wpull |
22:15
🔗
|
|
dashcloud has joined #archiveteam-bs |
22:15
🔗
|
yipdw |
another problem is making this not cause WARC size to blow up any more than they do in the default !a case |
22:22
🔗
|
balrog |
Deep Magic From Before The Dawn of Time where? |
22:22
🔗
|
balrog |
https://github.com/ikreymer/pywb/blob/4c08a6a06404388e673ed37a6969023712d91c18/pywb/static/vidrw.js |
22:22
🔗
|
balrog |
it's doing a bunch of transformation |
22:42
🔗
|
yipdw |
yeah |
22:42
🔗
|
yipdw |
also injecting flowplayer, etc. |
23:04
🔗
|
|
APerti has quit IRC (Read error: Operation timed out) |
23:13
🔗
|
|
APerti has joined #archiveteam-bs |
23:13
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
23:13
🔗
|
|
dashcloud has joined #archiveteam-bs |
23:18
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
23:22
🔗
|
|
APerti has quit IRC (Read error: Operation timed out) |
23:33
🔗
|
|
abartov has quit IRC (Ping timeout: 258 seconds) |
23:34
🔗
|
|
Ebony27 has joined #archiveteam-bs |
23:35
🔗
|
|
Ebony27 has quit IRC (Read error: Connection reset by peer) |
23:42
🔗
|
Start |
http://techcrunch.com/2015/01/07/is-youtube-the-yahoo-of-2015/ |
23:58
🔗
|
joepie91 |
Even BuzzFeed knows point No. 5, and they are the intellectual toilet of the Internet. |
23:58
🔗
|
joepie91 |
ouch |
23:58
🔗
|
BlueMaxim |
*flush* |