Time |
Nickname |
Message |
00:45
🔗
|
|
JesseW has joined #archiveteam-bs |
01:52
🔗
|
|
toad2 has joined #archiveteam-bs |
01:54
🔗
|
|
no2pencil has quit IRC (Read error: Operation timed out) |
01:54
🔗
|
|
toad1 has quit IRC (Read error: Operation timed out) |
02:20
🔗
|
JesseW |
ivan`: I'm going to dump a bunch of Youtube channels of folk music into your form: https://docs.google.com/forms/d/1_kkpBe6abFQ5sznrMfWHhP7ZhdktKejJEpvCCcqVues/viewform -- lemme know if you'd like them as a single email instead (probably about a dozen channels or so). |
02:23
🔗
|
kyan |
ivan`, are those being mirrored to IA? Is a list of items you've archived available, so that I could mirror them to IA off youtube if I wanted? |
02:24
🔗
|
JesseW |
kyan: I know they aren't being mirrored to IA, because one of the points was to avoid burdening IA's servers with stuff they can't/don't want to hold. |
02:25
🔗
|
kyan |
Ah, hmm |
02:25
🔗
|
kyan |
I didn't know they didn't want to hold things |
02:25
🔗
|
JesseW |
I don't know if a public list is available, but I'd be surprised if ivan` would mind privately sending you a list of what he has. |
02:25
🔗
|
kyan |
thought they were more like, expanding as use expanded, or something |
02:26
🔗
|
kyan |
Also, some of my most viewed uploads have been youtube videos I've mirrored to IA |
02:26
🔗
|
kyan |
so I think that it at least generate more traffic for IA to have discoverable content? |
02:26
🔗
|
kyan |
Then again, they don't have ads so I guess traffic ≠money |
02:27
🔗
|
JesseW |
I don't know that IA minds -- more that I remember seeing ivan` mention he was specifically intending to provide a home for stuff unable to be mirrored at IA. |
02:28
🔗
|
kyan |
Huh ok |
02:28
🔗
|
kyan |
I'm not sure how "unable" anything could be, but whatever |
02:28
🔗
|
kyan |
I mean, if the issue is with copyright, I guess |
02:34
🔗
|
|
schbirid2 has joined #archiveteam-bs |
02:35
🔗
|
JesseW |
I have no idea about why. |
02:36
🔗
|
kyan |
ah :P |
02:37
🔗
|
|
schbirid has quit IRC (Read error: Operation timed out) |
02:39
🔗
|
kyan |
Also a question — how to sort search results on IA by size? |
02:39
🔗
|
JesseW |
I'll be curious what Ivan has to say. |
02:39
🔗
|
JesseW |
size of what, individual files, whole items, something else? |
02:40
🔗
|
kyan |
Whole items |
02:40
🔗
|
JesseW |
I don't *think* that's available through the Advanced Search. |
02:40
🔗
|
JesseW |
Probably extracting it from the census data is your best bet. |
02:40
🔗
|
kyan |
I don't see anything that looks promising there |
02:40
🔗
|
kyan |
Ah, ok. Thanks. |
02:40
🔗
|
* |
kyan can't be bothered atm |
02:40
🔗
|
JesseW |
Heh |
02:40
🔗
|
JesseW |
What were you interested in looking for? |
02:40
🔗
|
kyan |
Also the census wouldn't fit on my drive lol |
02:41
🔗
|
kyan |
I've got a bunch of WARCs uploaded |
02:41
🔗
|
kyan |
the ones from one account have lots of views (10000+ per item, generally) |
02:41
🔗
|
kyan |
while the ones from the other account have like 10–50 views |
02:41
🔗
|
kyan |
I'd like to see if there's something wrong with the ones from the other account |
02:42
🔗
|
kyan |
but a lot of the items are small and only have a few URLs in them, making it understandable that they'd have few views. |
02:42
🔗
|
kyan |
By sorting by item size, I could see which ones have tens of thousands of URLs and try to see if there's something about them that's making them have few views. |
02:43
🔗
|
kyan |
Namely these: https://archive.org/search.php?query=uploader%3A%22worldpeacehaven%40gmail.com%22+mediatype%3Aweb&sort=-downloads&page=2 |
02:43
🔗
|
kyan |
46 views for the most viewed item, and going down from there |
02:43
🔗
|
JesseW |
Ah, if you are only interested in a limited number of identifiers, I'd just hack up curl to download http://archive.org/metadata/{id} for each one, then sort them locally. |
02:43
🔗
|
JesseW |
I thought you wanted to sort the whole corpus |
02:44
🔗
|
JesseW |
s/hack up curl/hack up a shell script *using* curl/ |
02:44
🔗
|
kyan |
Compare to my other account https://archive.org/search.php?query=uploader%3A%22kolubat%40gmail.com%22+mediatype%3Aweb&sort=-downloads |
02:44
🔗
|
kyan |
most views is 143K |
02:44
🔗
|
kyan |
makes me think something might be wrong. |
02:44
🔗
|
* |
JesseW needs to get around to uploading my census results |
02:45
🔗
|
kyan |
JesseW, cool, that sounds promising! Thanks! :D |
02:45
🔗
|
JesseW |
(and various shell commands) |
02:45
🔗
|
JesseW |
but I need to figure out what exactly the next step is, too. |
02:57
🔗
|
|
JetBalsa has quit IRC (hub.efnet.us irc.colosolutions.net) |
02:57
🔗
|
|
SadDM has quit IRC (hub.efnet.us irc.colosolutions.net) |
02:57
🔗
|
|
jspiros has quit IRC (hub.efnet.us irc.colosolutions.net) |
02:57
🔗
|
|
matthusby has quit IRC (hub.efnet.us irc.colosolutions.net) |
03:00
🔗
|
|
JesseW has quit IRC (Quit: Leaving.) |
03:59
🔗
|
|
SN4T14 has quit IRC (Read error: Operation timed out) |
03:59
🔗
|
|
SN4T14 has joined #archiveteam-bs |
03:59
🔗
|
|
MrRadar has quit IRC (Read error: Operation timed out) |
03:59
🔗
|
|
arkiver has quit IRC (Ping timeout: 360 seconds) |
04:00
🔗
|
|
signius has quit IRC (Read error: Operation timed out) |
04:00
🔗
|
|
joepie91 has quit IRC (Read error: Operation timed out) |
04:01
🔗
|
|
phuzion has quit IRC (Read error: Operation timed out) |
04:01
🔗
|
|
phuzion has joined #archiveteam-bs |
04:01
🔗
|
|
zenguy has quit IRC (Ping timeout: 360 seconds) |
04:01
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
04:01
🔗
|
|
atlogbot has quit IRC (Ping timeout: 360 seconds) |
04:02
🔗
|
|
arkiver has joined #archiveteam-bs |
04:02
🔗
|
|
joepie91 has joined #archiveteam-bs |
04:02
🔗
|
|
signius has joined #archiveteam-bs |
04:03
🔗
|
|
atlogbot has joined #archiveteam-bs |
04:04
🔗
|
|
zenguy has joined #archiveteam-bs |
04:04
🔗
|
|
dashcloud has joined #archiveteam-bs |
04:04
🔗
|
|
phuzion has quit IRC (Read error: Operation timed out) |
04:04
🔗
|
|
beardicus has quit IRC (Read error: Operation timed out) |
04:06
🔗
|
|
phuzion has joined #archiveteam-bs |
04:09
🔗
|
|
beardicus has joined #archiveteam-bs |
04:14
🔗
|
|
kvieta has quit IRC (Ping timeout: 633 seconds) |
04:14
🔗
|
|
kvieta has joined #archiveteam-bs |
04:18
🔗
|
|
RedType has quit IRC (Remote host closed the connection) |
04:21
🔗
|
|
MrRadar has joined #archiveteam-bs |
04:22
🔗
|
|
beardicus has quit IRC (Read error: Operation timed out) |
04:26
🔗
|
|
kvieta has quit IRC (Read error: Operation timed out) |
04:27
🔗
|
|
SimpBrain has quit IRC (Ping timeout: 633 seconds) |
04:36
🔗
|
|
SimpBrain has joined #archiveteam-bs |
04:44
🔗
|
|
JesseW has joined #archiveteam-bs |
04:46
🔗
|
|
toad2 has quit IRC (Ping timeout: 864 seconds) |
04:47
🔗
|
|
kvieta has joined #archiveteam-bs |
04:47
🔗
|
|
toad1 has joined #archiveteam-bs |
04:47
🔗
|
|
beardicus has joined #archiveteam-bs |
04:50
🔗
|
|
Swizzle has joined #archiveteam-bs |
04:54
🔗
|
|
zerkalo has joined #archiveteam-bs |
04:54
🔗
|
|
lbft_ has joined #archiveteam-bs |
04:57
🔗
|
|
zerkalo_ has quit IRC (hub.efnet.us irc.Prison.NET) |
04:57
🔗
|
|
chfoo has quit IRC (hub.efnet.us irc.Prison.NET) |
04:57
🔗
|
|
achip has quit IRC (hub.efnet.us irc.Prison.NET) |
04:57
🔗
|
|
lbft has quit IRC (hub.efnet.us irc.Prison.NET) |
04:59
🔗
|
|
chfoo0 has joined #archiveteam-bs |
05:04
🔗
|
|
achip has joined #archiveteam-bs |
05:05
🔗
|
|
pikhq_ has quit IRC (hub.dk irc.homelien.no) |
05:05
🔗
|
|
PurpleSym has quit IRC (hub.dk irc.homelien.no) |
05:05
🔗
|
|
PotcFdk has quit IRC (hub.dk irc.homelien.no) |
05:05
🔗
|
|
coretx has quit IRC (hub.dk irc.homelien.no) |
05:05
🔗
|
|
altlabel has quit IRC (hub.dk irc.homelien.no) |
05:05
🔗
|
|
limebyte has quit IRC (hub.dk irc.homelien.no) |
05:05
🔗
|
|
i0npulse has quit IRC (hub.dk irc.homelien.no) |
05:06
🔗
|
|
Rotab has quit IRC (hub.se irc.du.se) |
05:10
🔗
|
|
coretx_ has joined #archiveteam-bs |
05:11
🔗
|
|
vitzli has joined #archiveteam-bs |
05:16
🔗
|
xmc |
DFJustin: does wayback not support ftp at all, or can you construct ftp warcs and access them somehow |
05:35
🔗
|
|
SmileyG has quit IRC (Read error: Connection reset by peer) |
05:35
🔗
|
|
Smiley has joined #archiveteam-bs |
05:35
🔗
|
|
will has quit IRC (Ping timeout: 252 seconds) |
05:35
🔗
|
|
Rye has quit IRC (Ping timeout: 252 seconds) |
05:38
🔗
|
|
will has joined #archiveteam-bs |
05:40
🔗
|
|
useretail has quit IRC (Ping timeout: 252 seconds) |
05:43
🔗
|
|
will has quit IRC (Ping timeout: 252 seconds) |
05:45
🔗
|
|
will has joined #archiveteam-bs |
05:45
🔗
|
|
Rye has joined #archiveteam-bs |
05:45
🔗
|
|
Sk1d has quit IRC (Ping timeout: 250 seconds) |
05:47
🔗
|
JesseW |
Regarding ftp.esri.com, there are 11 wayback machine records from 2013, all of which returned 502 statuscodes. |
05:47
🔗
|
JesseW |
(there are *only* those 11 records) |
05:48
🔗
|
|
useretail has joined #archiveteam-bs |
05:52
🔗
|
|
Sk1d has joined #archiveteam-bs |
05:53
🔗
|
|
Swizzle has quit IRC (Quit: Leaving) |
06:14
🔗
|
SimpBrain |
ok... |
06:15
🔗
|
SimpBrain |
i get my nl vps suspended due to high loads, they send out a message saying they are going to do some work on the server. they send out another update saying they will put that server on a new server |
06:15
🔗
|
SimpBrain |
win win for them i think |
06:37
🔗
|
|
pikhq has joined #archiveteam-bs |
06:37
🔗
|
|
i0npulse has joined #archiveteam-bs |
06:37
🔗
|
|
altlabel has joined #archiveteam-bs |
06:37
🔗
|
|
PurpleSym has joined #archiveteam-bs |
06:37
🔗
|
|
PotcFdk has joined #archiveteam-bs |
06:37
🔗
|
|
limebyte has joined #archiveteam-bs |
06:38
🔗
|
vitzli |
JesseW, on IA census, I did IA.BAK census and found something, I don't know if it is only my 'ia search'/my mining script bug or it is common to all - a) both ia-mine and "ia search" return duplicate items b) they miss some items (about 10 or 15 on 600 item collection). I found this when I was doing "parallel --jobs 1" requests, and it happened to cli calls of ia too (week ago). |
06:39
🔗
|
vitzli |
Right now ia-mine --search --itemlist seems to behave better - no item drops, but ia search returned one duplicate record |
06:40
🔗
|
JesseW |
vitzli: https://archive.org/download/ia-bak-census_20150304/metamgr-norm-ids-20150304205357.txt.gz has a single duplicate (see http://archiveteam.org/index.php?title=Internet_Archive_Census#Contents_of_the_Census ) |
06:40
🔗
|
JesseW |
The other census files do seem to have a bunch of duplication -- I'm not sure why. |
06:40
🔗
|
JesseW |
I found that IA search was ... unreliable. |
06:41
🔗
|
vitzli |
it dropped one item and returned one duplicate, to be precise |
06:41
🔗
|
JesseW |
For getting a definitive census of items in larger collections. |
06:41
🔗
|
vitzli |
BUT - doing search multiple times and then sort|uniq it - worked and returned all elements in the collection |
06:41
🔗
|
JesseW |
Feel free to drop a note to jake about it -- I can certainly confirm I've seen the same issue. |
06:42
🔗
|
JesseW |
How many searches did you need to do? |
06:43
🔗
|
JesseW |
I tried to get all the items with addedates in a particular year, and gave up when I couldn't get consistent results from the search. I should hack up something to retry and combine results until the total mactches the provided number (because searches do generate a total number even before any individual results are requested). |
06:44
🔗
|
vitzli |
maybe 3 on 162 item collection, 3 or 4 on bigger collections (I think it was walnutcreekcdrom collection) |
06:44
🔗
|
JesseW |
hm |
06:46
🔗
|
vitzli |
got 162 items on the first 'ia search' run, but maybe 5 were duplicates, and did it again |
06:46
🔗
|
DFJustin |
xmc: as far as I know wayback doesn't support it at all. it is possible to construct ftp warcs but I don't know what tools are able to use them |
06:46
🔗
|
|
chfoo0 is now known as chfoo |
07:31
🔗
|
vitzli |
JesseW, right now: text file from ia search --itemlist 'collection:(prelingeritems)' : |
07:31
🔗
|
vitzli |
sort prelingeritems.txt | wc -l : 6533 |
07:31
🔗
|
vitzli |
sort -u prelingeritems.txt | wc -l: 4895 |
07:31
🔗
|
JesseW |
ha |
07:31
🔗
|
JesseW |
yeah, that's ... less than ideal |
07:33
🔗
|
vitzli |
uh, just prelinger collection, not prelingeritems |
07:42
🔗
|
JesseW |
vitzli: I got all 6533 distinct values the first time I make the search |
07:43
🔗
|
vitzli |
'ia search'? |
07:45
🔗
|
vitzli |
JesseW, https://paste.ee/p/vkBQU |
07:46
🔗
|
JesseW |
I was using the python interface. |
07:50
🔗
|
|
robink has quit IRC (Ping timeout: 190 seconds) |
07:50
🔗
|
vitzli |
a is a list of identifiers in collection, len(a): 6533; len(set(a)): 5071 |
07:51
🔗
|
vitzli |
is my install somehow broken? |
07:51
🔗
|
|
robink has joined #archiveteam-bs |
07:52
🔗
|
JesseW |
I'm not sure. I have to head to sleep now. Good luck. |
07:52
🔗
|
vitzli |
good night |
07:52
🔗
|
|
JesseW has quit IRC (Quit: Leaving.) |
07:53
🔗
|
vitzli |
JesseW, on python/IA search results: https://paste.ee/p/iXqQo |
08:23
🔗
|
|
kyan has quit IRC (Ping timeout: 260 seconds) |
08:47
🔗
|
|
RedType has joined #archiveteam-bs |
09:58
🔗
|
HCross2 |
Hmm. Best way of getting a csv of a collection, listing the file name and the date it was uploaded, as well as the view |
10:00
🔗
|
|
Rotab has joined #archiveteam-bs |
10:41
🔗
|
|
lytv has quit IRC (Ping timeout: 250 seconds) |
10:41
🔗
|
|
vtyl has joined #archiveteam-bs |
11:00
🔗
|
|
achip has quit IRC (hub.efnet.us irc.Prison.NET) |
11:07
🔗
|
|
signius has quit IRC (Read error: Operation timed out) |
11:17
🔗
|
|
achip has joined #archiveteam-bs |
11:20
🔗
|
|
signius has joined #archiveteam-bs |
13:01
🔗
|
|
arkiver3 has joined #archiveteam-bs |
13:28
🔗
|
joepie91 |
SketchCow: https://www.youtube.com/watch?v=cPaij2G3wTQ |
13:32
🔗
|
|
arkiver3 has quit IRC (Ping timeout: 252 seconds) |
14:13
🔗
|
|
arkiver3 has joined #archiveteam-bs |
14:24
🔗
|
|
arkiver3 has quit IRC (Ping timeout: 252 seconds) |
14:29
🔗
|
|
arkiver3 has joined #archiveteam-bs |
14:56
🔗
|
SketchCow |
Yeah, I've seen it. |
15:05
🔗
|
ersi |
what an annoying voice |
15:05
🔗
|
ersi |
but yes yes and yes for everything in it |
15:11
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
15:17
🔗
|
joepie91 |
ersi: haha, exactly my thoughts |
15:17
🔗
|
joepie91 |
watched a few eps so far |
15:17
🔗
|
joepie91 |
"jesus that voice is annoying, but he is so damn right about every single thing he says" |
15:17
🔗
|
joepie91 |
- every ep |
15:31
🔗
|
ersi |
I wouldn't watch more than that single episode |
15:36
🔗
|
|
wednesday has quit IRC (Ping timeout: 252 seconds) |
15:37
🔗
|
|
wednesday has joined #archiveteam-bs |
15:49
🔗
|
|
Start has joined #archiveteam-bs |
15:50
🔗
|
|
wednesday has quit IRC (Ping timeout: 252 seconds) |
15:53
🔗
|
|
arkiver3 has quit IRC (Quit: Nettalk6 - www.ntalk.de) |
16:48
🔗
|
schbirid2 |
https://chrome.google.com/webstore/detail/cookiestxt/njabckikapfpffapmjgojcnbfjonfjfg?hl=en cookies.txt |
16:51
🔗
|
joepie91 |
schbirid2: handy |
16:51
🔗
|
schbirid2 |
btw if you wget -x the same url you spent the night downloading it will start from 0 again \o/ |
17:07
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
17:19
🔗
|
|
Start has joined #archiveteam-bs |
17:24
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
17:51
🔗
|
|
vitzli has quit IRC (Leaving) |
17:52
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
17:55
🔗
|
|
dashcloud has joined #archiveteam-bs |
18:20
🔗
|
|
Swizzle has joined #archiveteam-bs |
18:43
🔗
|
|
Start has joined #archiveteam-bs |
19:00
🔗
|
|
signius has quit IRC (Ping timeout: 300 seconds) |
19:08
🔗
|
|
espes__ has quit IRC (Ping timeout: 252 seconds) |
19:12
🔗
|
|
signius has joined #archiveteam-bs |
19:14
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
19:20
🔗
|
|
Start has joined #archiveteam-bs |
19:24
🔗
|
|
acridAxid has quit IRC (Quit: marauder) |
19:29
🔗
|
|
kyan has joined #archiveteam-bs |
19:29
🔗
|
kyan |
my grab-site server is out of disk space :( |
19:29
🔗
|
kyan |
downloads faster than it can upload |
19:33
🔗
|
kyan |
Also TIL don't leave a grab-site --1 of a page that mentions Pinterest running over night unattended if you turned off the dupechecker. 56.5GB downloaded, 193k responses, almost all Pinterest 404s |
19:33
🔗
|
kyan |
ffs |
19:33
🔗
|
|
acridAxid has joined #archiveteam-bs |
19:34
🔗
|
godane |
SketchCow: i'm grabbing more of Network World from google books |
19:34
🔗
|
godane |
cause in part of what you said about googlebooks twitter account going private |
19:42
🔗
|
joepie91 |
http://trumpdonald.org/ |
19:47
🔗
|
ivan` |
kyan: https://gist.github.com/ivan/5779ac8d43817092aca6 |
19:47
🔗
|
|
Swizzle has quit IRC (Quit: Leaving) |
19:48
🔗
|
ivan` |
verify the df line before deploying |
19:48
🔗
|
kyan |
ivan`: Ooh, cool, thanks! :D |
19:49
🔗
|
ivan` |
kyan: not mirroring my 2M YouTube videos to IA. The plan is to scan my collection for deleted/private/unlisted videos and upload those. Just need to write software to check all the IDs and upload to IA. |
19:50
🔗
|
midas |
lol joepie91 |
19:50
🔗
|
kyan |
ivan`, Aah, cool, that's a good solutino |
19:51
🔗
|
kyan |
Might make sense to add that gist to the readme for grab-site too, that's handy |
19:51
🔗
|
ivan` |
yeah |
19:53
🔗
|
ivan` |
does anyone have some existing infrastructure to hit a site through many proxies? |
19:53
🔗
|
ivan` |
http://crawlera.com/ besides this commercial offering that I don't want to pay for |
19:54
🔗
|
|
Silvan has quit IRC (Read error: Operation timed out) |
19:54
🔗
|
kyan |
(Well, the warrior kind of does that) |
19:55
🔗
|
kyan |
Wow, $25 for 150k requests per month. That's pretty expensive |
20:01
🔗
|
joepie91 |
lol |
20:01
🔗
|
joepie91 |
kyan: you want to see expensive? |
20:01
🔗
|
joepie91 |
kyan: https://luminati.io/ |
20:02
🔗
|
kyan |
HAHAHAHAhaha ha .... ha? |
20:02
🔗
|
kyan |
do they get any customers? |
20:03
🔗
|
joepie91 |
kyan: yeah. |
20:03
🔗
|
joepie91 |
quite a few |
20:04
🔗
|
joepie91 |
kyan: it's used by companies scraping prices and shit |
20:04
🔗
|
joepie91 |
from competitors |
20:04
🔗
|
joepie91 |
their peers are Hola users |
20:04
🔗
|
joepie91 |
so, almost all residential |
20:04
🔗
|
kyan |
Hm, interesting |
20:04
🔗
|
kyan |
I'm not sure how well it would work against sophisticated crawler prevention |
20:05
🔗
|
kyan |
e.g. if they're scraping sequential IDs, that could be tracked between IP addresses |
20:05
🔗
|
kyan |
captchas could be required on suspicious requests |
20:05
🔗
|
kyan |
and also Bing would have search results as good as Google if it worked |
20:17
🔗
|
|
SilSte has joined #archiveteam-bs |
20:22
🔗
|
ivan` |
https://github.com/ludios/grab-site#automatically-pausing-grab-site-processes-when-free-disk-is-low |
20:26
🔗
|
ivan` |
I have a 3.5TB grab-site of http://digitalcollections.nypl.org/ going |
20:26
🔗
|
ivan` |
and 2.5TB of http://downloads.dell.com/ |
20:28
🔗
|
SimpBrain |
nice old driver/software downloads is good to have |
20:28
🔗
|
SimpBrain |
only a matter of time before they remove old downloads |
20:36
🔗
|
yipdw |
joepie91: wtf on luminati |
20:37
🔗
|
yipdw |
oh it's Hola |
20:38
🔗
|
|
kyan has quit IRC (Quit: This computer has gone to sleep) |
20:38
🔗
|
yipdw |
I thought they were actively infecting computers or some shit |
20:41
🔗
|
|
JW_work has joined #archiveteam-bs |
20:42
🔗
|
|
kyan has joined #archiveteam-bs |
20:43
🔗
|
JW_work |
ivan`: Here is a list of 127 youtube channels of contra dance music (with various other random crap mixed in), if you'd like to archive them: https://0bin.net/paste/0MMv2M-eh1hSTydI#SbPWz+5z+HxWt4YurJDQQUEjI7iWskKNDbNLGOBF0ik |
20:44
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
20:44
🔗
|
JW_work |
It's in OPML XML format — I'm glad to work on transforming it into an easier to use format if that'd be helpful. |
20:45
🔗
|
JW_work |
(the random crap is because various of the channels are their owner's personal channels, so they also uploaded various home video-type stuff — all the channels should have at least some contra dance music, and there aren't any channels focused on other topics, IIRC) |
20:48
🔗
|
ivan` |
I can transform XML with my mad sublime text skills, don't worry about that part |
20:48
🔗
|
ivan` |
that sure is a lot of channels |
20:49
🔗
|
JW_work |
yep, I've been collecting them for a while |
20:50
🔗
|
JW_work |
I've been (very slowly) working on indexing the contra dance videos on them on to MusicBrainz — when I heard about your archiving effort, I thought I'd send it over. |
20:50
🔗
|
JW_work |
I can also give you a smaller list of higher value ones, if you'd like. |
20:51
🔗
|
JW_work |
currently listening to https://www.youtube.com/watch?v=pthkg4f2HAo |
20:51
🔗
|
yipdw |
the more I use curl, the more I am disgusted by HTTP libraries |
20:52
🔗
|
ivan` |
JW_work: I will add all of them if you think they're all worth archiving |
20:52
🔗
|
JW_work |
I think they are all worth archiving. |
20:52
🔗
|
ivan` |
my script will work through them over about a month |
20:53
🔗
|
JW_work |
Great! None of them are particularly in danger of vanishing right now, so a month should work fine. |
20:53
🔗
|
* |
ivan` goes to write a program to turn channels into usernames |
20:54
🔗
|
JW_work |
Yeah, the XML is just the output of https://www.youtube.com/subscription_manager?action_takeout=1 |
20:55
🔗
|
JW_work |
so if you write something to handle it, it will likely be generally useful |
21:21
🔗
|
ivan` |
JW_work: OK, all of your subscriptions and spreadsheet submissions are queued |
21:22
🔗
|
ivan` |
beware my youtube archiver is a stochastic process |
21:22
🔗
|
ivan` |
and something like 1% of videos fail to download without manual intervention which I almost never bother with |
21:22
🔗
|
ivan` |
youtube is great. announces formats that it fails to serve. |
21:28
🔗
|
JW_work |
that shouldn't be a problem — the ones I *have* indexed on MusicBrainz should have already been grabbed by the musicbrainz external links warrior project recently, and I'll likely grab my high-value targets myself too; but it's very good to have another copy elsewhere, so thank you! |
21:29
🔗
|
ivan` |
np |
21:44
🔗
|
joepie91 |
yipdw: they are |
21:44
🔗
|
ersi |
ivan`: Jeez, that's some huge fucking grabs! |
21:44
🔗
|
joepie91 |
yipdw: with hola |
21:44
🔗
|
joepie91 |
lol |
21:44
🔗
|
joepie91 |
yipdw: http://adios-hola.org/ |
21:45
🔗
|
ersi |
yipdw: What in particular are you disgusted about with curl? |
21:51
🔗
|
|
kyan_ has joined #archiveteam-bs |
21:53
🔗
|
|
kyan has quit IRC (Ping timeout: 258 seconds) |
21:53
🔗
|
|
kyan_ is now known as kyan |
22:04
🔗
|
kyan |
don't have time to look into it right now but these ftp://ftp.us.dell.com/video/ just got posted to /r/opendirectories. Lots of drievrs |
22:04
🔗
|
arkiver |
ooooh |
22:05
🔗
|
arkiver |
will check that in for the ftp project |
22:05
🔗
|
kyan |
might be stuff in the parent dir too |
22:06
🔗
|
arkiver |
yep, will get that too |
22:16
🔗
|
ersi |
http://www.bloomberg.com/features/2016-solar-power-buffett-vs-musk/img/buffett_vs_musk.gif |
22:16
🔗
|
ersi |
hehehe |
22:27
🔗
|
|
kyan has quit IRC (This computer has gone to sleep) |
22:28
🔗
|
Smiley |
is there any chance of getting major to autovoice me in #archivebot ? |
22:46
🔗
|
HCross |
Currently watching my warrior archive the Friends Reunited page for my old school is so satisfying |
22:53
🔗
|
joepie91 |
wow |
22:53
🔗
|
joepie91 |
when you think you've seen everything |
22:53
🔗
|
joepie91 |
http://www.nieuwsbladtransport.nl/Nieuws/Article/tabid/85/ArticleID/40874/ArticleName/Samskipgaatreorganiseren/Default.aspx |
22:53
🔗
|
joepie91 |
cc arkiver |
22:54
🔗
|
joepie91 |
"Dear reader, After one year, the pictures in our articles are removed from the site. The texts themselves, however, will remain unchanged." |
22:54
🔗
|
HCross |
.... |
22:54
🔗
|
HCross |
wow |
22:54
🔗
|
joepie91 |
so yeah, one for your newsbot |
22:54
🔗
|
joepie91 |
lol |
22:54
🔗
|
joepie91 |
amazing, though |
22:54
🔗
|
joepie91 |
never seen this before, boggles the mind |
22:54
🔗
|
HCross |
joepie91, ill add it soon. Not atm though |
22:55
🔗
|
HCross |
Im not popular atm with the datacenter |
22:55
🔗
|
joepie91 |
HCross: haha, how come |
22:55
🔗
|
HCross |
Bandwith, ALL OF THE BANDWITH |
22:55
🔗
|
joepie91 |
lol |
22:55
🔗
|
joepie91 |
HCross: oh, they paywall too |
22:55
🔗
|
HCross |
feck |
22:55
🔗
|
joepie91 |
might need to make sure you're grabbing it without cookies |
22:56
🔗
|
|
vtyl has quit IRC (hub.efnet.us irc.servercentral.net) |
22:56
🔗
|
|
RedType has quit IRC (hub.efnet.us irc.servercentral.net) |
22:56
🔗
|
|
SimpBrain has quit IRC (hub.efnet.us irc.servercentral.net) |
22:56
🔗
|
|
phuzion has quit IRC (hub.efnet.us irc.servercentral.net) |
22:56
🔗
|
|
atlogbot has quit IRC (hub.efnet.us irc.servercentral.net) |
22:56
🔗
|
|
schbirid2 has quit IRC (hub.efnet.us irc.servercentral.net) |
22:56
🔗
|
|
Infreq has quit IRC (hub.efnet.us irc.servercentral.net) |
22:56
🔗
|
|
JW_work has quit IRC (hub.efnet.us irc.servercentral.net) |
22:56
🔗
|
|
mistym has quit IRC (hub.efnet.us irc.servercentral.net) |
22:56
🔗
|
|
dxrt has quit IRC (hub.efnet.us irc.servercentral.net) |
22:56
🔗
|
|
swebb has quit IRC (hub.efnet.us irc.servercentral.net) |
22:56
🔗
|
|
slyphic has quit IRC (hub.efnet.us irc.servercentral.net) |
22:56
🔗
|
|
chazchaz has quit IRC (hub.efnet.us irc.servercentral.net) |
22:56
🔗
|
joepie91 |
yeah, this site is a bit special |
22:56
🔗
|
joepie91 |
lol |
22:57
🔗
|
|
RedType_ has joined #archiveteam-bs |
22:57
🔗
|
|
phuzion_ has joined #archiveteam-bs |
22:59
🔗
|
|
mistym- has joined #archiveteam-bs |
23:00
🔗
|
|
lytv has joined #archiveteam-bs |
23:00
🔗
|
|
dxrt_ has joined #archiveteam-bs |
23:01
🔗
|
ersi |
joepie91: I guess they.. only license the images for one year? |
23:01
🔗
|
|
Infreq_ has joined #archiveteam-bs |
23:01
🔗
|
ersi |
Incredibly stupid though |
23:01
🔗
|
|
SimpBrai1 has joined #archiveteam-bs |
23:01
🔗
|
joepie91 |
very much so |
23:01
🔗
|
joepie91 |
lol |
23:01
🔗
|
joepie91 |
ersi: also, image licensing for news in NL is not usually time-limited... |
23:02
🔗
|
|
schbirid has joined #archiveteam-bs |
23:02
🔗
|
joepie91 |
they must've gotten the short end of the stick with their licensing agency :P |
23:02
🔗
|
ersi |
or they just wanted cheaper pics |
23:06
🔗
|
|
swebb has joined #archiveteam-bs |
23:07
🔗
|
|
JW_work2 has joined #archiveteam-bs |
23:08
🔗
|
|
chazchaz has joined #archiveteam-bs |
23:09
🔗
|
|
slyphic has joined #archiveteam-bs |
23:10
🔗
|
|
Start has joined #archiveteam-bs |
23:11
🔗
|
HCross |
from argparse import ArgumentParser |
23:11
🔗
|
HCross |
ImportError: No module named argparse |
23:11
🔗
|
HCross |
arkiver, ^^ |
23:12
🔗
|
HCross |
nvm |
23:22
🔗
|
xmc |
or they don't like paying for storage |
23:45
🔗
|
Smiley |
k, my warrior stats are miles off |
23:45
🔗
|
Smiley |
I'm on 30Mbit |
23:45
🔗
|
Smiley |
it's telling me 280MB/s |
23:45
🔗
|
Smiley |
D: |
23:45
🔗
|
Smiley |
Oh wait reading the total XD |
23:46
🔗
|
HCross |
Im waiting to get Debian installed on thsi server then I will have a clue what I am doing |
23:50
🔗
|
|
dxrt_ is now known as dxrt |