Time |
Nickname |
Message |
00:03
🔗
|
|
ranav has quit IRC (Read error: Connection reset by peer) |
00:14
🔗
|
|
ranavalon has joined #archiveteam-bs |
00:14
🔗
|
|
ranavalon has quit IRC (Remote host closed the connection) |
00:15
🔗
|
|
ranavalon has joined #archiveteam-bs |
00:18
🔗
|
|
BlueMaxim has quit IRC (Leaving) |
01:00
🔗
|
|
ranavalon has quit IRC (Quit: Leaving) |
01:15
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
01:42
🔗
|
|
yuitimoth has quit IRC (Read error: error:1408F119:SSL routines:SSL3_GET_RECORD:decryption failed or bad record mac) |
01:42
🔗
|
|
yuitimoth has joined #archiveteam-bs |
01:54
🔗
|
|
yuitimoth has quit IRC (Read error: error:1408F119:SSL routines:SSL3_GET_RECORD:decryption failed or bad record mac) |
01:54
🔗
|
|
yuitimoth has joined #archiveteam-bs |
02:12
🔗
|
|
DFJustin has quit IRC (Remote host closed the connection) |
02:15
🔗
|
|
DFJustin has joined #archiveteam-bs |
02:15
🔗
|
|
swebb sets mode: +o DFJustin |
02:38
🔗
|
bithippo |
Is it possible to edit an item's collection it belongs to after creating said item? |
02:43
🔗
|
|
bithippo has quit IRC (Ping timeout: 260 seconds) |
03:29
🔗
|
|
atlogbot has quit IRC (Read error: Operation timed out) |
03:29
🔗
|
|
swebb has quit IRC (Read error: Operation timed out) |
03:30
🔗
|
|
swebb has joined #archiveteam-bs |
03:30
🔗
|
|
atlogbot has joined #archiveteam-bs |
03:30
🔗
|
|
svchfoo3 sets mode: +o swebb |
03:30
🔗
|
|
svchost03 sets mode: +v atlogbot |
04:46
🔗
|
|
jdude104 has quit IRC (Read error: Operation timed out) |
04:49
🔗
|
|
qw3rty14 has joined #archiveteam-bs |
04:53
🔗
|
|
qw3rty13 has quit IRC (Read error: Operation timed out) |
05:05
🔗
|
|
K4k has quit IRC (Read error: Connection reset by peer) |
05:42
🔗
|
|
w0rp has quit IRC (Ping timeout: 245 seconds) |
05:45
🔗
|
|
w0rp has joined #archiveteam-bs |
06:28
🔗
|
|
zyphlar has joined #archiveteam-bs |
07:04
🔗
|
|
sekolyn has joined #archiveteam-bs |
07:05
🔗
|
|
octothorp has quit IRC (Read error: Operation timed out) |
07:28
🔗
|
|
octothorp has joined #archiveteam-bs |
07:29
🔗
|
|
sekolyn has quit IRC (Read error: Operation timed out) |
07:29
🔗
|
|
kpz has joined #archiveteam-bs |
07:30
🔗
|
|
kpz has left |
07:44
🔗
|
|
Asparagir has joined #archiveteam-bs |
08:38
🔗
|
|
zyphlar has quit IRC (Quit: Connection closed for inactivity) |
08:46
🔗
|
|
Asparagir has quit IRC (Asparagir) |
09:49
🔗
|
|
BlueMaxim has quit IRC (Read error: Connection reset by peer) |
11:07
🔗
|
|
slyphic has quit IRC (Read error: Operation timed out) |
11:13
🔗
|
|
slyphic has joined #archiveteam-bs |
11:42
🔗
|
|
ZexaronS- has quit IRC (Quit: Leaving) |
12:31
🔗
|
|
altlabel has joined #archiveteam-bs |
12:35
🔗
|
JAA |
Anyone else getting a lot of errors when accessing the Wayback Machine? I get "unable to connect", timeouts, pages which never finish loading, etc. |
12:46
🔗
|
|
sep332_ has joined #archiveteam-bs |
12:47
🔗
|
|
sep332 has quit IRC (Read error: Operation timed out) |
13:30
🔗
|
|
jacketcha has quit IRC (Read error: Connection reset by peer) |
13:48
🔗
|
JAA |
Seems to be better now. |
15:20
🔗
|
|
Mateon1 has quit IRC (Ping timeout: 255 seconds) |
15:20
🔗
|
|
Mateon1 has joined #archiveteam-bs |
15:33
🔗
|
|
jdude104 has joined #archiveteam-bs |
15:46
🔗
|
|
jdude104 has quit IRC (Quit: Leaving) |
16:37
🔗
|
|
schbirid has joined #archiveteam-bs |
17:33
🔗
|
|
RichardG_ has joined #archiveteam-bs |
17:33
🔗
|
|
RichardG has quit IRC (Read error: Connection reset by peer) |
17:47
🔗
|
|
RichardG_ has quit IRC (Read error: Connection reset by peer) |
17:50
🔗
|
|
jschwart has joined #archiveteam-bs |
17:53
🔗
|
|
RichardG has joined #archiveteam-bs |
17:54
🔗
|
|
RichardG has quit IRC (Read error: Connection reset by peer) |
17:57
🔗
|
|
RichardG has joined #archiveteam-bs |
18:13
🔗
|
SketchCow |
I'm cleaning WARCs still |
18:13
🔗
|
SketchCow |
https://archive.org/details/archiveteam_miiverse is getting that massive miiverse grab |
18:14
🔗
|
SketchCow |
https://archive.org/details/warczone now exists. It is "outsider" WARCs, WARCs where we have no idea who is sending them. There's a good chance they won't go directly into Wayback. |
18:15
🔗
|
|
ReimuHaku has quit IRC (Ping timeout: 250 seconds) |
18:17
🔗
|
|
ReimuHaku has joined #archiveteam-bs |
18:17
🔗
|
|
RichardG has quit IRC (Read error: Connection reset by peer) |
18:17
🔗
|
|
RichardG has joined #archiveteam-bs |
18:22
🔗
|
|
K4k has joined #archiveteam-bs |
18:23
🔗
|
jrwr |
SketchCow: thats a damn fine pun you made there |
18:46
🔗
|
SketchCow |
https://archive.org/details/archiveteam_yahoogroups is about to get super huge |
18:48
🔗
|
|
adinbied has joined #archiveteam-bs |
18:50
🔗
|
|
RichardG has quit IRC (Read error: Connection reset by peer) |
18:50
🔗
|
adinbied |
Hi all, I seem to have lost the link to the Discord server, can anyone send it to me? A while back I asked about archiving Gazelle-based sites, and got linked to the discord server to talk to -Archivist-, as he/she was working on that at the time. Thanks! |
18:51
🔗
|
|
RichardG has joined #archiveteam-bs |
18:53
🔗
|
JAA |
adinbied: If it was posted in here, try searching the logs: http://archive.fart.website/bin/irclogger_logs |
18:55
🔗
|
adinbied |
Found it, thanks! |
18:59
🔗
|
|
adinbied has quit IRC (Quit: Page closed) |
19:35
🔗
|
|
ndiddy_ has quit IRC () |
19:38
🔗
|
|
ndiddy has joined #archiveteam-bs |
20:01
🔗
|
|
REiN^ has quit IRC (no.money.no.love) |
20:02
🔗
|
|
purplebot has quit IRC (Ping timeout: 248 seconds) |
20:02
🔗
|
PurpleSym |
SketchCow: Can I get permission to upload to that collection? |
20:02
🔗
|
|
HCross2 has quit IRC (Ping timeout: 248 seconds) |
20:03
🔗
|
|
Rai-chan has quit IRC (Ping timeout: 248 seconds) |
20:03
🔗
|
|
i0npulse has quit IRC (Ping timeout: 248 seconds) |
20:09
🔗
|
|
RichardG has quit IRC (Ping timeout: 248 seconds) |
20:10
🔗
|
|
RichardG has joined #archiveteam-bs |
20:17
🔗
|
|
AeonG_ has joined #archiveteam-bs |
20:22
🔗
|
|
Caz has quit IRC (Read error: Operation timed out) |
20:30
🔗
|
|
purplebot has joined #archiveteam-bs |
20:30
🔗
|
|
i0npulse has joined #archiveteam-bs |
20:31
🔗
|
|
HCross2 has joined #archiveteam-bs |
20:31
🔗
|
|
svchfoo1 sets mode: +o HCross2 |
20:33
🔗
|
|
Rai-chan has joined #archiveteam-bs |
20:33
🔗
|
|
odemg has quit IRC (Read error: Operation timed out) |
20:36
🔗
|
|
odemg has joined #archiveteam-bs |
20:41
🔗
|
SketchCow |
I don't see why not, you're one of the processes I've got cleaning up |
20:43
🔗
|
SketchCow |
You now have archiveteam_yahoogroups. You might need to log out of your browser to get it noticed. |
20:50
🔗
|
DrasticAc |
SketchCow: Thanks for moving those miiverse files |
20:52
🔗
|
DrasticAc |
Kinda realized part of the way through making them that I _probably_ should have less of them, rather than 10,000 post chunks. |
20:52
🔗
|
DrasticAc |
But, hey, it's easier for people to download a 200 MB warc than multiple terabytes if they just need one post ;) |
20:55
🔗
|
|
octothorp has quit IRC (Read error: Connection reset by peer) |
20:59
🔗
|
|
Rai-chan has quit IRC (Ping timeout: 248 seconds) |
21:01
🔗
|
|
HCross2 has quit IRC (Ping timeout: 248 seconds) |
21:01
🔗
|
|
purplebot has quit IRC (Ping timeout: 248 seconds) |
21:02
🔗
|
godane |
SketchCow: my cat throw up on one of your boxes |
21:03
🔗
|
godane |
I NEED TO GET LABELS NOW SO I CAN MAIL THEM BEFORE THE CAT RUINS YOUR STUFF |
21:06
🔗
|
|
i0npulse has quit IRC (Ping timeout: 248 seconds) |
21:06
🔗
|
godane |
tapes are fine but box has dry cat vomit on it |
21:08
🔗
|
|
SketchCow changes topic to: Lengthy Archive Team related discussions here | General archiving & offtopic: #archiveteam-ot | < godane> SketchCow: my cat throw up on one of your boxes |
21:08
🔗
|
SketchCow |
Let me get on that |
21:08
🔗
|
SketchCow |
DrasticAc: yes, if I'd had more of a say on your project, I'd have said you should have 50gb per item |
21:09
🔗
|
DrasticAc |
Yeah, it was one of those things I didn't know until it was too late to switch. |
21:10
🔗
|
DrasticAc |
But next time, I have a better idea of what to do. |
21:10
🔗
|
Igloo |
godane: i am glad I am not the only one with that problem. My cats puke on stuff all the time ¬_¬ |
21:11
🔗
|
DrasticAc |
I don't know if it'll be useful, but I was thinking of making a mini-archivebot for stuff like Slack or Discord. |
21:12
🔗
|
SketchCow |
https://archive.org/details/archiveteam_verizon |
21:12
🔗
|
SketchCow |
You can see my script slowly adding a filler logo to all the items |
21:12
🔗
|
DrasticAc |
Since it seems like a portion of stuff that gets submitted to archivebot are one-off sites (like twitter links), having something like that available more widely might be useful. |
21:13
🔗
|
godane |
lgloo: lucky for most of my stuff is in my room |
21:13
🔗
|
DrasticAc |
Although I guess you can use the IA extension for that. |
21:13
🔗
|
SketchCow |
The problem is that people are not very good at assessing archivebot |
21:13
🔗
|
godane |
and the cat doesn't come into my room |
21:13
🔗
|
godane |
but there is no room for boxes in my room |
21:13
🔗
|
SketchCow |
And we get people doing things like "hurr durr The Onion is pretty amazeballs, I better kick off a million-url job with one line because just in case" |
21:14
🔗
|
SketchCow |
"Hey, someone mirrored a mirror of a mirror we mirror, better get THAT copy too" |
21:14
🔗
|
Igloo |
We are trying to police that much better though.... |
21:14
🔗
|
SketchCow |
We are |
21:14
🔗
|
SketchCow |
Adding it to random discords or slacks would not be smark |
21:14
🔗
|
DrasticAc |
Could keep a database to check against that though. |
21:14
🔗
|
SketchCow |
I'd kill any link |
21:14
🔗
|
DrasticAc |
Like, if x link was already archived, don't do it again. |
21:14
🔗
|
SketchCow |
Drop to a whitelist of people who can kick off jobs |
21:15
🔗
|
Igloo |
DrasticAc we do that. But it's just a bit broken at the moment. If you want to help us fix it we'd appreciate it ;-) |
21:15
🔗
|
Igloo |
AB is a victim of it's own success. |
21:16
🔗
|
SketchCow |
Just saying. Don't make more links to archivebot |
21:16
🔗
|
godane |
in other news i got my archivebox rpi project to broadcast a 'honeypot' wifi |
21:16
🔗
|
SketchCow |
Or things that can kick off archivebot to an even larger set of feel-no-pain instigators |
21:17
🔗
|
DrasticAc |
Oh no, I'm not saying make a slack bot that talks to _our_ archivebot. |
21:17
🔗
|
godane |
next part of my project is to add a local wayback machine to it |
21:17
🔗
|
DrasticAc |
I'm saying "make something totally different that offers a limited set of its functions" |
21:17
🔗
|
SketchCow |
Oh, here's a project I was thinking about that someone should do. |
21:17
🔗
|
SketchCow |
Ready? |
21:17
🔗
|
SketchCow |
You seem to all be quite capable of this. |
21:17
🔗
|
SketchCow |
A little package, that if you drop it in a directory, and the directory has WARCs, you get a little mini wayback for it |
21:18
🔗
|
SketchCow |
Which maybe a navigatron option for the family of URLs it covers |
21:19
🔗
|
Igloo |
So, Something that can run on any server? and provide a way back feel for the warcs in that directory? |
21:19
🔗
|
SketchCow |
Yes. |
21:19
🔗
|
SketchCow |
Or a subdirectory, I guess |
21:19
🔗
|
SketchCow |
WARCS/ |
21:19
🔗
|
Igloo |
Interesting, I like the idea of that |
21:20
🔗
|
DrasticAc |
Yeah, that sounds very useful |
21:20
🔗
|
SketchCow |
Do it |
21:20
🔗
|
SketchCow |
waiting |
21:20
🔗
|
* |
SketchCow taps watch |
21:21
🔗
|
DrasticAc |
Just wait till I get off of work, have dinner, etc. |
21:24
🔗
|
SketchCow |
https://www.youtube.com/watch?v=af3mlZ28MzI |
21:24
🔗
|
Igloo |
<< I love that film >> |
21:36
🔗
|
|
purplebot has joined #archiveteam-bs |
21:43
🔗
|
|
purplebot has quit IRC (hub.dk irc.underworld.no) |
22:12
🔗
|
|
k_o has joined #archiveteam-bs |
22:14
🔗
|
|
Jon has joined #archiveteam-bs |
22:16
🔗
|
Jon |
hmm. I've got a blu ray, CC-BY-SA-NC, but it is DRM protected. I would like to put it on archive.org but not sure whether to put it up with or without the DRM. Also a prior upload by someone else years back got deleted without explanation |
22:17
🔗
|
astrid |
do you have a link to this prior upload? it was probably darked because the copyright holder complained. i can check though. |
22:22
🔗
|
|
octothorp has joined #archiveteam-bs |
22:22
🔗
|
JAA |
k_o: VSCO will be quite annoying to archive with all that JS going on. If you could write up a summary of what the site structure is like and how the content can be accessed, that would be great. |
22:23
🔗
|
JAA |
Looks like they don't use numeric IDs though, so iterating over everything won't be easy. |
22:23
🔗
|
k_o |
Oh, the site is one of the worst things I've ever seen. |
22:23
🔗
|
k_o |
I've got two scripts that can download it, though. |
22:24
🔗
|
JAA |
That's definitely also helpful, yes. |
22:24
🔗
|
k_o |
The one I prefer is from github and it's written in ruby |
22:24
🔗
|
k_o |
Lemme find the link |
22:24
🔗
|
JAA |
(Ugh, Ruby. ;-) ) |
22:24
🔗
|
k_o |
https://github.com/HuggableSquare/vsco-dl Well, the other one I wrote in Python, but it's a good deal slower than this one, and doesn't get nearly as much metadata |
22:25
🔗
|
k_o |
This puts everything in a folder, but the naming is pretty crap, so I wrote a Python script to rename the files to the year, month, and day |
22:26
🔗
|
k_o |
After that I run packjpg to compress everything to about 75% and then pack it into .tar.bz2 archives |
22:26
🔗
|
JAA |
Well, we usually archive in the WARC format if possible. |
22:26
🔗
|
k_o |
I'm not too familiar with WARC, so some changes would probably be necessary there |
22:27
🔗
|
JAA |
What vsco-dl does should be fairly easy to do with a plugin for wget-lua or wpull. |
22:27
🔗
|
k_o |
Yeah, the problem is that I'm averaging 220MB/user right now |
22:27
🔗
|
k_o |
My current list is 150,000 names and growing, so it's already in the 30 TB range, which is more space than I have |
22:27
🔗
|
JAA |
Any idea how large it is in total? |
22:27
🔗
|
JAA |
Ah |
22:28
🔗
|
k_o |
The thing is, VSCO reported 30 million active monthly users last year |
22:28
🔗
|
k_o |
So it's probably in the petabytes range at least |
22:28
🔗
|
|
jschwart has quit IRC (Konversation terminated!) |
22:28
🔗
|
JAA |
Hmm, that seems way too large for a photo sharing website. |
22:29
🔗
|
JAA |
Vidme and SoundCloud are in that range. |
22:29
🔗
|
JAA |
(Well, Vidme was and SC is.) |
22:29
🔗
|
k_o |
Exactly, vidme *was* |
22:29
🔗
|
k_o |
and SC was threatening to go under |
22:29
🔗
|
JAA |
Yeah |
22:29
🔗
|
k_o |
hence my concern |
22:30
🔗
|
k_o |
what happened to SC, anyway? did they find new funding? |
22:30
🔗
|
JAA |
Right, but I can't believe that VSCO gets even close to 1 PB. |
22:31
🔗
|
JAA |
I'm not sure what IA thinks about grabbing a copy of them though. |
22:31
🔗
|
k_o |
I mean the 30 million thing is pretty widely reported https://finance.yahoo.com/news/vsco-now-30-million-active-170002551.html |
22:31
🔗
|
k_o |
That's actually the only info I can find about their stats. No user info since then, no size info, no quarterly reports. |
22:32
🔗
|
k_o |
I'm not even really sure how they make money, there's no articles about it on the first pages of search. |
22:32
🔗
|
k_o |
But yeah, there's the issue of privacy and all that. I remember the Instagram project got a lot of bad press |
22:32
🔗
|
k_o |
IA may not want that |
22:35
🔗
|
k_o |
Anyways, I thought I'd float the idea to archiveteam, see if anyone was interested |
22:35
🔗
|
|
purplebot has joined #archiveteam-bs |
22:36
🔗
|
|
Rai-chan has joined #archiveteam-bs |
22:36
🔗
|
JAA |
Looks like you can purchase something called "VSCO Film"? |
22:36
🔗
|
k_o |
There's no immediate danger, but I remember how short notice on vidme meant we couldn't save all of it |
22:36
🔗
|
k_o |
Hard to imagine how one product could bring in enough cash to host as much data as they do |
22:37
🔗
|
k_o |
Who knows, though, they don't seem to post earnings or anything |
22:37
🔗
|
JAA |
Yeah, it's nice to have an idea of how the site works etc. already so we can grab it quickly when they announce the shutdown. |
22:37
🔗
|
Jon |
astrid, yeah, thanks -- it was http://archive.org/details/NineInchNailsGhostsI-Ivblu-ray24bit96khz$ |
22:37
🔗
|
Jon |
minus the $ http://archive.org/details/NineInchNailsGhostsI-Ivblu-ray24bit96khz |
22:37
🔗
|
|
i0npulse has joined #archiveteam-bs |
22:37
🔗
|
astrid |
right |
22:37
🔗
|
Jon |
astrid: the album is widely available in 16/44.1 (including several times on archive.org); in 24/96 (as on the BD) it's much rarer. I just sourced one after 10 years or so, and it cost me £50 |
22:38
🔗
|
Jon |
despite that it's still clearly marked as CC-BY-SA-NC |
22:38
🔗
|
astrid |
that was darked in december 2014 with the comment "possible rights issues" |
22:38
🔗
|
astrid |
email info@archive.org and maybe they'll un-dark it |
22:38
🔗
|
Jon |
thanks, I shall. Can you tell if I was the original uploader? I've completely forgotten. My username is jmtd on archive.org |
22:38
🔗
|
Jon |
thanks for all your help |
22:38
🔗
|
JAA |
k_o: Apparently you can also buy filters and possibly other stuff through an in-app store. The famous microtransactions scheme. |
22:39
🔗
|
astrid |
original uploader was someone with email address 893productions@gmail.com |
22:39
🔗
|
Jon |
ok yeah that wasn't me. Thanks :> |
22:39
🔗
|
k_o |
In that case, their business model may be sound |
22:39
🔗
|
Jon |
I'll still email |
22:39
🔗
|
astrid |
sure thing Jon |
22:39
🔗
|
k_o |
I figured it's a website worth keeping an eye on though |
22:39
🔗
|
JAA |
k_o: Sure. Are you willing to share your code for scraping users? |
22:39
🔗
|
* |
Jon goes to bed |
22:40
🔗
|
k_o |
Sure, it's written in python and uses selenium |
22:40
🔗
|
k_o |
I can put it up on pastebin |
22:41
🔗
|
k_o |
It's probably not the most efficient way to go about it, but I don't know how else to render their crappy website except for a headless browser |
22:42
🔗
|
JAA |
Yeah, it should be a lot faster to just do the relevant API requests directly. |
22:43
🔗
|
JAA |
I'm interested in seeing the code anyway, also because I wanted to look into headless browsers for archiving before. |
22:45
🔗
|
|
k_o_ has joined #archiveteam-bs |
22:45
🔗
|
k_o_ |
internet crashed |
22:45
🔗
|
k_o_ |
idk if the message got through, I'll upload the code to pastebin |
22:45
🔗
|
|
k_o has quit IRC (Ping timeout: 260 seconds) |
22:46
🔗
|
godane |
i get to have fun setting up my new comcast cable modem latter |
22:46
🔗
|
JAA |
k_o_: Here's what happened: http://archive.fart.website/bin/irclogger_log/archiveteam-bs?date=2018-01-09,Tue&sel=229#l225 |
22:47
🔗
|
k_o_ |
Alright, that's all the messages I sent |
22:47
🔗
|
k_o_ |
Gimme a sec to cut out the code and put it up |
22:50
🔗
|
k_o_ |
https://pastebin.com/au6eSN39 |
22:51
🔗
|
k_o_ |
You start if off by creating a file vsco.txt with |
22:51
🔗
|
k_o_ |
At least one username and a "|" before the first username |
22:51
🔗
|
k_o_ |
It searches the collection for each user and adds those names to the file, going through all of the new names, so theoretically it will eventually scrape every non-orphan user on the site |
22:52
🔗
|
k_o_ |
If you need to break the script, just move the | back to the point you want it, and it won't search through the first names again |
22:52
🔗
|
k_o_ |
It also checks for duplicates and won't add those, so each username is unique |
22:53
🔗
|
JAA |
Ah, collections, I see. |
22:53
🔗
|
JAA |
Thanks |
22:54
🔗
|
k_o_ |
My vsco.txt is slightly over 157,000 lines currently, but with 30 million active users, that's barely half a percent |
22:55
🔗
|
k_o_ |
It's been running for about a day, so given a few weeks, it could probably build up a pretty good list |
22:55
🔗
|
k_o_ |
I figured it would be helpful to have around if/when there's a shutdown notice |
22:56
🔗
|
JAA |
Indeed |
23:07
🔗
|
JAA |
Another idea to discover users would be to search for tags appearing on the individual photo pages. |
23:10
🔗
|
k_o_ |
I think most of the people who are tagged also appear on the collection, but I could be wrong |
23:11
🔗
|
k_o_ |
If the script I'm running finishes with a lot of users missing, I could try that |
23:29
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
23:57
🔗
|
|
wbradley has quit IRC (WeeChat 1.4) |