#archiveteam-bs 2018-01-09,Tue

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)

WhoWhatWhen
***ranav has quit IRC (Read error: Connection reset by peer) [00:03]
ranavalon has joined #archiveteam-bs
ranavalon has quit IRC (Remote host closed the connection)
ranavalon has joined #archiveteam-bs
BlueMaxim has quit IRC (Leaving)
[00:14]
......... (idle for 42mn)
ranavalon has quit IRC (Quit: Leaving) [01:00]
.... (idle for 15mn)
BlueMaxim has joined #archiveteam-bs [01:15]
...... (idle for 27mn)
yuitimoth has quit IRC (Read error: error:1408F119:SSL routines:SSL3_GET_RECORD:decryption failed or bad record mac)
yuitimoth has joined #archiveteam-bs
[01:42]
yuitimoth has quit IRC (Read error: error:1408F119:SSL routines:SSL3_GET_RECORD:decryption failed or bad record mac)
yuitimoth has joined #archiveteam-bs
[01:54]
.... (idle for 18mn)
DFJustin has quit IRC (Remote host closed the connection)
DFJustin has joined #archiveteam-bs
swebb sets mode: +o DFJustin
[02:12]
..... (idle for 23mn)
bithippoIs it possible to edit an item's collection it belongs to after creating said item? [02:38]
***bithippo has quit IRC (Ping timeout: 260 seconds) [02:43]
.......... (idle for 46mn)
atlogbot has quit IRC (Read error: Operation timed out)
swebb has quit IRC (Read error: Operation timed out)
swebb has joined #archiveteam-bs
atlogbot has joined #archiveteam-bs
svchfoo3 sets mode: +o swebb
svchost03 sets mode: +v atlogbot
[03:29]
................ (idle for 1h16mn)
jdude104 has quit IRC (Read error: Operation timed out)
qw3rty14 has joined #archiveteam-bs
qw3rty13 has quit IRC (Read error: Operation timed out)
[04:46]
K4k has quit IRC (Read error: Connection reset by peer) [05:05]
........ (idle for 37mn)
w0rp has quit IRC (Ping timeout: 245 seconds)
w0rp has joined #archiveteam-bs
[05:42]
......... (idle for 43mn)
zyphlar has joined #archiveteam-bs [06:28]
........ (idle for 36mn)
sekolyn has joined #archiveteam-bs
octothorp has quit IRC (Read error: Operation timed out)
[07:04]
..... (idle for 23mn)
octothorp has joined #archiveteam-bs
sekolyn has quit IRC (Read error: Operation timed out)
kpz has joined #archiveteam-bs
kpz has left
[07:28]
Asparagir has joined #archiveteam-bs [07:44]
........... (idle for 54mn)
zyphlar has quit IRC (Quit: Connection closed for inactivity) [08:38]
Asparagir has quit IRC (Asparagir) [08:46]
............. (idle for 1h3mn)
BlueMaxim has quit IRC (Read error: Connection reset by peer) [09:49]
................ (idle for 1h18mn)
slyphic has quit IRC (Read error: Operation timed out) [11:07]
slyphic has joined #archiveteam-bs [11:13]
...... (idle for 29mn)
ZexaronS- has quit IRC (Quit: Leaving) [11:42]
.......... (idle for 49mn)
altlabel has joined #archiveteam-bs [12:31]
JAAAnyone else getting a lot of errors when accessing the Wayback Machine? I get "unable to connect", timeouts, pages which never finish loading, etc. [12:35]
***sep332_ has joined #archiveteam-bs
sep332 has quit IRC (Read error: Operation timed out)
[12:46]
......... (idle for 43mn)
jacketcha has quit IRC (Read error: Connection reset by peer) [13:30]
.... (idle for 18mn)
JAASeems to be better now. [13:48]
................... (idle for 1h32mn)
***Mateon1 has quit IRC (Ping timeout: 255 seconds)
Mateon1 has joined #archiveteam-bs
[15:20]
jdude104 has joined #archiveteam-bs [15:33]
jdude104 has quit IRC (Quit: Leaving) [15:46]
........... (idle for 51mn)
schbirid has joined #archiveteam-bs [16:37]
............ (idle for 56mn)
RichardG_ has joined #archiveteam-bs
RichardG has quit IRC (Read error: Connection reset by peer)
[17:33]
RichardG_ has quit IRC (Read error: Connection reset by peer)
jschwart has joined #archiveteam-bs
RichardG has joined #archiveteam-bs
RichardG has quit IRC (Read error: Connection reset by peer)
RichardG has joined #archiveteam-bs
[17:47]
.... (idle for 16mn)
SketchCowI'm cleaning WARCs still
https://archive.org/details/archiveteam_miiverse is getting that massive miiverse grab
https://archive.org/details/warczone now exists. It is "outsider" WARCs, WARCs where we have no idea who is sending them. There's a good chance they won't go directly into Wayback.
[18:13]
***ReimuHaku has quit IRC (Ping timeout: 250 seconds)
ReimuHaku has joined #archiveteam-bs
RichardG has quit IRC (Read error: Connection reset by peer)
RichardG has joined #archiveteam-bs
[18:15]
K4k has joined #archiveteam-bs [18:22]
jrwrSketchCow: thats a damn fine pun you made there [18:23]
..... (idle for 23mn)
SketchCowhttps://archive.org/details/archiveteam_yahoogroups is about to get super huge [18:46]
***adinbied has joined #archiveteam-bs
RichardG has quit IRC (Read error: Connection reset by peer)
[18:48]
adinbiedHi all, I seem to have lost the link to the Discord server, can anyone send it to me? A while back I asked about archiving Gazelle-based sites, and got linked to the discord server to talk to -Archivist-, as he/she was working on that at the time. Thanks! [18:50]
***RichardG has joined #archiveteam-bs [18:51]
JAAadinbied: If it was posted in here, try searching the logs: http://archive.fart.website/bin/irclogger_logs [18:53]
adinbiedFound it, thanks! [18:55]
***adinbied has quit IRC (Quit: Page closed) [18:59]
........ (idle for 36mn)
ndiddy_ has quit IRC ()
ndiddy has joined #archiveteam-bs
[19:35]
..... (idle for 23mn)
REiN^ has quit IRC (no.money.no.love)
purplebot has quit IRC (Ping timeout: 248 seconds)
[20:01]
PurpleSymSketchCow: Can I get permission to upload to that collection? [20:02]
***HCross2 has quit IRC (Ping timeout: 248 seconds)
Rai-chan has quit IRC (Ping timeout: 248 seconds)
i0npulse has quit IRC (Ping timeout: 248 seconds)
[20:02]
RichardG has quit IRC (Ping timeout: 248 seconds)
RichardG has joined #archiveteam-bs
[20:09]
AeonG_ has joined #archiveteam-bs [20:17]
Caz has quit IRC (Read error: Operation timed out) [20:22]
purplebot has joined #archiveteam-bs
i0npulse has joined #archiveteam-bs
HCross2 has joined #archiveteam-bs
svchfoo1 sets mode: +o HCross2
Rai-chan has joined #archiveteam-bs
odemg has quit IRC (Read error: Operation timed out)
odemg has joined #archiveteam-bs
[20:30]
SketchCowI don't see why not, you're one of the processes I've got cleaning up
You now have archiveteam_yahoogroups. You might need to log out of your browser to get it noticed.
[20:41]
DrasticAcSketchCow: Thanks for moving those miiverse files
Kinda realized part of the way through making them that I _probably_ should have less of them, rather than 10,000 post chunks.
But, hey, it's easier for people to download a 200 MB warc than multiple terabytes if they just need one post ;)
[20:50]
***octothorp has quit IRC (Read error: Connection reset by peer)
Rai-chan has quit IRC (Ping timeout: 248 seconds)
HCross2 has quit IRC (Ping timeout: 248 seconds)
purplebot has quit IRC (Ping timeout: 248 seconds)
[20:55]
godaneSketchCow: my cat throw up on one of your boxes
I NEED TO GET LABELS NOW SO I CAN MAIL THEM BEFORE THE CAT RUINS YOUR STUFF
[21:02]
***i0npulse has quit IRC (Ping timeout: 248 seconds) [21:06]
godanetapes are fine but box has dry cat vomit on it [21:06]
***SketchCow changes topic to: Lengthy Archive Team related discussions here | General archiving & offtopic: #archiveteam-ot | < godane> SketchCow: my cat throw up on one of your boxes [21:08]
SketchCowLet me get on that
DrasticAc: yes, if I'd had more of a say on your project, I'd have said you should have 50gb per item
[21:08]
DrasticAcYeah, it was one of those things I didn't know until it was too late to switch.
But next time, I have a better idea of what to do.
[21:09]
Igloogodane: i am glad I am not the only one with that problem. My cats puke on stuff all the time ¬_¬ [21:10]
DrasticAcI don't know if it'll be useful, but I was thinking of making a mini-archivebot for stuff like Slack or Discord. [21:11]
SketchCowhttps://archive.org/details/archiveteam_verizon
You can see my script slowly adding a filler logo to all the items
[21:12]
DrasticAcSince it seems like a portion of stuff that gets submitted to archivebot are one-off sites (like twitter links), having something like that available more widely might be useful. [21:12]
godanelgloo: lucky for most of my stuff is in my room [21:13]
DrasticAcAlthough I guess you can use the IA extension for that. [21:13]
SketchCowThe problem is that people are not very good at assessing archivebot [21:13]
godaneand the cat doesn't come into my room
but there is no room for boxes in my room
[21:13]
SketchCowAnd we get people doing things like "hurr durr The Onion is pretty amazeballs, I better kick off a million-url job with one line because just in case"
"Hey, someone mirrored a mirror of a mirror we mirror, better get THAT copy too"
[21:13]
IglooWe are trying to police that much better though.... [21:14]
SketchCowWe are
Adding it to random discords or slacks would not be smark
[21:14]
DrasticAcCould keep a database to check against that though. [21:14]
SketchCowI'd kill any link [21:14]
DrasticAcLike, if x link was already archived, don't do it again. [21:14]
SketchCowDrop to a whitelist of people who can kick off jobs [21:14]
IglooDrasticAc we do that. But it's just a bit broken at the moment. If you want to help us fix it we'd appreciate it ;-)
AB is a victim of it's own success.
[21:15]
SketchCowJust saying. Don't make more links to archivebot [21:16]
godanein other news i got my archivebox rpi project to broadcast a 'honeypot' wifi [21:16]
SketchCowOr things that can kick off archivebot to an even larger set of feel-no-pain instigators [21:16]
DrasticAcOh no, I'm not saying make a slack bot that talks to _our_ archivebot. [21:17]
godanenext part of my project is to add a local wayback machine to it [21:17]
DrasticAcI'm saying "make something totally different that offers a limited set of its functions" [21:17]
SketchCowOh, here's a project I was thinking about that someone should do.
Ready?
You seem to all be quite capable of this.
A little package, that if you drop it in a directory, and the directory has WARCs, you get a little mini wayback for it
Which maybe a navigatron option for the family of URLs it covers
[21:17]
IglooSo, Something that can run on any server? and provide a way back feel for the warcs in that directory? [21:19]
SketchCowYes.
Or a subdirectory, I guess
WARCS/
[21:19]
IglooInteresting, I like the idea of that [21:19]
DrasticAcYeah, that sounds very useful [21:20]
SketchCowDo it
waiting
SketchCow taps watch
[21:20]
DrasticAcJust wait till I get off of work, have dinner, etc. [21:21]
SketchCowhttps://www.youtube.com/watch?v=af3mlZ28MzI [21:24]
Igloo<< I love that film >> [21:24]
***purplebot has joined #archiveteam-bs [21:36]
purplebot has quit IRC (hub.dk irc.underworld.no) [21:43]
...... (idle for 29mn)
k_o has joined #archiveteam-bs
Jon has joined #archiveteam-bs
[22:12]
Jonhmm. I've got a blu ray, CC-BY-SA-NC, but it is DRM protected. I would like to put it on archive.org but not sure whether to put it up with or without the DRM. Also a prior upload by someone else years back got deleted without explanation [22:16]
astriddo you have a link to this prior upload? it was probably darked because the copyright holder complained. i can check though. [22:17]
***octothorp has joined #archiveteam-bs [22:22]
JAAk_o: VSCO will be quite annoying to archive with all that JS going on. If you could write up a summary of what the site structure is like and how the content can be accessed, that would be great.
Looks like they don't use numeric IDs though, so iterating over everything won't be easy.
[22:22]
k_oOh, the site is one of the worst things I've ever seen.
I've got two scripts that can download it, though.
[22:23]
JAAThat's definitely also helpful, yes. [22:24]
k_oThe one I prefer is from github and it's written in ruby
Lemme find the link
[22:24]
JAA(Ugh, Ruby. ;-) ) [22:24]
k_ohttps://github.com/HuggableSquare/vsco-dl Well, the other one I wrote in Python, but it's a good deal slower than this one, and doesn't get nearly as much metadata
This puts everything in a folder, but the naming is pretty crap, so I wrote a Python script to rename the files to the year, month, and day
After that I run packjpg to compress everything to about 75% and then pack it into .tar.bz2 archives
[22:24]
JAAWell, we usually archive in the WARC format if possible. [22:26]
k_oI'm not too familiar with WARC, so some changes would probably be necessary there [22:26]
JAAWhat vsco-dl does should be fairly easy to do with a plugin for wget-lua or wpull. [22:27]
k_oYeah, the problem is that I'm averaging 220MB/user right now
My current list is 150,000 names and growing, so it's already in the 30 TB range, which is more space than I have
[22:27]
JAAAny idea how large it is in total?
Ah
[22:27]
k_oThe thing is, VSCO reported 30 million active monthly users last year
So it's probably in the petabytes range at least
[22:28]
***jschwart has quit IRC (Konversation terminated!) [22:28]
JAAHmm, that seems way too large for a photo sharing website.
Vidme and SoundCloud are in that range.
(Well, Vidme was and SC is.)
[22:28]
k_oExactly, vidme *was*
and SC was threatening to go under
[22:29]
JAAYeah [22:29]
k_ohence my concern
what happened to SC, anyway? did they find new funding?
[22:29]
JAARight, but I can't believe that VSCO gets even close to 1 PB.
I'm not sure what IA thinks about grabbing a copy of them though.
[22:30]
k_oI mean the 30 million thing is pretty widely reported https://finance.yahoo.com/news/vsco-now-30-million-active-170002551.html
That's actually the only info I can find about their stats. No user info since then, no size info, no quarterly reports.
I'm not even really sure how they make money, there's no articles about it on the first pages of search.
But yeah, there's the issue of privacy and all that. I remember the Instagram project got a lot of bad press
IA may not want that
Anyways, I thought I'd float the idea to archiveteam, see if anyone was interested
[22:31]
***purplebot has joined #archiveteam-bs
Rai-chan has joined #archiveteam-bs
[22:35]
JAALooks like you can purchase something called "VSCO Film"? [22:36]
k_oThere's no immediate danger, but I remember how short notice on vidme meant we couldn't save all of it
Hard to imagine how one product could bring in enough cash to host as much data as they do
Who knows, though, they don't seem to post earnings or anything
[22:36]
JAAYeah, it's nice to have an idea of how the site works etc. already so we can grab it quickly when they announce the shutdown. [22:37]
Jonastrid, yeah, thanks -- it was http://archive.org/details/NineInchNailsGhostsI-Ivblu-ray24bit96khz$
minus the $ http://archive.org/details/NineInchNailsGhostsI-Ivblu-ray24bit96khz
[22:37]
***i0npulse has joined #archiveteam-bs [22:37]
astridright [22:37]
Jonastrid: the album is widely available in 16/44.1 (including several times on archive.org); in 24/96 (as on the BD) it's much rarer. I just sourced one after 10 years or so, and it cost me £50
despite that it's still clearly marked as CC-BY-SA-NC
[22:37]
astridthat was darked in december 2014 with the comment "possible rights issues"
email info@archive.org and maybe they'll un-dark it
[22:38]
Jonthanks, I shall. Can you tell if I was the original uploader? I've completely forgotten. My username is jmtd on archive.org
thanks for all your help
[22:38]
JAAk_o: Apparently you can also buy filters and possibly other stuff through an in-app store. The famous microtransactions scheme. [22:38]
astridoriginal uploader was someone with email address 893productions@gmail.com [22:39]
Jonok yeah that wasn't me. Thanks :> [22:39]
k_oIn that case, their business model may be sound [22:39]
JonI'll still email [22:39]
astridsure thing Jon [22:39]
k_oI figured it's a website worth keeping an eye on though [22:39]
JAAk_o: Sure. Are you willing to share your code for scraping users? [22:39]
JonJon goes to bed [22:39]
k_oSure, it's written in python and uses selenium
I can put it up on pastebin
It's probably not the most efficient way to go about it, but I don't know how else to render their crappy website except for a headless browser
[22:40]
JAAYeah, it should be a lot faster to just do the relevant API requests directly.
I'm interested in seeing the code anyway, also because I wanted to look into headless browsers for archiving before.
[22:42]
***k_o_ has joined #archiveteam-bs [22:45]
k_o_internet crashed
idk if the message got through, I'll upload the code to pastebin
[22:45]
***k_o has quit IRC (Ping timeout: 260 seconds) [22:45]
godanei get to have fun setting up my new comcast cable modem latter [22:46]
JAAk_o_: Here's what happened: http://archive.fart.website/bin/irclogger_log/archiveteam-bs?date=2018-01-09,Tue&sel=229#l225 [22:46]
k_o_Alright, that's all the messages I sent
Gimme a sec to cut out the code and put it up
https://pastebin.com/au6eSN39
You start if off by creating a file vsco.txt with
At least one username and a "|" before the first username
It searches the collection for each user and adds those names to the file, going through all of the new names, so theoretically it will eventually scrape every non-orphan user on the site
If you need to break the script, just move the | back to the point you want it, and it won't search through the first names again
It also checks for duplicates and won't add those, so each username is unique
[22:47]
JAAAh, collections, I see.
Thanks
[22:53]
k_o_My vsco.txt is slightly over 157,000 lines currently, but with 30 million active users, that's barely half a percent
It's been running for about a day, so given a few weeks, it could probably build up a pretty good list
I figured it would be helpful to have around if/when there's a shutdown notice
[22:54]
JAAIndeed [22:56]
Another idea to discover users would be to search for tags appearing on the individual photo pages. [23:07]
k_o_I think most of the people who are tagged also appear on the collection, but I could be wrong
If the script I'm running finishes with a lot of users missing, I could try that
[23:10]
.... (idle for 18mn)
***BlueMaxim has joined #archiveteam-bs [23:29]
...... (idle for 28mn)
wbradley has quit IRC (WeeChat 1.4) [23:57]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)