Time |
Nickname |
Message |
00:36
🔗
|
JAA |
Average FBO response time is now close to 30 seconds. Sigh... |
00:36
🔗
|
JAA |
At least the FTP data is safe. |
00:37
🔗
|
JAA |
https://archive.org/details/ftp.fbo.gov_20191111 |
00:42
🔗
|
JAA |
The first pass should be done in 4 hours or so, then there'll be another round that will probably take another 15-ish hours. And then obviously the actual entries. I hope they don't shut down too early morning on the 12th. |
00:43
🔗
|
JAA |
Though I'm not actually sure when fbo.gov will really be shut down. The notice at the top isn't very clear. |
00:51
🔗
|
|
robogoat has quit IRC (Ping timeout: 258 seconds) |
01:01
🔗
|
JAA |
My SuperiorPics forums grab is done. Now I just need to deal with the ~26 million images and outlinks it found. (Cc ibachandl) |
01:03
🔗
|
JAA |
(That includes lots of duplicates most likely. Proper numbers soon.) |
01:03
🔗
|
ibachandl |
nice! |
01:03
🔗
|
ibachandl |
was it with archivebot or a warrior? |
01:04
🔗
|
ibachandl |
or something else |
01:04
🔗
|
JAA |
Neither. I use qwarc for many of my independent archivals nowadays. |
01:05
🔗
|
JAA |
If there are no rate limits and the site can handle it, that lets me easily do 10k to 20k requests per minute with a single CPU core. |
01:06
🔗
|
JAA |
I love it when I'm limited by disk or network I/O. :-) |
01:09
🔗
|
|
robogoat has joined #archiveteam-bs |
01:10
🔗
|
JAA |
Welp, the FBO pagination just died. |
01:11
🔗
|
JAA |
That's a shame. No easy way to resume it either because of how shitty that site is. |
01:12
🔗
|
JAA |
Looks like their search also just broke. |
01:15
🔗
|
JAA |
"Your search resulted in an error. Please try again or modify your search criteria" |
01:16
🔗
|
JAA |
I did discover about 2705269 entries though, which should be about 86 %. |
01:17
🔗
|
JAA |
s/about // |
01:18
🔗
|
JAA |
I wonder how much will break when I start retrieving those... |
01:22
🔗
|
JAA |
That lovely site also has permalinks that aren't permanent. |
01:23
🔗
|
JAA |
Ah no, it's just more madness: https://www.fbo.gov/spg/DON/NAVSUP/N000104/N0010419RK167/listing.html -> "The requested url/solicitation number is found in multiple base notices." + the same search error as above |
01:38
🔗
|
|
Ivy has joined #archiveteam-bs |
02:06
🔗
|
|
HP_Archiv has joined #archiveteam-bs |
02:06
🔗
|
HP_Archiv |
Hey guys. I forgot which one of you was trying to help me the other night for archiving HP-Games.net and then associated/linked out files in a GDrive account. Is that person here? |
02:12
🔗
|
JAA |
So I'm on track to grab those 2.7M entries on FBO in about 2 days now. Which is too slow, but I can't go much faster as the response time is already elevated. Shitty government sites be shitty. |
02:26
🔗
|
HP_Archiv |
@JAA, do you think you can help with a workaround for what I'm trying to do? |
02:27
🔗
|
JAA |
HP_Archiv: Sorry, no experience with Google Drive downloads. There should be tools for that out there though. |
02:31
🔗
|
HP_Archiv |
@JAA, no worries. I forget the handle of the person that was helping me with this the other night. He was going to see if a script might pull them down and re-upload into archivebot. Not sure how to do it though, or where to look for said tools |
03:08
🔗
|
|
manjaro-u has quit IRC (Read error: Operation timed out) |
03:25
🔗
|
markedL |
Looks like you were talking with betamax and Igloo |
03:35
🔗
|
HP_Archiv |
@markedL thank you, couldn't remember their names.. |
03:36
🔗
|
HP_Archiv |
@betamax and Igloo, either you think you can assist further with archiving HP-Games.net ? |
04:12
🔗
|
HP_Archiv |
@markedL, by the way, how were you able to find previous chat history? |
04:14
🔗
|
markedL |
I didn't note what kind of computer you're on, but most IRC clients will have an option to keep a log of your prior communications |
04:19
🔗
|
Raccoon |
(and short of power outages, i bet more than a few can simply scroll up the last 5000, 20,000 lines. |
04:20
🔗
|
astrid |
i have in fact closed my irc client at least once this year |
04:29
🔗
|
|
synm0nger has joined #archiveteam-bs |
04:30
🔗
|
|
odemgi has joined #archiveteam-bs |
04:30
🔗
|
|
SynMonger has quit IRC (Ping timeout: 246 seconds) |
04:34
🔗
|
|
odemgi_ has quit IRC (Read error: Operation timed out) |
04:36
🔗
|
|
qw3rty has joined #archiveteam-bs |
04:43
🔗
|
|
qw3rty2 has quit IRC (Ping timeout: 745 seconds) |
04:53
🔗
|
|
icedice2 has joined #archiveteam-bs |
04:53
🔗
|
|
fredgido_ has joined #archiveteam-bs |
04:53
🔗
|
|
Damme_ has joined #archiveteam-bs |
04:54
🔗
|
|
yano_ has joined #archiveteam-bs |
04:55
🔗
|
|
benjinsmi has joined #archiveteam-bs |
04:56
🔗
|
|
TC01_ has joined #archiveteam-bs |
04:56
🔗
|
|
af10b3e5e has joined #archiveteam-bs |
04:56
🔗
|
|
girst_ has joined #archiveteam-bs |
04:57
🔗
|
|
odemgi_ has joined #archiveteam-bs |
04:57
🔗
|
|
Maylay_ has joined #archiveteam-bs |
04:57
🔗
|
|
Maylay_ has quit IRC (Remote host closed the connection!) |
04:57
🔗
|
|
thejsa_ has joined #archiveteam-bs |
04:58
🔗
|
|
Maylay_ has joined #archiveteam-bs |
04:58
🔗
|
|
Dark_Star has joined #archiveteam-bs |
04:58
🔗
|
|
tuluu_ has joined #archiveteam-bs |
05:00
🔗
|
|
chfoo_ has joined #archiveteam-bs |
05:00
🔗
|
|
Fusl__ sets mode: +o chfoo_ |
05:00
🔗
|
|
Fusl sets mode: +o chfoo_ |
05:00
🔗
|
|
Fusl_ sets mode: +o chfoo_ |
05:00
🔗
|
|
omglolba- has joined #archiveteam-bs |
05:05
🔗
|
|
ibachandl has quit IRC (Quit: Page closed) |
05:12
🔗
|
|
odemgi has quit IRC (irc.efnet.nl efnet.deic.eu) |
05:12
🔗
|
|
tuluu has quit IRC (irc.efnet.nl efnet.deic.eu) |
05:12
🔗
|
|
icedice has quit IRC (irc.efnet.nl efnet.deic.eu) |
05:12
🔗
|
|
omglolbah has quit IRC (irc.efnet.nl efnet.deic.eu) |
05:12
🔗
|
|
Damme has quit IRC (irc.efnet.nl efnet.deic.eu) |
05:12
🔗
|
|
halt_ has quit IRC (irc.efnet.nl efnet.deic.eu) |
05:12
🔗
|
|
d5f4a3622 has quit IRC (irc.efnet.nl efnet.deic.eu) |
05:12
🔗
|
|
dashcloud has quit IRC (irc.efnet.nl efnet.deic.eu) |
05:12
🔗
|
|
benjins has quit IRC (irc.efnet.nl efnet.deic.eu) |
05:12
🔗
|
|
girst has quit IRC (irc.efnet.nl efnet.deic.eu) |
05:12
🔗
|
|
thejsa has quit IRC (irc.efnet.nl efnet.deic.eu) |
05:12
🔗
|
|
fredgido has quit IRC (irc.efnet.nl efnet.deic.eu) |
05:12
🔗
|
|
ctrl has quit IRC (irc.efnet.nl efnet.deic.eu) |
05:12
🔗
|
|
Maylay has quit IRC (irc.efnet.nl efnet.deic.eu) |
05:12
🔗
|
|
nepeat has quit IRC (irc.efnet.nl efnet.deic.eu) |
05:12
🔗
|
|
fuzzy8021 has quit IRC (irc.efnet.nl efnet.deic.eu) |
05:12
🔗
|
|
chfoo has quit IRC (irc.efnet.nl efnet.deic.eu) |
05:12
🔗
|
|
ndiddy has quit IRC (irc.efnet.nl efnet.deic.eu) |
05:12
🔗
|
|
wp494 has quit IRC (irc.efnet.nl efnet.deic.eu) |
05:12
🔗
|
|
zerkalo has quit IRC (irc.efnet.nl efnet.deic.eu) |
05:12
🔗
|
|
TC01 has quit IRC (irc.efnet.nl efnet.deic.eu) |
05:12
🔗
|
|
Dark-Star has quit IRC (irc.efnet.nl efnet.deic.eu) |
05:12
🔗
|
|
yuitimoth has quit IRC (irc.efnet.nl efnet.deic.eu) |
05:12
🔗
|
|
yano has quit IRC (irc.efnet.nl efnet.deic.eu) |
05:15
🔗
|
|
zerkalo_ has joined #archiveteam-bs |
05:18
🔗
|
|
IAmbience has quit IRC (Quit: Connection closed for inactivity) |
05:28
🔗
|
|
yuitimoth has joined #archiveteam-bs |
05:29
🔗
|
|
fuzzy8021 has joined #archiveteam-bs |
05:31
🔗
|
HP_Archiv |
Odd, unless I'm missing something, I can't scroll up |
05:32
🔗
|
HP_Archiv |
I've left the IRC and come back a few times since then. So maybe it's only current-session only? |
05:33
🔗
|
astrid |
yes probably. depending on your client. some load history from before; some do not. |
05:35
🔗
|
HP_Archiv |
Hm, okay. No matter in this chat or in #archivebot, I can't go past my actual login point to view history. |
05:35
🔗
|
markedL |
What client are you using now, and what type of computer are you on? People here will set you up with something |
05:35
🔗
|
HP_Archiv |
I'm using Chrome, on a Windows 10 machine |
05:38
🔗
|
jake_test |
Are you using the basic web client? That wouldn't store history at all. |
05:38
🔗
|
|
ndiddy has joined #archiveteam-bs |
05:39
🔗
|
HP_Archiv |
Yeah I am |
05:39
🔗
|
HP_Archiv |
And I figured as much ^^ |
05:39
🔗
|
HP_Archiv |
I didn't know there were other ways to sign into the chat other than through the web client |
05:42
🔗
|
HP_Archiv |
How else do I sign in? |
05:42
🔗
|
jake_test |
You would have to grab a IRC client, there are so many for practically every operating system, the people here may have some better suggestions? |
05:45
🔗
|
HP_Archiv |
I'm seeing HexChat as one option, also mIRC is another |
05:45
🔗
|
HP_Archiv |
I didn't realize I'd have to use my own chat client, but it's fine. @jake_text, which one do you use? |
05:48
🔗
|
HP_Archiv |
@jake_test* |
05:48
🔗
|
jodizzle |
Can we take this to #archiveteam-ot? |
05:50
🔗
|
HP_Archiv |
Done ^^ Thanks @jodizzle |
05:56
🔗
|
HP_Archiv |
Also, I'd like to get privileges for proper site-wide archiving and ingestion into archivebot. I was told previously that voice and something else commands are required. How do I become authorized to do that myself? |
06:21
🔗
|
|
ctrl has joined #archiveteam-bs |
06:56
🔗
|
jodizzle |
It's usually up to someone with ops (an '@' next to their name, at least on my IRC client) in the #archivebot channel to decide whether to give you the necessary permissions. |
06:56
🔗
|
|
nepeat has joined #archiveteam-bs |
06:57
🔗
|
HP_Archiv |
@jodizzle, yeah, as I'm finding out ^^ |
06:57
🔗
|
jodizzle |
Which you usually get by hanging around and wanting to archive things. |
07:41
🔗
|
|
Deewiant has quit IRC (Ping timeout: 186 seconds) |
08:19
🔗
|
|
markedH has joined #archiveteam-bs |
09:58
🔗
|
godane |
SketchCow: so i looked at Success Magazine and i maybe be able to get a back from july 2011 to now |
10:03
🔗
|
|
BlueMax has quit IRC (Remote host closed the connection) |
10:03
🔗
|
|
BlueMax has joined #archiveteam-bs |
10:04
🔗
|
|
BlueMax has quit IRC (Remote host closed the connection) |
10:05
🔗
|
|
BlueMax has joined #archiveteam-bs |
10:19
🔗
|
|
tuluu_ has quit IRC (Quit: No Ping reply in 180 seconds.) |
10:19
🔗
|
|
tuluu has joined #archiveteam-bs |
10:46
🔗
|
|
icedice2 has quit IRC (Leaving) |
11:35
🔗
|
|
Deewiant has joined #archiveteam-bs |
11:38
🔗
|
|
BlueMax has quit IRC (Remote host closed the connection) |
11:38
🔗
|
|
BlueMax has joined #archiveteam-bs |
11:54
🔗
|
|
BlueMax has quit IRC (Remote host closed the connection) |
11:54
🔗
|
|
BlueMax has joined #archiveteam-bs |
11:55
🔗
|
|
BlueMax has quit IRC (Remote host closed the connection) |
11:56
🔗
|
|
BlueMax has joined #archiveteam-bs |
12:04
🔗
|
|
BlueMax has quit IRC (Remote host closed the connection) |
12:05
🔗
|
|
BlueMax has joined #archiveteam-bs |
12:09
🔗
|
|
BlueMax has quit IRC (Remote host closed the connection) |
12:09
🔗
|
|
BlueMax has joined #archiveteam-bs |
12:10
🔗
|
|
BlueMax has quit IRC (Remote host closed the connection) |
12:11
🔗
|
|
BlueMax has joined #archiveteam-bs |
12:47
🔗
|
|
mls_ has quit IRC (Remote host closed the connection) |
13:06
🔗
|
|
mtntmnky_ has quit IRC (Remote host closed the connection) |
13:06
🔗
|
|
mtntmnky_ has joined #archiveteam-bs |
13:33
🔗
|
|
synm0nger has quit IRC (Quit: Wait, what?) |
13:34
🔗
|
|
SynMonger has joined #archiveteam-bs |
13:39
🔗
|
|
BlueMax has quit IRC (Read error: Connection reset by peer) |
13:55
🔗
|
|
yano_ is now known as yano |
13:59
🔗
|
|
Damme_ has quit IRC (Read error: Connection reset by peer) |
13:59
🔗
|
|
Damme_ has joined #archiveteam-bs |
14:08
🔗
|
|
synm0nger has joined #archiveteam-bs |
14:08
🔗
|
|
SynMonger has quit IRC (Read error: Operation timed out) |
14:08
🔗
|
|
HP_Archiv has quit IRC (Quit: Page closed) |
14:51
🔗
|
|
phillipsj has quit IRC (Remote host closed the connection) |
14:51
🔗
|
|
phillipsj has joined #archiveteam-bs |
15:12
🔗
|
|
omglolba- has quit IRC (Read error: No route to host) |
15:16
🔗
|
|
omglolbah has joined #archiveteam-bs |
15:28
🔗
|
|
jc86035 has joined #archiveteam-bs |
15:31
🔗
|
JAA |
FBO just shut down a few minutes ago. |
15:33
🔗
|
JAA |
Looks like I was able to retrieve only about 390k entries of the over 3 million total/2.7 million discovered. |
15:33
🔗
|
JAA |
And none of the downloads. |
15:34
🔗
|
|
prq has joined #archiveteam-bs |
15:39
🔗
|
JAA |
Also, I have a list of about 8.8 million images and 9.2 million outlinks (deduped) from the SuperiorPics forums. That's going to take a while... |
15:41
🔗
|
|
akierig has joined #archiveteam-bs |
15:41
🔗
|
|
akierig_ has joined #archiveteam-bs |
15:42
🔗
|
|
markedH has quit IRC (Read error: Operation timed out) |
15:42
🔗
|
|
markedH has joined #archiveteam-bs |
15:46
🔗
|
|
akierig has quit IRC (Read error: Operation timed out) |
16:04
🔗
|
prq |
I'm coming up to speed on the various archiveteam projects. Which piece of software are you using for a project that large? Is this being handled by the warrior distributed archive thing that people can run? |
16:05
🔗
|
JAA |
prq: For those images and outlinks from SuperiorPics? I'll probably throw them into ArchiveBot since they're not really urgent, so it doesn't matter much if it takes a month or two. |
16:07
🔗
|
prq |
http://dashboard.at.ninjawedding.org/3 - this is the dashboard for the irc archivebot you're referring to, right? |
16:07
🔗
|
astrid |
yea |
16:07
🔗
|
prq |
neat. |
16:08
🔗
|
prq |
https://www.archiveteam.org/index.php?title=ArchiveBot this says that it'll eventually be injected to the wayback machine-- is that possible because the archive.org folks trust the archiveteam? I had been looking for a way to inject a warc to the wayback machine and it seems that any random joe isn't able to. |
16:08
🔗
|
JAA |
Correct |
16:09
🔗
|
prq |
cool cool. |
16:11
🔗
|
prq |
my story is that I'm going through the process of leaving a high control religion, and I've come across tons of dead links on older resources-- wayback machine is helpful, but doesn't have everything of course. I'm here trying to learn about all the tools available for site archival. I may be up for running my own mini-archive for my special interest, but I'm one person with a homelab freenas server. |
16:11
🔗
|
|
akierig has joined #archiveteam-bs |
16:12
🔗
|
astrid |
oofh |
16:13
🔗
|
|
jc86035 has quit IRC (Quit: Leaving.) |
16:13
🔗
|
|
jc86035 has joined #archiveteam-bs |
16:14
🔗
|
|
jc86035 has quit IRC (Client Quit) |
16:14
🔗
|
prq |
(high control religion is a nice way to say cult-- those groups like to control information, which turns into editing stuff they publish online (1984 style) |
16:16
🔗
|
astrid |
yea |
16:16
🔗
|
astrid |
im aware of the concept, thankfully haven't gotten tangled up in any of that |
16:16
🔗
|
astrid |
sounds like a hell of a thing |
16:17
🔗
|
|
akierig_ has quit IRC (Read error: Operation timed out) |
16:21
🔗
|
prq |
one interesting thing in this exittor community is there are lots of podcasts. Those tend to not be in the wayback machine. |
16:22
🔗
|
prq |
listening to old episodes of ones that are still around, they'll promote some other podcast and it's just gone off the face of the internet. :/ |
16:22
🔗
|
prq |
this is happening more and more as I dig deeper, hence my interest in archiving. |
16:26
🔗
|
prq |
it is looking more and more like I'll end up needing to do some coding to get podcast archival to be a thing. wouldn't take too much to make it happen. |
16:29
🔗
|
astrid |
there's a lot of podcasts uploaded via https://archive.org/upload/ : https://archive.org/search.php?query=podcast |
16:36
🔗
|
prq |
those seem to be original content producers who have opted to on purpose host their podcast via archive.org instead of doing a self-host or a paid podcast hoster. |
16:36
🔗
|
prq |
I didn't see a way to drop a podcast rss file into the wayback machine to be indexed though. |
16:36
🔗
|
prq |
I did do a little experiment, and I can request the individual rss file, and even the linked mp3 files be indexed by the wayback machine. |
16:37
🔗
|
prq |
but I can't turn around and point a podcast app at the wayback copy. it takes a little more doing to rehost a podcast. |
17:05
🔗
|
astrid |
hm yeah |
17:05
🔗
|
astrid |
youd have to edit the urls in the rss file |
17:10
🔗
|
prq |
I have managed to "rescue" one podcast. its audio was still in stitcher and its show notes was in player.fm and stitcher. I host it in my homelab for just me. I would like to make a tool for podcast authors to help them preserve their content way past the day they stop paying for libsyn. maybe a tool they can run that will grab all the episodes and put them in the archive.org free podcast hosting. |
17:18
🔗
|
|
manjaro-u has joined #archiveteam-bs |
17:46
🔗
|
hook54321 |
prq: depending on the specific group, you might get lots of pushback, and possibly legal threats. |
17:47
🔗
|
prq |
right-- that's a major concern in all of this for sure. |
17:47
🔗
|
prq |
the podcast I "rescued" I cannot rehost due to those concerns. |
17:48
🔗
|
prq |
I have talked to a few podcasters about this, and it is a fairly common concern that they have. I'm hoping to get some tools and resources to make it much much easier for them to help preserve their content. |
18:06
🔗
|
|
Damme_ has quit IRC (Read error: Connection reset by peer) |
18:06
🔗
|
|
Damme_ has joined #archiveteam-bs |
18:09
🔗
|
|
DogsRNice has joined #archiveteam-bs |
18:23
🔗
|
|
zhongfu has quit IRC (Ping timeout: 745 seconds) |
18:24
🔗
|
|
zhongfu has joined #archiveteam-bs |
18:35
🔗
|
|
X-Scale has quit IRC (Ping timeout: 252 seconds) |
18:35
🔗
|
|
X-Scale` has joined #archiveteam-bs |
18:36
🔗
|
|
X-Scale` is now known as X-Scale |
18:41
🔗
|
|
akierig has quit IRC (Remote host closed the connection) |
18:58
🔗
|
|
katocala has quit IRC () |
19:00
🔗
|
|
katocala has joined #archiveteam-bs |
19:16
🔗
|
|
katocala has quit IRC () |
19:21
🔗
|
|
katocala has joined #archiveteam-bs |
19:25
🔗
|
prq |
hook54321â–¸ is there a good resource to be intelligent about archiving content with regard to archiving different sites like that? I imagine that's a pretty regular contern for archiveteam and archive.org. |
19:34
🔗
|
JAA |
prq: There is no "one-size-fits-all" archival approach. Every site is different, and while you can usually get quite far just doing a recursive crawl of all links (which is what we do for example with ArchiveBot and what the Internet Archive does in their web-wide crawls), that won't necessarily result in a complete archive and might have all sorts of other issues. Basically, web archival is somewhere |
19:34
🔗
|
JAA |
between computer science and art. There is no way really to learn how to do it; you do it, you learn from your inevitable mistakes, and eventually you get an intuition on how you need to proceed with a particular site. |
19:35
🔗
|
prq |
sorry, I should have included more context in my question. I was building on a comment about content owners pushing back on archival efforts and legal concerns. |
19:35
🔗
|
JAA |
Ah |
19:36
🔗
|
prq |
and of course there's no one-size-fits-all on that either. |
19:36
🔗
|
prq |
but I'd like to understand the issue better |
19:36
🔗
|
JAA |
One good approach is "archive it, keep copies safe in private, make it public at some point in the future when the copyright owners no longer care". |
19:37
🔗
|
prq |
that's definitely something I've considered for some content. |
19:39
🔗
|
|
katocala has quit IRC () |
19:44
🔗
|
hook54321 |
prq: If it's content owned by, for example, the church of scientology and it's stuff they don't want public, I wouldn't re-publish it unless you want there to be a potential you'll have to deal with serious legal stuff. |
19:44
🔗
|
hook54321 |
It depends on who owns it and what it is for the most part. |
19:45
🔗
|
prq |
fortunately, it isn't the church of scientology. The organization I have in mind is fortunately a lot less aggressive in those regards. They do put copyright notices on stuff, but don't seem to request takedowns from the wayback machine. |
19:47
🔗
|
hook54321 |
There's some that afaik for the most part don't care as long as the content was public in the first place. |
19:47
🔗
|
prq |
they rely more on manipulation tactics of the followers (like gaslighting) rather than trying to take on the whole internet. it does seem they don't quite know what to do in the internet age (there are well documented cases of collecting books, burning them, and distributing edited ones in the back in the 19th century) |
19:47
🔗
|
prq |
that tactic just won't work as well these days |
19:47
🔗
|
prq |
and if I have to sit on a collection privately to hedge against it, I'm prepared to, but I don't think that needs to be the only thing I'd do. |
19:51
🔗
|
prq |
The fact that I'm only interested in their publicly available stuff is helpful. there are people that do work with insiders to leak private information. I might archive *their* site too, but I'm not going to do any leaking myself. |
19:53
🔗
|
|
katocala has joined #archiveteam-bs |
20:09
🔗
|
|
jc86035 has joined #archiveteam-bs |
20:11
🔗
|
|
akierig has joined #archiveteam-bs |
20:12
🔗
|
jc86035 |
hi. I've been archiving stuff to the wayback machine on my own for a while and was wondering if archive team would want to host some of those things. currently some of it runs on the wikimedia servers and pings web.archive.org directly, the rest is on my laptop (also mostly pinging web.archive.org directly) and I run it at irregular intervals. |
20:12
🔗
|
jc86035 |
some of them are fairly small scale (e.g. apple music chart playlists), some of them are quite a bit bigger |
20:14
🔗
|
jc86035 |
most of it is stuff that's ephemeral (i.e. it changes daily/hourly and isn't archived by the host website), not stuff that's very likely to disappear soon, so I'm not sure if it would fit within the archive team's scope |
20:16
🔗
|
markedL |
you work at wikimedia? |
20:16
🔗
|
|
ShellyRol has quit IRC (Read error: Connection reset by peer) |
20:17
🔗
|
|
ShellyRol has joined #archiveteam-bs |
20:18
🔗
|
|
X-Scale` has joined #archiveteam-bs |
20:19
🔗
|
|
X-Scale has quit IRC (Ping timeout: 252 seconds) |
20:19
🔗
|
|
X-Scale` is now known as X-Scale |
20:23
🔗
|
|
jc86035 has quit IRC (Quit: Leaving.) |
20:23
🔗
|
|
jc86035 has joined #archiveteam-bs |
20:24
🔗
|
jc86035 |
markedL: no, I host things on Wikimedia Toolforge, which is technically something that anyone can sign up for but it's supposed to be only used for Wikimedia-related stuff |
20:24
🔗
|
jc86035 |
https://wikitech.wikimedia.org/, https://tools.wmflabs.org/ |
20:29
🔗
|
|
jc86035 has quit IRC (Quit: Leaving.) |
20:31
🔗
|
|
X-Scale has quit IRC (Ping timeout: 252 seconds) |
20:33
🔗
|
|
jc86035 has joined #archiveteam-bs |
20:38
🔗
|
|
jc86035 has quit IRC (Client Quit) |
20:38
🔗
|
|
jc86035 has joined #archiveteam-bs |
20:40
🔗
|
|
jc86035 has quit IRC (Client Quit) |
20:41
🔗
|
|
jc86035 has joined #archiveteam-bs |
20:43
🔗
|
|
jc86035 has quit IRC (Client Quit) |
20:43
🔗
|
|
jc86035 has joined #archiveteam-bs |
20:44
🔗
|
|
mls_ has joined #archiveteam-bs |
20:46
🔗
|
|
jc86035 has quit IRC (Client Quit) |
20:46
🔗
|
|
jc86035 has joined #archiveteam-bs |
20:48
🔗
|
|
jc86035 has quit IRC (Client Quit) |
20:49
🔗
|
|
X-Scale has joined #archiveteam-bs |
20:49
🔗
|
|
jc86035 has joined #archiveteam-bs |
20:54
🔗
|
|
jc86035 has quit IRC (Client Quit) |
20:54
🔗
|
|
jc86035 has joined #archiveteam-bs |
20:56
🔗
|
|
akierig_ has joined #archiveteam-bs |
20:57
🔗
|
|
akierig has quit IRC (Read error: Operation timed out) |
20:58
🔗
|
|
jc86035 has quit IRC (Client Quit) |
20:58
🔗
|
|
jc86035 has joined #archiveteam-bs |
21:11
🔗
|
jc86035 |
did anyone respond to me in the last hour? I can't see because I kept disconnecting from the server and I don't know where to find the logs |
21:12
🔗
|
prq |
nope |
21:15
🔗
|
jc86035 |
anyway, I guess the main question is whether archiveteam does stuff like scheduled periodic archiving (e.g. once an hour/week/month etc) |
21:16
🔗
|
jc86035 |
since this is primarily what I've been doing but I'm aware it might not really fall into the scope (since iirc the archivebot instructions sort of discouraged that sort of thing) |
21:17
🔗
|
astrid |
we don't have any tooling around that; it keeps coming up so maybe we should. want to work on a project? :) |
21:18
🔗
|
jc86035 |
I'd like to, though I'm constantly being distracted by other projects in completely unrelated areas though so I'm not going to guarantee anything |
21:19
🔗
|
jc86035 |
(and also have irl commitments other than those, ofc) |
21:30
🔗
|
jc86035 |
astrid: how would this sort of project work, exactly? would one just, like, make a github and start putting code into it? |
21:31
🔗
|
astrid |
pretty much |
21:31
🔗
|
jc86035 |
I've pretty much done archival alone excluding my ArchiveTeam Warrior usage so idk how this actually works |
21:31
🔗
|
astrid |
github.com/archiveteam has a bunch of examples of stuff |
21:32
🔗
|
jc86035 |
I imagine we wouldn't just directly use the warrior architecture? I suppose you could just distribute it but it might be overkill |
21:33
🔗
|
astrid |
it really depends on what you're doing! |
21:33
🔗
|
jc86035 |
For the once-an-hour stuff I just used a cron job and wget/xargs/url list so most of that stuff wasn't very complicated at all |
21:33
🔗
|
astrid |
warrior is a great way to run the same archiving tool on a bunch of different machines and collect all the results |
21:34
🔗
|
astrid |
you could run an archivebot pipeline and restrict it to your periodic jobs, and then set up some kind of cron job to feed that |
21:34
🔗
|
jc86035 |
On the other hand stuff like musescore.com actually requires getting multiple url fragments out of the page source (there aren't any direct links), the scale is a lot smaller than some other websites (there's only about 500k–600k public scores) so I'd probably favour something more akin to a bash script for that |
21:35
🔗
|
astrid |
sounds like you might need several different things :) |
21:35
🔗
|
jc86035 |
Yeah it would, I've had to write several different scripts for different things (e.g. the Spotify website is probably bigger than archivebot can handle, so I had to do outlinks one round at a time) |
21:36
🔗
|
jc86035 |
[yes, Spotify is ephemeral, sometimes even artists get deleted, not to mention the constantly changing playlists and such] |
21:39
🔗
|
prq |
this wget with delay command I started back on friday is still going, so I'm starting to read up on the archivebot/warrior pipeline stuff a bit. |
21:40
🔗
|
Kaz |
that's kinda what we did with newsgrabber, but that's very out-of-action for the time being |
21:41
🔗
|
jc86035 |
unfortunately I did almost everything in bash so some of the stuff actually ended up breaking my computer's maximum process limit, I eventually had to build in retrying sets of urls from the parent script just to work around the issue |
21:41
🔗
|
JAA |
Spotify probably uses a bunch of JS and wouldn't work in ArchiveBot at all. |
21:42
🔗
|
jc86035 |
On the contrary, if you use a different user agent you end up on their old site, which still has all the metadata and outlinks and such |
21:42
🔗
|
jc86035 |
It's definitely not an accurate picture of the Spotify interface, but it gives a very good view of the Spotify library |
21:42
🔗
|
jc86035 |
(* user agent: basically anything that the web player won't work with, wget for example) |
21:43
🔗
|
markedL |
were you using savepagenow? |
21:43
🔗
|
jc86035 |
yes |
21:43
🔗
|
jc86035 |
for pretty much everything, I did automate archive.is for a while (for a very small number of tasks) but toolforge wasn't cooperative so it just stopped working |
21:46
🔗
|
markedL |
so what kind of jobs did you find make sense? ignoring the technical questions |
21:47
🔗
|
jc86035 |
make sense in terms of what? |
21:47
🔗
|
jc86035 |
like, as in, what was I archiving with what process? |
21:48
🔗
|
markedL |
yeah |
21:50
🔗
|
markedL |
or what content at what frequency and scope |
21:52
🔗
|
jc86035 |
txt/xargs/wget/cron: youtube trending (e.g. https://web.archive.org/web/*/youtube.com/feed/trending?gl=AE, 91 national plus gaming/movies/etc), apple music charts (all once a day), youtube music playlists (don't remember), socialblade/youtube data (once a day), wikipedia music charts (https://en.wikipedia.org/wiki/Wikipedia:Record_charts/List), youtube most viewed (from wikipedia articles) and some others I think |
21:53
🔗
|
jc86035 |
I personally think some of it was overkill and I didn't really do it properly so I would definitely do something less intensive if I started those again |
21:53
🔗
|
jc86035 |
they all stopped working at the end of August because IA introduced the 15/min rate limit |
21:54
🔗
|
jc86035 |
also possibly they banned toolforge's IP addresses because I didn't figure out that the rate limit was being hit for a while and didn't fix it for a few weeks |
21:54
🔗
|
jc86035 |
only apple music and wikipedia stuff are running right now |
21:56
🔗
|
jc86035 |
specialized scripts: musescore (particularly successful, it's not even supposed to work but they made their image server URL structure predictable so I managed to archive every single public score in July), |
21:57
🔗
|
jc86035 |
new alexa.com website (recursive from page links and also from other url lists, once every few months, in April or so I also fed a few million urls in from external sources and used their image server to test if they were worth archiving, no images need to be archived because there aren't any unique images on any pages) |
21:58
🔗
|
jc86035 |
(musescore runs every hour and basically hovers up any new scores, unfortunately they limited the public score indexes to 101 pages but the scores have sequential identifiers) |
22:01
🔗
|
jc86035 |
(technically speaking the socialblade/youtube stuff also used some specialized scripts to select youtube channel IDs but it's still a cron job, and I've done musescore runs without cron separate to the cron job) |
22:01
🔗
|
jc86035 |
(also now that we have 11 months' worth of trending data we could potentially get all the youtube channel ids out of that data instead, it would be a lot of downloading though) |
22:02
🔗
|
JAA |
That sounds like something ivan and #youtubearchive might be interested in. |
22:02
🔗
|
jc86035 |
and more recently I've also tried to script the new save page now on wayback, primarily for alexa and youtube (screenshots are nice to have, also the outlinks function is very useful) |
22:04
🔗
|
jc86035 |
so if you couldn't use archivebot and wanted to upload to web.archive.org you could technically just put a few urls in a list, open firefox and make spn go to town. not totally sure how reliable it is but it definitely seems to work |
22:06
🔗
|
jc86035 |
I think Jason disapproves (he sent me here from the IA discord server) but I haven't really done much of it anyway because it's a lot more energy intensive than wget, and I kind of took a break after the server change in September |
22:06
🔗
|
|
jc86035 has quit IRC (Quit: Leaving.) |
22:07
🔗
|
|
jc86035 has joined #archiveteam-bs |
22:08
🔗
|
jc86035 |
my internet connection stopped working a few minutes ago so here's what I last tried to send |
22:08
🔗
|
jc86035 |
> and more recently I've also tried to script the new save page now on wayback, primarily for alexa and youtube (screenshots are nice to have, also the outlinks function is very useful) |
22:08
🔗
|
jc86035 |
> so if you couldn't use archivebot and wanted to upload to web.archive.org you could technically just put a few urls in a list, open firefox and make spn go to town. not totally sure how reliable it is but it definitely seems to work |
22:08
🔗
|
jc86035 |
> I think Jason disapproves (he sent me here from the IA discord server) but I haven't really done much of it anyway because it's a lot more energy intensive than wget, and I kind of took a break after the server change in September |
22:09
🔗
|
astrid |
aye |
22:11
🔗
|
jc86035 |
I don't use IRC a lot so would it be better if I give out my Discord ID or something? if this stuff is something that warrants further discussion |
22:12
🔗
|
JAA |
Well, ArchiveTeam is on IRC, not on Discord, fortunately. |
22:13
🔗
|
jc86035 |
I know, but I'm not really familiar with it (I've only sent PMs once IIRC) |
22:13
🔗
|
hook54321 |
there's some clients that are pretty easy to use. |
22:14
🔗
|
jc86035 |
should I just continue to discuss it here? I might go soon so I might just pop in later and then keep discussing it I guess |
22:14
🔗
|
jc86035 |
I'm using Adium right now |
22:14
🔗
|
astrid |
you're doing just fine |
22:14
🔗
|
JAA |
^ |
22:14
🔗
|
jc86035 |
astrid: thanks for the validation lol |
22:15
🔗
|
astrid |
:) |
22:24
🔗
|
jc86035 |
(also if anyone was wondering I scripted spn by using a bash script to create a temporary file and then send a POST form via a JS one-liner, credit where credit is due to https://unix.stackexchange.com/questions/375857/) |
22:28
🔗
|
|
BlueMax has joined #archiveteam-bs |
22:42
🔗
|
|
DogsRNice has quit IRC (Ping timeout: 252 seconds) |
22:42
🔗
|
|
akierig_ has quit IRC (Quit: later_gator) |
22:57
🔗
|
|
jc86035 has quit IRC (Quit: Leaving.) |
23:08
🔗
|
|
BartoCH has quit IRC (Remote host closed the connection) |
23:09
🔗
|
|
BartoCH has joined #archiveteam-bs |
23:23
🔗
|
|
wp494 has joined #archiveteam-bs |
23:38
🔗
|
Raccoon |
What are some handy tools for parsing grabbed pages, such that I can create a template to scrape metadata into columns |
23:41
🔗
|
|
BartoCH has quit IRC (Ping timeout: 615 seconds) |
23:42
🔗
|
|
foureyes has quit IRC (Quit: brb) |
23:44
🔗
|
|
foureyes has joined #archiveteam-bs |
23:55
🔗
|
markedL |
https://www.import.io/ |
23:56
🔗
|
JAA |
lol |