#archiveteam-bs 2020-06-26,Fri

↑back Search

Time Nickname Message
00:02 🔗 scorche` has joined #archiveteam-bs
00:03 🔗 achip has quit IRC (Ping timeout: 255 seconds)
00:03 🔗 scorche` has quit IRC (hub.efnet.us irc.Prison.NET)
00:03 🔗 sivoais_ has quit IRC (hub.efnet.us irc.Prison.NET)
00:03 🔗 scorche has quit IRC (hub.efnet.us irc.Prison.NET)
00:03 🔗 Somebody2 has quit IRC (hub.efnet.us irc.Prison.NET)
00:07 🔗 sivoais has joined #archiveteam-bs
00:13 🔗 achip has joined #archiveteam-bs
00:13 🔗 scorche has joined #archiveteam-bs
00:13 🔗 Somebody2 has joined #archiveteam-bs
00:30 🔗 godane has quit IRC (Read error: Operation timed out)
00:31 🔗 godane has joined #archiveteam-bs
01:11 🔗 lennier2 has joined #archiveteam-bs
01:15 🔗 lennier1 has quit IRC (Read error: Operation timed out)
01:15 🔗 lennier2 is now known as lennier1
01:47 🔗 larryv has joined #archiveteam-bs
02:01 🔗 mgrandi has joined #archiveteam-bs
02:19 🔗 lennier2 has joined #archiveteam-bs
02:22 🔗 lennier1 has quit IRC (Read error: Connection reset by peer)
02:22 🔗 lennier2 is now known as lennier1
02:54 🔗 mgrandi has quit IRC (Leaving)
03:04 🔗 DogsRNice has quit IRC (Read error: Connection reset by peer)
03:36 🔗 qw3rty_ has joined #archiveteam-bs
03:43 🔗 qw3rty has quit IRC (Read error: Operation timed out)
04:55 🔗 benjins has quit IRC (Read error: Operation timed out)
05:24 🔗 justcool3 has joined #archiveteam-bs
05:50 🔗 HP_Archiv has joined #archiveteam-bs
06:02 🔗 atomicthu any ideas why wpull won't resume from a crawl that finished, with new regexes to get it to pull more?
06:02 🔗 atomicthu i rerun the command and just get
06:02 🔗 atomicthu INFO FINISHED.
06:02 🔗 atomicthu INFO Duration: 0:00:00. Speed: -- B/s.
06:02 🔗 atomicthu INFO Downloaded: 0 files, 0.0 B.
06:03 🔗 atomicthu and it creates a 4kb additional numbered warc
07:24 🔗 larryv_ has joined #archiveteam-bs
07:24 🔗 larryv_ has quit IRC (Client Quit)
07:29 🔗 larryv has quit IRC (Read error: Operation timed out)
09:20 🔗 achip has quit IRC (Read error: Operation timed out)
09:25 🔗 achip has joined #archiveteam-bs
09:44 🔗 bsmith093 has quit IRC (Ping timeout: 272 seconds)
10:00 🔗 bsmith093 has joined #archiveteam-bs
10:10 🔗 benjins has joined #archiveteam-bs
10:35 🔗 JAA atomicthu: As mentioned yesterday, that doesn't work. The URLs are already marked as "skipped" in the DB, so adjusting the options doesn't do anything.
10:35 🔗 atomicthu woops
10:35 🔗 atomicthu i misinterpreted or forgot. thank you
10:37 🔗 HP_Archiv has quit IRC (Quit: Leaving)
11:34 🔗 ranma_ has joined #archiveteam-bs
11:36 🔗 ranma has quit IRC (Ping timeout: 272 seconds)
11:43 🔗 BlueMax has quit IRC (Read error: Connection reset by peer)
11:46 🔗 Ravenloft has joined #archiveteam-bs
12:14 🔗 dashcloud has quit IRC (Read error: Operation timed out)
13:10 🔗 dashcloud has joined #archiveteam-bs
13:11 🔗 HP_Archiv has joined #archiveteam-bs
13:19 🔗 sembiance has quit IRC (Remote host closed the connection)
13:51 🔗 HP_Archiv has quit IRC (Quit: Leaving)
14:41 🔗 JAA So apparently Tigris.org already nuked everything even though the deadline is 1 July.
14:43 🔗 JAA Their shutdown notice advised people to "copy and archive any data you wish to keep before that date [2020-07-01]"...
14:43 🔗 JAA Great job, millions of issue comments and discussion posts destroyed prematurely just when I was getting ready to start my grab.
14:44 🔗 Jake :(
15:05 🔗 JAA The Center for Election Science's forums are shutting down. They're entering read-only mode on 30 July and will remain online until the end of the year or so. https://forum.electionscience.org/t/archiving-the-forum-july-30th/697
15:05 🔗 JAA Fairly small, just an AB job in August.
15:10 🔗 Arcorann has quit IRC (Read error: Connection reset by peer)
15:27 🔗 JAA Has anyone done anything with regards to the Something Awful forums so far?
15:28 🔗 JAA I mean apart from atomicthu. :-)
15:28 🔗 JAA I'm considering emailing the mods to see if we can work something out with them.
15:29 🔗 JAA To get all ~178M posts going back to the early years.
15:41 🔗 JAA atomicthu: Also, can you clarify please what you meant when you said the archives were inaccessible even with the archive upgrade?
15:41 🔗 Rapix has joined #archiveteam-bs
16:01 🔗 dashcloud has quit IRC (Ping timeout: 272 seconds)
16:03 🔗 katocala anyone interested in a time sensitive archiving project? Find any sites/stores etc that are "Dixie" themed. Many stores are, such as https://dixieoutfitters.com/
16:04 🔗 katocala If history is any indicator, WordPress will shut some down, and some will just go under from outside pressure
16:05 🔗 dashcloud has joined #archiveteam-bs
16:06 🔗 nicolas17 yikes
16:08 🔗 katocala its already happened to many Confederate groups, now the target is on the stores: https://fox2now.com/news/missouri/confederate-store-in-branson-missouri-at-protests-center/
16:08 🔗 katocala I have too much going on to run this one as a project, but I think it would be great for someone to do so
16:43 🔗 Ctrl has quit IRC (Ping timeout: 857 seconds)
17:00 🔗 larryv has joined #archiveteam-bs
17:01 🔗 jodizzle Interesting. Any ideas on where and how to look?
17:01 🔗 jodizzle I'm guessing that there isn't a simple directory of them.
17:04 🔗 VoynichCr has left
17:05 🔗 katocala jodizzle no, probably more of a hunt for them with Google searches
17:06 🔗 katocala there are probably links to similar sites at the back of some websites, like so many sites do, but haven't looked deep enough to know if there is a larger list of them
17:07 🔗 jodizzle Hm, alright. I might look into it
17:07 🔗 katocala Nice :)
17:08 🔗 VoynichCr has joined #archiveteam-bs
17:23 🔗 Ctrl has joined #archiveteam-bs
17:45 🔗 DogsRNice has joined #archiveteam-bs
17:47 🔗 DogsRNice this community is getting shut down soon http://www.skvrs.com/ https://steamcommunity.com/groups/skiver
17:48 🔗 lunik13 has quit IRC (Quit: :x)
17:50 🔗 lunik13 has joined #archiveteam-bs
17:50 🔗 maxfan8_ has joined #archiveteam-bs
17:51 🔗 maxfan8_ has quit IRC (Client Quit)
17:51 🔗 DogsRNice these are the only two places they have
17:56 🔗 DogsRNice sorry i didnt put this in the archivebot channel btw its kinda busy there right now so this probably would have been burried quickly
17:57 🔗 maxfan8_ has joined #archiveteam-bs
18:01 🔗 maxfan8 has quit IRC (Ping timeout: 745 seconds)
18:01 🔗 maxfan8_ is now known as maxfan8
18:05 🔗 Pixi has quit IRC (Quit: Leaving)
18:14 🔗 Ryz DogsRNice, did a run of that community website, although it might be an attempt 2 depending if it gets to the forums somehow
18:18 🔗 Pixi has joined #archiveteam-bs
18:28 🔗 DogsRNice i cant really tell if it grabbed the forums or not
18:47 🔗 mtntmnky has quit IRC (Remote host closed the connection)
18:48 🔗 mtntmnky has joined #archiveteam-bs
18:48 🔗 mtntmnky has quit IRC (Client Quit)
18:57 🔗 mtntmnky has joined #archiveteam-bs
20:25 🔗 t3 has joined #archiveteam-bs
20:39 🔗 RichardG_ has joined #archiveteam-bs
20:40 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
20:48 🔗 RichardG_ is now known as RichardG
21:56 🔗 dashcloud has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.)
22:25 🔗 wyatt8750 has joined #archiveteam-bs
22:26 🔗 wyatt8740 has quit IRC (Read error: Operation timed out)
22:27 🔗 Nikchemny has joined #archiveteam-bs
22:27 🔗 Nikchemny https://archive.org/details/archiveteam_videobot_twitter_com_907101280488837121 Please delete this item
22:29 🔗 JAA Nikchemny: #archiveteam is only intended for very important messages.
22:29 🔗 JAA Not random chat.
22:30 🔗 Nikchemny Ok
22:30 🔗 Nikchemny JAA: Btw, I took two photos of elections posters in my city. Are they needed?
22:30 🔗 JAA Yeah, that sounds like something we should archive.
22:30 🔗 JAA Upload them to https://transfer.notkiska.pw/ ?
22:30 🔗 Nikchemny Emm
22:30 🔗 Nikchemny Nope
22:31 🔗 Nikchemny I meant upload them on wiki
22:31 🔗 JAA Ah
22:31 🔗 Nikchemny For the article
22:31 🔗 JAA Yeah, that's also good. :-)
22:31 🔗 Nikchemny So, is videobot random?
22:32 🔗 Nikchemny Or
22:32 🔗 JAA It was on request. I think there were ways to give it individual tweets, users, hashtags, search terms, etc.
22:33 🔗 JAA It has been broken for 2 years or so now.
22:33 🔗 maxfan8 has quit IRC (Quit: WeeChat 2.8)
22:33 🔗 maxfan8 has joined #archiveteam-bs
22:34 🔗 Nikchemny Wow
22:34 🔗 Nikchemny #archivebot
22:36 🔗 Nikchemny Wow. American politician liked this porn video http://web.archive.org/web/20170912042432im_/https://pbs.twimg.com/media/DJfxCREUMAAg9GU.jpg
22:40 🔗 JAA Ah yes, that's probably why it was archived then.
22:42 🔗 Nikchemny So, all videos (on yt) about Elections can be founded here: https://www.youtube.com/results?search_query=%D0%BF%D0%BE%D0%BF%D1%80%D0%B0%D0%B2%D0%BA%D0%B8
22:43 🔗 JAA #youtubearchive on hackint for YouTube stuff
22:44 🔗 arkiver /join #youtubearchive
22:44 🔗 arkiver whoops
22:45 🔗 ramolg has joined #archiveteam-bs
22:48 🔗 Nikchemny I'm sorry, I can't connect with hackint; I'll just create another account on IA and mirror videos on it
22:49 🔗 arkiver uh
22:49 🔗 arkiver Nikchemny: what videos?
22:52 🔗 Nikchemny arkiver These videos https://www.youtube.com/results?search_query=%D0%BF%D0%BE%D0%BF%D1%80%D0%B0%D0%B2%D0%BA%D0%B8
22:54 🔗 arkiver Nikchemny: why should they be archived?
23:03 🔗 Rapix has quit IRC (Read error: Connection reset by peer)
23:09 🔗 lennier1 has quit IRC (Quit: Going offline, see ya! (www.adiirc.com))
23:11 🔗 lennier1 has joined #archiveteam-bs
23:16 🔗 ranma_ has quit IRC ()
23:22 🔗 Nikchemny arkiver: Almost all of them about this "great" event: https://en.wikipedia.org/wiki/2020_amendments_to_the_Constitution_of_Russia
23:26 🔗 ramolg Hi all. I've been working on a project to both archive and consolidate US government technical report databases like DTIC, OSTI, and NTRS.
23:27 🔗 ramolg I'm planning to launch a website later this year.
23:27 🔗 ramolg I'm a competent bot programmer but am at the limits of my web programming knowledge. I'm particularly clueless about how to write a good search engine for this database.
23:28 🔗 ramolg Right now I have about 2 GB in a sqlite file, DTIC citations only.
23:28 🔗 ramolg Anyone have any recommendations about how to write a good search engine? Preferably in Python.
23:28 🔗 Nikchemny arkiver: also there are these videos: https://www.youtube.com/results?search_query=%D0%BF%D0%BE%D0%BF%D1%80%D0%B0%D0%B2%D0%BA%D0%B8+%D1%80%D0%B5%D0%BA%D0%BB%D0%B0%D0%BC%D0%B0
23:28 🔗 Nikchemny About elections' advertisments
23:29 🔗 ramolg DTIC's current search engine is pretty bad so I want to make a good one.
23:30 🔗 ramolg Also: I'm getting a lot of non-public information via FOIA requests. At present I have declassified citations to thousands of classified reports, though obviously none of the reports.
23:32 🔗 JAA ramolg: You probably don't want to write your own search engine. Look into Solr, Elasticsearch, etc.
23:42 🔗 ramolg JAA: Thanks, I'm looking at Solr right now but it's clear that there's a lot I need to learn before I can approach this.
23:43 🔗 JAA Yeah, large-scale text indexing is not easy.
23:45 🔗 ramolg How much easier would it be if I only searched the metadata/citations rather than full text of the reports?
23:47 🔗 ramolg Also: I've been very slowly spidering this little-known website from Los Alamos National Labs: https://permalink.lanl.gov/object/view?what=info:lanl-repo/lareport/LA-00001
23:47 🔗 ramolg The only "interface" to the website that I'm aware of is changing the report number in the URL string.
23:48 🔗 ramolg I've downloaded every page listed on Google and the Internet Archive.
23:48 🔗 ramolg I've been running a bot to systematically try almost every report number, but if I try more than about one a minute they IP ban me.
23:48 🔗 ramolg Any ideas about how to get a comprehensive list of pages on that site?
23:51 🔗 JAA You can certainly build a basic search function with any DB and some LIKE conditions or similar, but at least with SQLite, you'll quickly run into performance issues if you run that at scale. Proper DBs will do better, but for free-form text, not so sure.
23:51 🔗 ramolg Noted.
23:51 🔗 JAA You'll want more IPs, i.e. rent cheap virtual/cloud servers etc.
23:53 🔗 Nikchemny has quit IRC (Ping timeout: 252 seconds)
23:53 🔗 JAA I know that godane did some things regarding DTIC before. Not sure about any of the other databases you mentioned.
23:55 🔗 ramolg I'm really interested in any technical report database, and also things that haven't been digitized yet.

irclogger-viewer